Comparing GeoParquet vs FlatGeobuf Performance

Modern geospatial pipelines have largely moved past legacy constraints. As platform teams and cloud architects evaluate cloud-native alternatives, Comparing GeoParquet vs FlatGeobuf Performance becomes a critical architectural decision. Both formats solve distinct bottlenecks: one optimizes analytical throughput via columnar encoding and vectorized decoding, while the other prioritizes spatial indexing and streaming reads. This guide provides a reproducible benchmark workflow, tested Python patterns, and production-grade error handling for GIS data engineers and backend developers. For foundational context on format evolution and storage paradigms, see Geospatial Storage Fundamentals & Format Comparison.

Architectural Foundations: Columnar Analytics vs Spatial Streaming

The performance delta between these formats stems from their underlying serialization strategies. GeoParquet extends the Apache Parquet columnar format by embedding geometry metadata and coordinate reference system (CRS) definitions directly into the schema. This structure enables predicate pushdown, allowing query engines to skip irrelevant row groups and decode only the required columns in memory. When evaluating Understanding Parquet Columnar Storage for GIS, you will notice that analytical workloads benefit significantly from this layout, especially when aggregating non-geometric attributes alongside spatial filters.

FlatGeobuf takes a fundamentally different approach. It serializes geometries into a single, contiguous binary file with an embedded Hilbert curve spatial index. This design eliminates the need for external index files or directory structures, making it ideal for streaming reads and web-based tile serving. The format emerged as a direct response to Shapefile Limitations in Modern Data Stacks, particularly the 2GB file cap, fragmented metadata, and lack of native spatial indexing. By packing geometries sequentially and maintaining a compact index header, FlatGeobuf achieves near-instantaneous bounding-box queries without loading the entire dataset into memory.

Benchmark Environment & Prerequisites

Reproducible performance testing requires strict environmental control. Variability in I/O scheduling, garbage collection, or library versions will skew results. Standardize your setup using the following baseline:

  • Python 3.10+ with geopandas>=0.14, pyarrow>=14.0, flatgeobuf>=3.26, shapely>=2.0
  • Dataset Profile: ~500k–1.2M features (e.g., OpenStreetMap building footprints, US census block groups, or global administrative boundaries)
  • Storage Tier: Local NVMe SSD for baseline I/O testing; AWS S3 or GCS with s3fs/gcsfs for cloud latency simulation
  • Monitoring Stack: psutil for RSS tracking, time.perf_counter for wall-clock timing, and py-spy or cProfile for CPU profiling
  • CRS Standardization: All datasets normalized to EPSG:4326 before serialization to eliminate on-the-fly projection overhead
  • Execution Policy: Run benchmarks in a clean virtual environment with PYTHONHASHSEED=0 and disable OS-level readahead where possible

Step-by-Step Benchmark Workflow

A structured testing matrix prevents cherry-picked results and ensures statistical validity.

  1. Data Ingestion & Validation: Load raw source data via geopandas.read_file(), execute shapely.make_valid(), drop null geometries, and enforce a unified schema. Invalid polygons will cause silent decoding failures in columnar readers.
  2. Serialization Matrix: Write identical datasets using multiple compression codecs. For GeoParquet, test snappy, zstd, and lz4. For FlatGeobuf, use default binary packing with optional ZSTD compression if supported by your binding version.
  3. Read Pattern Testing: Execute three distinct query profiles:
  • Full-table scan: Load all rows without filters to measure raw decode throughput.
  • Spatial bounding-box filter: Query a 10km × 10km window to test index efficiency.
  • Attribute predicate pushdown: Filter on a categorical column (e.g., building_type = 'residential') to evaluate column pruning.
  1. Resource Tracking: Capture peak memory (RSS), CPU utilization, and I/O wait time per operation. Flush OS page caches between runs to prevent cold/warm read bias.
  2. Statistical Aggregation: Execute each test 7 times, discard the highest and lowest values, and report the median alongside the interquartile range (IQR). This mitigates garbage collection spikes and background process interference.

Production-Ready Implementation Patterns

The following implementation avoids naive read_file() calls in favor of engine-specific optimizations. It includes explicit resource tracking, error boundaries, and engine routing suitable for backend services.

python
import logging
import os
import time
from typing import Any, Callable, Dict, Tuple

import flatgeobuf
import geopandas as gpd
import psutil
import pyarrow.dataset as ds
from shapely.geometry import box

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def track_resources(func: Callable, *args: Any, **kwargs: Any) -> Tuple[Any, float, float]:
    """Wrapper to capture peak memory (MB) and execution time (seconds)."""
    process = psutil.Process()
    start_mem = process.memory_info().rss
    start_time = time.perf_counter()
    
    try:
        result = func(*args, **kwargs)
    except Exception as e:
        logging.error(f"Execution failed: {e}")
        raise
    
    end_time = time.perf_counter()
    end_mem = process.memory_info().rss
    peak_mem_mb = max(end_mem, start_mem) / (1024 ** 2)
    
    return result, (end_time - start_time), peak_mem_mb

def benchmark_geoparquet(file_path: str, bbox: Tuple[float, float, float, float]) -> Dict[str, Any]:
    """Benchmark GeoParquet with PyArrow dataset API and predicate pushdown."""
    dataset = ds.dataset(file_path, format="parquet")

    # Spatial filter via bounding box (requires GeoParquet 1.0+ metadata)
    bbox_filter = ds.field("geometry").intersects(box(*bbox))
    
    def load_with_pushdown():
        # Pushdown applies both spatial and attribute filters
        table = dataset.filter(bbox_filter).to_table()
        return gpd.GeoDataFrame(table.to_pandas(), geometry="geometry")
        
    gdf, duration, peak_mem = track_resources(load_with_pushdown)
    return {"rows": len(gdf), "duration_s": duration, "peak_mem_mb": peak_mem}

def benchmark_flatgeobuf(file_path: str, bbox: Tuple[float, float, float, float]) -> Dict[str, Any]:
    """Benchmark FlatGeobuf using native spatial index streaming."""
    def load_with_index():
        # flatgeobuf supports direct bbox filtering via the embedded index
        with open(file_path, "rb") as f:
            # The Python binding returns an iterator of features
            features = list(flatgeobuf.deserialize(f, bbox=bbox))
        # Convert to GeoDataFrame for parity
        gdf = gpd.GeoDataFrame.from_features(features)
        return gdf
        
    gdf, duration, peak_mem = track_resources(load_with_index)
    return {"rows": len(gdf), "duration_s": duration, "peak_mem_mb": peak_mem}

# Example execution block
if __name__ == "__main__":
    TEST_BBOX = (-74.05, 40.65, -73.90, 40.75)  # Lower Manhattan
    PARQUET_PATH = "data/buildings.parquet"
    FGB_PATH = "data/buildings.fgb"
    
    if not os.path.exists(PARQUET_PATH) or not os.path.exists(FGB_PATH):
        raise FileNotFoundError("Benchmark files missing. Run serialization step first.")
        
    print("Running GeoParquet benchmark...")
    pq_res = benchmark_geoparquet(PARQUET_PATH, TEST_BBOX)
    print(f"GeoParquet: {pq_res['rows']} rows | {pq_res['duration_s']:.3f}s | {pq_res['peak_mem_mb']:.1f}MB")
    
    print("\nRunning FlatGeobuf benchmark...")
    fgb_res = benchmark_flatgeobuf(FGB_PATH, TEST_BBOX)
    print(f"FlatGeobuf: {fgb_res['rows']} rows | {fgb_res['duration_s']:.3f}s | {fgb_res['peak_mem_mb']:.1f}MB")

Performance Trade-offs & Selection Criteria

Raw decode speed rarely tells the full story. GeoParquet consistently outperforms in analytical pipelines where multiple attribute columns are aggregated alongside spatial joins. The columnar layout minimizes cache misses and enables vectorized execution engines like DuckDB or Polars to process millions of rows in parallel. However, this advantage diminishes when querying highly sparse regions or when the dataset size is under 100k features, where index overhead outweighs decode latency.

FlatGeobuf excels in low-latency, read-heavy scenarios. Its spatial index allows the reader to jump directly to relevant byte offsets, making it highly efficient for bounding-box queries and tile generation. When evaluating Comparing FlatGeobuf Read Speeds in Python, you will observe that streaming architectures benefit from its single-file design, which eliminates directory metadata lookups and reduces HTTP round trips in cloud storage.

The decision matrix ultimately depends on your access patterns. If your workload involves heavy filtering, aggregation, or integration with modern data warehouses, GeoParquet is the default choice. For real-time APIs, web mapping backends, or edge deployments where memory is constrained, FlatGeobuf provides predictable latency. Detailed guidance on Optimizing FlatGeobuf for High-Read APIs covers connection pooling, range-request tuning, and index pre-warming strategies. For teams still evaluating trade-offs across multiple projects, How to Choose Between GeoParquet and FlatGeobuf provides a decision tree aligned with cloud-native architecture patterns.

Production Deployment Checklist

Before promoting either format to production, validate the following operational requirements:

  • Metadata Compliance: Ensure GeoParquet files include geo metadata keys (primary_column, columns, crs) per the OGC GeoParquet standard. Missing metadata will break cross-engine compatibility.
  • Index Validation: Verify FlatGeobuf spatial index integrity using flatgeobuf --verify or equivalent CLI tools. Corrupted Hilbert sequences cause silent read failures.
  • Cloud Storage Optimization: Enable HTTP range requests for S3/GCS. Both formats benefit from partial reads, but FlatGeobuf requires precise byte-range alignment to avoid downloading full files.
  • Schema Evolution: Implement strict schema validation on write. GeoParquet supports schema evolution, but type mismatches (e.g., int32 vs int64) will break downstream query plans.
  • Monitoring Integration: Track bytes_scanned vs bytes_returned ratios. A ratio approaching 1.0 indicates inefficient filtering and suggests index or partitioning adjustments.

By aligning format selection with query topology and enforcing strict serialization standards, platform teams can eliminate I/O bottlenecks and scale geospatial workloads predictably across cloud environments.

Continue exploring