Comparing GeoParquet vs FlatGeobuf Performance
Modern geospatial pipelines have largely moved past legacy constraints. As platform teams and cloud architects evaluate cloud-native alternatives, Comparing GeoParquet vs FlatGeobuf Performance becomes a critical architectural decision. Both formats solve distinct bottlenecks: one optimizes analytical throughput via columnar encoding and vectorized decoding, while the other prioritizes spatial indexing and streaming reads. This guide provides a reproducible benchmark workflow, tested Python patterns, and production-grade error handling for GIS data engineers and backend developers. For foundational context on format evolution and storage paradigms, see Geospatial Storage Fundamentals & Format Comparison.
Architectural Foundations: Columnar Analytics vs Spatial Streaming
The performance delta between these formats stems from their underlying serialization strategies. GeoParquet extends the Apache Parquet columnar format by embedding geometry metadata and coordinate reference system (CRS) definitions directly into the schema. This structure enables predicate pushdown, allowing query engines to skip irrelevant row groups and decode only the required columns in memory. When evaluating Understanding Parquet Columnar Storage for GIS, you will notice that analytical workloads benefit significantly from this layout, especially when aggregating non-geometric attributes alongside spatial filters.
FlatGeobuf takes a fundamentally different approach. It serializes geometries into a single, contiguous binary file with an embedded Hilbert curve spatial index. This design eliminates the need for external index files or directory structures, making it ideal for streaming reads and web-based tile serving. The format emerged as a direct response to Shapefile Limitations in Modern Data Stacks, particularly the 2GB file cap, fragmented metadata, and lack of native spatial indexing. By packing geometries sequentially and maintaining a compact index header, FlatGeobuf achieves near-instantaneous bounding-box queries without loading the entire dataset into memory.
Benchmark Environment & Prerequisites
Reproducible performance testing requires strict environmental control. Variability in I/O scheduling, garbage collection, or library versions will skew results. Standardize your setup using the following baseline:
- Python 3.10+ with
geopandas>=0.14,pyarrow>=14.0,flatgeobuf>=3.26,shapely>=2.0 - Dataset Profile: ~500k–1.2M features (e.g., OpenStreetMap building footprints, US census block groups, or global administrative boundaries)
- Storage Tier: Local NVMe SSD for baseline I/O testing; AWS S3 or GCS with
s3fs/gcsfsfor cloud latency simulation - Monitoring Stack:
psutilfor RSS tracking,time.perf_counterfor wall-clock timing, andpy-spyorcProfilefor CPU profiling - CRS Standardization: All datasets normalized to
EPSG:4326before serialization to eliminate on-the-fly projection overhead - Execution Policy: Run benchmarks in a clean virtual environment with
PYTHONHASHSEED=0and disable OS-level readahead where possible
Step-by-Step Benchmark Workflow
A structured testing matrix prevents cherry-picked results and ensures statistical validity.
- Data Ingestion & Validation: Load raw source data via
geopandas.read_file(), executeshapely.make_valid(), drop null geometries, and enforce a unified schema. Invalid polygons will cause silent decoding failures in columnar readers. - Serialization Matrix: Write identical datasets using multiple compression codecs. For GeoParquet, test
snappy,zstd, andlz4. For FlatGeobuf, use default binary packing with optionalZSTDcompression if supported by your binding version. - Read Pattern Testing: Execute three distinct query profiles:
- Full-table scan: Load all rows without filters to measure raw decode throughput.
- Spatial bounding-box filter: Query a 10km × 10km window to test index efficiency.
- Attribute predicate pushdown: Filter on a categorical column (e.g.,
building_type = 'residential') to evaluate column pruning.
- Resource Tracking: Capture peak memory (RSS), CPU utilization, and I/O wait time per operation. Flush OS page caches between runs to prevent cold/warm read bias.
- Statistical Aggregation: Execute each test 7 times, discard the highest and lowest values, and report the median alongside the interquartile range (IQR). This mitigates garbage collection spikes and background process interference.
Production-Ready Implementation Patterns
The following implementation avoids naive read_file() calls in favor of engine-specific optimizations. It includes explicit resource tracking, error boundaries, and engine routing suitable for backend services.
import logging
import os
import time
from typing import Any, Callable, Dict, Tuple
import flatgeobuf
import geopandas as gpd
import psutil
import pyarrow.dataset as ds
from shapely.geometry import box
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
def track_resources(func: Callable, *args: Any, **kwargs: Any) -> Tuple[Any, float, float]:
"""Wrapper to capture peak memory (MB) and execution time (seconds)."""
process = psutil.Process()
start_mem = process.memory_info().rss
start_time = time.perf_counter()
try:
result = func(*args, **kwargs)
except Exception as e:
logging.error(f"Execution failed: {e}")
raise
end_time = time.perf_counter()
end_mem = process.memory_info().rss
peak_mem_mb = max(end_mem, start_mem) / (1024 ** 2)
return result, (end_time - start_time), peak_mem_mb
def benchmark_geoparquet(file_path: str, bbox: Tuple[float, float, float, float]) -> Dict[str, Any]:
"""Benchmark GeoParquet with PyArrow dataset API and predicate pushdown."""
dataset = ds.dataset(file_path, format="parquet")
# Spatial filter via bounding box (requires GeoParquet 1.0+ metadata)
bbox_filter = ds.field("geometry").intersects(box(*bbox))
def load_with_pushdown():
# Pushdown applies both spatial and attribute filters
table = dataset.filter(bbox_filter).to_table()
return gpd.GeoDataFrame(table.to_pandas(), geometry="geometry")
gdf, duration, peak_mem = track_resources(load_with_pushdown)
return {"rows": len(gdf), "duration_s": duration, "peak_mem_mb": peak_mem}
def benchmark_flatgeobuf(file_path: str, bbox: Tuple[float, float, float, float]) -> Dict[str, Any]:
"""Benchmark FlatGeobuf using native spatial index streaming."""
def load_with_index():
# flatgeobuf supports direct bbox filtering via the embedded index
with open(file_path, "rb") as f:
# The Python binding returns an iterator of features
features = list(flatgeobuf.deserialize(f, bbox=bbox))
# Convert to GeoDataFrame for parity
gdf = gpd.GeoDataFrame.from_features(features)
return gdf
gdf, duration, peak_mem = track_resources(load_with_index)
return {"rows": len(gdf), "duration_s": duration, "peak_mem_mb": peak_mem}
# Example execution block
if __name__ == "__main__":
TEST_BBOX = (-74.05, 40.65, -73.90, 40.75) # Lower Manhattan
PARQUET_PATH = "data/buildings.parquet"
FGB_PATH = "data/buildings.fgb"
if not os.path.exists(PARQUET_PATH) or not os.path.exists(FGB_PATH):
raise FileNotFoundError("Benchmark files missing. Run serialization step first.")
print("Running GeoParquet benchmark...")
pq_res = benchmark_geoparquet(PARQUET_PATH, TEST_BBOX)
print(f"GeoParquet: {pq_res['rows']} rows | {pq_res['duration_s']:.3f}s | {pq_res['peak_mem_mb']:.1f}MB")
print("\nRunning FlatGeobuf benchmark...")
fgb_res = benchmark_flatgeobuf(FGB_PATH, TEST_BBOX)
print(f"FlatGeobuf: {fgb_res['rows']} rows | {fgb_res['duration_s']:.3f}s | {fgb_res['peak_mem_mb']:.1f}MB")
Performance Trade-offs & Selection Criteria
Raw decode speed rarely tells the full story. GeoParquet consistently outperforms in analytical pipelines where multiple attribute columns are aggregated alongside spatial joins. The columnar layout minimizes cache misses and enables vectorized execution engines like DuckDB or Polars to process millions of rows in parallel. However, this advantage diminishes when querying highly sparse regions or when the dataset size is under 100k features, where index overhead outweighs decode latency.
FlatGeobuf excels in low-latency, read-heavy scenarios. Its spatial index allows the reader to jump directly to relevant byte offsets, making it highly efficient for bounding-box queries and tile generation. When evaluating Comparing FlatGeobuf Read Speeds in Python, you will observe that streaming architectures benefit from its single-file design, which eliminates directory metadata lookups and reduces HTTP round trips in cloud storage.
The decision matrix ultimately depends on your access patterns. If your workload involves heavy filtering, aggregation, or integration with modern data warehouses, GeoParquet is the default choice. For real-time APIs, web mapping backends, or edge deployments where memory is constrained, FlatGeobuf provides predictable latency. Detailed guidance on Optimizing FlatGeobuf for High-Read APIs covers connection pooling, range-request tuning, and index pre-warming strategies. For teams still evaluating trade-offs across multiple projects, How to Choose Between GeoParquet and FlatGeobuf provides a decision tree aligned with cloud-native architecture patterns.
Production Deployment Checklist
Before promoting either format to production, validate the following operational requirements:
- Metadata Compliance: Ensure GeoParquet files include
geometadata keys (primary_column,columns,crs) per the OGC GeoParquet standard. Missing metadata will break cross-engine compatibility. - Index Validation: Verify FlatGeobuf spatial index integrity using
flatgeobuf --verifyor equivalent CLI tools. Corrupted Hilbert sequences cause silent read failures. - Cloud Storage Optimization: Enable HTTP range requests for S3/GCS. Both formats benefit from partial reads, but FlatGeobuf requires precise byte-range alignment to avoid downloading full files.
- Schema Evolution: Implement strict schema validation on write. GeoParquet supports schema evolution, but type mismatches (e.g.,
int32vsint64) will break downstream query plans. - Monitoring Integration: Track
bytes_scannedvsbytes_returnedratios. A ratio approaching 1.0 indicates inefficient filtering and suggests index or partitioning adjustments.
By aligning format selection with query topology and enforcing strict serialization standards, platform teams can eliminate I/O bottlenecks and scale geospatial workloads predictably across cloud environments.