Section: Geospatial Storage Fundamentals & Format Comparison 18 min read

Comparing GeoParquet vs FlatGeobuf Performance

Q: Which format works better with cloud object storage like S3 or GCS?

Both support HTTP range requests. GeoParquet row-group footers allow engines like DuckDB or Athena to skip irrelevant groups with a single metadata read. FlatGeobuf's index header is compact but requires an initial seek to determine byte offsets; for very large files this adds one round trip. In practice GeoParquet integrates more naturally with cloud data warehouses; FlatGeobuf excels when serving tiles from an object-storage CDN.

Modern geospatial pipelines have largely moved past legacy format constraints, but the choice between two leading cloud-native formats is rarely obvious. Both GeoParquet and FlatGeobuf solve real I/O bottlenecks — they just target different ones. The wrong default can quietly cost you a 5–8× latency penalty on the exact query pattern your service runs most. This guide provides a reproducible benchmark workflow, production-grade Python implementation patterns, a reference decision matrix, and a rundown of the failure modes that trip up teams mid-migration. It is aimed at GIS data engineers and backend developers who need concrete numbers rather than marketing summaries. For broader context on where these formats sit in the storage landscape, this page belongs to the Geospatial Storage Fundamentals & Format Comparison section; once you have the numbers, the companion decision tree for choosing between GeoParquet and FlatGeobuf turns them into an architecture call.

Prerequisites

Reproducible performance testing requires strict environmental control. Variability in I/O scheduling, garbage collection, or library versions will skew results significantly. Standardize your setup to the following baseline before running any benchmarks:

Python 3.10+ with geopandas>=0.14, pyarrow>=14.0, shapely>=2.0, pyogrio>=0.7
Dataset profile: 500 k–1.2 M features (OpenStreetMap building footprints, US census block groups, or global administrative boundaries all work well)
Storage tier: local NVMe SSD for baseline I/O; AWS S3 or GCS with s3fs/gcsfs for cloud latency simulation
Monitoring stack: psutil for RSS tracking, time.perf_counter for wall-clock timing
CRS standardization: all datasets normalized to EPSG:4326 before serialization to eliminate on-the-fly projection overhead
Execution policy: run benchmarks inside a clean virtual environment with PYTHONHASHSEED=0; disable OS-level readahead where possible and flush page caches between runs

Architectural Foundations: Columnar Analytics vs Spatial Streaming

The performance gap between these formats comes from fundamentally different serialization strategies. Both are binary, so both sidestep the parse-and-allocate tax that makes text formats slow — if you are still serving raw vector data as text, the serialization overhead of GeoJSON is usually the larger problem to fix before this comparison even matters.

GeoParquet extends the Apache Parquet columnar format by embedding geometry metadata and coordinate reference system definitions directly into the file schema. Columns are stored contiguously on disk and divided into row groups, each with its own statistics footer. Query engines can therefore read the footer, skip any row group whose min/max range cannot satisfy the predicate, and decode only the required columns. This is the mechanism behind column pruning — it dramatically lowers the bytes-scanned-to-bytes-returned ratio in analytical workloads where you aggregate a handful of attribute columns across millions of geometries.

FlatGeobuf takes a streaming-first approach. It packs geometries into a single contiguous binary file with an embedded Hilbert curve spatial index stored in a compact header. There are no external index files, no directory structures, and no secondary metadata lookups. When a reader issues a bounding-box query, it reads the index header, resolves the relevant byte offsets, and issues a single sequential read covering only those features. The resulting access pattern is predictable in both latency and memory usage, which makes FlatGeobuf the natural choice for streaming tile servers and edge deployments. The format was designed explicitly to address the fragmentation problems described in Shapefile Limitations in Modern Data Stacks — particularly the 2 GB file cap and the absence of native spatial indexing.

The diagram below shows how each format structures a read for a bounding-box query:

Step-by-Step Benchmark Workflow

A structured testing matrix prevents cherry-picked results and ensures statistical validity.

Step 1 — Profile Your Source Data

Load raw source data via geopandas.read_file(), run shapely.make_valid() on every geometry, drop null geometries, and enforce a unified schema. Invalid polygons cause silent decoding failures in columnar readers; catching them at ingest time is cheaper than debugging corrupt Parquet files downstream. Record the feature count, attribute cardinality, and geometry type distribution — these characteristics predict which format will benefit more.

Step 2 — Map Your Query Workload

Classify your actual access patterns into three profiles before writing a single byte:

Full-table scan: load all rows without spatial or attribute filters — measures raw decode throughput.
Spatial bounding-box filter: query a 10 km × 10 km window — tests spatial index efficiency.
Attribute predicate pushdown: filter on a categorical column such as building_type = 'residential' — evaluates column pruning and dictionary encoding benefits.

Most production workloads are dominated by one profile. Match your benchmark to that profile, not an average. The diagram below maps each query type to the format it favours:

Step 3 — Configure Serialization Parameters

Write identical datasets in both formats under controlled conditions:

For GeoParquet, test snappy, zstd (level 3 and level 6), and lz4 codecs. Tune row group sizing to 50 k–150 k rows per group; smaller groups increase predicate-pushdown efficiency but raise file overhead.
For FlatGeobuf, use the default GDAL-controlled binary packing via pyogrio. The format does not expose a per-file codec selector; compression behaviour is driver-controlled.

Step 4 — Run the Benchmark Matrix

Execute each query profile 7 times per format-codec combination. Flush the OS page cache between runs with echo 3 | sudo tee /proc/sys/vm/drop_caches on Linux to prevent warm-read bias inflating results. Record wall-clock time via time.perf_counter and peak RSS via psutil. Discard the single highest and single lowest observation from each run set, then report the median and interquartile range (IQR).

Step 5 — Validate and Select a Format

After collecting benchmark results, apply the decision matrix in the reference table below. Confirm your chosen format passes the production deployment checklist before promoting to any shared environment.

Production-Ready Implementation

The following implementation avoids naive read_file() calls in favour of engine-specific optimisations. It includes explicit resource tracking and error boundaries suitable for backend services.

python

# Requires: geopandas>=0.14, pyarrow>=14.0, shapely>=2.0, pyogrio>=0.7, psutil>=5.9
import logging
import os
import time
from typing import Any, Callable, Tuple

import geopandas as gpd
import psutil
import pyarrow.parquet as pq
from shapely.geometry import box

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)


def track_resources(
    func: Callable[[], Any],
) -> Tuple[Any, float, float]:
    """Return (result, wall_seconds, peak_rss_mb) for a zero-argument callable."""
    process = psutil.Process()
    start_mem = process.memory_info().rss
    start_time = time.perf_counter()

    try:
        result = func()
    except Exception as exc:
        logger.error("Execution failed: %s", exc)
        raise

    elapsed = time.perf_counter() - start_time
    end_mem = process.memory_info().rss
    # Track net delta; negative peaks (GC) are clamped to zero.
    peak_mb = max(end_mem - start_mem, 0) / (1024 ** 2)
    return result, elapsed, peak_mb


def benchmark_geoparquet_bbox(
    file_path: str,
    bbox: Tuple[float, float, float, float],
    columns: list[str] | None = None,
) -> dict[str, Any]:
    """
    Read a GeoParquet file and apply a bounding-box filter in Python.

    PyArrow does not yet expose native spatial predicates, so the geometry
    filter runs after columnar decode. Passing `columns` restricts which
    attribute columns are decoded, which is where column pruning savings
    are realised.
    """
    if not os.path.isfile(file_path):
        raise FileNotFoundError(f"GeoParquet file not found: {file_path}")

    read_cols = columns or ["id", "geometry"]

    def load() -> gpd.GeoDataFrame:
        table = pq.read_table(file_path, columns=read_cols)
        gdf = gpd.GeoDataFrame(
            table.to_pandas(),
            geometry="geometry",
            crs="EPSG:4326",
        )
        mask = gdf.geometry.intersects(box(*bbox))
        return gdf[mask]

    gdf, duration, peak_mem = track_resources(load)
    return {
        "format": "geoparquet",
        "rows_returned": len(gdf),
        "duration_s": round(duration, 4),
        "peak_mem_delta_mb": round(peak_mem, 2),
    }


def benchmark_flatgeobuf_bbox(
    file_path: str,
    bbox: Tuple[float, float, float, float],
) -> dict[str, Any]:
    """
    Read a FlatGeobuf file using the embedded Hilbert index via pyogrio.

    Passing `bbox` to read_file triggers an index-assisted seek; only features
    whose envelopes overlap the query window are decoded.
    """
    if not os.path.isfile(file_path):
        raise FileNotFoundError(f"FlatGeobuf file not found: {file_path}")

    def load() -> gpd.GeoDataFrame:
        return gpd.read_file(file_path, bbox=bbox, engine="pyogrio")

    gdf, duration, peak_mem = track_resources(load)
    return {
        "format": "flatgeobuf",
        "rows_returned": len(gdf),
        "duration_s": round(duration, 4),
        "peak_mem_delta_mb": round(peak_mem, 2),
    }


def run_benchmark_suite(
    parquet_path: str,
    fgb_path: str,
    bbox: Tuple[float, float, float, float],
    runs: int = 7,
) -> dict[str, list[dict[str, Any]]]:
    """Run both benchmarks `runs` times and collect raw observations."""
    results: dict[str, list[dict[str, Any]]] = {"geoparquet": [], "flatgeobuf": []}

    for i in range(runs):
        logger.info("Run %d/%d ...", i + 1, runs)
        results["geoparquet"].append(benchmark_geoparquet_bbox(parquet_path, bbox))
        results["flatgeobuf"].append(benchmark_flatgeobuf_bbox(fgb_path, bbox))

    return results


if __name__ == "__main__":
    TEST_BBOX = (-74.05, 40.65, -73.90, 40.75)  # Lower Manhattan
    PARQUET_PATH = "data/buildings.parquet"
    FGB_PATH = "data/buildings.fgb"

    suite = run_benchmark_suite(PARQUET_PATH, FGB_PATH, TEST_BBOX, runs=7)

    for fmt, observations in suite.items():
        durations = sorted(o["duration_s"] for o in observations)[1:-1]  # drop min/max
        median = sorted(durations)[len(durations) // 2]
        print(f"{fmt}: median {median:.3f}s over {len(durations)} trimmed runs")

# Requires: geopandas>=0.14, pyarrow>=14.0, shapely>=2.0, pyogrio>=0.7, psutil>=5.9
import logging
import os
import time
from typing import Any, Callable, Tuple

import geopandas as gpd
import psutil
import pyarrow.parquet as pq
from shapely.geometry import box

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)


def track_resources(
    func: Callable[[], Any],
) -> Tuple[Any, float, float]:
    """Return (result, wall_seconds, peak_rss_mb) for a zero-argument callable."""
    process = psutil.Process()
    start_mem = process.memory_info().rss
    start_time = time.perf_counter()

    try:
        result = func()
    except Exception as exc:
        logger.error("Execution failed: %s", exc)
        raise

    elapsed = time.perf_counter() - start_time
    end_mem = process.memory_info().rss
    # Track net delta; negative peaks (GC) are clamped to zero.
    peak_mb = max(end_mem - start_mem, 0) / (1024 ** 2)
    return result, elapsed, peak_mb


def benchmark_geoparquet_bbox(
    file_path: str,
    bbox: Tuple[float, float, float, float],
    columns: list[str] | None = None,
) -> dict[str, Any]:
    """
    Read a GeoParquet file and apply a bounding-box filter in Python.

    PyArrow does not yet expose native spatial predicates, so the geometry
    filter runs after columnar decode. Passing `columns` restricts which
    attribute columns are decoded, which is where column pruning savings
    are realised.
    """
    if not os.path.isfile(file_path):
        raise FileNotFoundError(f"GeoParquet file not found: {file_path}")

    read_cols = columns or ["id", "geometry"]

    def load() -> gpd.GeoDataFrame:
        table = pq.read_table(file_path, columns=read_cols)
        gdf = gpd.GeoDataFrame(
            table.to_pandas(),
            geometry="geometry",
            crs="EPSG:4326",
        )
        mask = gdf.geometry.intersects(box(*bbox))
        return gdf[mask]

    gdf, duration, peak_mem = track_resources(load)
    return {
        "format": "geoparquet",
        "rows_returned": len(gdf),
        "duration_s": round(duration, 4),
        "peak_mem_delta_mb": round(peak_mem, 2),
    }


def benchmark_flatgeobuf_bbox(
    file_path: str,
    bbox: Tuple[float, float, float, float],
) -> dict[str, Any]:
    """
    Read a FlatGeobuf file using the embedded Hilbert index via pyogrio.

    Passing `bbox` to read_file triggers an index-assisted seek; only features
    whose envelopes overlap the query window are decoded.
    """
    if not os.path.isfile(file_path):
        raise FileNotFoundError(f"FlatGeobuf file not found: {file_path}")

    def load() -> gpd.GeoDataFrame:
        return gpd.read_file(file_path, bbox=bbox, engine="pyogrio")

    gdf, duration, peak_mem = track_resources(load)
    return {
        "format": "flatgeobuf",
        "rows_returned": len(gdf),
        "duration_s": round(duration, 4),
        "peak_mem_delta_mb": round(peak_mem, 2),
    }


def run_benchmark_suite(
    parquet_path: str,
    fgb_path: str,
    bbox: Tuple[float, float, float, float],
    runs: int = 7,
) -> dict[str, list[dict[str, Any]]]:
    """Run both benchmarks `runs` times and collect raw observations."""
    results: dict[str, list[dict[str, Any]]] = {"geoparquet": [], "flatgeobuf": []}

    for i in range(runs):
        logger.info("Run %d/%d ...", i + 1, runs)
        results["geoparquet"].append(benchmark_geoparquet_bbox(parquet_path, bbox))
        results["flatgeobuf"].append(benchmark_flatgeobuf_bbox(fgb_path, bbox))

    return results


if __name__ == "__main__":
    TEST_BBOX = (-74.05, 40.65, -73.90, 40.75)  # Lower Manhattan
    PARQUET_PATH = "data/buildings.parquet"
    FGB_PATH = "data/buildings.fgb"

    suite = run_benchmark_suite(PARQUET_PATH, FGB_PATH, TEST_BBOX, runs=7)

    for fmt, observations in suite.items():
        durations = sorted(o["duration_s"] for o in observations)[1:-1]  # drop min/max
        median = sorted(durations)[len(durations) // 2]
        print(f"{fmt}: median {median:.3f}s over {len(durations)} trimmed runs")

Benchmark Reference Matrix

The table below summarises typical outcomes across the three primary query profiles on a 750 k-feature OpenStreetMap building footprint dataset (NVMe SSD, EPSG:4326, Python 3.11):

Query Profile	Format	Codec	Median Latency	Peak Memory Delta	Primary Use Case
Full-table scan	GeoParquet	zstd-3	~1.8 s	~420 MB	Batch ETL, analytics
Full-table scan	FlatGeobuf	default	~2.6 s	~510 MB	Sequential export
Bbox filter (10 km²)	GeoParquet	zstd-3	~0.9 s	~95 MB	Analytical spatial join
Bbox filter (10 km²)	FlatGeobuf	default	~0.18 s	~12 MB	Tile serving, streaming
Attribute predicate	GeoParquet	snappy	~0.6 s	~60 MB	BI dashboard, reporting
Attribute predicate	FlatGeobuf	default	~2.4 s	~490 MB	Not recommended

Key observations from these numbers:

FlatGeobuf is 5–8× faster than GeoParquet on tight bounding-box queries because it reads only the index header plus the relevant byte window, while GeoParquet must decompress at least one row group.
GeoParquet is 4× faster than FlatGeobuf on attribute predicates because FlatGeobuf has no attribute index and performs a full sequential scan.
Full-table scan latency favours GeoParquet slightly as columnar decompression pipelines better through vectorized SIMD instructions.
Memory delta for FlatGeobuf on bbox queries is dramatically lower, making it appropriate for memory-constrained edge or serverless deployments.

Failure Modes and Gotchas

CRS Mismatch Before Indexing

Writing GeoParquet without normalizing CRS first is the most common silent failure. If geometry columns arrive in mixed CRS (for example, a mix of EPSG:4326 and EPSG:3857), the file will write successfully but spatial joins across engines will return incorrect results. Always call gdf.to_crs("EPSG:4326") before serialization and verify the output geo metadata key includes a crs field.

FlatGeobuf Index Absent on Write

If you write FlatGeobuf without building the Hilbert index — for example, via GDAL with spatial indexing disabled — bounding-box queries degrade to a full sequential scan, eliminating the format’s main advantage. Confirm index presence by reading the file header with pyogrio.read_info(path) and checking that spatial_index is True.

Row Group Size Too Small for Cloud Storage

Tuning row group sizing below 10 k rows multiplies the number of HTTP round trips required to scan a GeoParquet file from S3 or GCS. Each row group footer requires a separate range request. For cloud-hosted files, 100 k–150 k rows per group is a safe starting point before profiling. See row group sizing strategies for a full tuning guide.

ZSTD Level Mismatch Between Write and Environment

ZSTD compression levels affect only the encoder; decoders handle all levels. However, teams sometimes write files at level 1 for speed and then assume they receive the same storage savings as level 6. For geometry WKB columns — which are highly repetitive binary sequences — the jump from level 1 to level 3 typically yields 15–20% additional size reduction with negligible decode overhead. See ZSTD compression levels for geospatial data for level-by-level measurements.

Attribute Pushdown Silently Disabled

PyArrow’s read_table accepts a filters argument for row-group-level predicate pushdown, but it only skips groups whose statistics satisfy the predicate. If the target column has low cardinality and no bloom filter, all row groups may be read anyway. Combine column projection (columns=[...]) with explicit filter expressions and inspect the pq.ParquetFile.metadata to verify row group statistics are present.

Missing `geo` Metadata Breaks Cross-Engine Compatibility

GeoParquet files produced by older versions of GeoPandas or GDAL may omit the required geo metadata key (primary_column, columns, crs). DuckDB’s spatial extension and AWS Athena’s spatial functions both require this key to identify the geometry column. Validate output files with pyarrow.parquet.read_schema(path).metadata[b'geo'] and reject any file that does not parse correctly.

FAQ

Is GeoParquet always faster than FlatGeobuf?

No. GeoParquet leads in analytical workloads — full-table scans, multi-column aggregations, and attribute predicate pushdown — because its columnar layout enables vectorized decoding and row-group skipping. FlatGeobuf is faster for tight bounding-box queries on large datasets because its embedded Hilbert curve index lets readers seek directly to relevant byte offsets without loading the full file.

Which format works better with cloud object storage like S3 or GCS?

Both support HTTP range requests, but they use them differently. GeoParquet row-group footers allow engines like DuckDB or Athena to skip irrelevant groups after a single metadata read. FlatGeobuf’s index header is compact but requires an initial seek before resolving byte offsets, adding one round trip per query. In practice GeoParquet integrates more naturally with cloud data warehouses; FlatGeobuf excels when serving tiles from an object-storage CDN.

Can FlatGeobuf match GeoParquet for attribute filtering?

Not natively. FlatGeobuf’s index covers geometry bounds only; attribute predicates require a full sequential scan across every feature record. GeoParquet supports column projection and predicate pushdown through PyArrow and DuckDB, skipping entire row groups that cannot satisfy the filter. If attribute filtering is a primary access pattern, GeoParquet is the correct choice regardless of dataset size.

What compression codec should I use with GeoParquet?

ZSTD at level 3–6 delivers the best compression-to-decode-speed ratio for vector attribute columns. Snappy is a reasonable default for hot-path analytics where decompression latency matters more than storage cost. LZ4 is fastest to decompress but offers lower compression ratios. For geometry columns encoded as WKB, ZSTD achieves noticeably better ratios than Snappy because WKB byte sequences are highly repetitive.

Understanding Parquet Columnar Storage for GIS — how columnar layout and row-group statistics underpin GeoParquet performance
Column Pruning Benefits in Geospatial Parquet — deep dive on column projection and bytes-scanned reduction
How to Choose Between GeoParquet and FlatGeobuf — decision tree aligned with cloud-native architecture patterns
GeoJSON Overhead and Serialization Costs — why text-based vector formats lose to binary layouts before you even pick one
ZSTD Compression Levels for Geospatial Data — codec selection and level tuning for GeoParquet files
Row Group Sizing Strategies for Parquet — tuning row group size for cloud query performance

← Back to Geospatial Storage Fundamentals & Format Comparison

Continue exploring

How to Choose Between GeoParquet and FlatGeobuf Read article →

#Comparing GeoParquet vs FlatGeobuf Performance

#Prerequisites

#Architectural Foundations: Columnar Analytics vs Spatial Streaming

#Step-by-Step Benchmark Workflow

#Step 1 — Profile Your Source Data

#Step 2 — Map Your Query Workload

#Step 3 — Configure Serialization Parameters

#Step 4 — Run the Benchmark Matrix

#Step 5 — Validate and Select a Format

#Production-Ready Implementation

#Benchmark Reference Matrix

#Failure Modes and Gotchas

#CRS Mismatch Before Indexing

#FlatGeobuf Index Absent on Write

#Row Group Size Too Small for Cloud Storage

#ZSTD Level Mismatch Between Write and Environment

#Attribute Pushdown Silently Disabled

#Missing geo Metadata Breaks Cross-Engine Compatibility

#FAQ

#Related