Section: Data Conversion & Migration Pipelines for Cloud-Native Geospatial Storage 14 min read

Automating Shapefile to GeoParquet Conversion

Q: Do I need GDAL installed separately to use pyogrio?

No. The pyogrio wheel bundles GDAL as a shared library. Installing via pip install pyogrio is sufficient on Linux, macOS, and Windows — no system GDAL is required.

Q: What happens if the .prj sidecar file is missing?

GeoPandas sets gdf.crs to None. The pipeline detects this, logs a warning, and applies the configured target_crs (default EPSG:4326) via set_crs(). Records are not reprojected — coordinates are assumed to already be in that system.

Q: Will this pipeline handle shapefiles larger than 2 GB?

Yes. The 2 GB limit applies only to the .dbf attribute file in shapefiles. GeoParquet has no per-file size restriction. For very large sources, partition by bounding-box tile or H3 index before writing to enable row-group pruning downstream.

Q: Which ZSTD compression level should I use for GeoParquet outputs?

Level 3 is the default and offers the best balance of write speed and compression ratio for vector geometries. For archival outputs queried infrequently, levels 9–12 reduce storage cost at the expense of slightly higher write time. See the ZSTD levels guide for a full benchmark.

To automate shapefile-to-GeoParquet conversion at scale, use geopandas (≥ 1.0) backed by pyogrio for I/O and pyarrow for validation: read each .shp with the pyogrio engine, normalize the CRS to EPSG:4326, sanitize column names to satisfy Arrow constraints, write GeoParquet with compression="zstd", then verify the required geo metadata key is present in the output schema. This single pass eliminates the 2 GB .dbf cap, 10-character field-name truncation, and projection ambiguity that make shapefiles a poor fit for modern data stacks, producing files ready for direct predicate-pushdown queries in DuckDB, AWS Athena, or BigQuery. This page sits under Building Batch Conversion Pipelines with Python; the script below drops straight into that batch framework.

Quick-Reference: Shapefile vs GeoParquet Conversion Settings

Variable	Recommended Setting	Rationale	Primary Use Case
I/O engine	`pyogrio`	Thread-safe GDAL reads; bypasses Python GIL contention	High-volume batch conversion over many files
Target CRS	`EPSG:4326` (WGS84)	Standard for cloud spatial joins; widely supported by query engines	Cross-dataset joins in DuckDB / Athena / BigQuery
Compression	`zstd` level 3	Best write-speed / ratio balance for vector data at scale	General-purpose vector outputs queried frequently
Column sanitization	Lowercase + `[^a-z0-9_]` → `_`	Arrow rejects spaces, hyphens, leading digits	Legacy `.dbf` headers with non-ASCII or spaced names
Metadata check	Assert `b"geo"` in `pq.read_schema().metadata`	GeoParquet 1.0 spec compliance	CI/CD validation gate before promoting to production

Why This Approach Works

The shapefile I/O problem

A shapefile is not a single file — it is a collection of sidecars: .shp (geometry), .shx (index), .dbf (attributes), and optionally .prj (projection). Reads that span these files across a network filesystem or object store amplify latency at every row. pyogrio addresses this by driving GDAL’s vectorized C++ layer directly, bypassing the Python GIL and reducing per-file round-trips. Setting PYOGRIO_USE_ARROW=1 enables zero-copy transfer into Arrow memory before GeoPandas ever touches Python objects.

CRS normalization before serialization

GeoParquet’s columnar storage embeds CRS metadata as WKT2 inside the geo key at the file level. If the source .prj is absent — a common occurrence with legacy municipal datasets — gdf.crs is None, and writing a GeoParquet without CRS metadata produces a file that query engines will reject or silently misinterpret. The pipeline must detect None CRS and apply a deterministic fallback rather than silently writing broken output.

Arrow schema constraints

Parquet’s columnar layout stores each column as a typed Arrow array. Arrow’s schema parser rejects column names with spaces, hyphens, leading digits, or non-ASCII characters — all of which are legal in legacy .dbf headers. Sanitizing names before writing prevents ArrowInvalid exceptions deep in the serialization path where the error message is least informative.

Pipeline Architecture

The diagram below shows how each shapefile flows through the four mandatory stages, then through a validation gate: files that carry the geo metadata key are promoted to object storage, while files that fail are rejected to a dead-letter path for triage rather than silently shipped.

Production-Ready Conversion Script

python

# Requires: geopandas>=1.0, pyogrio>=0.7, pyarrow>=14.0
import logging
import re
from pathlib import Path

import geopandas as gpd
import pyarrow.parquet as pq

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")


def sanitize_column_names(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Normalize column names to be Arrow/Parquet compatible."""
    gdf = gdf.copy()
    gdf.columns = [
        re.sub(r"[^a-z0-9_]", "_", c.strip().lower())
        for c in gdf.columns
    ]
    # Drop duplicates that can appear after normalization
    gdf = gdf.loc[:, ~gdf.columns.duplicated()]
    return gdf


def convert_shp_to_geoparquet(
    shp_path: Path,
    out_dir: Path,
    target_crs: str = "EPSG:4326",
    compression: str = "zstd",
) -> Path:
    if not shp_path.exists():
        raise FileNotFoundError(f"Source shapefile missing: {shp_path}")

    # Thread-safe read via pyogrio; set PYOGRIO_USE_ARROW=1 for zero-copy
    try:
        gdf = gpd.read_file(shp_path, engine="pyogrio")
    except Exception as e:
        raise RuntimeError(f"Shapefile read failed: {e}") from e

    # Enforce CRS — fail fast on None rather than silently writing bad metadata
    if gdf.crs is None:
        logging.warning(
            "No CRS in %s — applying %s (coordinates assumed correct)",
            shp_path.name, target_crs,
        )
        gdf = gdf.set_crs(target_crs)
    elif gdf.crs.to_epsg() != int(target_crs.split(":")[1]):
        gdf = gdf.to_crs(target_crs)

    gdf = sanitize_column_names(gdf)

    out_path = out_dir / f"{shp_path.stem}.parquet"
    out_path.parent.mkdir(parents=True, exist_ok=True)

    # GeoPandas >= 1.0 auto-injects spec-compliant GeoParquet geo metadata
    gdf.to_parquet(out_path, compression=compression, index=False)

    _verify_geoparquet_metadata(out_path)
    logging.info("Converted %s → %s", shp_path.name, out_path.name)
    return out_path


def _verify_geoparquet_metadata(parquet_path: Path) -> None:
    """Assert the output carries the required GeoParquet 1.0 geo metadata key."""
    schema = pq.read_schema(parquet_path)
    if b"geo" not in (schema.metadata or {}):
        raise ValueError(
            f"{parquet_path.name} is missing 'geo' metadata — not GeoParquet compliant."
        )


def batch_convert(
    input_dir: Path,
    output_dir: Path,
    **kwargs,
) -> list[Path]:
    shp_files = sorted(input_dir.rglob("*.shp"))
    logging.info("Found %d shapefiles. Starting conversion...", len(shp_files))

    converted: list[Path] = []
    for shp in shp_files:
        try:
            out = convert_shp_to_geoparquet(shp, output_dir, **kwargs)
            converted.append(out)
        except Exception as e:
            logging.error("Skipped %s: %s", shp.name, e)
    return converted

# Requires: geopandas>=1.0, pyogrio>=0.7, pyarrow>=14.0
import logging
import re
from pathlib import Path

import geopandas as gpd
import pyarrow.parquet as pq

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")


def sanitize_column_names(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Normalize column names to be Arrow/Parquet compatible."""
    gdf = gdf.copy()
    gdf.columns = [
        re.sub(r"[^a-z0-9_]", "_", c.strip().lower())
        for c in gdf.columns
    ]
    # Drop duplicates that can appear after normalization
    gdf = gdf.loc[:, ~gdf.columns.duplicated()]
    return gdf


def convert_shp_to_geoparquet(
    shp_path: Path,
    out_dir: Path,
    target_crs: str = "EPSG:4326",
    compression: str = "zstd",
) -> Path:
    if not shp_path.exists():
        raise FileNotFoundError(f"Source shapefile missing: {shp_path}")

    # Thread-safe read via pyogrio; set PYOGRIO_USE_ARROW=1 for zero-copy
    try:
        gdf = gpd.read_file(shp_path, engine="pyogrio")
    except Exception as e:
        raise RuntimeError(f"Shapefile read failed: {e}") from e

    # Enforce CRS — fail fast on None rather than silently writing bad metadata
    if gdf.crs is None:
        logging.warning(
            "No CRS in %s — applying %s (coordinates assumed correct)",
            shp_path.name, target_crs,
        )
        gdf = gdf.set_crs(target_crs)
    elif gdf.crs.to_epsg() != int(target_crs.split(":")[1]):
        gdf = gdf.to_crs(target_crs)

    gdf = sanitize_column_names(gdf)

    out_path = out_dir / f"{shp_path.stem}.parquet"
    out_path.parent.mkdir(parents=True, exist_ok=True)

    # GeoPandas >= 1.0 auto-injects spec-compliant GeoParquet geo metadata
    gdf.to_parquet(out_path, compression=compression, index=False)

    _verify_geoparquet_metadata(out_path)
    logging.info("Converted %s → %s", shp_path.name, out_path.name)
    return out_path


def _verify_geoparquet_metadata(parquet_path: Path) -> None:
    """Assert the output carries the required GeoParquet 1.0 geo metadata key."""
    schema = pq.read_schema(parquet_path)
    if b"geo" not in (schema.metadata or {}):
        raise ValueError(
            f"{parquet_path.name} is missing 'geo' metadata — not GeoParquet compliant."
        )


def batch_convert(
    input_dir: Path,
    output_dir: Path,
    **kwargs,
) -> list[Path]:
    shp_files = sorted(input_dir.rglob("*.shp"))
    logging.info("Found %d shapefiles. Starting conversion...", len(shp_files))

    converted: list[Path] = []
    for shp in shp_files:
        try:
            out = convert_shp_to_geoparquet(shp, output_dir, **kwargs)
            converted.append(out)
        except Exception as e:
            logging.error("Skipped %s: %s", shp.name, e)
    return converted

Validation and Cloud-Native Query Readiness

After conversion, confirm the geo metadata key and test predicate pushdown before promoting files to production storage. The _verify_geoparquet_metadata function in the script above is the minimal check; a fuller validation also confirms geometry type consistency and CRS string round-trips:

python

import json
import pyarrow.parquet as pq

def audit_geoparquet(path: Path) -> dict:
    schema = pq.read_schema(path)
    meta = schema.metadata or {}
    geo = json.loads(meta.get(b"geo", b"{}"))
    return {
        "has_geo_key": b"geo" in meta,
        "primary_column": geo.get("primary_column"),
        "crs_authority": geo.get("columns", {})
                            .get(geo.get("primary_column", ""), {})
                            .get("crs", {})
                            .get("id", {})
                            .get("authority"),
        "row_groups": pq.read_metadata(path).num_row_groups,
    }

import json
import pyarrow.parquet as pq

def audit_geoparquet(path: Path) -> dict:
    schema = pq.read_schema(path)
    meta = schema.metadata or {}
    geo = json.loads(meta.get(b"geo", b"{}"))
    return {
        "has_geo_key": b"geo" in meta,
        "primary_column": geo.get("primary_column"),
        "crs_authority": geo.get("columns", {})
                            .get(geo.get("primary_column", ""), {})
                            .get("crs", {})
                            .get("id", {})
                            .get("authority"),
        "row_groups": pq.read_metadata(path).num_row_groups,
    }

To benchmark actual predicate pushdown in DuckDB, run the spatial extension against a converted file:

sql

-- DuckDB with spatial extension
INSTALL spatial; LOAD spatial;

SELECT COUNT(*)
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE ST_Intersects(
    ST_GeomFromWKB(geometry),
    ST_GeomFromText('POLYGON((...))')
)
AND category = 'target_value';

-- DuckDB with spatial extension
INSTALL spatial; LOAD spatial;

SELECT COUNT(*)
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE ST_Intersects(
    ST_GeomFromWKB(geometry),
    ST_GeomFromText('POLYGON((...))')
)
AND category = 'target_value';

Cloud engines use the embedded row-group statistics and spatial metadata to skip irrelevant chunks. Compared to shapefile or GeoJSON baselines, scan costs drop 60–90% on selective queries. Pairing this with row group sizing tuned to your query engine further reduces the data scanned per query.

Expected output ranges. A healthy audit_geoparquet() result returns has_geo_key: True, a non-empty primary_column (usually "geometry"), and crs_authority: "EPSG". For typical municipal or parcel datasets, the converted GeoParquet lands at roughly 15–35% of the source shapefile’s on-disk size — a 3–7× reduction — because ZSTD compresses the redundant coordinate runs and dictionary-encodes low-cardinality attributes. The row_groups count should be ≥ 1; a single row group for a multi-million-row file is a warning sign that you should partition before writing (see the edge cases below). If has_geo_key is False or crs_authority is None, treat the file as non-compliant and do not promote it.

Edge Cases and Caveats

Mixed geometry types

Legacy shapefiles frequently store Point, Polygon, and MultiLineString records in a single layer. GeoParquet allows mixed geometry types within a column, but analytical engines such as DuckDB and Athena perform better when a column is homogeneous. If your source contains mixed types, split the GeoDataFrame by gdf.geom_type before writing, or cast to the Multi* supertype (MultiPoint, MultiPolygon). Upcasting preserves all geometry but sacrifices the predicate-pushdown efficiency available to simple types.

python

# Split by geometry type to produce homogeneous outputs
for geom_type, subset in gdf.groupby(gdf.geom_type):
    out = out_dir / f"{shp_path.stem}_{geom_type.lower()}.parquet"
    subset.to_parquet(out, compression="zstd", index=False)

# Split by geometry type to produce homogeneous outputs
for geom_type, subset in gdf.groupby(gdf.geom_type):
    out = out_dir / f"{shp_path.stem}_{geom_type.lower()}.parquet"
    subset.to_parquet(out, compression="zstd", index=False)

Files larger than 500 MB

For very large shapefiles, writing a single GeoParquet row group per file produces poor query performance because the engine must scan entire row groups even for small spatial predicates. Partition by bounding-box tile, H3 hexagon index, or geohash prefix before writing to enable partition pruning — the same locality principle behind quadtree spatial partitioning. The schema mapping guide covers how to propagate partition keys through the schema without losing attribute fidelity. If your attribute table carries many low-cardinality string fields (zoning codes, land-use classes), enabling dictionary encoding for categorical attributes before the write step compounds the ZSTD savings.

Streaming vs batch execution

The batch_convert function above processes files sequentially, which is safe but slow for hundreds of inputs. For parallel execution, wrap each convert_shp_to_geoparquet call in a concurrent.futures.ProcessPoolExecutor. Because pyogrio releases the GIL during GDAL reads, a thread pool also works; processes avoid shared-memory pitfalls with large GeoDataFrames. Cap workers at os.cpu_count() // 2 to leave headroom for GDAL’s internal threading.

For pipeline-level fallback routing when individual jobs fail, consider a dead-letter queue pattern that captures failed shapefile paths and retries them with relaxed CRS assumptions after the main batch completes.

Frequently Asked Questions

Do I need GDAL installed separately to use pyogrio?

No. The pyogrio wheel bundles GDAL as a shared library. pip install pyogrio is sufficient on Linux, macOS, and Windows without a system GDAL installation.

What happens if the .prj sidecar file is missing?

GeoPandas sets gdf.crs to None. The pipeline detects this, logs a warning, and applies the configured target_crs via set_crs(). Records are not reprojected — coordinates are assumed to already be in that system. If you cannot confirm the source CRS from the data owner, record the uncertainty in a sidecar manifest rather than silently assuming WGS84.

Will this pipeline handle shapefiles larger than 2 GB?

Yes. The 2 GB ceiling is a .dbf constraint specific to the shapefile format. GeoParquet has no per-file size restriction. For very large sources, partition outputs by spatial tile so downstream query engines can apply partition pruning rather than scanning entire files.

Which ZSTD compression level should I use for GeoParquet?

Level 3 (compression="zstd" default) balances write speed and compression ratio for most vector workloads. For archival outputs queried infrequently, levels 9–12 reduce storage cost at the expense of write throughput. Raster-heavy workflows benefit from lower levels (1–3) because coordinate arrays compress poorly at higher efforts.

← Back to Building Batch Conversion Pipelines with Python

#Automating Shapefile to GeoParquet Conversion

#Quick-Reference: Shapefile vs GeoParquet Conversion Settings

#Why This Approach Works

#The shapefile I/O problem

#CRS normalization before serialization

#Arrow schema constraints

#Pipeline Architecture

#Production-Ready Conversion Script

#Validation and Cloud-Native Query Readiness

#Edge Cases and Caveats

#Mixed geometry types

#Files larger than 500 MB

#Streaming vs batch execution

#Frequently Asked Questions

#Related

Automating Shapefile to GeoParquet Conversion

Quick-Reference: Shapefile vs GeoParquet Conversion Settings

Why This Approach Works

The shapefile I/O problem

CRS normalization before serialization

Arrow schema constraints

Pipeline Architecture

Production-Ready Conversion Script

Validation and Cloud-Native Query Readiness

Edge Cases and Caveats

Mixed geometry types

Files larger than 500 MB

Streaming vs batch execution

Frequently Asked Questions

Related