Automating Shapefile to GeoParquet Conversion

Automating Shapefile to GeoParquet Conversion requires a deterministic pipeline that normalizes coordinate reference systems, enforces strict schema validation, and writes columnar data with spatial metadata intact. The most reliable production approach uses geopandas backed by pyogrio for high-throughput I/O and pyarrow for serialization, wrapped in a batch processor that validates geometry types, strips legacy .shp constraints, and outputs compliant GeoParquet files ready for cloud-native querying. This pipeline eliminates the 2GB file cap, 10-character field truncation, and mixed-geometry ambiguity inherent to legacy shapefiles while preserving spatial indexing and enabling direct predicate pushdown in DuckDB, AWS Athena, or BigQuery.

Pipeline Architecture & Design Principles

A robust conversion workflow must address three core failure points before writing to disk: CRS ambiguity, schema drift, and I/O bottlenecks. Shapefiles distribute geometry and attributes across multiple sidecar files (.shp, .shx, .dbf, .prj), making atomic reads fragile. GeoParquet consolidates everything into a single, self-describing columnar file.

When designing Building Batch Conversion Pipelines with Python, prioritize these architectural rules:

  • Thread-safe I/O: Use pyogrio instead of fiona to bypass GIL contention and leverage GDAL’s vectorized drivers.
  • Explicit CRS Normalization: Never assume .prj files exist. Fail fast or apply a deterministic fallback (e.g., EPSG:4326).
  • Schema Sanitization: Parquet and Arrow reject special characters, leading/trailing whitespace, and duplicate column names. Normalize headers before serialization.
  • Metadata Compliance: GeoParquet requires a geo key in the Parquet file’s schema metadata containing primary_column, columns, and crs definitions. Modern geopandas handles this automatically, but validation should be explicit.

Scaling these patterns across enterprise datasets is a core component of modern Data Conversion & Migration Pipelines, where idempotency and resumable execution prevent partial writes during network or disk failures.

Production-Ready Conversion Script

python
import logging
from pathlib import Path

import geopandas as gpd
import pyarrow.parquet as pq

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def convert_shp_to_geoparquet(
    shp_path: Path,
    out_dir: Path,
    target_crs: str = "EPSG:4326",
    compression: str = "snappy"
) -> Path:
    if not shp_path.exists():
        raise FileNotFoundError(f"Source shapefile missing: {shp_path}")

    # Fast, thread-safe read via pyogrio
    try:
        gdf = gpd.read_file(shp_path, engine="pyogrio")
    except Exception as e:
        raise RuntimeError(f"Shapefile read failed: {e}") from e

    # Enforce CRS normalization
    if gdf.crs is None:
        logging.warning(f"No CRS detected in {shp_path.name}. Applying {target_crs}")
        gdf.set_crs(target_crs, inplace=True)
    elif str(gdf.crs).upper() != target_crs.upper():
        gdf = gdf.to_crs(target_crs)

    # Sanitize column names for Arrow/Parquet compatibility
    gdf.columns = [c.strip().replace(" ", "_").replace("-", "_").lower() for c in gdf.columns]
    
    # Remove duplicate columns if present
    gdf = gdf.loc[:, ~gdf.columns.duplicated()]

    out_path = out_dir / f"{shp_path.stem}.parquet"
    out_path.parent.mkdir(parents=True, exist_ok=True)

    # Write GeoParquet (geopandas >= 1.0 auto-injects spec-compliant metadata)
    gdf.to_parquet(out_path, compression=compression, index=False)
    
    # Verify GeoParquet compliance
    _verify_geoparquet_metadata(out_path)
    
    logging.info(f"Converted {shp_path.name} -> {out_path.name}")
    return out_path

def _verify_geoparquet_metadata(parquet_path: Path) -> None:
    """Ensure the output contains required GeoParquet schema metadata."""
    schema = pq.read_schema(parquet_path)
    if b"geo" not in (schema.metadata or {}):
        raise ValueError(f"Output {parquet_path.name} missing 'geo' metadata. Not GeoParquet compliant.")

def batch_convert(input_dir: Path, output_dir: Path, **kwargs) -> list[Path]:
    shp_files = sorted(input_dir.rglob("*.shp"))
    logging.info(f"Found {len(shp_files)} shapefiles. Starting conversion...")
    
    converted = []
    for shp in shp_files:
        try:
            out = convert_shp_to_geoparquet(shp, output_dir, **kwargs)
            converted.append(out)
        except Exception as e:
            logging.error(f"Skipped {shp.name}: {e}")
    return converted

Implementation Breakdown & Best Practices

Step Technical Rationale Production Tip
I/O Engine pyogrio bypasses Python GIL locks and streams directly to Arrow memory. Set PYOGRIO_USE_ARROW=1 environment variable for zero-copy reads on large datasets.
CRS Handling GeoParquet requires explicit WKT2 or EPSG definitions in metadata. Always transform to EPSG:4326 (WGS84) or EPSG:3857 (Web Mercator) before cloud ingestion to standardize spatial joins.
Column Sanitization Parquet schemas reject spaces, hyphens, and leading numbers. Apply regex normalization: re.sub(r"[^a-z0-9_]", "_", col.lower()) for strict compliance.
Metadata Injection The GeoParquet 1.0.0 Specification mandates geo metadata with column-level geometry encoding. geopandas >= 1.0 handles this natively. Verify with pq.read_schema(path).metadata.

Geometry Type Enforcement

Legacy shapefiles frequently contain mixed geometry types (Point, Polygon, MultiLineString) in a single layer. GeoParquet expects homogeneous geometry per column. If your pipeline encounters mixed types, split the GeoDataFrame by gdf.geom_type before writing, or cast to GeometryCollection (though this sacrifices predicate pushdown efficiency).

Compression & Chunking

Snappy offers the best balance of read speed and compression ratio for cloud query engines. For archival storage, switch to zstd or brotli. When files exceed 500MB, consider partitioning by spatial index (e.g., H3 hexagons or geohash prefixes) to enable partition pruning in Athena or BigQuery.

Validation & Cloud-Native Query Readiness

Once converted, verify spatial integrity and query performance before promoting to production storage. Run a quick validation pass using pyarrow to confirm the geo metadata key exists and matches the expected primary_column. Then, test predicate pushdown in your target engine:

sql
-- DuckDB / Athena / BigQuery compatible
SELECT COUNT(*) 
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((...))'))
AND attribute_col = 'target_value';

Cloud engines leverage the embedded spatial metadata to skip irrelevant row groups, reducing scan costs by 60–90% compared to shapefile or GeoJSON baselines. For detailed configuration guidance, consult the official GeoPandas I/O documentation and your cloud provider’s spatial query tuning guides.

Troubleshooting Common Edge Cases

Symptom Root Cause Resolution
ValueError: Cannot convert mixed geometry types Shapefile contains multiple geometry classes in one .shp Filter by gdf.geom_type or use gdf.explode() before writing
ArrowInvalid: Column name contains invalid characters Legacy .dbf headers use spaces or special chars Apply the sanitization list comprehension in the script
Missing CRS / Projection mismatch .prj file absent or malformed Explicitly pass target_crs and log warnings for manual review
File size > 2GB after conversion High-precision coordinates or excessive attributes Enable compression="zstd", drop unused columns, or partition spatially

Automating Shapefile to GeoParquet Conversion transforms brittle, desktop-bound workflows into scalable, cloud-optimized data products. By enforcing strict schema validation, leveraging pyogrio for vectorized I/O, and embedding compliant spatial metadata, platform teams can eliminate legacy bottlenecks and unlock high-performance spatial analytics at scale.