Preserving Metadata During GeoParquet Conversion

Modern geospatial analytics demand columnar storage for performance, yet migrating legacy vector datasets to GeoParquet frequently strips critical provenance, coordinate reference system (CRS) definitions, and custom attributes. Preserving Metadata During GeoParquet Conversion is a non-negotiable requirement for platform teams building compliant data lakes. Unlike shapefiles or GeoJSON, which embed metadata in sidecar files or inline JSON, GeoParquet relies on Parquet’s footer metadata and a strict geo JSON schema. Naïve conversions drop everything outside geometry and tabular columns, breaking downstream lineage tracking, spatial indexing, and regulatory compliance.

This guide provides a production-tested workflow for extracting, mapping, and injecting metadata during format transitions, ensuring full alignment with the official GeoParquet specification while maintaining pipeline reliability.

The Metadata Architecture Challenge

Traditional GIS formats store metadata heterogeneously. GeoPackage uses SQLite tables, Shapefile relies on .prj and .xml sidecars, and FlatGeobuf packs metadata in a header block. Parquet, by design, optimizes for analytical query performance and stores metadata as key-value pairs in the file footer. The GeoParquet standard bridges this gap by reserving a geo key containing a JSON object that defines geometry columns, CRS, bounding boxes, and encoding.

When designing Data Conversion & Migration Pipelines, engineers must account for three distinct metadata layers:

  1. Geospatial Core: CRS, geometry type, bounding box, and primary geometry column.
  2. Schema & Type Mapping: Precision, scale, nullability constraints, and temporal formats.
  3. Custom/Provenance: Source system identifiers, processing timestamps, and domain-specific tags.

Failing to serialize these layers correctly leads to silent data degradation. Understanding Metadata Loss Prevention During Format Swaps is essential before implementing conversion logic, as different I/O drivers handle metadata truncation differently. Some libraries silently drop unrecognized keys, while others raise exceptions on malformed CRS strings.

Prerequisites & Environment Configuration

A reliable conversion stack requires modern, actively maintained libraries that expose Parquet metadata APIs. The following environment is validated for cloud-native and on-premise deployments:

  • Python 3.9+ (type hints and pathlib required for pipeline orchestration)
  • pyarrow>=14.0.0 (direct Parquet metadata manipulation and schema enforcement)
  • geopandas>=0.14.0 (spatial DataFrame operations and CRS normalization)
  • pyogrio>=0.7.0 (GDAL-backed I/O with optimized vector reading)
  • shapely>=2.0.0 (geometry validation and topology checks)
  • pandas>=2.0.0 (tabular type coercion and null handling)

Install via pip:

bash
pip install pyarrow geopandas pyogrio shapely pandas

For cloud deployments, ensure your IAM roles grant s3:PutObject and s3:GetObject permissions if writing directly to object storage. Configure pyogrio to use the system GDAL installation for maximum format compatibility.

Step 1: Extracting Source Metadata and CRS Definitions

The first phase involves reading the source dataset while explicitly capturing all embedded metadata. Many legacy formats store CRS information in non-standard locations, requiring careful parsing. Using pyogrio provides the fastest read path while exposing the raw metadata dictionary.

python
import pyogrio
from pathlib import Path
from typing import Dict, Any

def extract_source_metadata(source_path: Path) -> Dict[str, Any]:
    """Extract CRS, bounding box, and custom metadata from source vector."""
    meta = pyogrio.read_info(str(source_path))
    
    # Normalize CRS to EPSG format if available
    crs_raw = meta.get("crs", None)
    if crs_raw and "EPSG:" in crs_raw:
        crs_epsg = crs_raw.split("EPSG:")[1]
    else:
        crs_epsg = "unknown"
        
    return {
        "crs": crs_epsg,
        "bbox": meta.get("bbox", None),
        "geometry_type": meta.get("geometry_type", "Unknown"),
        "encoding": meta.get("encoding", "UTF-8"),
        "custom_tags": meta.get("metadata", {})
    }

Automating this extraction step prevents manual oversight and establishes a baseline for downstream validation. Teams scaling across hundreds of datasets should integrate this logic into a metadata registry. For deeper implementation patterns, review Automating Metadata Extraction During Migration.

Step 2: Schema Mapping and Type Coercion

Parquet enforces strict typing, which conflicts with the loosely typed nature of many GIS formats. Integer fields stored as strings, mixed precision floats, and datetime formats must be explicitly coerced before serialization. Improper mapping triggers schema drift during batch processing.

python
import geopandas as gpd
import pandas as pd
from pyarrow import Table

def map_and_coerce_schema(gdf: gpd.GeoDataFrame) -> Table:
    """Normalize types and prepare for Parquet serialization."""
    # Ensure datetime columns use pyarrow timestamp types
    for col in gdf.select_dtypes(include=["datetime", "object"]).columns:
        if pd.api.types.is_datetime64_any_dtype(gdf[col]):
            gdf[col] = gdf[col].astype("datetime64[ms, UTC]")
            
    # Drop empty string columns that cause Arrow type inference failures
    gdf = gdf.replace("", pd.NA)
    
    # Convert to PyArrow Table with explicit schema
    table = Table.from_pandas(gdf, preserve_index=False)
    return table

When migrating from legacy systems, column names often contain spaces, special characters, or reserved keywords. Standardizing these during the mapping phase prevents query failures in downstream engines like DuckDB or AWS Athena. Refer to Schema Mapping for Legacy to Modern Formats for comprehensive naming conventions and type coercion matrices.

GeoParquet requires a specific JSON structure attached to the geo key in the Parquet file metadata. This structure must conform to the specification exactly, or spatial engines will ignore the geometry column. The following function builds a compliant geo metadata payload and attaches it to the Arrow table.

python
import json
from typing import Dict, Any
from pyarrow import Table

def build_geoparquet_metadata(
    table: Table,
    source_meta: Dict[str, Any],
    primary_column: str = "geometry"
) -> Table:
    """Attach compliant GeoParquet metadata to the Arrow table."""
    
    # Construct the mandatory geo JSON object
    geo_meta = {
        "version": "1.0.0",
        "primary_column": primary_column,
        "columns": {
            primary_column: {
                "encoding": "WKB",
                "geometry_types": [source_meta.get("geometry_type", "Unknown")],
                "crs": {
                    "$schema": "https://proj.org/schemas/v0.6/projjson.schema.json",
                    "type": "GeographicCRS",
                    "name": f"EPSG:{source_meta['crs']}" if source_meta['crs'] != "unknown" else "WGS 84",
                    "id": {"authority": "EPSG", "code": source_meta['crs'] if source_meta['crs'] != "unknown" else 4326}
                },
                "bbox": source_meta.get("bbox", None)
            }
        }
    }
    
    # Serialize to bytes and attach to table metadata
    geo_bytes = json.dumps(geo_meta, separators=(",", ":")).encode("utf-8")
    
    existing_meta = table.schema.metadata or {}
    updated_meta = {**existing_meta, b"geo": geo_bytes}
    
    return table.replace_schema_metadata(updated_meta)

Note that the encoding field must be WKB or WKT. GeoParquet v1.0.0 mandates WKB for performance, and most modern engines expect this encoding. If you are migrating QGIS projects that rely on proprietary styling or layer metadata, consider Preserving QGIS Metadata in FlatGeobuf before converting to Parquet, as FlatGeobuf handles QGIS-specific tags more gracefully.

Step 4: Validation and Production Pipeline Integration

Writing the file is straightforward, but validation ensures downstream compatibility. Use pyarrow.parquet to write the table with compression and row group sizing optimized for analytical workloads.

python
from pathlib import Path
from pyarrow import Table, parquet

def write_geoparquet(table: Table, output_path: Path) -> None:
    """Write validated GeoParquet with optimized compression."""
    parquet.write_table(
        table,
        str(output_path),
        compression="ZSTD",
        use_dictionary=True,
        write_statistics=True,
        row_group_size=100_000  # Optimize for cloud query engines
    )

    # Quick validation: read back and verify geo key exists
    read_meta = parquet.read_metadata(str(output_path))
    if b"geo" not in read_meta.schema.to_arrow_schema().metadata:
        raise RuntimeError("GeoParquet metadata missing in output file footer.")

For enterprise-scale operations, wrap these functions in an orchestration framework like Apache Airflow or Prefect. Implement retry logic, dead-letter queues for malformed geometries, and checksum verification. When scaling horizontally, partition datasets by spatial index or temporal buckets to avoid skew. Detailed orchestration patterns are covered in Building Batch Conversion Pipelines with Python.

Common Pitfalls and Fallback Strategies

Even with careful implementation, certain edge cases consistently break conversions:

  1. Multi-CRS Datasets: GeoParquet v1.0.0 supports only one CRS per geometry column. If your source contains mixed projections, you must normalize to a single target CRS (typically EPSG:4326 or EPSG:3857) before conversion.
  2. Large Geometry Blobs: WKB encoding can produce massive binary payloads. Enable ZSTD compression and consider tiling or simplifying geometries at scale.
  3. Missing CRS Definitions: Shapefiles without .prj files default to unknown CRS. Implement a fallback routing mechanism that quarantines these files for manual review rather than injecting a default projection.
  4. Metadata Bloat: Parquet footers are loaded into memory during file open. Keep the geo JSON lean and move extensive provenance logs to a companion metadata catalog (e.g., OpenMetadata or AWS Glue).

Always validate output files against the Apache Parquet Python documentation and run spatial queries in DuckDB or PostGIS to confirm geometry integrity before promoting to production.

Conclusion

Preserving metadata during format transitions is not a cosmetic requirement; it is foundational to data lineage, spatial accuracy, and regulatory compliance. By explicitly extracting source attributes, coercing types to Parquet standards, and injecting a compliant geo footer, platform teams can eliminate silent degradation and build resilient geospatial data lakes. Implement the extraction, mapping, and injection steps outlined above, integrate automated validation, and your pipelines will consistently produce query-ready, standards-compliant GeoParquet files.

Continue exploring