Schema Mapping for Legacy to Modern Formats

Migrating geospatial datasets from legacy storage formats to modern columnar architectures requires precise schema translation. Shapefiles, legacy PostGIS exports, and proprietary CAD/GIS formats often carry implicit type assumptions, inconsistent coordinate precision, and fragmented metadata. When targeting compressed, cloud-native formats like GeoParquet, FlatGeobuf, or Zarr, Schema Mapping for Legacy to Modern Formats becomes the foundational step that dictates query performance, storage efficiency, and downstream analytical reliability.

This guide provides a production-ready workflow for GIS data engineers, Python backend developers, and cloud architects. It covers schema discovery, type alignment, spatial normalization, and validation patterns required to build resilient migration pipelines within broader Data Conversion & Migration Pipelines architectures.

Prerequisites

Before implementing schema mapping logic, ensure your environment meets the following baseline requirements:

  • Python 3.9+ with pip or conda environment isolation
  • Core Libraries: geopandas>=0.14, pyarrow>=14.0, shapely>=2.0, fiona>=1.9, pyproj>=3.0
  • Cloud SDK (optional but recommended): boto3 or gcsfs for object storage staging
  • Baseline Knowledge: Understanding of WKT/WKB geometry serialization, CRS transformation (EPSG/OGC URN), and columnar storage principles
  • Test Dataset: A representative legacy dataset (e.g., .shp, .gdb layer, or legacy CSV with lat/lon columns) containing mixed types, null geometries, and legacy attribute names

Install dependencies:

bash
pip install geopandas pyarrow shapely fiona pyproj

1. Schema Discovery & Profiling

Legacy formats rarely expose explicit schemas. Shapefiles infer types from the first 100 rows, while legacy databases may use VARCHAR(255) for numeric codes. Begin by extracting raw field names, detected types, geometry types, and CRS metadata. Profile null ratios, string cardinality, and numeric ranges to identify truncation risks before committing to a target schema.

Automated profiling should run as a pre-flight check in any Building Batch Conversion Pipelines with Python workflow. Use geopandas to inspect the DataFrame schema, then cross-reference with pyarrow type inference to flag mismatches early.

python
import geopandas as gpd

gdf = gpd.read_file("legacy_data.shp")
print(gdf.dtypes)
print(f"Geometry Type: {gdf.geometry.geom_type.unique()}")
print(f"CRS: {gdf.crs}")

2. Type Alignment & Precision Control

Map legacy types to strict Arrow/Parquet equivalents. Avoid implicit casting. Enforce int32/int64 for identifiers, float64 for measurements, and string for categorical attributes. Control decimal precision explicitly to prevent storage bloat and ensure deterministic query results.

Refer to Preserving Attribute Data Types During Conversion for detailed casting matrices and overflow prevention strategies. When mapping to the Apache Arrow type system, explicitly define a pa.schema() object rather than relying on automatic inference. This guarantees consistent column ordering and prevents downstream type coercion errors in distributed query engines like DuckDB or Trino.

python
import pyarrow as pa

target_schema = pa.schema([
    ("id", pa.int64()),
    ("measurement", pa.float64()),
    ("category", pa.string()),
    ("geometry", pa.large_binary())  # WKB encoding
])

3. Spatial Reference & Geometry Normalization

Legacy datasets frequently mix geometry types (e.g., MULTIPOLYGON and POLYGON in the same column) or use outdated CRS definitions. Normalize to a single geometry type where possible, or explicitly declare mixed-type support in the output format. Transform coordinates to a standardized CRS (typically EPSG:4326 for global data or a regional projected CRS for analytical workloads) using pyproj and shapely.

The OGC GeoParquet specification mandates explicit CRS metadata in the Parquet column schema. Always validate that the output CRS matches the target standard before serialization. Mixed geometry columns should be unified using shapely.geometry.shape or explicitly cast to MULTIPOLYGON/MULTILINESTRING to maintain query compatibility.

python
from shapely.geometry import MultiPolygon

# Normalize mixed polygons to MultiPolygon
gdf["geometry"] = gdf["geometry"].apply(
    lambda geom: MultiPolygon([geom]) if geom.geom_type == "Polygon" else geom
)
gdf = gdf.set_crs("EPSG:4326", allow_override=True)

4. Null Handling & Validation Logic

Missing geometries and incomplete attribute records are common in legacy exports. Rather than dropping rows or allowing silent failures, implement explicit null routing. Define fallback behaviors: convert None geometries to POINT EMPTY or GEOMETRYCOLLECTION EMPTY, and replace string nulls with explicit "" or None depending on downstream query requirements.

For comprehensive strategies on managing incomplete spatial records, review Handling Null Values in Spatial Schema Mapping. Validation should occur post-transformation using assertion checks that verify row counts, geometry validity, and schema conformity before writing to disk.

python
# Validate geometry integrity
assert gdf["geometry"].is_valid.all(), "Invalid geometries detected post-normalization"
assert gdf.crs.to_epsg() == 4326, "CRS mismatch in normalized dataset"

5. Serialization & Metadata Preservation

When writing to cloud-native formats, schema mapping extends beyond column types. You must preserve dataset-level metadata, including field descriptions, provenance tags, and coordinate precision notes. GeoParquet and FlatGeobuf support custom metadata dictionaries, but they require explicit attachment during the write phase.

Consult Preserving Metadata During GeoParquet Conversion for implementation patterns that attach JSON-encoded metadata to the Parquet footer. Always serialize geometries as WKB (Well-Known Binary) rather than WKT to minimize storage footprint and accelerate spatial index construction.

python
import pyarrow as pa
import pyarrow.parquet as pq

# `gdf` and `target_schema` come from the prior snippets in this section.
gdf["geometry"] = gdf["geometry"].apply(lambda g: g.wkb if g else None)

table = pa.Table.from_pandas(gdf, schema=target_schema)
table = table.replace_schema_metadata({
    "geo": '{"primary_column": "geometry", "columns": {"geometry": {"encoding": "WKB", "geometry_types": ["MultiPolygon"]}}}'
})
pq.write_table(table, "output_geoparquet.parquet")

6. Pipeline Integration & Evolution

Schema mapping is rarely a one-time operation. Source systems evolve, legacy vendors deprecate formats, and downstream consumers request new attributes. Implement schema versioning, drift detection, and backward-compatible evolution strategies to prevent pipeline breakage.

Adopt Schema Evolution Best Practices for GIS Pipelines to manage additive changes safely. Use pyarrow’s unify_schemas() for merging incremental loads, and maintain a manifest file tracking schema versions, CRS baselines, and transformation rules. This enables reproducible migrations and simplifies emergency rollbacks when upstream data contracts shift unexpectedly.

Production Code Example

The following script consolidates the workflow into a reusable, production-grade function. It handles discovery, type alignment, spatial normalization, validation, and Parquet serialization with explicit error routing.

python
import geopandas as gpd
import pyarrow as pa
import pyarrow.parquet as pq
from shapely.geometry import MultiPolygon
import logging

logging.basicConfig(level=logging.INFO)

def migrate_legacy_to_geoparquet(
    input_path: str,
    output_path: str,
    target_crs: str = "EPSG:4326",
    id_col: str = "id",
    numeric_cols: list[str] = None,
    string_cols: list[str] = None
) -> None:
    logging.info(f"Loading legacy dataset: {input_path}")
    gdf = gpd.read_file(input_path)
    
    # 1. Normalize geometries
    gdf["geometry"] = gdf["geometry"].apply(
        lambda geom: MultiPolygon([geom]) if geom and geom.geom_type == "Polygon" else geom
    )
    gdf = gdf.set_crs(target_crs, allow_override=True)
    
    # 2. Enforce type alignment
    if id_col in gdf.columns:
        gdf[id_col] = gdf[id_col].astype("int64")
    if numeric_cols:
        for col in numeric_cols:
            if col in gdf.columns:
                gdf[col] = gdf[col].astype("float64")
    if string_cols:
        for col in string_cols:
            if col in gdf.columns:
                gdf[col] = gdf[col].astype("string")
                
    # 3. Serialize geometry to WKB
    gdf["geometry"] = gdf["geometry"].apply(lambda g: g.wkb if g else None)
    
    # 4. Define explicit Arrow schema
    arrow_schema = pa.schema([
        pa.field(id_col, pa.int64()),
        pa.field("geometry", pa.large_binary())
    ])
    if numeric_cols:
        arrow_schema = arrow_schema.append([pa.field(c, pa.float64()) for c in numeric_cols if c in gdf.columns])
    if string_cols:
        arrow_schema = arrow_schema.append([pa.field(c, pa.string()) for c in string_cols if c in gdf.columns])
        
    # 5. Validate & Write
    assert gdf.crs.to_epsg() == int(target_crs.split(":")[-1]), "CRS validation failed"
    table = pa.Table.from_pandas(gdf, schema=arrow_schema)
    
    # Attach GeoParquet metadata
    geo_meta = {
        "primary_column": "geometry",
        "columns": {"geometry": {"encoding": "WKB", "geometry_types": ["MultiPolygon"]}}
    }
    table = table.replace_schema_metadata({"geo": str(geo_meta)})
    
    pq.write_table(table, output_path, compression="snappy")
    logging.info(f"Successfully migrated to {output_path}")

# Example execution
# migrate_legacy_to_geoparquet("legacy.shp", "output.parquet", numeric_cols=["area_km2"], string_cols=["region_name"])

Key Takeaways

  • Explicit schemas prevent silent data corruption. Never rely on automatic type inference when migrating legacy GIS data.
  • Normalize geometry types and CRS early. Mixed geometries and outdated projections break downstream spatial indexes and query engines.
  • Validate before writing. Row counts, geometry validity, and schema conformity must pass pre-flight checks.
  • Version your mappings. Treat schema definitions as code. Track changes, document fallback routes, and maintain backward compatibility for incremental loads.

By treating schema translation as a deterministic, testable pipeline stage rather than an ad-hoc conversion step, engineering teams can scale migrations safely, reduce cloud storage costs, and maintain analytical consistency across modern data platforms.

Continue exploring