Handling Null Values in Spatial Schema Mapping

Handling null values in spatial schema mapping requires explicit null-aware type coercion during ETL. Legacy formats lack true null support, while modern columnar formats enforce strict nullable semantics. The reliable approach is to standardize on None/NaN at the Python layer, enforce nullable geometry and attribute columns via PyArrow schemas, and apply sentinel fallbacks only when targeting legacy sinks. Never rely on implicit type coercion; always declare nullability upfront in your schema definition before writing to compressed or modern storage formats.

Null Semantics in Geospatial Data

Geospatial nulls fall into three distinct categories, each requiring different handling during Data Conversion & Migration Pipelines:

  1. Missing Geometry: The feature exists in the attribute table but has no coordinate representation. Modern formats store this as a true NULL geometry. Legacy formats often drop the row, write invalid coordinates, or silently coerce to (0, 0).
  2. Missing Attributes: Standard relational nulls (NULL, NaN, None). These map cleanly to Arrow/Parquet but break Shapefile DBF encoding, which requires placeholder values like 0 or empty strings.
  3. Empty Geometry: Valid geometry objects with zero area or length (POINT EMPTY, POLYGON EMPTY). These are not nulls. Confusing empty geometries with nulls causes topology validation failures and breaks spatial indexing downstream.

When migrating from shapefiles, GML, or KML to compressed formats, you must explicitly separate these states. Implicit conversions corrupt spatial joins, bounding box calculations, and downstream analytics.

Core Rules for Null-Aware Schema Mapping

  • Declare nullability explicitly: Every column in your target schema must specify nullable=True. Relying on auto-inference drops nulls or forces unsafe type widening.
  • Preserve None over sentinels: Use None for missing geometries and NaN for missing numeric attributes. Reserve sentinel values (e.g., -9999, "UNKNOWN") only for legacy sinks that cannot represent true nulls.
  • Separate geometry from attributes: Null handling differs between spatial and tabular data. Process geometry WKB serialization independently from attribute casting to avoid cross-contamination.
  • Validate before write: Run null-count assertions and geometry validity checks after schema application. Catch coercion errors before they hit object storage or data warehouses.

Working Implementation: Null-Aware Schema Mapping

The following Python workflow uses geopandas and pyarrow to map a legacy dataset to a modern nullable schema. It preserves null semantics without silent coercion and aligns with the OGC GeoParquet specification.

python
import geopandas as gpd
import pyarrow as pa
import shapely

def map_to_nullable_spatial_schema(gdf: gpd.GeoDataFrame) -> pa.Table:
    """
    Converts a GeoDataFrame to a PyArrow Table with explicit null handling.
    Preserves missing geometries and attributes without coercion.
    """
    # 1. Work on a copy to avoid mutating source data
    df = gdf.copy()
    
    # 2. Standardize geometry nulls: replace NaN/None with Python None
    # Shapely treats None as missing geometry; GeoParquet expects null WKB
    df["geometry"] = df["geometry"].where(df["geometry"].notna(), None)
    
    # 3. Convert geometry to WKB binary, preserving nulls
    wkb_series = df["geometry"].apply(
        lambda geom: shapely.to_wkb(geom) if geom is not None else None
    )
    
    # 4. Define explicit nullable Arrow schema
    # Note: Use pa.binary() or pa.large_binary() depending on expected WKB size
    schema = pa.schema([
        ("id", pa.field("int64", nullable=True)),
        ("name", pa.field("string", nullable=True)),
        ("elevation", pa.field("float64", nullable=True)),
        ("status", pa.field("string", nullable=True)),
        ("geometry", pa.field("binary", nullable=True))  # WKB with null support
    ])
    
    # 5. Build PyArrow Table with null preservation
    table = pa.Table.from_pandas(
        df.drop(columns=["geometry"]),
        schema=schema.drop(["geometry"]),
        preserve_index=False
    )
    
    # Attach geometry column separately to ensure exact null alignment
    geom_array = pa.array(wkb_series, type=pa.binary())
    table = table.append_column("geometry", geom_array)
    
    # 6. Validate null counts match source
    assert table.column("geometry").null_count == df["geometry"].isna().sum(), \
        "Geometry null count mismatch during conversion"
        
    return table

Why this works: The function isolates geometry serialization, enforces explicit nullable=True fields, and validates null preservation before returning the table. This pattern prevents the silent row-dropping behavior common in gdal-based converters and aligns with Apache Arrow’s null handling guidelines.

Format-Specific Considerations

Shapefile (ESRI Shapefile)

Shapefiles cannot represent true nulls in numeric or string fields. DBF encoding forces 0 or empty strings. When mapping from shapefiles, treat 0 in numeric fields and "" in string fields as potential nulls only if documented. When mapping to shapefiles, replace None with format-safe sentinels and log the transformation.

GeoParquet & FlatGeobuf

Both formats support true null geometries and nullable attributes. GeoParquet stores geometry as WKB with a null bitmask. FlatGeobuf uses a similar null-aware binary layout. Ensure your PyArrow schema marks geometry as nullable=True and avoid coercing None to POINT EMPTY, which breaks spatial predicates like ST_IsEmpty() vs ST_IsNull().

PostGIS / Cloud Data Warehouses

When loading into PostGIS, Snowflake, or BigQuery, map None to NULL geometry and use NaN for missing numerics. Avoid string placeholders like "NULL" or "MISSING", as they bypass spatial index optimizations and break IS NULL queries. For Schema Mapping for Legacy to Modern Formats, always verify that target database drivers respect Arrow null bitmaps during bulk inserts.

Validation & Testing Checklist

Run these checks after schema application to guarantee null integrity:

  1. Null Count Parity: source_df.isna().sum() must equal target_table.null_count per column.
  2. Geometry Validity: Filter out None geometries and run shapely.is_valid() on remaining features. Invalid geometries should be logged, not silently dropped.
  3. Spatial Join Test: Perform a point-in-polygon or nearest-neighbor join using the mapped table. Verify that null geometries do not trigger IndexError or return false positives.
  4. Format Round-Trip: Write to target format, read back, and compare null masks. Any discrepancy indicates driver-level coercion.
  5. Downstream Query Test: Run SELECT COUNT(*) WHERE geometry IS NULL and WHERE elevation IS NULL in the target system. Results must match source expectations.

Common Pitfalls to Avoid

  • Coercing NaN to 0 in numeric attributes: Breaks statistical aggregations and machine learning pipelines. Keep NaN for floats, use explicit nulls for integers.
  • Using POINT EMPTY as a null substitute: Empty geometries pass topology checks but fail IS NULL predicates. Reserve them for valid zero-area features only.
  • Relying on gdf.to_parquet() without schema control: Auto-inference widens types and drops null bitmaps in older GeoPandas versions. Always pass an explicit schema or use pyarrow directly for production pipelines.
  • Ignoring CRS null behavior: Some projections drop features with missing coordinates during transformation. Validate CRS consistency before spatial operations.

By enforcing explicit null semantics, validating schema alignment, and respecting format-specific constraints, you eliminate silent data corruption and ensure reliable spatial analytics across modern data platforms.