What is the difference between a null geometry and an empty geometry?

A null geometry means the feature has no coordinate representation at all — the cell in the geometry column is absent. An empty geometry (POINT EMPTY, POLYGON EMPTY) is a valid geometry object with zero area or length. They behave differently in spatial predicates: ST_IsNull() catches nulls; ST_IsEmpty() catches empties. Confusing the two breaks spatial joins and bounding-box calculations.

Why does gdf.to_parquet() sometimes drop null bitmaps?

Older versions of GeoPandas (before 0.12) used auto-inferred PyArrow schemas that could widen integer types and discard null bitmaps. Always pass an explicit pa.schema() with nullable=True on every field when writing production data, or use pyarrow directly instead of the GeoPandas convenience method.

How should I handle Shapefile numeric fields that use 0 as a null substitute?

Only treat 0 as a null substitute if your data dictionary explicitly documents it. Blind coercion of 0 to None corrupts legitimate zero-value measurements. Instead, add a companion boolean null-flag column during migration and document the sentinel mapping in your schema metadata.

Can PostGIS bulk-load operations respect Arrow null bitmaps?

Yes, when using COPY or ogr2ogr with the Arrow IPC format. However, JDBC/ODBC bulk inserts sometimes flatten null bitmaps to empty strings. Always run a SELECT COUNT(*) WHERE geometry IS NULL after loading and compare against the expected null count from your migration log.

Section: Data Conversion & Migration Pipelines for Cloud-Native Geospatial Storage 15 min read

Handling Null Values in Spatial Schema Mapping

Standardise on None at the Python layer, enforce nullable geometry and attribute columns via explicit PyArrow schemas, and apply sentinel fallbacks only when targeting legacy sinks that cannot represent true nulls. Never rely on implicit type coercion: declare nullability upfront in your schema before writing to GeoParquet or FlatGeobuf, or you risk silent row-dropping and corrupted spatial analytics downstream. Null handling is one piece of the broader schema mapping for legacy to modern formats workflow, where it sits alongside type alignment and CRS normalisation.

Quick-Reference: Null Strategy by Format

Source / Target	Null Representation	Migration Action	Primary Use Case
Shapefile → GeoParquet	DBF uses `0` / `""` as sentinels	Treat only documented sentinels as `None`; validate with null-count assertions	Bulk archive migration to cloud object storage
GML / KML → GeoParquet	Missing elements map to `None`	Preserve `None`; do not substitute `POINT EMPTY`	Municipal and agency data ingest pipelines
GeoParquet → PostGIS	Arrow null bitmaps → SQL `NULL`	Verify with `SELECT COUNT(*) WHERE geom IS NULL` post-load	Analytical warehouse loading with spatial queries
Any → Shapefile	True `NULL` cannot be stored in DBF	Replace `None` with documented sentinels; log every replacement	Downstream delivery to legacy GIS desktop tools

Null Semantics in Geospatial Data

Geospatial nulls fall into three distinct categories that require separate handling during schema migration — conflating them causes topology validation failures and breaks spatial indexing.

Missing geometry means the feature exists in the attribute table but has no coordinate representation. Modern formats store this as a true NULL geometry; Shapefiles silently drop the row, write invalid coordinates, or coerce to (0, 0). Missing attributes are standard relational nulls (NULL, NaN, None) that map cleanly to columnar Parquet storage but break Shapefile DBF encoding, which requires placeholder values like 0 or empty strings. Empty geometries (POINT EMPTY, POLYGON EMPTY) are valid geometry objects with zero area or length — these are not nulls. ST_IsEmpty() and ST_IsNull() are not interchangeable predicates, and confusing them breaks spatial joins and bounding-box calculations.

When migrating from Shapefiles, GML, or KML to compressed formats, you must apply three concrete rules: declare nullable=True on every column rather than relying on auto-inference; preserve None for missing geometries rather than substituting POINT EMPTY; and serialise geometry WKB independently from attribute casting to avoid cross-contamination between the spatial and tabular null handling paths.

Three geospatial null types that require separate handling during schema migration. Only missing geometry and missing attribute produce SQL IS NULL results; empty geometries are valid WKB objects.

When a Source Value Should Become `None`

The single most error-prone decision in spatial null migration is whether a given source cell represents a genuine null, a legitimate value, or a sentinel that was standing in for a null. Get it wrong in either direction and you either corrupt real measurements or fabricate data. The routing below encodes the safe default: only documented sentinels collapse to None; everything else passes through untouched, and unrepresentable nulls on legacy write-back become logged sentinels rather than silent zeros.

Safe routing for each source cell during migration. Only data-dictionary-documented sentinels collapse to None; on write-back to formats that cannot store nulls, the reverse substitution is logged rather than applied silently.

Production Implementation

The following workflow uses geopandas (≥ 0.14) and pyarrow (≥ 14.0) to map a legacy dataset to a modern nullable schema. It preserves null semantics without silent coercion and aligns with the GeoParquet specification.

python

# Requirements: geopandas>=0.14, pyarrow>=14.0, shapely>=2.0
import geopandas as gpd
import pyarrow as pa
import shapely
from typing import Optional


def map_to_nullable_spatial_schema(
    gdf: gpd.GeoDataFrame,
    attribute_schema: Optional[pa.Schema] = None,
) -> pa.Table:
    """
    Convert a GeoDataFrame to a PyArrow Table with explicit null handling.

    Preserves missing geometries and attribute nulls without coercion.
    Geometry is stored as nullable binary (WKB); attribute columns use
    the caller-supplied schema or a safe default.

    Args:
        gdf: Source GeoDataFrame, any CRS.
        attribute_schema: Optional explicit pa.Schema for non-geometry columns.
            Every field MUST use nullable=True to prevent type widening.

    Returns:
        pa.Table with a 'geometry' column (binary, nullable) appended last.

    Raises:
        AssertionError: if null counts in the output do not match the source.
    """
    df = gdf.copy()
    geom_col = gdf.geometry.name

    # 1. Standardise geometry nulls: replace any NaN/None sentinel with None.
    #    shapely treats Python None as missing geometry; GeoParquet expects
    #    a null WKB entry (null bitmap set in the Arrow column chunk).
    df[geom_col] = df[geom_col].where(df[geom_col].notna(), other=None)

    # 2. Serialise geometry to WKB, preserving None for missing features.
    wkb_series = df[geom_col].apply(
        lambda geom: shapely.to_wkb(geom) if geom is not None else None
    )

    # 3. Build the attribute schema if not supplied.
    #    Caller MUST set nullable=True on every field — never rely on
    #    auto-inference from pandas dtypes, which can drop null bitmaps.
    if attribute_schema is None:
        attribute_schema = pa.schema([
            pa.field("id",        pa.int64(),   nullable=True),
            pa.field("name",      pa.string(),  nullable=True),
            pa.field("elevation", pa.float64(), nullable=True),
            pa.field("status",    pa.string(),  nullable=True),
        ])

    # 4. Convert non-geometry columns to an Arrow Table.
    non_geom_df = df.drop(columns=[geom_col])
    table = pa.Table.from_pandas(
        non_geom_df,
        schema=attribute_schema,
        preserve_index=False,
    )

    # 5. Append geometry as a separate nullable binary column.
    #    Attaching it last keeps the schema aligned with GeoParquet metadata.
    geom_array = pa.array(list(wkb_series), type=pa.binary())
    table = table.append_column(
        pa.field("geometry", pa.binary(), nullable=True),
        geom_array,
    )

    # 6. Validate null counts match the source to catch silent coercion.
    src_geom_nulls = int(df[geom_col].isna().sum())
    out_geom_nulls = table.column("geometry").null_count
    if out_geom_nulls != src_geom_nulls:
        raise AssertionError(
            f"Geometry null count mismatch: "
            f"expected {src_geom_nulls}, got {out_geom_nulls}. "
            "Check for implicit None→POINT EMPTY coercion in the geometry column."
        )

    for col in attribute_schema.names:
        src_nulls = int(non_geom_df[col].isna().sum())
        out_nulls = table.column(col).null_count
        if out_nulls != src_nulls:
            raise AssertionError(
                f"Null count mismatch on column '{col}': "
                f"expected {src_nulls}, got {out_nulls}."
            )

    return table

# Requirements: geopandas>=0.14, pyarrow>=14.0, shapely>=2.0
import geopandas as gpd
import pyarrow as pa
import shapely
from typing import Optional


def map_to_nullable_spatial_schema(
    gdf: gpd.GeoDataFrame,
    attribute_schema: Optional[pa.Schema] = None,
) -> pa.Table:
    """
    Convert a GeoDataFrame to a PyArrow Table with explicit null handling.

    Preserves missing geometries and attribute nulls without coercion.
    Geometry is stored as nullable binary (WKB); attribute columns use
    the caller-supplied schema or a safe default.

    Args:
        gdf: Source GeoDataFrame, any CRS.
        attribute_schema: Optional explicit pa.Schema for non-geometry columns.
            Every field MUST use nullable=True to prevent type widening.

    Returns:
        pa.Table with a 'geometry' column (binary, nullable) appended last.

    Raises:
        AssertionError: if null counts in the output do not match the source.
    """
    df = gdf.copy()
    geom_col = gdf.geometry.name

    # 1. Standardise geometry nulls: replace any NaN/None sentinel with None.
    #    shapely treats Python None as missing geometry; GeoParquet expects
    #    a null WKB entry (null bitmap set in the Arrow column chunk).
    df[geom_col] = df[geom_col].where(df[geom_col].notna(), other=None)

    # 2. Serialise geometry to WKB, preserving None for missing features.
    wkb_series = df[geom_col].apply(
        lambda geom: shapely.to_wkb(geom) if geom is not None else None
    )

    # 3. Build the attribute schema if not supplied.
    #    Caller MUST set nullable=True on every field — never rely on
    #    auto-inference from pandas dtypes, which can drop null bitmaps.
    if attribute_schema is None:
        attribute_schema = pa.schema([
            pa.field("id",        pa.int64(),   nullable=True),
            pa.field("name",      pa.string(),  nullable=True),
            pa.field("elevation", pa.float64(), nullable=True),
            pa.field("status",    pa.string(),  nullable=True),
        ])

    # 4. Convert non-geometry columns to an Arrow Table.
    non_geom_df = df.drop(columns=[geom_col])
    table = pa.Table.from_pandas(
        non_geom_df,
        schema=attribute_schema,
        preserve_index=False,
    )

    # 5. Append geometry as a separate nullable binary column.
    #    Attaching it last keeps the schema aligned with GeoParquet metadata.
    geom_array = pa.array(list(wkb_series), type=pa.binary())
    table = table.append_column(
        pa.field("geometry", pa.binary(), nullable=True),
        geom_array,
    )

    # 6. Validate null counts match the source to catch silent coercion.
    src_geom_nulls = int(df[geom_col].isna().sum())
    out_geom_nulls = table.column("geometry").null_count
    if out_geom_nulls != src_geom_nulls:
        raise AssertionError(
            f"Geometry null count mismatch: "
            f"expected {src_geom_nulls}, got {out_geom_nulls}. "
            "Check for implicit None→POINT EMPTY coercion in the geometry column."
        )

    for col in attribute_schema.names:
        src_nulls = int(non_geom_df[col].isna().sum())
        out_nulls = table.column(col).null_count
        if out_nulls != src_nulls:
            raise AssertionError(
                f"Null count mismatch on column '{col}': "
                f"expected {src_nulls}, got {out_nulls}."
            )

    return table

The function isolates geometry serialisation, enforces nullable=True on every field via pa.field(name, type, nullable=True), and validates null preservation before returning. This prevents the silent row-dropping behaviour common in GDAL-based converters.

Validation and Verification

After calling map_to_nullable_spatial_schema, run a round-trip check before committing data to object storage. This catches driver-level coercion that occurs at write time rather than at schema construction time:

python

import pyarrow.parquet as pq

def verify_null_integrity(
    table: pa.Table,
    source_gdf: gpd.GeoDataFrame,
    output_path: str,
) -> None:
    """Write to Parquet, read back, and compare null masks column-by-column."""
    pq.write_table(table, output_path, compression="zstd")
    roundtrip = pq.read_table(output_path)

    for col in table.schema.names:
        written_nulls = table.column(col).null_count
        read_nulls    = roundtrip.column(col).null_count
        assert written_nulls == read_nulls, (
            f"Round-trip null mismatch on '{col}': "
            f"written={written_nulls}, read={read_nulls}. "
            "Check your Parquet writer version or compression codec."
        )
    total = sum(table.column(c).null_count for c in table.schema.names)
    print(f"Round-trip OK — {output_path}: {table.num_rows} rows, "
          f"{total} total nulls preserved.")

import pyarrow.parquet as pq

def verify_null_integrity(
    table: pa.Table,
    source_gdf: gpd.GeoDataFrame,
    output_path: str,
) -> None:
    """Write to Parquet, read back, and compare null masks column-by-column."""
    pq.write_table(table, output_path, compression="zstd")
    roundtrip = pq.read_table(output_path)

    for col in table.schema.names:
        written_nulls = table.column(col).null_count
        read_nulls    = roundtrip.column(col).null_count
        assert written_nulls == read_nulls, (
            f"Round-trip null mismatch on '{col}': "
            f"written={written_nulls}, read={read_nulls}. "
            "Check your Parquet writer version or compression codec."
        )
    total = sum(table.column(c).null_count for c in table.schema.names)
    print(f"Round-trip OK — {output_path}: {table.num_rows} rows, "
          f"{total} total nulls preserved.")

Expected output for a clean migration:

markup

Round-trip OK — output/parcels.parquet: 142803 rows, 317 total nulls preserved.

Round-trip OK — output/parcels.parquet: 142803 rows, 317 total nulls preserved.

If null counts diverge after the round-trip, the most common cause is writing with an older pyarrow version (< 12.0) that did not honour the null bitmap for binary columns. Pin pyarrow>=14.0 in your requirements.txt. When loading into PostGIS, additionally run SELECT COUNT(*) WHERE geometry IS NULL and compare against the expected null count from your migration log — JDBC connectors sometimes flatten Arrow null bitmaps to empty strings during bulk inserts.

Edge Cases and Caveats

Large mixed-CRS datasets. Some projections silently drop features with missing coordinates during CRS transformation. Run gdf[gdf.geometry.notna()].to_crs(target_crs) and re-merge with the null-geometry rows afterwards, rather than transforming the full GeoDataFrame and hoping the driver propagates nulls correctly through the reprojection.

Integer columns in Parquet. Arrow’s int64 type has no native NaN. Pandas represents integer nulls using pd.NA in the Int64 (capitalised) nullable-integer dtype. If you pass a plain numpy int64 column with NaN, Pandas silently casts it to float64. Declare pa.field("count", pa.int64(), nullable=True) explicitly and pass pandas_metadata=False if you do not want Pandas dtype round-trip metadata embedded in the Parquet footer.

Streaming ingestion to Kafka or object storage. Null bitmaps are row-group-scoped in Parquet; in streaming contexts the null count is not known until the row group is finalised. For incremental writes, use fallback routing patterns to quarantine rows where geometry serialisation fails, rather than letting a single null coercion corrupt an entire partition.

Frequently Asked Questions

Schema Mapping for Legacy to Modern Formats — type alignment, CRS normalisation, and schema evolution strategies
Shapefile Limitations in Modern Data Stacks — why DBF encoding forces sentinel-based nulls
Building Batch Conversion Pipelines with Python — orchestrating large-scale format migrations
Fallback Routing for Failed Migration Jobs — quarantining null-coercion errors without halting the pipeline
Understanding Parquet Columnar Storage for GIS — how null bitmaps work inside Parquet column chunks

← Back to Schema Mapping for Legacy to Modern Formats

#Handling Null Values in Spatial Schema Mapping

#Quick-Reference: Null Strategy by Format

#Null Semantics in Geospatial Data

#When a Source Value Should Become None

#Production Implementation

#Validation and Verification

#Edge Cases and Caveats

#Frequently Asked Questions

Data Conversion & Migration Pipelines for Cloud-Native Geospatial Storage

All Topics