Retry Logic for Cloud Migration Pipelines
Retry logic for cloud migration pipelines is a deterministic, idempotent error-handling strategy that automatically re-executes transiently failed data transfer, compression, or format conversion tasks. By combining exponential backoff, jitter, and state-aware validation, it prevents pipeline collapse during API throttling, network partitions, or temporary compute stalls. For GIS data engineers and cloud architects, this ensures large raster and vector datasets migrate reliably to modern formats like Cloud Optimized GeoTIFF (COG), GeoParquet, or Zarr without manual intervention or data corruption.
Why Geospatial Workloads Require Specialized Retries
Generic HTTP retry loops break down in spatial contexts. Geospatial pipelines involve chunked I/O, coordinate reference system (CRS) transformations, and heavy compression algorithms (ZSTD, DEFLATE). Cloud object storage APIs frequently return 503 SlowDown or 429 Too Many Requests during bulk multipart uploads. GDAL-based conversions stall on malformed geometries, projection mismatches, or memory spikes when processing high-resolution satellite imagery.
A naive retry strategy triggers thundering herd problems, duplicates partially written Parquet row groups, or corrupts Zarr chunk manifests. Effective retry logic must be:
- Idempotent: Safe to repeat without duplicating data or overwriting valid outputs
- Format-aware: Understands partial writes in columnar (GeoParquet) or chunked (COG/Zarr) layouts
- Backoff-calibrated: Respects cloud provider rate limits and storage gateway concurrency caps
- State-tracked: Skips successfully migrated tiles/chunks using ETags, content hashes, or transaction logs
Core Principles for Production Systems
1. Transient vs. Permanent Error Classification
Not all failures are equal. Infrastructure timeouts, 5xx API responses, and temporary disk I/O contention are transient and warrant retries. Invalid CRS definitions, unsupported GDAL drivers, and corrupted source files are permanent and should fail fast. Implementing a Fallback Routing for Failed Migration Jobs strategy ensures that when retries exhaust, the pipeline gracefully diverts problematic assets to a quarantine bucket rather than halting the entire batch.
2. Idempotency via Content Hashing
Before attempting a conversion or upload, verify whether the destination already exists and matches the source. Store a SHA-256 hash of the input file in object metadata. If the hash matches on a subsequent run, skip the operation entirely. This prevents duplicate row groups in GeoParquet and avoids unnecessary compute spend.
3. Jitter to Prevent Thundering Herds
Exponential backoff alone causes synchronized retry spikes when multiple workers fail simultaneously. Adding randomized jitter spreads retries across a time window, keeping storage gateways and conversion clusters stable. AWS explicitly recommends full jitter for distributed systems handling bulk object operations (AWS S3 Error Best Practices).
Production Implementation
The following Python pattern uses tenacity for declarative retry control, integrates with boto3 and rasterio, and enforces idempotency before attempting conversion or upload. It isolates permanent geospatial errors from transient infrastructure failures.
import os
import logging
import hashlib
from pathlib import Path
import boto3
import rasterio
from tenacity import (
retry, stop_after_attempt, wait_exponential,
retry_if_exception_type, before_sleep_log, wait_random
)
from botocore.exceptions import ClientError, BotoCoreError
from rasterio.errors import RasterioError
logger = logging.getLogger("geo_migration.retry")
class PermanentGeoError(Exception):
"""Non-retryable geospatial conversion failures."""
pass
def _is_idempotent(dest_bucket: str, dest_key: str, src_hash: str) -> bool:
"""Check if destination exists and matches source hash."""
s3 = boto3.client("s3")
try:
head = s3.head_object(Bucket=dest_bucket, Key=dest_key)
return head.get("Metadata", {}).get("src_hash") == src_hash
except ClientError as e:
if e.response["Error"]["Code"] == "404":
return False
raise
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=2, min=4, max=60) + wait_random(0, 2),
retry=retry_if_exception_type((ClientError, BotoCoreError, RasterioError, ConnectionError)),
before_sleep=before_sleep_log(logger, logging.WARNING),
reraise=True
)
def migrate_raster_to_cog(src_path: str, dest_bucket: str, dest_key: str, region: str = "us-east-1") -> dict:
"""Migrate a local raster to a Cloud Optimized GeoTIFF with idempotent retries."""
src_hash = hashlib.sha256(Path(src_path).read_bytes()).hexdigest()
if _is_idempotent(dest_bucket, dest_key, src_hash):
logger.info("Skipping %s: already migrated and verified.", dest_key)
return {"status": "skipped", "key": dest_key}
s3 = boto3.client("s3", region_name=region)
tmp_cog = f"/tmp/{Path(dest_key).name}"
try:
with rasterio.open(src_path) as src:
# Validate CRS upfront to fail fast on permanent errors
if not src.crs or not src.crs.is_valid:
raise PermanentGeoError(f"Invalid CRS in {src_path}")
# Convert to COG (requires rasterio >= 1.3.0)
with rasterio.open(
tmp_cog, "w",
driver="COG",
width=src.width, height=src.height,
count=src.count, dtype=src.dtypes[0],
crs=src.crs, transform=src.transform,
nodata=src.nodata, compress="deflate",
tiled=True, blockxsize=256, blockysize=256
) as dst:
for i in range(1, src.count + 1):
dst.write(src.read(i), i)
# Upload with metadata for idempotency tracking
s3.upload_file(
tmp_cog, dest_bucket, dest_key,
ExtraArgs={"Metadata": {"src_hash": src_hash}}
)
logger.info("Successfully migrated %s to s3://%s/%s", src_path, dest_bucket, dest_key)
return {"status": "success", "key": dest_key}
except ClientError as e:
code = e.response["Error"]["Code"]
if code in ("InvalidParameter", "AccessDenied"):
raise PermanentGeoError(f"Permanent S3 error {code}: {e}") from e
raise
except RasterioError as e:
if "CRS" in str(e) or "Driver" in str(e):
raise PermanentGeoError(str(e)) from e
raise
finally:
if os.path.exists(tmp_cog):
os.remove(tmp_cog)
Tuning Backoff, Jitter, and Fallbacks
The tenacity decorator above applies a base wait of 2^n seconds, capped at 60 seconds, with an additional 0–2 second random jitter. This configuration aligns with Tenacity’s official documentation for distributed workloads. Adjust stop_after_attempt based on your SLA: 3–5 attempts typically resolve transient cloud throttling, while higher counts increase queue depth and cost.
When integrating this logic into broader Data Conversion & Migration Pipelines, ensure your orchestrator (Airflow, Prefect, Dagster) treats PermanentGeoError as a terminal failure. Configure dead-letter queues to capture metadata like src_path, error_code, and attempt_count. This enables automated ticketing or manual triage without blocking downstream consumers.
Observability & Validation
Retry logic is only as reliable as its telemetry. Track these metrics at the pipeline level:
- Retry Rate: Percentage of tasks requiring ≥1 retry (target: <5%)
- Success After N Attempts: Distribution of retries before success (identifies systemic throttling)
- Permanent Failure Volume: Assets routed to quarantine (indicates upstream data quality issues)
- Idempotency Hit Rate: Percentage of runs skipped due to existing valid outputs (optimizes cost)
Validate your implementation using chaos engineering: inject 503 responses via local proxy tools (e.g., Toxiproxy), simulate partial network drops, and feed intentionally malformed GeoTIFFs to verify permanent error routing. Combine these tests with automated schema validation (e.g., rio info or geoparquet readers) to guarantee output integrity before downstream consumption.