Back to Insights

Scaling Down: Building a Robust Micro-ETL for $5/Mo

In a world dominated by massive data stacks—Airflow, dbt, Snowflake—it’s easy to forget that not every data problem requires a Kubernetes cluster. Recently, I found myself needing to automate a daily data pull for a few clients. The data volume was moderate ("small data"), the budget was tight, and reliability was non-negotiable.

Enter qry-micro-etl: a lightweight, containerized ETL framework designed to run comfortably on a $5/month DigitalOcean droplet (2GB RAM).

The Problem: "Enterprise" is heavy

Tools like Airflow are fantastic, but they introduce significant operational overhead. For a simple pipeline that fetches emails, parses CSVs, transforms data, and syncs to Google Sheets, I didn't want to manage a scheduler, a web server, and a metadata database. I just needed a script that runs, logs what happened, and exits.

The Solution: Python + DuckDB + Docker

I built qry-micro-etl with a philosophy of radical simplicity and resource efficiency.

1. Memory-First Design

Running on 2GB RAM means you can't just pd.read_csv() a 500MB file without sweating.

  • Streaming: I used Python's generators to stream data processing for the transformation layer.
  • DuckDB: For data merging (upserts), I leveraged DuckDB. It allows executing complex SQL joins on dataframes with a fraction of the memory footprint of pandas, handling larger-than-ram datasets gracefully.

2. Containerized Reliability

To solve the "it works on my machine" problem, I containerized the application using a minimal python:3.13-slim image. This ensures that the exact same environment runs in development and on the production Linux server.

  • uv for Speed: I used uv for lightning-fast dependency resolution and installation.
  • Security: Credentials are injected via volume mounts, keeping the image clean and secrets safe.

3. Observability

When a cron job fails in the forest, does it make a sound? Only if you log it. I implemented structured JSON logging. This allows logs to be easily ingested by observability tools or simply grepped with jq. Every run generates a correlation ID, making it trivial to trace a specific execution flow across different modules.

Key Code Snippet

Here is how I prevented OOM kills using a sequential execution strategy in the shell runner:

Force sequential execution to respect low-memory environments

echo "Starting ETL jobs in Sequential Mode (Memory Safe)"
for CLIENT in "${CLIENTS[@]}"; do
    python simple_etl.py --client "$CLIENT"
    # Explicitly release resources/wait before next
    sleep 1 
done

Conclusion

This project reminded me that software engineering isn't always about using the newest, biggest tool. It's about choosing the right tool. By stripping away the bloat, I built a system that is easy to understand, cheap to run, and easier to maintain than a sprawling Airflow deployment.

Tech Stack: Python 3.13, DuckDB, Pandas, Google Workspace APIs, Docker.