Scaling Down: Building a Robust Micro-ETL for $5/Mo
In a world dominated by massive data stacks—Airflow, dbt, Snowflake—it’s easy to forget that not every data problem requires a Kubernetes cluster. Recently, I found myself needing to automate a daily data pull for a few clients. The data volume was moderate ("small data"), the budget was tight, and reliability was non-negotiable.
Enter qry-micro-etl: a lightweight, containerized ETL framework designed to run comfortably on a $5/month DigitalOcean droplet (2GB RAM).
The Problem: "Enterprise" is heavy
Tools like Airflow are fantastic, but they introduce significant operational overhead. For a simple pipeline that fetches emails, parses CSVs, transforms data, and syncs to Google Sheets, I didn't want to manage a scheduler, a web server, and a metadata database. I just needed a script that runs, logs what happened, and exits.
The Solution: Python + DuckDB + Docker
I built qry-micro-etl with a philosophy of radical simplicity and resource efficiency.
1. Memory-First Design
Running on 2GB RAM means you can't just pd.read_csv() a 500MB file without sweating.
- Streaming: I used Python's generators to stream data processing for the transformation layer.
- DuckDB: For data merging (upserts), I leveraged DuckDB. It allows executing complex SQL joins on dataframes with a fraction of the memory footprint of pandas, handling larger-than-ram datasets gracefully.
2. Containerized Reliability
To solve the "it works on my machine" problem, I containerized the application using a minimal python:3.13-slim image. This ensures that the exact same environment runs in development and on the production Linux server.
uvfor Speed: I useduvfor lightning-fast dependency resolution and installation.- Security: Credentials are injected via volume mounts, keeping the image clean and secrets safe.
3. Observability
When a cron job fails in the forest, does it make a sound? Only if you log it. I implemented structured JSON logging. This allows logs to be easily ingested by observability tools or simply grepped with jq. Every run generates a correlation ID, making it trivial to trace a specific execution flow across different modules.
Key Code Snippet
Here is how I prevented OOM kills using a sequential execution strategy in the shell runner:
Force sequential execution to respect low-memory environments
echo "Starting ETL jobs in Sequential Mode (Memory Safe)"
for CLIENT in "${CLIENTS[@]}"; do
python simple_etl.py --client "$CLIENT"
# Explicitly release resources/wait before next
sleep 1
done
Conclusion
This project reminded me that software engineering isn't always about using the newest, biggest tool. It's about choosing the right tool. By stripping away the bloat, I built a system that is easy to understand, cheap to run, and easier to maintain than a sprawling Airflow deployment.
Tech Stack: Python 3.13, DuckDB, Pandas, Google Workspace APIs, Docker.