Guide 4
Where DuckDB Fits (and Where It Doesn't)
Problem Framing
Engineers encounter S3-stored data constantly — Parquet files in data lakes, Iceberg tables in lakehouses, ad-hoc exports. Historically, exploring this data required setting up Spark clusters or Trino coordinators. DuckDB changes the equation by bringing fast columnar analytics to a single machine, reading directly from S3. But knowing when DuckDB is the right tool — and when it is not — prevents both over-engineering and under-performing.
Relevant Nodes
- Topics: S3, Lakehouse
- Technologies: DuckDB, Trino, Apache Spark, ClickHouse, StarRocks
- Standards: Apache Parquet, Apache Arrow
- Pain Points: Small Files Problem, Object Listing Performance, Cold Scan Latency
Decision Path
Use DuckDB when:
- You need ad-hoc exploration of S3 data (quick SELECT against a few Parquet files)
- You are developing and testing queries before deploying them to Spark or Trino
- You need embedded analytics in an application (DuckDB runs in-process, no server needed)
- Your data fits in a single machine's processing capacity (up to ~100GB of result sets, much more for streaming scans)
- You want to query Iceberg tables on S3 without deploying a cluster
Do not use DuckDB when:
- Data volume requires distributed processing (petabyte-scale joins, multi-TB shuffles)
- You need concurrent multi-user access (DuckDB is single-process)
- You need to write to table formats on S3 in production pipelines (use Spark/Flink)
- You are querying millions of small files on S3 (DuckDB is constrained by S3 listing performance)
DuckDB + S3 configuration:
- Use the
httpfsextension for S3 access with credential configuration - DuckDB supports reading Parquet, CSV, JSON, and Iceberg directly from S3 URIs
- Arrow integration enables zero-copy data exchange with Python analytics libraries
- Parallel S3 reads improve throughput for larger datasets
- Use the
DuckDB vs. alternatives (quick reference):
- DuckDB vs. Spark: DuckDB for single-machine, interactive; Spark for distributed, production pipelines
- DuckDB vs. Trino: DuckDB for local exploration; Trino for multi-user, multi-source, federated queries
- DuckDB vs. ClickHouse: DuckDB for embedded/serverless; ClickHouse for persistent, low-latency dashboards
- DuckDB vs. StarRocks: DuckDB for development; StarRocks for production analytics with caching
What Changed Over Time
- DuckDB started as an academic project (CWI Amsterdam) focused on in-process OLAP — the "SQLite for analytics."
- S3 support came via the
httpfsextension, making DuckDB immediately useful for data lake exploration. - Iceberg support expanded DuckDB from "Parquet file reader" to "lakehouse query tool" — querying table format metadata, not just raw files.
- The "DuckDB for everything" trend has led to engineers using it beyond its design envelope. Single-machine performance is excellent but has a ceiling.
- Integration with Python (pandas, Polars, Arrow) has made DuckDB the default local analytics tool for data engineers.