DuckDB
Summary
What it is
An in-process analytical database engine (like SQLite for analytics) that reads Parquet, Iceberg, and other formats directly from S3 without requiring a server or cluster.
Where it fits
DuckDB fills the gap between "I need to explore this S3 data" and "I need to deploy a Spark cluster." It brings fast columnar analytics to a single machine, reading S3 data directly — ideal for development, ad-hoc analysis, and embedded analytics.
Misconceptions / Traps
- DuckDB is single-node. It does not scale horizontally. For petabyte-scale queries, you still need Spark, Trino, or StarRocks.
- DuckDB reads from S3 over HTTP. Performance is bottlenecked by network throughput and S3 request latency, especially with many small files.
Key Connections
depends_onApache Parquet, Apache Arrow — reads Parquet, processes in Arrow formatconstrained_bySmall Files Problem, Object Listing Performance — performance degrades with too many small S3 objects- Natural Language Querying
augmentsDuckDB — LLMs can generate SQL for DuckDB scoped_toS3, Lakehouse
Definition
What it is
An in-process analytical database engine (similar to SQLite for analytics) that can directly read Parquet, Iceberg, and other formats from S3 without requiring a server or cluster.
Why it exists
Not every analytical query requires a distributed cluster. DuckDB brings fast columnar analytics to a single machine, reading directly from S3 — eliminating the need to copy data to a local database or set up distributed infrastructure.
Primary use cases
Local S3 data exploration, ad-hoc analytics over Parquet files on S3, development and testing of queries before deploying to distributed engines, embedded analytics.
Relationships
Outbound Relationships
depends_onconstrained_byResources
Official DuckDB documentation covering SQL dialect, extensions, and embedded analytics engine capabilities.
Primary DuckDB repository with the full C++ source, extension framework, and build system.
DuckDB's dedicated S3 support documentation covering direct S3 reads/writes via the httpfs extension, credentials configuration, and Parquet-on-S3 queries.