Engineer Guides
8 cross-cutting guides anchored to the S3 node graph
How S3 Shapes Lakehouse Design
Every lakehouse architecture sits on object storage — almost always S3 or an S3-compatible store. But S3 is not a database, and its constraints fundamentally shape how lakehouses are designed. Enginee...
Small Files Problem — Why It Exists and the Common Mitigations
A dataset with 10 million 10KB files performs worse on S3 than the same data in 100 files of 1GB each. The small files problem is the most common performance issue in S3-based systems, and it is cause...
Why Iceberg Exists (and What It Replaces)
Before Iceberg, querying data on S3 meant pointing a Hive Metastore at a directory of Parquet files and hoping for the best. There were no transactions, schema changes required rewriting data, partiti...
Where DuckDB Fits (and Where It Doesn't)
Engineers encounter S3-stored data constantly — Parquet files in data lakes, Iceberg tables in lakehouses, ad-hoc exports. Historically, exploring this data required setting up Spark clusters or Trino...
Vector Indexing on Object Storage — What's Real vs. Hype
Vector databases and semantic search are heavily marketed features in the AI ecosystem. For engineers building on S3, the question is practical: can you build production vector search over S3-stored d...
LLMs over S3 Data — Embeddings, Metadata, and Local Inference Constraints
LLMs can extract value from S3-stored data — generating embeddings, extracting metadata, classifying documents, inferring schemas, and translating natural language to SQL. But every one of these opera...
Choosing a Table Format — Iceberg vs. Delta vs. Hudi
The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — all solve the same fundamental problem: adding transactional table semantics to files on S3. But they solve it differ...
Egress, Lock-In, and the Case for S3-Compatible Alternatives
AWS S3 egress pricing and proprietary feature creep create a gravitational well: data flows in cheaply but flows out expensively. For organizations with multi-cloud strategies, data sovereignty requir...