Hybrid S3 + Vector Index
Summary
What it is
A pattern that stores raw data on S3 and maintains a vector index over embeddings that points back to S3 objects.
Where it fits
This pattern bridges structured storage (S3) with semantic retrieval (vector search). It is the architecture behind RAG systems that ground LLM responses in S3-stored documents.
Misconceptions / Traps
- The vector index and the raw data can drift. If S3 objects are updated or deleted without updating the index, search results return stale or broken references.
- Hybrid does not mean "query both simultaneously." Typically, vector search retrieves references first, then the application fetches the raw data from S3 in a second step.
Key Connections
depends_onS3 API — raw data stored in S3solvesCold Scan Latency — pre-computed embeddings avoid scanning raw contentconstrained_byHigh Cloud Inference Cost — generating embeddings is expensive- LanceDB
implementsHybrid S3 + Vector Index - Embedding Generation, Semantic Search
enablesHybrid S3 + Vector Index scoped_toVector Indexing on Object Storage, S3
Definition
What it is
A pattern that stores raw data (documents, media, logs) on S3 and maintains a vector index (embeddings + similarity search) that points back to the S3 objects.
Why it exists
S3 is excellent for durable, cheap storage of unstructured content, but it has no concept of semantic similarity. A vector index adds a semantic retrieval layer without duplicating the raw data.
Primary use cases
Retrieval-augmented generation (RAG) over S3-stored corpora, semantic document search, content recommendation systems backed by S3 data.
Relationships
Outbound Relationships
scoped_todepends_onsolvesconstrained_byResources
AWS Architecture Blog describing a production-grade 1B+ vector search solution built on LanceDB with S3 as the storage layer, demonstrating the hybrid pattern at scale.
Official Milvus documentation for configuring S3 as the object storage backend for vector data and index persistence.
LanceDB's official example of running a serverless vector database directly on S3 with AWS Lambda.