Guide 5
Vector Indexing on Object Storage — What's Real vs. Hype
Problem Framing
Vector databases and semantic search are heavily marketed features in the AI ecosystem. For engineers building on S3, the question is practical: can you build production vector search over S3-stored data, and what are the real trade-offs? The answer depends on data volume, latency requirements, and whether you need a separate infrastructure layer.
Relevant Nodes
- Topics: Vector Indexing on Object Storage, LLM-Assisted Data Systems, S3
- Technologies: LanceDB, AWS S3
- Standards: S3 API
- Architectures: Hybrid S3 + Vector Index, Offline Embedding Pipeline, Local Inference Stack
- Model Classes: Embedding Model, Small / Distilled Model
- LLM Capabilities: Embedding Generation, Semantic Search
- Pain Points: High Cloud Inference Cost, Cold Scan Latency
Decision Path
Decide if you need vector search at all:
- Yes if your data is unstructured (documents, images, logs) and users need to find content by meaning.
- Yes if you are building RAG systems grounded in S3-stored corpora.
- No if your queries are structured (SQL filters, exact matches, aggregations). Table formats and SQL engines are the right tool.
- Maybe if you want to combine semantic and structured search (hybrid search) — this is real but adds complexity.
Choose your vector index architecture:
- S3-native (LanceDB): Vector indexes stored as files on S3. Serverless, no separate infrastructure, lowest operational overhead. Trade-off: higher query latency (S3 read on every query).
- Dedicated vector database (Milvus, Weaviate): Separate infrastructure with in-memory indexes. Lower latency, higher throughput. Trade-off: another system to operate, and you store data in two places (S3 + vector DB).
- Managed service (OpenSearch, S3 Vectors): Cloud-managed vector search. Trade-off: vendor lock-in and cost at scale.
Plan your embedding pipeline:
- Source data lives in S3 → embedding model processes it → vectors are stored in the index
- Batch (Offline Embedding Pipeline): Process S3 data on a schedule. Cost-predictable. Stale by design.
- Stream: Embed on ingest. Fresh but expensive and operationally complex.
- Embedding model choice: Commercial APIs (OpenAI) for quality, open-source (sentence-transformers) for cost/privacy, small/distilled models for local inference.
Understand what's real vs. hype:
- Real: Vector search over thousands to millions of documents on S3. LanceDB handles 1B+ vectors on S3. RAG with S3-backed corpora works in production.
- Real: Embedding costs dominate the total cost. The index itself is cheap; generating embeddings is not.
- Hype: "Just add vector search to your data lake." Integration requires embedding pipelines, index maintenance, sync mechanisms, and relevance tuning.
- Hype: "Vector search replaces SQL." It does not. It answers a different question (semantic similarity vs. predicate matching).
What Changed Over Time
- Early vector databases (2020-2022) were standalone systems with no S3 story. Data had to be copied in.
- S3-native vector search emerged (LanceDB, Lance format) to align with the separation of storage and compute principle.
- AWS announced S3 Vectors — native vector storage in S3 itself — signaling that vector search is moving into the storage layer.
- Embedding model costs dropped significantly (open-source models, quantized models, distillation). This makes the embedding pipeline more viable at S3 data scale.
- The "RAG over S3 data" pattern has become a standard architecture, with AWS, Databricks, and LangChain providing reference implementations.
Sources
- aws.amazon.com/blogs/architecture/a-scalable-elastic-database-and-sear...
- lancedb.github.io/lancedb/
- milvus.io/docs/overview.md
- milvus.io/docs/deploy_s3.md
- aws.amazon.com/blogs/aws/introducing-amazon-s3-vectors-first-cloud-sto...
- sbert.net/
- platform.openai.com/docs/guides/embeddings
- github.com/aws-samples/text-embeddings-pipeline-for-rag