Guide 6

LLMs over S3 Data — Embeddings, Metadata, and Local Inference Constraints

Problem Framing

LLMs can extract value from S3-stored data — generating embeddings, extracting metadata, classifying documents, inferring schemas, and translating natural language to SQL. But every one of these operations has a cost, and at S3 data volumes (terabytes to petabytes), the cost question dominates. Engineers need to understand which LLM capabilities are viable at their scale, how to control costs, and when local inference is the right answer.

Relevant Nodes

Topics: LLM-Assisted Data Systems, S3, Vector Indexing on Object Storage, Metadata Management
Technologies: LanceDB, AWS S3
Architectures: Offline Embedding Pipeline, Local Inference Stack, Hybrid S3 + Vector Index
Model Classes: Embedding Model, General-Purpose LLM, Code-Focused LLM, Small / Distilled Model
LLM Capabilities: Embedding Generation, Semantic Search, Metadata Extraction, Schema Inference, Data Classification, Natural Language Querying
Pain Points: High Cloud Inference Cost, Egress Cost

Decision Path

Assess your LLM use case against S3 data volume:
- Embedding generation at 1M documents: ~$50-500 via cloud API, ~$5-50 on local GPU. Viable at most scales.
- Metadata extraction on 10M objects: ~$5,000-50,000 via cloud API. Only viable with prioritization (extract from high-value objects only) or local inference.
- Schema inference is low-volume (run once per new dataset). Cloud API cost is negligible.
- Natural language querying is per-query cost. Low volume, high value per query. Cloud API is usually fine.
- Data classification at petabyte scale: requires local inference or AWS Macie for PII. Cloud LLM APIs are prohibitive.
Choose your inference strategy:
- Cloud API (OpenAI, Bedrock, SageMaker): Highest quality, highest cost, zero infrastructure. Use for low-volume, high-value tasks (schema inference, NL querying).
- Managed local (SageMaker endpoints): Medium cost, auto-scaling, AWS-managed. Use for medium-volume batch processing.
- Self-hosted local (vLLM, llama.cpp): Lowest per-token cost at high volume, highest operational overhead. Use for high-volume embedding and classification.
- Small/distilled models: Run on commodity hardware. Quality trade-off. Use when 90% accuracy is acceptable and volume makes cloud APIs prohibitive.
Account for data movement costs:
- Cloud inference often requires moving S3 data to inference endpoints → egress charges.
- Local inference with MinIO (on-premise S3) eliminates egress entirely.
- Hybrid: keep models near data. Deploy inference in the same region/VPC as your S3 buckets.
Structure your pipeline:
- Use the Offline Embedding Pipeline pattern for batch processing. Schedule daily/weekly. Idempotent and resumable.
- Store embeddings back to S3 (Lance format, Parquet with vector columns, or dedicated vector store).
- Use the Hybrid S3 + Vector Index pattern to make embedded data searchable.
- Metadata extraction results → enrich table format metadata (Iceberg custom properties, Glue Data Catalog tags).
Set quality expectations:
- LLM outputs are probabilistic. Schema inference suggestions need human review. Classification needs confidence thresholds. NL-to-SQL needs query validation.
- Build validation into the pipeline, not as an afterthought.

What Changed Over Time

Early LLM-over-data workloads (2022-2023) used cloud APIs exclusively. Costs were high and scale was limited.
Open-source embedding models (sentence-transformers, E5) made local embedding generation viable.
Quantized inference (llama.cpp, GGML/GGUF) brought LLM inference to commodity hardware.
vLLM and model streaming from S3 (Run:ai Model Streamer) reduced cold-start latency for self-hosted inference.
AWS introduced S3 Vectors and S3 Metadata features, signaling that LLM-derived data enrichment is moving into the storage platform itself.
The cost-per-token of both cloud and local inference has dropped steadily, but S3 data volumes grow faster. The economic tension persists.

Problem Framing

Relevant Nodes

Decision Path

What Changed Over Time

Sources