High Cloud Inference Cost
Summary
What it is
The expense of running LLM/ML inference via cloud APIs (per-token or per-request pricing) against S3 data at scale.
Where it fits
This is the economic constraint that limits LLM adoption over S3 data. Embedding generation, metadata extraction, and classification are useful but only viable if inference costs do not exceed the value of the results.
Misconceptions / Traps
- Cost is not just API pricing. Egress charges for moving S3 data to inference endpoints, and storage costs for embeddings, add to the total.
- "Run it locally" is not free either. Local inference has GPU hardware, power, and maintenance costs. The break-even volume depends on model size and throughput.
Key Connections
- Local Inference Stack
solvesHigh Cloud Inference Cost — runs models on local hardware - Offline Embedding Pipeline
constrained_byHigh Cloud Inference Cost — batch processing amortizes cost - Embedding Generation, Metadata Extraction, Data Classification
constrained_byHigh Cloud Inference Cost - Hybrid S3 + Vector Index
constrained_byHigh Cloud Inference Cost — embedding generation is expensive scoped_toLLM-Assisted Data Systems, S3
Definition
What it is
The expense of running LLM/ML inference via cloud APIs (per-token or per-request pricing) against S3-stored data at scale.
Relationships
Outbound Relationships
scoped_toInbound Relationships
constrained_bysolvesResources
Official AWS SageMaker documentation on inference cost optimization best practices, covering multi-model endpoints, autoscaling, instance selection, and S3 model streaming to reduce costs.
NVIDIA technical blog showing how streaming models from S3 reduces cold-start latency and infrastructure costs compared to pre-loading models to local storage.
Detailed breakdown of inference unit economics including GPU costs, model storage on S3, and hidden expenses that make up 60-80% of total spend.