Small / Distilled Model
Summary
What it is
A compact model (typically under 10B parameters) suitable for local or edge deployment, often distilled from a larger model to retain key capabilities at lower cost.
Where it fits
Small models make LLM-over-S3 workloads economically viable at scale. They can run on commodity hardware for embedding generation, classification, and metadata extraction — avoiding cloud API costs and egress charges.
Misconceptions / Traps
- "Small" does not mean "bad." Distilled models retain 90%+ of the teacher model's capability for specific tasks. But they are less versatile than full-size models.
- Quantized models (4-bit, 8-bit) trade precision for throughput. Test on your specific data before assuming quality is acceptable.
Key Connections
enablesEmbedding Generation — can generate embeddings locallyscoped_toLLM-Assisted Data Systems
Definition
What it is
A compact model (typically under 10B parameters) suitable for local or edge deployment, often distilled from a larger model to retain key capabilities at lower computational cost.
Why it exists
Processing large volumes of S3-stored data through cloud LLM APIs is expensive. Small models can run on local hardware, enabling cost-effective embedding generation, classification, and metadata extraction at scale without egress or per-token charges.
Primary use cases
Local embedding generation for S3-stored content, on-premise data classification, edge inference for IoT data stored in S3.
Relationships
Resources
Official Hugging Face documentation for DistilBERT, the landmark distilled model retaining 97% of BERT's performance at 40% smaller size and 60% faster inference.
The original DistilBERT paper by Sanh et al. from Hugging Face, establishing the triple-loss knowledge distillation approach widely adopted for creating smaller models.