Data Classification
Summary
What it is
Using LLMs to categorize, tag, or label S3-stored objects based on content analysis — by topic, sensitivity level, or compliance category.
Where it fits
Data classification enables governance over S3 data lakes. It identifies PII, classifies documents by sensitivity, and routes data to appropriate processing pipelines — all of which are critical at scale where manual review is impossible.
Misconceptions / Traps
- Classification accuracy varies by data type and domain. General-purpose LLMs may misclassify domain-specific content. Fine-tuned or domain-adapted models improve accuracy.
- Classification is not a substitute for proper access controls. Tagging data as "sensitive" does not protect it — IAM policies and encryption must enforce the classification.
Key Connections
depends_onGeneral-Purpose LLM — requires content understandingaugmentsApache Iceberg — enriches table metadata with classification tagsconstrained_byHigh Cloud Inference Cost — per-object processing is expensivescoped_toLLM-Assisted Data Systems, Metadata Management
Definition
What it is
Using LLMs to categorize, tag, or label S3-stored objects based on content analysis — classifying documents by topic, sensitivity level, or compliance category.
Why it exists
S3 buckets accumulate vast quantities of unlabeled data. Classification enables governance (identifying PII), organization (routing data to correct processing pipelines), and discovery (finding relevant data across a large lake).
Primary use cases
PII detection in S3-stored documents, automated data governance tagging, content-based routing in data lake ingestion.
Relationships
Outbound Relationships
depends_onaugmentsconstrained_byInbound Relationships
enablesResources
Grab's engineering blog on deploying LLM-powered data classification at petabyte scale across their entire data lake, using GPT-3.5 for PII tagging and sensitivity tiering.
Official AWS Macie documentation for automated sensitive data discovery and PII classification in S3 buckets using machine learning and pattern matching.