LLM Capability

Data Classification

Summary

What it is

Using LLMs to categorize, tag, or label S3-stored objects based on content analysis — by topic, sensitivity level, or compliance category.

Where it fits

Data classification enables governance over S3 data lakes. It identifies PII, classifies documents by sensitivity, and routes data to appropriate processing pipelines — all of which are critical at scale where manual review is impossible.

Misconceptions / Traps

Classification accuracy varies by data type and domain. General-purpose LLMs may misclassify domain-specific content. Fine-tuned or domain-adapted models improve accuracy.
Classification is not a substitute for proper access controls. Tagging data as "sensitive" does not protect it — IAM policies and encryption must enforce the classification.

Key Connections

depends_on General-Purpose LLM — requires content understanding
augments Apache Iceberg — enriches table metadata with classification tags
constrained_by High Cloud Inference Cost — per-object processing is expensive
scoped_to LLM-Assisted Data Systems, Metadata Management

Definition

What it is

Using LLMs to categorize, tag, or label S3-stored objects based on content analysis — classifying documents by topic, sensitivity level, or compliance category.

Why it exists

S3 buckets accumulate vast quantities of unlabeled data. Classification enables governance (identifying PII), organization (routing data to correct processing pipelines), and discovery (finding relevant data across a large lake).

Primary use cases

PII detection in S3-stored documents, automated data governance tagging, content-based routing in data lake ingestion.

Relationships

Outbound Relationships

scoped_to

LLM-Assisted Data Systems Metadata Management

depends_on

General-Purpose LLM

augments

Apache Iceberg

constrained_by

High Cloud Inference Cost

Inbound Relationships

enables

General-Purpose LLM

Resources

BlogHigh

engineering.grab.com/llm-powered-data-classification

Grab's engineering blog on deploying LLM-powered data classification at petabyte scale across their entire data lake, using GPT-3.5 for PII tagging and sensitivity tiering.

DocsHigh

docs.aws.amazon.com/macie/latest/user/data-classification.ht...

Official AWS Macie documentation for automated sensitive data discovery and PII classification in S3 buckets using machine learning and pattern matching.