Standard

ORC

Summary

What it is

Optimized Row Columnar file format specification — a columnar format with built-in indexing, compression, and predicate pushdown support, originally developed for the Hive ecosystem.

Where it fits

ORC is the legacy columnar format in the Hadoop/Hive ecosystem. On S3, it serves the same role as Parquet — efficient columnar storage for analytical queries — but is primarily used in organizations with existing Hive investments.

Misconceptions / Traps

ORC and Parquet are functionally similar for most workloads. The choice is usually driven by ecosystem (Hive → ORC, everything else → Parquet) rather than technical superiority.
ORC's built-in ACID support (for Hive) operates differently from table format ACID (Iceberg, Delta). They are not the same concept.

Key Connections

used_by Apache Spark, Trino — supported as a data file format
solves Cold Scan Latency — columnar format enables predicate pushdown
scoped_to S3, Table Formats

Definition

What it is

Optimized Row Columnar file format specification. A columnar format with built-in indexing, compression, and predicate pushdown support, originally developed for the Hive ecosystem.

Why it exists

ORC predates Parquet in the Hadoop ecosystem and remains in use in organizations with significant Hive and Spark-on-YARN investments. It provides similar benefits to Parquet (columnar storage, efficient analytics) with different performance trade-offs.

Primary use cases

Analytical data storage in Hive-centric S3 environments, legacy Hadoop data lake compatibility.

Relationships

Outbound Relationships

scoped_to

S3 Table Formats

used_by

Apache Spark Trino

solves

Cold Scan Latency

Resources

SpecHigh

orc.apache.org/specification/

The authoritative ORC file format specification defining the stripe structure, type system, encoding schemes, compression, indexes, and file footer layout.

DocsHigh

orc.apache.org/docs/

Official Apache ORC documentation covering configuration, Hive/Spark integration, ACID support, and performance tuning.

GitHubHigh

github.com/apache/orc

Canonical repository containing the C++ and Java implementations of the ORC format, plus the specification source and test files.