ORC
Summary
What it is
Optimized Row Columnar file format specification — a columnar format with built-in indexing, compression, and predicate pushdown support, originally developed for the Hive ecosystem.
Where it fits
ORC is the legacy columnar format in the Hadoop/Hive ecosystem. On S3, it serves the same role as Parquet — efficient columnar storage for analytical queries — but is primarily used in organizations with existing Hive investments.
Misconceptions / Traps
- ORC and Parquet are functionally similar for most workloads. The choice is usually driven by ecosystem (Hive → ORC, everything else → Parquet) rather than technical superiority.
- ORC's built-in ACID support (for Hive) operates differently from table format ACID (Iceberg, Delta). They are not the same concept.
Key Connections
used_byApache Spark, Trino — supported as a data file formatsolvesCold Scan Latency — columnar format enables predicate pushdownscoped_toS3, Table Formats
Definition
What it is
Optimized Row Columnar file format specification. A columnar format with built-in indexing, compression, and predicate pushdown support, originally developed for the Hive ecosystem.
Why it exists
ORC predates Parquet in the Hadoop ecosystem and remains in use in organizations with significant Hive and Spark-on-YARN investments. It provides similar benefits to Parquet (columnar storage, efficient analytics) with different performance trade-offs.
Primary use cases
Analytical data storage in Hive-centric S3 environments, legacy Hadoop data lake compatibility.
Relationships
Resources
The authoritative ORC file format specification defining the stripe structure, type system, encoding schemes, compression, indexes, and file footer layout.
Official Apache ORC documentation covering configuration, Hive/Spark integration, ACID support, and performance tuning.
Canonical repository containing the C++ and Java implementations of the ORC format, plus the specification source and test files.