Add tech_docs/columnar_data.md

This commit is contained in:
2025-07-01 13:01:32 +00:00
parent 6b1c5dc652
commit eb10a88ce1

View File

@@ -0,0 +1,78 @@
### **What is Columnar Data?**
Columnar data refers to a way of storing and organizing data where values are stored by **column** rather than by **row** (as in traditional row-based storage). This format is optimized for analytical queries that read large datasets but only access a subset of columns.
#### **Key Characteristics:**
- Data is stored column-wise (e.g., all values for `column1` are stored together, then `column2`, etc.).
- Highly efficient for **read-heavy** operations (e.g., aggregations, filtering on specific columns).
- Typically used in **OLAP (Online Analytical Processing)** systems (e.g., data warehouses, big data analytics).
#### **Example: Row vs. Column Storage**
| **Row-Based Storage** (e.g., CSV, traditional databases) | **Columnar Storage** (e.g., Parquet, ORC) |
|----------------------------------------------------------|------------------------------------------|
| `[Row1: 1, "Alice", 25], [Row2: 2, "Bob", 30], ...` | `IDs: [1, 2, ...]`, `Names: ["Alice", "Bob", ...]`, `Ages: [25, 30, ...]` |
---
### **How to Identify Columnar Data?**
You can determine if data is stored in a columnar format by checking:
1. **File Format:** Common columnar formats include:
- **Apache Parquet** (`.parquet`)
- **ORC** (Optimized Row Columnar, `.orc`)
- **Arrow** (in-memory columnar format)
- **Columnar databases** (e.g., Snowflake, BigQuery, Redshift)
2. **Metadata:** Columnar files often contain metadata like statistics (min/max values) for each column.
3. **Query Performance:** If filtering or aggregating a single column is extremely fast, its likely columnar.
---
### **How to Efficiently Process Columnar Data?**
To maximize performance when working with columnar data:
1. **Use Columnar-Optimized Tools:**
- **Query Engines:** Apache Spark, Presto, DuckDB, ClickHouse.
- **Libraries:** PyArrow (Python), `pandas` (with `engine='pyarrow'`).
- **Databases:** Snowflake, BigQuery, Amazon Redshift.
2. **Push Down Predicates:**
- Filter columns early in the query (columnar storage skips unneeded data).
```sql
-- Good: Only reads "age" column
SELECT name FROM table WHERE age > 30;
```
3. **Use Column Pruning:**
- Only read the columns you need (avoid `SELECT *`).
4. **Partitioning:**
- Split data by columns (e.g., date) to further reduce I/O.
5. **Compression:**
- Columnar formats (like Parquet) use efficient compression (e.g., run-length encoding, dictionary encoding).
#### **Example in Code:**
```python
# Using PyArrow (columnar processing)
import pyarrow.parquet as pq
# Read only specific columns (efficient)
table = pq.read_table("data.parquet", columns=["name", "age"])
# Filter efficiently with predicate pushdown
filtered = table.filter(pc.field("age") > 30)
```
---
### **When to Use Columnar Data?**
**Best for:**
- Analytical workloads (aggregations, scans on few columns).
- Large datasets where I/O efficiency matters.
- Cloud data warehouses (BigQuery, Snowflake).
**Not ideal for:**
- Row-by-row transactional workloads (OLTP).
- Frequent single-row updates (columnar is write-heavy for updates).
---
### **Summary**
- **Columnar data** stores values by column (not row), optimizing read performance.
- **Identify it** by file formats (Parquet, ORC) or fast column-specific queries.
- **Process efficiently** by using columnar tools, predicate pushdown, and column pruning.
Would you like a comparison with row-oriented formats (e.g., CSV) in terms of performance benchmarks?