From eb10a88ce1dc73b010f547999ea1f665b6318775 Mon Sep 17 00:00:00 2001 From: medusa Date: Tue, 1 Jul 2025 13:01:32 +0000 Subject: [PATCH] Add tech_docs/columnar_data.md --- tech_docs/columnar_data.md | 78 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 tech_docs/columnar_data.md diff --git a/tech_docs/columnar_data.md b/tech_docs/columnar_data.md new file mode 100644 index 0000000..47f335c --- /dev/null +++ b/tech_docs/columnar_data.md @@ -0,0 +1,78 @@ +### **What is Columnar Data?** +Columnar data refers to a way of storing and organizing data where values are stored by **column** rather than by **row** (as in traditional row-based storage). This format is optimized for analytical queries that read large datasets but only access a subset of columns. + +#### **Key Characteristics:** +- Data is stored column-wise (e.g., all values for `column1` are stored together, then `column2`, etc.). +- Highly efficient for **read-heavy** operations (e.g., aggregations, filtering on specific columns). +- Typically used in **OLAP (Online Analytical Processing)** systems (e.g., data warehouses, big data analytics). + +#### **Example: Row vs. Column Storage** +| **Row-Based Storage** (e.g., CSV, traditional databases) | **Columnar Storage** (e.g., Parquet, ORC) | +|----------------------------------------------------------|------------------------------------------| +| `[Row1: 1, "Alice", 25], [Row2: 2, "Bob", 30], ...` | `IDs: [1, 2, ...]`, `Names: ["Alice", "Bob", ...]`, `Ages: [25, 30, ...]` | + +--- + +### **How to Identify Columnar Data?** +You can determine if data is stored in a columnar format by checking: +1. **File Format:** Common columnar formats include: + - **Apache Parquet** (`.parquet`) + - **ORC** (Optimized Row Columnar, `.orc`) + - **Arrow** (in-memory columnar format) + - **Columnar databases** (e.g., Snowflake, BigQuery, Redshift) +2. **Metadata:** Columnar files often contain metadata like statistics (min/max values) for each column. +3. **Query Performance:** If filtering or aggregating a single column is extremely fast, it’s likely columnar. + +--- + +### **How to Efficiently Process Columnar Data?** +To maximize performance when working with columnar data: +1. **Use Columnar-Optimized Tools:** + - **Query Engines:** Apache Spark, Presto, DuckDB, ClickHouse. + - **Libraries:** PyArrow (Python), `pandas` (with `engine='pyarrow'`). + - **Databases:** Snowflake, BigQuery, Amazon Redshift. +2. **Push Down Predicates:** + - Filter columns early in the query (columnar storage skips unneeded data). + ```sql + -- Good: Only reads "age" column + SELECT name FROM table WHERE age > 30; + ``` +3. **Use Column Pruning:** + - Only read the columns you need (avoid `SELECT *`). +4. **Partitioning:** + - Split data by columns (e.g., date) to further reduce I/O. +5. **Compression:** + - Columnar formats (like Parquet) use efficient compression (e.g., run-length encoding, dictionary encoding). + +#### **Example in Code:** +```python +# Using PyArrow (columnar processing) +import pyarrow.parquet as pq + +# Read only specific columns (efficient) +table = pq.read_table("data.parquet", columns=["name", "age"]) + +# Filter efficiently with predicate pushdown +filtered = table.filter(pc.field("age") > 30) +``` + +--- + +### **When to Use Columnar Data?** +✅ **Best for:** +- Analytical workloads (aggregations, scans on few columns). +- Large datasets where I/O efficiency matters. +- Cloud data warehouses (BigQuery, Snowflake). + +❌ **Not ideal for:** +- Row-by-row transactional workloads (OLTP). +- Frequent single-row updates (columnar is write-heavy for updates). + +--- + +### **Summary** +- **Columnar data** stores values by column (not row), optimizing read performance. +- **Identify it** by file formats (Parquet, ORC) or fast column-specific queries. +- **Process efficiently** by using columnar tools, predicate pushdown, and column pruning. + +Would you like a comparison with row-oriented formats (e.g., CSV) in terms of performance benchmarks? \ No newline at end of file