Add tech_docs/database/sql_getting_started.md

2025-06-18 04:34:22 +00:00
parent 2961b6a216
commit 65adc021aa
1 changed files with 454 additions and 0 deletions
--- a/tech_docs/database/sql_getting_started.md
+++ b/tech_docs/database/sql_getting_started.md
@@ -0,0 +1,454 @@
+# **The Ultimate SQL Getting Started Guide**
+
+This guide will take you from absolute beginner to SQL proficiency, with a focus on practical data analysis and EDA applications.
+
+## **1. SQL Fundamentals**
+
+### **What is SQL?**
+SQL (Structured Query Language) is the standard language for interacting with relational databases. It allows you to:
+- Retrieve data
+- Insert, update, and delete records
+- Create and modify database structures
+- Perform complex calculations on data
+
+### **Core Concepts**
+1. **Databases**: Collections of structured data
+2. **Tables**: Data organized in rows and columns
+3. **Queries**: Commands to interact with data
+4. **Schemas**: Blueprints defining database structure
+
+## **2. Setting Up Your SQL Environment**
+
+### **Choose a Database System**
+| Option | Best For | Installation |
+|--------|----------|--------------|
+| **SQLite** | Beginners, small projects | Built into Python |
+| **PostgreSQL** | Production, complex queries | [Download here](https://www.postgresql.org/download/) |
+| **MySQL** | Web applications | [Download here](https://dev.mysql.com/downloads/) |
+| **DuckDB** | Analytical workloads | `pip install duckdb` |
+
+### **Install a SQL Client**
+- **DBeaver** (Free, multi-platform)
+- **TablePlus** (Paid, excellent UI)
+- **VS Code + SQL Tools** (For developers)
+
+## **3. Basic SQL Syntax**
+
+### **SELECT Statements**
+```sql
+-- Basic selection
+SELECT column1, column2 FROM table_name;
+
+-- Select all columns
+SELECT * FROM table_name;
+
+-- Filtering with WHERE
+SELECT * FROM table_name WHERE condition;
+
+-- Sorting with ORDER BY
+SELECT * FROM table_name ORDER BY column1 DESC;
+```
+
+### **Common Data Types**
+- `INTEGER`: Whole numbers
+- `FLOAT/REAL`: Decimal numbers
+- `VARCHAR(n)`: Text (n characters max)
+- `BOOLEAN`: True/False
+- `DATE/TIMESTAMP`: Date and time values
+
+## **4. Essential SQL Operations**
+
+### **Filtering Data**
+```sql
+-- Basic conditions
+SELECT * FROM employees WHERE salary > 50000;
+
+-- Multiple conditions
+SELECT * FROM products 
+WHERE price BETWEEN 10 AND 100 
+AND category = 'Electronics';
+
+-- Pattern matching
+SELECT * FROM customers 
+WHERE name LIKE 'J%'; -- Starts with J
+```
+
+### **Sorting and Limiting**
+```sql
+-- Sort by multiple columns
+SELECT * FROM orders 
+ORDER BY order_date DESC, total_amount DESC;
+
+-- Limit results
+SELECT * FROM large_table LIMIT 100;
+```
+
+### **Aggregation Functions**
+```sql
+-- Basic aggregations
+SELECT 
+    COUNT(*) AS total_orders,
+    AVG(amount) AS avg_order,
+    MAX(amount) AS largest_order
+FROM orders;
+
+-- GROUP BY
+SELECT 
+    department, 
+    AVG(salary) AS avg_salary
+FROM employees
+GROUP BY department;
+```
+
+## **5. Joining Tables**
+
+### **Join Types**
+| Join Type | Description | Example |
+|-----------|-------------|---------|
+| **INNER JOIN** | Only matching rows | `SELECT * FROM A INNER JOIN B ON A.id = B.id` |
+| **LEFT JOIN** | All from left table, matches from right | `SELECT * FROM A LEFT JOIN B ON A.id = B.id` |
+| **RIGHT JOIN** | All from right table, matches from left | `SELECT * FROM A RIGHT JOIN B ON A.id = B.id` |
+| **FULL JOIN** | All rows from both tables | `SELECT * FROM A FULL JOIN B ON A.id = B.id` |
+
+### **Practical Example**
+```sql
+SELECT 
+    o.order_id,
+    c.customer_name,
+    o.order_date,
+    o.total_amount
+FROM orders o
+JOIN customers c ON o.customer_id = c.customer_id
+WHERE o.order_date > '2023-01-01'
+ORDER BY o.total_amount DESC;
+```
+
+## **6. Advanced SQL Features**
+
+### **Common Table Expressions (CTEs)**
+```sql
+WITH high_value_customers AS (
+    SELECT customer_id, SUM(amount) AS total_spent
+    FROM orders
+    GROUP BY customer_id
+    HAVING SUM(amount) > 1000
+)
+SELECT * FROM high_value_customers;
+```
+
+### **Window Functions**
+```sql
+-- Running total
+SELECT 
+    date,
+    revenue,
+    SUM(revenue) OVER (ORDER BY date) AS running_total
+FROM daily_sales;
+
+-- Rank products by category
+SELECT 
+    product_name,
+    category,
+    price,
+    RANK() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank
+FROM products;
+```
+
+## **7. SQL for Data Analysis**
+
+### **Time Series Analysis**
+```sql
+-- Daily aggregates
+SELECT 
+    DATE_TRUNC('day', transaction_time) AS day,
+    COUNT(*) AS transactions,
+    SUM(amount) AS total_amount
+FROM transactions
+GROUP BY 1
+ORDER BY 1;
+
+-- Month-over-month growth
+WITH monthly_sales AS (
+    SELECT 
+        DATE_TRUNC('month', order_date) AS month,
+        SUM(amount) AS total_sales
+    FROM orders
+    GROUP BY 1
+)
+SELECT 
+    month,
+    total_sales,
+    (total_sales - LAG(total_sales) OVER (ORDER BY month)) / 
+        LAG(total_sales) OVER (ORDER BY month) AS growth_rate
+FROM monthly_sales;
+```
+
+### **Pivot Tables in SQL**
+```sql
+-- Using CASE statements
+SELECT 
+    product_category,
+    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2022 THEN amount ELSE 0 END) AS sales_2022,
+    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2023 THEN amount ELSE 0 END) AS sales_2023
+FROM orders
+GROUP BY product_category;
+```
+
+## **8. Performance Optimization**
+
+### **Indexing Strategies**
+```sql
+-- Create indexes
+CREATE INDEX idx_customer_name ON customers(name);
+CREATE INDEX idx_order_date ON orders(order_date);
+
+-- Composite index
+CREATE INDEX idx_category_price ON products(category, price);
+```
+
+### **Query Optimization Tips**
+1. Use `EXPLAIN ANALYZE` to understand query plans
+2. Limit columns in `SELECT` (avoid `SELECT *`)
+3. Filter early with `WHERE` clauses
+4. Use appropriate join types
+
+## **9. Learning Resources**
+
+### **Free Interactive Tutorials**
+1. [SQLZoo](https://sqlzoo.net/)
+2. [Mode Analytics SQL Tutorial](https://mode.com/sql-tutorial/)
+3. [PostgreSQL Exercises](https://pgexercises.com/)
+
+### **Books**
+- "SQL for Data Analysis" by Cathy Tanimura
+- "SQL Cookbook" by Anthony Molinaro
+
+### **Practice Platforms**
+- [LeetCode SQL Problems](https://leetcode.com/problemset/database/)
+- [HackerRank SQL](https://www.hackerrank.com/domains/sql)
+
+## **10. Next Steps**
+
+1. **Install a database system** and practice daily
+2. **Work with real datasets** (try [Kaggle datasets](https://www.kaggle.com/datasets))
+3. **Build a portfolio project** (e.g., analyze sales data)
+4. **Learn database design** (normalization, relationships)
+
+Remember: SQL is a skill best learned by doing. Start writing queries today!
+
+---
+
+### **Technical Overview: SQL for EDA (Structured Data Analysis)**
+You're diving into SQL-first EDA—excellent choice. Below is a **structured roadmap** covering key SQL concepts, EDA-specific queries, and pro tips to maximize efficiency.
+
+---
+
+## **1. Core SQL Concepts for EDA**
+### **A. Foundational Operations**
+| Concept          | Purpose                          | Example Query                          |
+|------------------|----------------------------------|----------------------------------------|
+| **Filtering**    | Subset data (`WHERE`, `HAVING`)  | `SELECT * FROM prices WHERE asset = 'EUR_USD'` |
+| **Aggregation**  | Summarize data (`GROUP BY`)      | `SELECT asset, AVG(close) FROM prices GROUP BY asset` |
+| **Joins**        | Combine tables (`INNER JOIN`)    | `SELECT * FROM trades JOIN assets ON trades.id = assets.id` |
+| **Sorting**      | Order results (`ORDER BY`)       | `SELECT * FROM prices ORDER BY time DESC` |
+
+### **B. Advanced EDA Tools**
+| Concept               | Purpose                                      | Example Query                          |
+|-----------------------|----------------------------------------------|----------------------------------------|
+| **Window Functions**  | Calculate rolling stats, ranks               | `SELECT time, AVG(close) OVER (ORDER BY time ROWS 29 PRECEDING) FROM prices` |
+| **CTEs (WITH)**       | Break complex queries into steps             | `WITH filtered AS (SELECT * FROM prices WHERE volume > 1000) SELECT * FROM filtered` |
+| **Statistical Aggregates** | Built-in stats (`STDDEV`, `CORR`, `PERCENTILE_CONT`) | `SELECT CORR(open, close) FROM prices` |
+| **Time-Series Handling** | Extract dates, resample                    | `SELECT DATE_TRUNC('hour', time) AS hour, AVG(close) FROM prices GROUP BY 1` |
+
+---
+
+## **2. Essential EDA Queries**
+### **A. Data Profiling**
+```sql
+-- 1. Basic stats
+SELECT 
+    COUNT(*) AS row_count,
+    COUNT(DISTINCT asset) AS unique_assets,
+    MIN(close) AS min_price,
+    MAX(close) AS max_price,
+    AVG(close) AS mean_price,
+    STDDEV(close) AS volatility
+FROM prices;
+
+-- 2. Missing values
+SELECT 
+    COUNT(*) - COUNT(close) AS missing_prices
+FROM prices;
+
+-- 3. Value distribution (histogram)
+SELECT 
+    FLOOR(close / 10) * 10 AS price_bin,
+    COUNT(*) AS frequency
+FROM prices
+GROUP BY 1
+ORDER BY 1;
+```
+
+### **B. Correlation Analysis**
+```sql
+-- 1. Pairwise correlations
+SELECT 
+    CORR(EUR_USD, GBP_USD) AS eur_gbp,
+    CORR(EUR_USD, USD_JPY) AS eur_jpy,
+    CORR(GBP_USD, USD_JPY) AS gbp_jpy
+FROM hourly_rates;
+
+-- 2. Rolling correlation (30-day)
+WITH normalized AS (
+    SELECT 
+        time,
+        (EUR_USD - AVG(EUR_USD) OVER()) / STDDEV(EUR_USD) OVER() AS eur_norm,
+        (GBP_USD - AVG(GBP_USD) OVER()) / STDDEV(GBP_USD) OVER() AS gbp_norm
+    FROM hourly_rates
+)
+SELECT 
+    time,
+    AVG(eur_norm * gbp_norm) OVER(ORDER BY time ROWS 29 PRECEDING) AS rolling_corr
+FROM normalized;
+```
+
+### **C. Time-Series EDA**
+```sql
+-- 1. Hourly volatility patterns
+SELECT 
+    EXTRACT(HOUR FROM time) AS hour,
+    AVG(ABS(close - open)) AS avg_volatility
+FROM prices
+GROUP BY 1
+ORDER BY 1;
+
+-- 2. Daily returns distribution
+SELECT 
+    DATE_TRUNC('day', time) AS day,
+    (LAST(close) - FIRST(open)) / FIRST(open) AS daily_return
+FROM prices
+GROUP BY 1;
+```
+
+### **D. Outlier Detection**
+```sql
+-- Z-score outliers (|Z| > 3)
+WITH stats AS (
+    SELECT 
+        AVG(close) AS mean,
+        STDDEV(close) AS stddev
+    FROM prices
+)
+SELECT 
+    time,
+    close,
+    (close - mean) / stddev AS z_score
+FROM prices, stats
+WHERE ABS((close - mean) / stddev) > 3;
+```
+
+---
+
+## **3. Key Optimizations**
+### **A. Indexing for EDA**
+```sql
+-- Speed up time-series queries
+CREATE INDEX idx_prices_time ON prices(time);
+
+-- Speed up asset-specific filters
+CREATE INDEX idx_prices_asset ON prices(asset);
+```
+
+### **B. Partitioning Large Tables**
+```sql
+-- Partition by time range (PostgreSQL)
+CREATE TABLE prices (
+    time TIMESTAMP,
+    asset TEXT,
+    close FLOAT
+) PARTITION BY RANGE (time);
+```
+
+### **C. Materialized Views**
+```sql
+-- Pre-compute frequent aggregates
+CREATE MATERIALIZED VIEW hourly_stats AS
+SELECT 
+    DATE_TRUNC('hour', time) AS hour,
+    AVG(close) AS avg_price,
+    STDDEV(close) AS volatility
+FROM prices
+GROUP BY 1;
+
+-- Refresh periodically
+REFRESH MATERIALIZED VIEW hourly_stats;
+```
+
+---
+
+## **4. Pro Tips**
+### **A. Use the Right Database**
+| Database      | Best For                          |
+|--------------|----------------------------------|
+| **PostgreSQL** | Complex EDA, extensions (MADlib) |
+| **DuckDB**   | Embedded analytics, Parquet/CSV  |
+| **SQLite**   | Lightweight, local prototyping   |
+
+### **B. Learn These Functions**
+| Function              | Purpose                          |
+|-----------------------|----------------------------------|
+| `DATE_TRUNC()`        | Group by time intervals          |
+| `FIRST_VALUE()`       | First/last in a window           |
+| `PERCENTILE_CONT()`   | Median, quantiles                |
+| `ROLLUP`/`CUBE`       | Hierarchical aggregations        |
+
+### **C. Avoid Anti-Patterns**
+- **Don’t** fetch all rows to Python for filtering (`SELECT *` → `pd.read_sql()`).
+- **Do** push computations to SQL (`WHERE`, `GROUP BY` in-database).
+- **Don’t** use Python loops for row-wise operations.
+- **Do** use window functions for rolling calculations.
+
+---
+
+## **5. Study Roadmap**
+1. **Beginner**:  
+   - SQL syntax (`SELECT`, `JOIN`, `GROUP BY`)  
+   - [SQLZoo](https://sqlzoo.net/) (free exercises)  
+
+2. **Intermediate**:  
+   - Window functions (`OVER`, `PARTITION BY`)  
+   - [PostgreSQL Window Functions Guide](https://www.postgresqltutorial.com/postgresql-window-function/)  
+
+3. **Advanced**:  
+   - Query optimization (EXPLAIN ANALYZE, indexing)  
+   - Time-series SQL ([TimescaleDB docs](https://docs.timescale.com/))  
+
+4. **EDA-Specific**:  
+   - Statistical SQL ([MADlib documentation](https://madlib.apache.org/))  
+   - Correlation patterns (rolling, cross-asset)  
+
+---
+
+## **6. Sample Workflow**
+```mermaid
+flowchart TB
+    A[Load Data] --> B[Profile Data]
+    B --> C[Clean Data]
+    C --> D[Explore Relationships]
+    D --> E[Validate Hypotheses]
+    E --> F[Visualize in Python]
+```
+
+**Example**:  
+1. Profile data → find missing values.  
+2. Clean → filter outliers.  
+3. Explore → calculate rolling correlations.  
+4. Validate → test "volatility clusters on Mondays".  
+5. Visualize → plot results in Python.  
+
+---
+
+### **Final Thought**
+SQL is **the** tool for structured EDA—it’s faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and you’ll outperform 90% of analysts stuck in pandas.  
+
+Want a **ready-to-run Docker container** with PostgreSQL + sample forex data for practice? Let me know!