diff --git a/tech_docs/database/sql_getting_started.md b/tech_docs/database/sql_getting_started.md new file mode 100644 index 0000000..ece0974 --- /dev/null +++ b/tech_docs/database/sql_getting_started.md @@ -0,0 +1,454 @@ +# **The Ultimate SQL Getting Started Guide** + +This guide will take you from absolute beginner to SQL proficiency, with a focus on practical data analysis and EDA applications. + +## **1. SQL Fundamentals** + +### **What is SQL?** +SQL (Structured Query Language) is the standard language for interacting with relational databases. It allows you to: +- Retrieve data +- Insert, update, and delete records +- Create and modify database structures +- Perform complex calculations on data + +### **Core Concepts** +1. **Databases**: Collections of structured data +2. **Tables**: Data organized in rows and columns +3. **Queries**: Commands to interact with data +4. **Schemas**: Blueprints defining database structure + +## **2. Setting Up Your SQL Environment** + +### **Choose a Database System** +| Option | Best For | Installation | +|--------|----------|--------------| +| **SQLite** | Beginners, small projects | Built into Python | +| **PostgreSQL** | Production, complex queries | [Download here](https://www.postgresql.org/download/) | +| **MySQL** | Web applications | [Download here](https://dev.mysql.com/downloads/) | +| **DuckDB** | Analytical workloads | `pip install duckdb` | + +### **Install a SQL Client** +- **DBeaver** (Free, multi-platform) +- **TablePlus** (Paid, excellent UI) +- **VS Code + SQL Tools** (For developers) + +## **3. Basic SQL Syntax** + +### **SELECT Statements** +```sql +-- Basic selection +SELECT column1, column2 FROM table_name; + +-- Select all columns +SELECT * FROM table_name; + +-- Filtering with WHERE +SELECT * FROM table_name WHERE condition; + +-- Sorting with ORDER BY +SELECT * FROM table_name ORDER BY column1 DESC; +``` + +### **Common Data Types** +- `INTEGER`: Whole numbers +- `FLOAT/REAL`: Decimal numbers +- `VARCHAR(n)`: Text (n characters max) +- `BOOLEAN`: True/False +- `DATE/TIMESTAMP`: Date and time values + +## **4. Essential SQL Operations** + +### **Filtering Data** +```sql +-- Basic conditions +SELECT * FROM employees WHERE salary > 50000; + +-- Multiple conditions +SELECT * FROM products +WHERE price BETWEEN 10 AND 100 +AND category = 'Electronics'; + +-- Pattern matching +SELECT * FROM customers +WHERE name LIKE 'J%'; -- Starts with J +``` + +### **Sorting and Limiting** +```sql +-- Sort by multiple columns +SELECT * FROM orders +ORDER BY order_date DESC, total_amount DESC; + +-- Limit results +SELECT * FROM large_table LIMIT 100; +``` + +### **Aggregation Functions** +```sql +-- Basic aggregations +SELECT + COUNT(*) AS total_orders, + AVG(amount) AS avg_order, + MAX(amount) AS largest_order +FROM orders; + +-- GROUP BY +SELECT + department, + AVG(salary) AS avg_salary +FROM employees +GROUP BY department; +``` + +## **5. Joining Tables** + +### **Join Types** +| Join Type | Description | Example | +|-----------|-------------|---------| +| **INNER JOIN** | Only matching rows | `SELECT * FROM A INNER JOIN B ON A.id = B.id` | +| **LEFT JOIN** | All from left table, matches from right | `SELECT * FROM A LEFT JOIN B ON A.id = B.id` | +| **RIGHT JOIN** | All from right table, matches from left | `SELECT * FROM A RIGHT JOIN B ON A.id = B.id` | +| **FULL JOIN** | All rows from both tables | `SELECT * FROM A FULL JOIN B ON A.id = B.id` | + +### **Practical Example** +```sql +SELECT + o.order_id, + c.customer_name, + o.order_date, + o.total_amount +FROM orders o +JOIN customers c ON o.customer_id = c.customer_id +WHERE o.order_date > '2023-01-01' +ORDER BY o.total_amount DESC; +``` + +## **6. Advanced SQL Features** + +### **Common Table Expressions (CTEs)** +```sql +WITH high_value_customers AS ( + SELECT customer_id, SUM(amount) AS total_spent + FROM orders + GROUP BY customer_id + HAVING SUM(amount) > 1000 +) +SELECT * FROM high_value_customers; +``` + +### **Window Functions** +```sql +-- Running total +SELECT + date, + revenue, + SUM(revenue) OVER (ORDER BY date) AS running_total +FROM daily_sales; + +-- Rank products by category +SELECT + product_name, + category, + price, + RANK() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank +FROM products; +``` + +## **7. SQL for Data Analysis** + +### **Time Series Analysis** +```sql +-- Daily aggregates +SELECT + DATE_TRUNC('day', transaction_time) AS day, + COUNT(*) AS transactions, + SUM(amount) AS total_amount +FROM transactions +GROUP BY 1 +ORDER BY 1; + +-- Month-over-month growth +WITH monthly_sales AS ( + SELECT + DATE_TRUNC('month', order_date) AS month, + SUM(amount) AS total_sales + FROM orders + GROUP BY 1 +) +SELECT + month, + total_sales, + (total_sales - LAG(total_sales) OVER (ORDER BY month)) / + LAG(total_sales) OVER (ORDER BY month) AS growth_rate +FROM monthly_sales; +``` + +### **Pivot Tables in SQL** +```sql +-- Using CASE statements +SELECT + product_category, + SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2022 THEN amount ELSE 0 END) AS sales_2022, + SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2023 THEN amount ELSE 0 END) AS sales_2023 +FROM orders +GROUP BY product_category; +``` + +## **8. Performance Optimization** + +### **Indexing Strategies** +```sql +-- Create indexes +CREATE INDEX idx_customer_name ON customers(name); +CREATE INDEX idx_order_date ON orders(order_date); + +-- Composite index +CREATE INDEX idx_category_price ON products(category, price); +``` + +### **Query Optimization Tips** +1. Use `EXPLAIN ANALYZE` to understand query plans +2. Limit columns in `SELECT` (avoid `SELECT *`) +3. Filter early with `WHERE` clauses +4. Use appropriate join types + +## **9. Learning Resources** + +### **Free Interactive Tutorials** +1. [SQLZoo](https://sqlzoo.net/) +2. [Mode Analytics SQL Tutorial](https://mode.com/sql-tutorial/) +3. [PostgreSQL Exercises](https://pgexercises.com/) + +### **Books** +- "SQL for Data Analysis" by Cathy Tanimura +- "SQL Cookbook" by Anthony Molinaro + +### **Practice Platforms** +- [LeetCode SQL Problems](https://leetcode.com/problemset/database/) +- [HackerRank SQL](https://www.hackerrank.com/domains/sql) + +## **10. Next Steps** + +1. **Install a database system** and practice daily +2. **Work with real datasets** (try [Kaggle datasets](https://www.kaggle.com/datasets)) +3. **Build a portfolio project** (e.g., analyze sales data) +4. **Learn database design** (normalization, relationships) + +Remember: SQL is a skill best learned by doing. Start writing queries today! + +--- + +### **Technical Overview: SQL for EDA (Structured Data Analysis)** +You're diving into SQL-first EDA—excellent choice. Below is a **structured roadmap** covering key SQL concepts, EDA-specific queries, and pro tips to maximize efficiency. + +--- + +## **1. Core SQL Concepts for EDA** +### **A. Foundational Operations** +| Concept | Purpose | Example Query | +|------------------|----------------------------------|----------------------------------------| +| **Filtering** | Subset data (`WHERE`, `HAVING`) | `SELECT * FROM prices WHERE asset = 'EUR_USD'` | +| **Aggregation** | Summarize data (`GROUP BY`) | `SELECT asset, AVG(close) FROM prices GROUP BY asset` | +| **Joins** | Combine tables (`INNER JOIN`) | `SELECT * FROM trades JOIN assets ON trades.id = assets.id` | +| **Sorting** | Order results (`ORDER BY`) | `SELECT * FROM prices ORDER BY time DESC` | + +### **B. Advanced EDA Tools** +| Concept | Purpose | Example Query | +|-----------------------|----------------------------------------------|----------------------------------------| +| **Window Functions** | Calculate rolling stats, ranks | `SELECT time, AVG(close) OVER (ORDER BY time ROWS 29 PRECEDING) FROM prices` | +| **CTEs (WITH)** | Break complex queries into steps | `WITH filtered AS (SELECT * FROM prices WHERE volume > 1000) SELECT * FROM filtered` | +| **Statistical Aggregates** | Built-in stats (`STDDEV`, `CORR`, `PERCENTILE_CONT`) | `SELECT CORR(open, close) FROM prices` | +| **Time-Series Handling** | Extract dates, resample | `SELECT DATE_TRUNC('hour', time) AS hour, AVG(close) FROM prices GROUP BY 1` | + +--- + +## **2. Essential EDA Queries** +### **A. Data Profiling** +```sql +-- 1. Basic stats +SELECT + COUNT(*) AS row_count, + COUNT(DISTINCT asset) AS unique_assets, + MIN(close) AS min_price, + MAX(close) AS max_price, + AVG(close) AS mean_price, + STDDEV(close) AS volatility +FROM prices; + +-- 2. Missing values +SELECT + COUNT(*) - COUNT(close) AS missing_prices +FROM prices; + +-- 3. Value distribution (histogram) +SELECT + FLOOR(close / 10) * 10 AS price_bin, + COUNT(*) AS frequency +FROM prices +GROUP BY 1 +ORDER BY 1; +``` + +### **B. Correlation Analysis** +```sql +-- 1. Pairwise correlations +SELECT + CORR(EUR_USD, GBP_USD) AS eur_gbp, + CORR(EUR_USD, USD_JPY) AS eur_jpy, + CORR(GBP_USD, USD_JPY) AS gbp_jpy +FROM hourly_rates; + +-- 2. Rolling correlation (30-day) +WITH normalized AS ( + SELECT + time, + (EUR_USD - AVG(EUR_USD) OVER()) / STDDEV(EUR_USD) OVER() AS eur_norm, + (GBP_USD - AVG(GBP_USD) OVER()) / STDDEV(GBP_USD) OVER() AS gbp_norm + FROM hourly_rates +) +SELECT + time, + AVG(eur_norm * gbp_norm) OVER(ORDER BY time ROWS 29 PRECEDING) AS rolling_corr +FROM normalized; +``` + +### **C. Time-Series EDA** +```sql +-- 1. Hourly volatility patterns +SELECT + EXTRACT(HOUR FROM time) AS hour, + AVG(ABS(close - open)) AS avg_volatility +FROM prices +GROUP BY 1 +ORDER BY 1; + +-- 2. Daily returns distribution +SELECT + DATE_TRUNC('day', time) AS day, + (LAST(close) - FIRST(open)) / FIRST(open) AS daily_return +FROM prices +GROUP BY 1; +``` + +### **D. Outlier Detection** +```sql +-- Z-score outliers (|Z| > 3) +WITH stats AS ( + SELECT + AVG(close) AS mean, + STDDEV(close) AS stddev + FROM prices +) +SELECT + time, + close, + (close - mean) / stddev AS z_score +FROM prices, stats +WHERE ABS((close - mean) / stddev) > 3; +``` + +--- + +## **3. Key Optimizations** +### **A. Indexing for EDA** +```sql +-- Speed up time-series queries +CREATE INDEX idx_prices_time ON prices(time); + +-- Speed up asset-specific filters +CREATE INDEX idx_prices_asset ON prices(asset); +``` + +### **B. Partitioning Large Tables** +```sql +-- Partition by time range (PostgreSQL) +CREATE TABLE prices ( + time TIMESTAMP, + asset TEXT, + close FLOAT +) PARTITION BY RANGE (time); +``` + +### **C. Materialized Views** +```sql +-- Pre-compute frequent aggregates +CREATE MATERIALIZED VIEW hourly_stats AS +SELECT + DATE_TRUNC('hour', time) AS hour, + AVG(close) AS avg_price, + STDDEV(close) AS volatility +FROM prices +GROUP BY 1; + +-- Refresh periodically +REFRESH MATERIALIZED VIEW hourly_stats; +``` + +--- + +## **4. Pro Tips** +### **A. Use the Right Database** +| Database | Best For | +|--------------|----------------------------------| +| **PostgreSQL** | Complex EDA, extensions (MADlib) | +| **DuckDB** | Embedded analytics, Parquet/CSV | +| **SQLite** | Lightweight, local prototyping | + +### **B. Learn These Functions** +| Function | Purpose | +|-----------------------|----------------------------------| +| `DATE_TRUNC()` | Group by time intervals | +| `FIRST_VALUE()` | First/last in a window | +| `PERCENTILE_CONT()` | Median, quantiles | +| `ROLLUP`/`CUBE` | Hierarchical aggregations | + +### **C. Avoid Anti-Patterns** +- **Don’t** fetch all rows to Python for filtering (`SELECT *` → `pd.read_sql()`). +- **Do** push computations to SQL (`WHERE`, `GROUP BY` in-database). +- **Don’t** use Python loops for row-wise operations. +- **Do** use window functions for rolling calculations. + +--- + +## **5. Study Roadmap** +1. **Beginner**: + - SQL syntax (`SELECT`, `JOIN`, `GROUP BY`) + - [SQLZoo](https://sqlzoo.net/) (free exercises) + +2. **Intermediate**: + - Window functions (`OVER`, `PARTITION BY`) + - [PostgreSQL Window Functions Guide](https://www.postgresqltutorial.com/postgresql-window-function/) + +3. **Advanced**: + - Query optimization (EXPLAIN ANALYZE, indexing) + - Time-series SQL ([TimescaleDB docs](https://docs.timescale.com/)) + +4. **EDA-Specific**: + - Statistical SQL ([MADlib documentation](https://madlib.apache.org/)) + - Correlation patterns (rolling, cross-asset) + +--- + +## **6. Sample Workflow** +```mermaid +flowchart TB + A[Load Data] --> B[Profile Data] + B --> C[Clean Data] + C --> D[Explore Relationships] + D --> E[Validate Hypotheses] + E --> F[Visualize in Python] +``` + +**Example**: +1. Profile data → find missing values. +2. Clean → filter outliers. +3. Explore → calculate rolling correlations. +4. Validate → test "volatility clusters on Mondays". +5. Visualize → plot results in Python. + +--- + +### **Final Thought** +SQL is **the** tool for structured EDA—it’s faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and you’ll outperform 90% of analysts stuck in pandas. + +Want a **ready-to-run Docker container** with PostgreSQL + sample forex data for practice? Let me know! \ No newline at end of file