# **SQL for Forex Data Analysis: The 20% That Delivers 80% Results**

## **Focused Learning Roadmap**  
*Master these core skills to handle most forex data analysis tasks*

### **Phase 1: Core Skills (Week 1-2)**  
**What to Learn** | **Why It Matters** | **Key Syntax Examples**
-----------------|-------------------|----------------------
**Filtering Data** | Isolate specific currency pairs/timeframes | `SELECT * FROM ticks WHERE symbol='EUR/USD' AND timestamp > '2023-01-01'`
**Time Bucketing** | Convert raw ticks into candlesticks (1min/5min/1H) | `DATE_TRUNC('hour', timestamp) AS hour`
**Basic Aggregates** | Calculate spreads, highs/lows, averages | `AVG(ask-bid) AS avg_spread`, `MAX(ask) AS high`
**Grouping** | Summarize data by pair/time period | `GROUP BY symbol, DATE_TRUNC('day', timestamp)`

---

### **Phase 2: Essential Techniques (Week 3-4)**  
**Skill** | **Forex Application** | **Example**
---------|---------------------|-----------
**Joins** | Combine tick data with economic calendars | `JOIN economic_events ON ticks.date = events.date`
**Rolling Windows** | Calculate moving averages & volatility | `AVG(price) OVER (ORDER BY timestamp ROWS 30 PRECEDING)`
**Correlations** | Compare currency pairs (e.g., EUR/USD vs. USD/JPY) | `CORR(eurusd_mid, usdjpy_mid)`
**Session Analysis** | Compare volatility across trading sessions | `WHERE EXTRACT(HOUR FROM timestamp) IN (7,13,21)` *(London/NY/Asia hours)*

---

### **Phase 3: Optimization (Week 5)**  
**Skill** | **Impact** | **Implementation**
---------|----------|-----------------
**Indexing** | Speed up time/symbol queries | `CREATE INDEX idx_symbol_time ON ticks(symbol, timestamp)`
**CTEs** | Break complex queries into steps | `WITH filtered AS (...) SELECT * FROM filtered`
**Partitioning** | Faster queries on large datasets | `PARTITION BY RANGE (timestamp)`

---

## **10 Essential Forex Queries You'll Use Daily**  
1. **Current Spread Analysis**  
   ```sql
   SELECT symbol, AVG(ask-bid) AS spread 
   FROM ticks 
   WHERE timestamp > NOW() - INTERVAL '1 hour'
   GROUP BY symbol;
   ```

2. **5-Minute Candlesticks**  
   ```sql
   SELECT 
     DATE_TRUNC('5 minutes', timestamp) AS time,
     MIN(bid) AS low,
     MAX(ask) AS high
   FROM ticks
   WHERE symbol = 'GBP/USD'
   GROUP BY time;
   ```

3. **Rolling Volatility**  
   ```sql
   SELECT 
     timestamp,
     STDDEV(ask) OVER (ORDER BY timestamp ROWS 100 PRECEDING) AS vol
   FROM ticks
   WHERE symbol = 'EUR/USD';
   ```

4. **Session Volume Comparison**  
   ```sql
   SELECT 
     CASE 
       WHEN EXTRACT(HOUR FROM timestamp) BETWEEN 7 AND 15 THEN 'London' 
       ELSE 'Other' 
     END AS session,
     SUM(volume) AS total_volume
   FROM ticks
   GROUP BY session;
   ```

---

## **Study Plan**  
- **Week 1**: Master `SELECT`, `WHERE`, `GROUP BY`, `DATE_TRUNC`  
  → *Goal: Generate hourly OHLC data for one currency pair*  
- **Week 2**: Learn `JOIN`, `AVG() OVER()`, `CORR()`  
  → *Goal: Compare two pairs' correlation over different timeframes*  
- **Week 3**: Optimize with indexes + CTEs  
  → *Goal: Run a 1M-row query in <1 second*  

---

## **Tools to Start With**  
- **Data**: Free forex ticks from [Dukascopy](https://www.dukascopy.com/)  
- **Database**: DuckDB (lightweight, no setup required)  
- **Visualization**: Metabase (free) or Python with Matplotlib  

---

## **What to Avoid (For Now)**  
- Stored procedures  
- Advanced indexing strategies  
- Machine learning in SQL  
- Recursive queries  

---

### **Bare Minimum Survival Kit**  
1. `WHERE` + `DATE_TRUNC` *(filter and bucket time data)*  
2. `GROUP BY` *(summarize data efficiently)*  
3. `AVG() OVER()` *(rolling calculations)*  
4. `CORR()` *(measure pair relationships)*  

**Next Step**: Want a 5-question quiz to test your core knowledge? Or shall we dive deeper into one technique?

---

Understood! You're seeking **maximum SQL mastery with minimum time investment** by focusing on the critical 20% that delivers 80% of results in forex data analysis. Here's your **laser-focused SQL SME blueprint**:

---

### **The SQL SME 20% Framework for Forex**
#### **1. Core Principles to Internalize**
1. **Forex Data is Time-Series First**  
   - Every query must handle timestamp filtering efficiently  
   - Key skill: `DATE_TRUNC()` + `WHERE` on time ranges  

2. **Bid/Ask Dynamics are Fundamental**  
   - Spread analysis unlocks liquidity insights  
   - Key skill: `(ask - bid)` calculations with proper grouping  

3. **Rolling Windows > Static Aggregates**  
   - Market conditions change constantly - analyze trends, not snapshots  
   - Key skill: `AVG() OVER(ROWS N PRECEDING)`  

---

### **2. The 10 Essential Patterns (Memorize These)**
| # | Pattern | Forex Application | Example |
|---|---------|-------------------|---------|
| 1 | **Time Bucketing** | Convert ticks → candles | `DATE_TRUNC('15 min', timestamp)` |
| 2 | **Rolling Volatility** | Measure risk | `STDDEV(price) OVER(ROWS 99 PRECEDING)` |
| 3 | **Session Comparison** | London vs. NY activity | `WHERE EXTRACT(HOUR FROM timestamp) IN (7,13)` |
| 4 | **Pair Correlation** | Hedge ratios | `CORR(eurusd, usdjpy)` |
| 5 | **Spread Analysis** | Liquidity monitoring | `AVG(ask - bid) GROUP BY symbol` |
| 6 | **Event Impact** | NFP/CPI reactions | `WHERE timestamp BETWEEN event-15min AND event+1H` |
| 7 | **Liquidity Zones** | Volume clusters | `NTILE(4) OVER(ORDER BY volume)` |
| 8 | **Outlier Detection** | Data quality checks | `WHERE price > 3*STDDEV() OVER()` |
| 9 | **Gap Analysis** | Weekend openings | `LAG(close) OVER() - open` |
| 10 | **Rolling Sharpe** | Strategy performance | `AVG(return)/STDDEV(return) OVER()` |

---

### **3. SME-Level Documentation Template**
**For each pattern**, document:  
1. **Business Purpose**: *"Identify optimal trading hours by comparing volatility across sessions"*  
2. **Technical Implementation**:  
   ```sql
   SELECT 
     EXTRACT(HOUR FROM timestamp) AS hour,
     STDDEV((bid+ask)/2) AS volatility
   FROM ticks
   WHERE symbol = 'EUR/USD'
   GROUP BY hour
   ORDER BY volatility DESC
   ```
3. **Performance Considerations**: *"Add composite index on (symbol, timestamp) for 100x speedup"*  
4. **Edge Cases**: *"Exclude holidays where volatility is artificially low"*  

---

### **4. Drills to Achieve Mastery**
#### **Daily Challenge (15 mins/day)**
- **Day 1**: Generate 1H candles with OHLC + volume  
- **Day 2**: Calculate 30-period rolling correlation between EUR/USD and GBP/USD  
- **Day 3**: Find days with spread > 2x 30-day average  
- **Day 4**: Compare pre/post-FOMC volatility  
- **Day 5**: Optimize a slow query using EXPLAIN ANALYZE  

#### **Weekly Project**
- Build a **volatility surface** showing:  
  ```sql
  SELECT 
    symbol,
    DATE_TRUNC('hour', timestamp) AS hour,
    STDDEV((bid+ask)/2) AS vol,
    AVG(ask-bid) AS spread
  FROM ticks
  GROUP BY symbol, hour
  ```

---

### **5. Forensic Analysis Checklist**
When reviewing any forex query, ask:  
1. **Time Handling**:  
   - ✅ Timestamps in UTC?  
   - ✅ Correct timezone conversions?  
2. **Spread Awareness**:  
   - ✅ Using (bid+ask)/2 for mid-price?  
   - ✅ Calculating raw spread metrics?  
3. **Rolling vs Static**:  
   - ✅ Using window functions where appropriate?  
4. **Performance**:  
   - ✅ Indexes on (symbol, timestamp)?  
   - ✅ Avoiding full table scans?  

---

### **6. Pro-Grade Resources**
1. **Dukascopy Tick Data**: Real-world practice dataset  
2. **PostgreSQL Window Functions Docs**: Master the OVER() clause  
3. **Forex Economic Calendar**: For event impact studies  

---

### **Key Mindset Shifts**
1. **From**: "I need to learn all SQL functions"  
   **To**: "I need to master time-series patterns for bid/ask data"  

2. **From**: "Queries must return perfect results"  
   **To**: "Queries must be fast enough for real-time decisions"  

3. **From**: "Document after writing"  
   **To**: "Document patterns before implementation"  

---

**Next Step**: Pick **one** pattern from the 10 above. I'll provide:  
1. Detailed breakdown  
2. Common mistakes  
3. Optimization techniques  
4. Forensic analysis questions  

Which pattern would you like to deep-dive into first? (Recommend starting with **Time Bucketing** or **Rolling Volatility**)

---

Here’s the **20% of SQL skills that will deliver 80% of your forex data analysis needs**, structured as a focused roadmap:

---

### **SQL for Forex Data: The 20% Priority Roadmap**
#### **1. Core Skills (Weeks 1-2)**
| Skill | Why It Matters | Key Syntax |
|-------|---------------|------------|
| **Filtering Data** | Isolate specific currency pairs/timeframes | `SELECT * FROM ticks WHERE symbol='EUR/USD' AND timestamp > '2023-01-01'` |
| **Time Bucketing** | Convert ticks to candles (1min/5min/1H) | `DATE_TRUNC('hour', timestamp) AS hour` |
| **Basic Aggregates** | Calculate spreads, averages, highs/lows | `AVG(ask-bid) AS avg_spread`, `MAX(ask) AS high` |
| **Grouping** | Summarize by pair/time period | `GROUP BY symbol, DATE_TRUNC('day', timestamp)` |

#### **2. Essential Techniques (Weeks 3-4)**
| Skill | Forex Application | Example |
|-------|-------------------|---------|
| **Joins** | Combine tick data with economic calendars | `JOIN economic_events ON ticks.date = events.date` |
| **Rolling Windows** | Calculate moving averages/volatility | `AVG(price) OVER (ORDER BY timestamp ROWS 30 PRECEDING)` |
| **Correlations** | Compare pairs (EUR/USD vs. USD/JPY) | `CORR(eurusd_mid, usdjpy_mid)` |
| **Session Analysis** | Compare London/NY/Asia volatility | `WHERE EXTRACT(HOUR FROM timestamp) IN (7,13,21)` |

#### **3. Optimization (Week 5)**
| Skill | Impact | Implementation |
|-------|--------|----------------|
| **Indexing** | Speed up time/symbol queries | `CREATE INDEX idx_symbol_time ON ticks(symbol, timestamp)` |
| **CTEs** | Break complex queries into steps | `WITH filtered AS (...) SELECT * FROM filtered` |
| **Partitioning** | Faster queries on large datasets | `PARTITION BY RANGE (timestamp)` |

---

### **Prioritized Cheat Sheet**
#### **10 Queries You’ll Use Daily**
1. **Current Spread**:  
   ```sql
   SELECT symbol, AVG(ask-bid) AS spread 
   FROM ticks 
   WHERE timestamp > NOW() - INTERVAL '1 hour'
   GROUP BY symbol;
   ```

2. **5-Min Candles**:  
   ```sql
   SELECT 
     DATE_TRUNC('5 minutes', timestamp) AS time,
     MIN(bid) AS low,
     MAX(ask) AS high
   FROM ticks
   WHERE symbol = 'GBP/USD'
   GROUP BY time;
   ```

3. **Rolling Volatility**:  
   ```sql
   SELECT 
     timestamp,
     STDDEV(ask) OVER (ORDER BY timestamp ROWS 100 PRECEDING) AS vol
   FROM ticks
   WHERE symbol = 'EUR/USD';
   ```

4. **Session Volume**:  
   ```sql
   SELECT 
     CASE 
       WHEN EXTRACT(HOUR FROM timestamp) BETWEEN 7 AND 15 THEN 'London' 
       ELSE 'Other' 
     END AS session,
     SUM(volume) AS total_volume
   FROM ticks
   GROUP BY session;
   ```

---

### **Study Plan**
1. **Week 1**: Master `SELECT`, `WHERE`, `GROUP BY`, `DATE_TRUNC`  
   → *Goal: Generate hourly high/low/close for 1 pair*

2. **Week 2**: Learn `JOIN`, `AVG() OVER()`, `CORR()`  
   → *Goal: Compare 2 pairs’ correlation last week vs. last month*

3. **Week 3**: Optimize with indexes + CTEs  
   → *Goal: Make a 1M-row query run in <1 sec*

---

### **Tools to Use Now**
- **Data**: Free forex ticks from [Dukascopy](https://www.dukascopy.com/)  
- **Database**: DuckDB (simple, fast, no setup)  
- **Visualization**: Metabase (free) or Python’s Matplotlib

---

### **Avoid Until Later**
- Stored procedures  
- Advanced indexing strategies  
- Machine learning in SQL  
- Recursive queries  

---

**Key Principle**: Focus on **time-based analysis** (the core of forex data) first. Everything else builds on this.  

Want the absolute bare minimum? Learn these 4 things:  
1. `WHERE` + `DATE_TRUNC` (filter and bucket time)  
2. `GROUP BY` (summarize data)  
3. `AVG() OVER()` (rolling calculations)  
4. `CORR()` (pair relationships)  

Would you like a 5-question quiz to test your 20% knowledge?

---

Here's a structured **Technical Guide & Roadmap for Forex Tick Data Analysis with SQL**, designed as a progressive learning path with clear milestones and reference examples:

---

# **Forex Tick Data Analysis: SQL Learning Roadmap**
*A step-by-step guide from beginner to advanced techniques*

## **Phase 1: Foundations**
### **1.1 Understanding Your Data**
- **Structure**: Forex ticks typically contain:
  ```sql
  symbol (e.g., 'EUR/USD'), 
  timestamp (precision to milliseconds), 
  bid price, 
  ask price, 
  volume
  ```
- **Key Metrics**:
  - **Spread**: `ask - bid` (liquidity measure)
  - **Mid-price**: `(bid + ask) / 2` (reference price)

### **1.2 Basic SQL Operations**
```sql
-- Sample data inspection
SELECT * FROM forex_ticks 
WHERE symbol = 'EUR/USD' 
LIMIT 100;

-- Count ticks per pair
SELECT symbol, COUNT(*) 
FROM forex_ticks 
GROUP BY symbol;

-- Time range filtering
SELECT MIN(timestamp), MAX(timestamp) 
FROM forex_ticks;
```

---

## **Phase 2: Core Analysis**
### **2.1 Spread Analysis**
```sql
-- Basic spread stats
SELECT 
  symbol,
  AVG(ask - bid) AS avg_spread,
  MAX(ask - bid) AS max_spread
FROM forex_ticks
GROUP BY symbol;
```

### **2.2 Time Bucketing**
```sql
-- 5-minute candlesticks
SELECT 
  symbol,
  DATE_TRUNC('5 minutes', timestamp) AS time_bucket,
  MIN(bid) AS low,
  MAX(ask) AS high,
  AVG((bid+ask)/2) AS close
FROM forex_ticks
GROUP BY symbol, time_bucket;
```

### **2.3 Session Analysis**
```sql
-- Volume by hour (GMT)
SELECT 
  EXTRACT(HOUR FROM timestamp) AS hour,
  AVG(volume) AS avg_volume
FROM forex_ticks
WHERE symbol = 'GBP/USD'
GROUP BY hour
ORDER BY hour;
```

---

## **Phase 3: Intermediate Techniques**
### **3.1 Rolling Calculations**
```sql
-- 30-minute moving average
SELECT 
  timestamp,
  symbol,
  AVG((bid+ask)/2) OVER (
    PARTITION BY symbol 
    ORDER BY timestamp 
    ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
  ) AS 30min_MA
FROM forex_ticks;
```

### **3.2 Pair Correlation**
```sql
WITH hourly_prices AS (
  SELECT 
    DATE_TRUNC('hour', timestamp) AS hour,
    symbol,
    AVG((bid+ask)/2) AS mid_price
  FROM forex_ticks
  GROUP BY hour, symbol
)
SELECT 
  a.symbol AS pair1,
  b.symbol AS pair2,
  CORR(a.mid_price, b.mid_price) AS correlation
FROM hourly_prices a
JOIN hourly_prices b ON a.hour = b.hour
WHERE a.symbol < b.symbol
GROUP BY pair1, pair2;
```

---

## **Phase 4: Advanced Topics**
### **4.1 Volatility Measurement**
```sql
WITH returns AS (
  SELECT 
    symbol,
    timestamp,
    (ask - LAG(ask) OVER (PARTITION BY symbol ORDER BY timestamp)) / 
    LAG(ask) OVER (PARTITION BY symbol ORDER BY timestamp) AS return
  FROM forex_ticks
)
SELECT 
  symbol,
  STDDEV(return) AS hourly_volatility
FROM returns
GROUP BY symbol;
```

### **4.2 Event Impact Analysis**
```sql
-- Compare 15-min pre/post NFP release
SELECT 
  AVG(CASE WHEN timestamp BETWEEN '2023-12-01 13:30' AND '2023-12-01 13:45' 
      THEN (bid+ask)/2 END) AS post_NFP,
  AVG(CASE WHEN timestamp BETWEEN '2023-12-01 13:15' AND '2023-12-01 13:30' 
      THEN (bid+ask)/2 END) AS pre_NFP
FROM forex_ticks
WHERE symbol = 'EUR/USD';
```

---

## **Study Roadmap**
### **Weekly Learning Plan**
| Week | Focus Area | Key Skills |
|------|-----------|------------|
| 1 | SQL Basics | `SELECT`, `WHERE`, `GROUP BY` |
| 2 | Time Handling | `DATE_TRUNC`, `EXTRACT`, timezones |
| 3 | Aggregations | `AVG`, `STDDEV`, `CORR` |
| 4 | Window Functions | `OVER`, `PARTITION BY`, rolling calcs |
| 5 | Optimization | Indexes, query planning |
| 6 | Advanced Patterns | Volatility modeling, microstructure |

---

## **Cheat Sheet**
### **Essential Functions**
| Function | Purpose | Example |
|----------|---------|---------|
| `DATE_TRUNC` | Bucket timestamps | `DATE_TRUNC('hour', timestamp)` |
| `EXTRACT` | Get time parts | `EXTRACT(HOUR FROM timestamp)` |
| `CORR` | Correlation | `CORR(price1, price2)` |
| `AVG() OVER` | Moving average | `AVG(price) OVER (ORDER BY time ROWS 30 PRECEDING)` |

### **Common Patterns**
```sql
-- Get latest price per pair
SELECT DISTINCT ON (symbol) symbol, bid, ask
FROM forex_ticks
ORDER BY symbol, timestamp DESC;

-- Detect stale data
SELECT symbol, MAX(timestamp) AS last_update
FROM forex_ticks
GROUP BY symbol
HAVING MAX(timestamp) < NOW() - INTERVAL '5 minutes';
```

---

## **Next Steps**
1. **Practice Dataset**: Download free forex tick data from [Dukascopy Bank](https://www.dukascopy.com/) 
2. **Sandbox Setup**: Install PostgreSQL + TimescaleDB for time-series optimizations
3. **Projects**:
   - Build a volatility dashboard
   - Analyze London vs. NY session spreads
   - Track correlation breakdowns during crises

---

**Pro Tip**: Bookmark this guide and revisit each phase as your skills progress. Start with Phase 1 queries, then gradually incorporate more complex techniques.

---

# **The Ultimate SQL Getting Started Guide**

This guide will take you from absolute beginner to SQL proficiency, with a focus on practical data analysis and EDA applications.

## **1. SQL Fundamentals**

### **What is SQL?**
SQL (Structured Query Language) is the standard language for interacting with relational databases. It allows you to:
- Retrieve data
- Insert, update, and delete records
- Create and modify database structures
- Perform complex calculations on data

### **Core Concepts**
1. **Databases**: Collections of structured data
2. **Tables**: Data organized in rows and columns
3. **Queries**: Commands to interact with data
4. **Schemas**: Blueprints defining database structure

## **2. Setting Up Your SQL Environment**

### **Choose a Database System**
| Option | Best For | Installation |
|--------|----------|--------------|
| **SQLite** | Beginners, small projects | Built into Python |
| **PostgreSQL** | Production, complex queries | [Download here](https://www.postgresql.org/download/) |
| **MySQL** | Web applications | [Download here](https://dev.mysql.com/downloads/) |
| **DuckDB** | Analytical workloads | `pip install duckdb` |

### **Install a SQL Client**
- **DBeaver** (Free, multi-platform)
- **TablePlus** (Paid, excellent UI)
- **VS Code + SQL Tools** (For developers)

## **3. Basic SQL Syntax**

### **SELECT Statements**
```sql
-- Basic selection
SELECT column1, column2 FROM table_name;

-- Select all columns
SELECT * FROM table_name;

-- Filtering with WHERE
SELECT * FROM table_name WHERE condition;

-- Sorting with ORDER BY
SELECT * FROM table_name ORDER BY column1 DESC;
```

### **Common Data Types**
- `INTEGER`: Whole numbers
- `FLOAT/REAL`: Decimal numbers
- `VARCHAR(n)`: Text (n characters max)
- `BOOLEAN`: True/False
- `DATE/TIMESTAMP`: Date and time values

## **4. Essential SQL Operations**

### **Filtering Data**
```sql
-- Basic conditions
SELECT * FROM employees WHERE salary > 50000;

-- Multiple conditions
SELECT * FROM products 
WHERE price BETWEEN 10 AND 100 
AND category = 'Electronics';

-- Pattern matching
SELECT * FROM customers 
WHERE name LIKE 'J%'; -- Starts with J
```

### **Sorting and Limiting**
```sql
-- Sort by multiple columns
SELECT * FROM orders 
ORDER BY order_date DESC, total_amount DESC;

-- Limit results
SELECT * FROM large_table LIMIT 100;
```

### **Aggregation Functions**
```sql
-- Basic aggregations
SELECT 
    COUNT(*) AS total_orders,
    AVG(amount) AS avg_order,
    MAX(amount) AS largest_order
FROM orders;

-- GROUP BY
SELECT 
    department, 
    AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
```

## **5. Joining Tables**

### **Join Types**
| Join Type | Description | Example |
|-----------|-------------|---------|
| **INNER JOIN** | Only matching rows | `SELECT * FROM A INNER JOIN B ON A.id = B.id` |
| **LEFT JOIN** | All from left table, matches from right | `SELECT * FROM A LEFT JOIN B ON A.id = B.id` |
| **RIGHT JOIN** | All from right table, matches from left | `SELECT * FROM A RIGHT JOIN B ON A.id = B.id` |
| **FULL JOIN** | All rows from both tables | `SELECT * FROM A FULL JOIN B ON A.id = B.id` |

### **Practical Example**
```sql
SELECT 
    o.order_id,
    c.customer_name,
    o.order_date,
    o.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > '2023-01-01'
ORDER BY o.total_amount DESC;
```

## **6. Advanced SQL Features**

### **Common Table Expressions (CTEs)**
```sql
WITH high_value_customers AS (
    SELECT customer_id, SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
    HAVING SUM(amount) > 1000
)
SELECT * FROM high_value_customers;
```

### **Window Functions**
```sql
-- Running total
SELECT 
    date,
    revenue,
    SUM(revenue) OVER (ORDER BY date) AS running_total
FROM daily_sales;

-- Rank products by category
SELECT 
    product_name,
    category,
    price,
    RANK() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank
FROM products;
```

## **7. SQL for Data Analysis**

### **Time Series Analysis**
```sql
-- Daily aggregates
SELECT 
    DATE_TRUNC('day', transaction_time) AS day,
    COUNT(*) AS transactions,
    SUM(amount) AS total_amount
FROM transactions
GROUP BY 1
ORDER BY 1;

-- Month-over-month growth
WITH monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS total_sales
    FROM orders
    GROUP BY 1
)
SELECT 
    month,
    total_sales,
    (total_sales - LAG(total_sales) OVER (ORDER BY month)) / 
        LAG(total_sales) OVER (ORDER BY month) AS growth_rate
FROM monthly_sales;
```

### **Pivot Tables in SQL**
```sql
-- Using CASE statements
SELECT 
    product_category,
    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2022 THEN amount ELSE 0 END) AS sales_2022,
    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2023 THEN amount ELSE 0 END) AS sales_2023
FROM orders
GROUP BY product_category;
```

## **8. Performance Optimization**

### **Indexing Strategies**
```sql
-- Create indexes
CREATE INDEX idx_customer_name ON customers(name);
CREATE INDEX idx_order_date ON orders(order_date);

-- Composite index
CREATE INDEX idx_category_price ON products(category, price);
```

### **Query Optimization Tips**
1. Use `EXPLAIN ANALYZE` to understand query plans
2. Limit columns in `SELECT` (avoid `SELECT *`)
3. Filter early with `WHERE` clauses
4. Use appropriate join types

## **9. Learning Resources**

### **Free Interactive Tutorials**
1. [SQLZoo](https://sqlzoo.net/)
2. [Mode Analytics SQL Tutorial](https://mode.com/sql-tutorial/)
3. [PostgreSQL Exercises](https://pgexercises.com/)

### **Books**
- "SQL for Data Analysis" by Cathy Tanimura
- "SQL Cookbook" by Anthony Molinaro

### **Practice Platforms**
- [LeetCode SQL Problems](https://leetcode.com/problemset/database/)
- [HackerRank SQL](https://www.hackerrank.com/domains/sql)

## **10. Next Steps**

1. **Install a database system** and practice daily
2. **Work with real datasets** (try [Kaggle datasets](https://www.kaggle.com/datasets))
3. **Build a portfolio project** (e.g., analyze sales data)
4. **Learn database design** (normalization, relationships)

Remember: SQL is a skill best learned by doing. Start writing queries today!

---

### **Technical Overview: SQL for EDA (Structured Data Analysis)**
You're diving into SQL-first EDA—excellent choice. Below is a **structured roadmap** covering key SQL concepts, EDA-specific queries, and pro tips to maximize efficiency.

---

## **1. Core SQL Concepts for EDA**
### **A. Foundational Operations**
| Concept          | Purpose                          | Example Query                          |
|------------------|----------------------------------|----------------------------------------|
| **Filtering**    | Subset data (`WHERE`, `HAVING`)  | `SELECT * FROM prices WHERE asset = 'EUR_USD'` |
| **Aggregation**  | Summarize data (`GROUP BY`)      | `SELECT asset, AVG(close) FROM prices GROUP BY asset` |
| **Joins**        | Combine tables (`INNER JOIN`)    | `SELECT * FROM trades JOIN assets ON trades.id = assets.id` |
| **Sorting**      | Order results (`ORDER BY`)       | `SELECT * FROM prices ORDER BY time DESC` |

### **B. Advanced EDA Tools**
| Concept               | Purpose                                      | Example Query                          |
|-----------------------|----------------------------------------------|----------------------------------------|
| **Window Functions**  | Calculate rolling stats, ranks               | `SELECT time, AVG(close) OVER (ORDER BY time ROWS 29 PRECEDING) FROM prices` |
| **CTEs (WITH)**       | Break complex queries into steps             | `WITH filtered AS (SELECT * FROM prices WHERE volume > 1000) SELECT * FROM filtered` |
| **Statistical Aggregates** | Built-in stats (`STDDEV`, `CORR`, `PERCENTILE_CONT`) | `SELECT CORR(open, close) FROM prices` |
| **Time-Series Handling** | Extract dates, resample                    | `SELECT DATE_TRUNC('hour', time) AS hour, AVG(close) FROM prices GROUP BY 1` |

---

## **2. Essential EDA Queries**
### **A. Data Profiling**
```sql
-- 1. Basic stats
SELECT 
    COUNT(*) AS row_count,
    COUNT(DISTINCT asset) AS unique_assets,
    MIN(close) AS min_price,
    MAX(close) AS max_price,
    AVG(close) AS mean_price,
    STDDEV(close) AS volatility
FROM prices;

-- 2. Missing values
SELECT 
    COUNT(*) - COUNT(close) AS missing_prices
FROM prices;

-- 3. Value distribution (histogram)
SELECT 
    FLOOR(close / 10) * 10 AS price_bin,
    COUNT(*) AS frequency
FROM prices
GROUP BY 1
ORDER BY 1;
```

### **B. Correlation Analysis**
```sql
-- 1. Pairwise correlations
SELECT 
    CORR(EUR_USD, GBP_USD) AS eur_gbp,
    CORR(EUR_USD, USD_JPY) AS eur_jpy,
    CORR(GBP_USD, USD_JPY) AS gbp_jpy
FROM hourly_rates;

-- 2. Rolling correlation (30-day)
WITH normalized AS (
    SELECT 
        time,
        (EUR_USD - AVG(EUR_USD) OVER()) / STDDEV(EUR_USD) OVER() AS eur_norm,
        (GBP_USD - AVG(GBP_USD) OVER()) / STDDEV(GBP_USD) OVER() AS gbp_norm
    FROM hourly_rates
)
SELECT 
    time,
    AVG(eur_norm * gbp_norm) OVER(ORDER BY time ROWS 29 PRECEDING) AS rolling_corr
FROM normalized;
```

### **C. Time-Series EDA**
```sql
-- 1. Hourly volatility patterns
SELECT 
    EXTRACT(HOUR FROM time) AS hour,
    AVG(ABS(close - open)) AS avg_volatility
FROM prices
GROUP BY 1
ORDER BY 1;

-- 2. Daily returns distribution
SELECT 
    DATE_TRUNC('day', time) AS day,
    (LAST(close) - FIRST(open)) / FIRST(open) AS daily_return
FROM prices
GROUP BY 1;
```

### **D. Outlier Detection**
```sql
-- Z-score outliers (|Z| > 3)
WITH stats AS (
    SELECT 
        AVG(close) AS mean,
        STDDEV(close) AS stddev
    FROM prices
)
SELECT 
    time,
    close,
    (close - mean) / stddev AS z_score
FROM prices, stats
WHERE ABS((close - mean) / stddev) > 3;
```

---

## **3. Key Optimizations**
### **A. Indexing for EDA**
```sql
-- Speed up time-series queries
CREATE INDEX idx_prices_time ON prices(time);

-- Speed up asset-specific filters
CREATE INDEX idx_prices_asset ON prices(asset);
```

### **B. Partitioning Large Tables**
```sql
-- Partition by time range (PostgreSQL)
CREATE TABLE prices (
    time TIMESTAMP,
    asset TEXT,
    close FLOAT
) PARTITION BY RANGE (time);
```

### **C. Materialized Views**
```sql
-- Pre-compute frequent aggregates
CREATE MATERIALIZED VIEW hourly_stats AS
SELECT 
    DATE_TRUNC('hour', time) AS hour,
    AVG(close) AS avg_price,
    STDDEV(close) AS volatility
FROM prices
GROUP BY 1;

-- Refresh periodically
REFRESH MATERIALIZED VIEW hourly_stats;
```

---

## **4. Pro Tips**
### **A. Use the Right Database**
| Database      | Best For                          |
|--------------|----------------------------------|
| **PostgreSQL** | Complex EDA, extensions (MADlib) |
| **DuckDB**   | Embedded analytics, Parquet/CSV  |
| **SQLite**   | Lightweight, local prototyping   |

### **B. Learn These Functions**
| Function              | Purpose                          |
|-----------------------|----------------------------------|
| `DATE_TRUNC()`        | Group by time intervals          |
| `FIRST_VALUE()`       | First/last in a window           |
| `PERCENTILE_CONT()`   | Median, quantiles                |
| `ROLLUP`/`CUBE`       | Hierarchical aggregations        |

### **C. Avoid Anti-Patterns**
- **Don’t** fetch all rows to Python for filtering (`SELECT *` → `pd.read_sql()`).
- **Do** push computations to SQL (`WHERE`, `GROUP BY` in-database).
- **Don’t** use Python loops for row-wise operations.
- **Do** use window functions for rolling calculations.

---

## **5. Study Roadmap**
1. **Beginner**:  
   - SQL syntax (`SELECT`, `JOIN`, `GROUP BY`)  
   - [SQLZoo](https://sqlzoo.net/) (free exercises)  

2. **Intermediate**:  
   - Window functions (`OVER`, `PARTITION BY`)  
   - [PostgreSQL Window Functions Guide](https://www.postgresqltutorial.com/postgresql-window-function/)  

3. **Advanced**:  
   - Query optimization (EXPLAIN ANALYZE, indexing)  
   - Time-series SQL ([TimescaleDB docs](https://docs.timescale.com/))  

4. **EDA-Specific**:  
   - Statistical SQL ([MADlib documentation](https://madlib.apache.org/))  
   - Correlation patterns (rolling, cross-asset)  

---

## **6. Sample Workflow**
```mermaid
flowchart TB
    A[Load Data] --> B[Profile Data]
    B --> C[Clean Data]
    C --> D[Explore Relationships]
    D --> E[Validate Hypotheses]
    E --> F[Visualize in Python]
```

**Example**:  
1. Profile data → find missing values.  
2. Clean → filter outliers.  
3. Explore → calculate rolling correlations.  
4. Validate → test "volatility clusters on Mondays".  
5. Visualize → plot results in Python.  

---

### **Final Thought**
SQL is **the** tool for structured EDA—it’s faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and you’ll outperform 90% of analysts stuck in pandas.  

Want a **ready-to-run Docker container** with PostgreSQL + sample forex data for practice? Let me know!

---

Here's how the SQL concepts you've presented fit into the EDA (Exploratory Data Analysis) world, organized by their relevance and application:

---

### **1. SQL Fundamentals in EDA**
#### **Data Manipulation Language (DML)**
- **SELECT**: Core to EDA for retrieving and filtering data (e.g., `SELECT * FROM sales WHERE date > '2023-01-01'`).
- **INSERT/UPDATE/DELETE**: Less common in pure EDA (used more in data preparation pipelines).

#### **Data Definition Language (DDL)**
- **CREATE/ALTER**: Used to set up analysis environments (e.g., creating temp tables for intermediate results).
- **TRUNCATE/DROP**: Rare in EDA unless resetting sandbox environments.

#### **Data Control Language (DCL)**
- **GRANT/REVOKE**: Relevant for team-based EDA to manage access to datasets.

#### **Transaction Control Language (TCL)**
- **COMMIT/ROLLBACK**: Critical for reproducible EDA to ensure query consistency.

---

### **2. Advanced SQL for Deeper EDA**
#### **Window Functions**
- **Ranking**: `RANK() OVER (PARTITION BY region ORDER BY revenue DESC)` to identify top performers.
- **Rolling Metrics**: `AVG(revenue) OVER (ORDER BY date ROWS 7 PRECEDING)` for 7-day moving averages.

#### **Common Table Expressions (CTEs)**
- Break complex EDA logic into readable steps:
  ```sql
  WITH filtered_data AS (
    SELECT * FROM sales WHERE region = 'West'
  )
  SELECT product, SUM(revenue) FROM filtered_data GROUP BY product;
  ```

#### **JSON Handling**
- Analyze semi-structured data (e.g., API responses stored in JSON columns):
  ```sql
  SELECT json_extract(user_data, '$.demographics.age') FROM users;
  ```

---

### **3. Performance Optimization for Large-Scale EDA**
#### **Indexes**
- Speed up filtering on large tables:
  ```sql
  CREATE INDEX idx_sales_date ON sales(date);
  ```

#### **Query Planning**
- Use `EXPLAIN ANALYZE` to identify bottlenecks in EDA queries.

#### **Partitioning**
- Improve performance on time-series EDA:
  ```sql
  CREATE TABLE sales PARTITION BY RANGE (date);
  ```

---

### **4. SQL for Specific EDA Tasks**
#### **Data Profiling**
```sql
SELECT 
  COUNT(*) AS row_count,
  COUNT(DISTINCT product_id) AS unique_products,
  AVG(price) AS avg_price,
  MIN(price) AS min_price,
  MAX(price) AS max_price
FROM products;
```

#### **Correlation Analysis**
```sql
SELECT CORR(price, units_sold) AS price_elasticity FROM sales;
```

#### **Time-Series Analysis**
```sql
SELECT 
  DATE_TRUNC('month', order_date) AS month,
  SUM(revenue) AS monthly_revenue,
  (SUM(revenue) - LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) / 
    LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) AS mom_growth
FROM orders
GROUP BY 1;
```

#### **Outlier Detection**
```sql
WITH stats AS (
  SELECT 
    AVG(price) AS mean, 
    STDDEV(price) AS stddev 
  FROM products
)
SELECT * FROM products, stats
WHERE ABS((price - mean) / stddev) > 3; -- Z-score > 3
```

---

### **5. Visualization Integration**
While SQL handles the analysis, tools like:
- **Metabase**: Connect directly to SQL databases for visualization
- **Python + SQLAlchemy**: Run SQL queries and visualize with Matplotlib/Seaborn
- **Tableau**: Direct SQL connections for dashboards

Example workflow:
```python
# Python snippet for SQL-powered EDA
import pandas as pd
import seaborn as sns

df = pd.read_sql("""
    SELECT date, SUM(revenue) AS daily_revenue
    FROM sales 
    GROUP BY date
""", engine)

sns.lineplot(data=df, x='date', y='daily_revenue')
```

---

### **6. EDA Workflow with SQL**
1. **Data Discovery**: 
   ```sql
   SELECT column_name, data_type FROM information_schema.columns 
   WHERE table_name = 'sales';
   ```
2. **Initial Profiling**: Basic stats, missing values
3. **Hypothesis Testing**: Use SQL to validate assumptions
4. **Feature Engineering**: Create derived columns for analysis
5. **Visualization Prep**: Aggregate data for plotting

---

### **Key Tools for SQL-Based EDA**
| Tool | Best For | Open Source |
|------|----------|-------------|
| **DBeaver** | Multi-DB exploration | ✓ | 
| **PostgreSQL** | Advanced analytics | ✓ |
| **DuckDB** | Embedded analytical SQL | ✓ |
| **Jupyter + SQL Magic** | Interactive analysis | ✓ |

---

### **When to Use SQL vs. Other Tools in EDA**
| Task | Best Tool |
|------|----------|
| Initial data profiling | SQL |
| Complex aggregations | SQL |
| Statistical testing | Python/R |
| Advanced visualization | Python/R/Tableau |
| Machine learning prep | SQL + Python |

---

### **Conclusion**
SQL is foundational for EDA because:
1. **Efficiency**: Processes large datasets faster than in-memory tools
2. **Reproducibility**: Queries document the analysis steps
3. **Precision**: Exact calculations without sampling
4. **Scalability**: Handles TB-scale data with proper indexing

For modern EDA:
- Start with SQL for data exploration/aggregation
- Switch to Python/R for advanced statistics/ML
- Use visualization tools that connect directly to SQL databases

Would you like me to develop a specific EDA workflow for your particular dataset or industry?