medusa/the_information_nexus

Fork 0

Files

medusa f6d0035f53 Update tech_docs/database/sql_getting_started.md

2025-06-18 04:54:26 +00:00

34 KiB

Raw Blame History

SQL for Forex Data Analysis: The 20% That Delivers 80% Results

Focused Learning Roadmap

Master these core skills to handle most forex data analysis tasks

Phase 1: Core Skills (Week 1-2)

What to Learn	Why It Matters	Key Syntax Examples
Filtering Data	Isolate specific currency pairs/timeframes	`SELECT * FROM ticks WHERE symbol='EUR/USD' AND timestamp > '2023-01-01'`
Time Bucketing	Convert raw ticks into candlesticks (1min/5min/1H)	`DATE_TRUNC('hour', timestamp) AS hour`
Basic Aggregates	Calculate spreads, highs/lows, averages	`AVG(ask-bid) AS avg_spread`, `MAX(ask) AS high`
Grouping	Summarize data by pair/time period	`GROUP BY symbol, DATE_TRUNC('day', timestamp)`

Phase 2: Essential Techniques (Week 3-4)

Skill	Forex Application	Example
Joins	Combine tick data with economic calendars	`JOIN economic_events ON ticks.date = events.date`
Rolling Windows	Calculate moving averages & volatility	`AVG(price) OVER (ORDER BY timestamp ROWS 30 PRECEDING)`
Correlations	Compare currency pairs (e.g., EUR/USD vs. USD/JPY)	`CORR(eurusd_mid, usdjpy_mid)`
Session Analysis	Compare volatility across trading sessions	`WHERE EXTRACT(HOUR FROM timestamp) IN (7,13,21)` (London/NY/Asia hours)

Phase 3: Optimization (Week 5)

Skill	Impact	Implementation
Indexing	Speed up time/symbol queries	`CREATE INDEX idx_symbol_time ON ticks(symbol, timestamp)`
CTEs	Break complex queries into steps	`WITH filtered AS (...) SELECT * FROM filtered`
Partitioning	Faster queries on large datasets	`PARTITION BY RANGE (timestamp)`

10 Essential Forex Queries You'll Use Daily

Current Spread Analysis

SELECT symbol, AVG(ask-bid) AS spread 
FROM ticks 
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY symbol;

5-Minute Candlesticks

SELECT 
  DATE_TRUNC('5 minutes', timestamp) AS time,
  MIN(bid) AS low,
  MAX(ask) AS high
FROM ticks
WHERE symbol = 'GBP/USD'
GROUP BY time;

Rolling Volatility

SELECT 
  timestamp,
  STDDEV(ask) OVER (ORDER BY timestamp ROWS 100 PRECEDING) AS vol
FROM ticks
WHERE symbol = 'EUR/USD';

Session Volume Comparison

SELECT 
  CASE 
    WHEN EXTRACT(HOUR FROM timestamp) BETWEEN 7 AND 15 THEN 'London' 
    ELSE 'Other' 
  END AS session,
  SUM(volume) AS total_volume
FROM ticks
GROUP BY session;

Study Plan

Week 1: Master SELECT, WHERE, GROUP BY, DATE_TRUNC
→ Goal: Generate hourly OHLC data for one currency pair
Week 2: Learn JOIN, AVG() OVER(), CORR()
→ Goal: Compare two pairs' correlation over different timeframes
Week 3: Optimize with indexes + CTEs
→ Goal: Run a 1M-row query in <1 second

Tools to Start With

Data: Free forex ticks from Dukascopy
Database: DuckDB (lightweight, no setup required)
Visualization: Metabase (free) or Python with Matplotlib

What to Avoid (For Now)

Stored procedures
Advanced indexing strategies
Machine learning in SQL
Recursive queries

Bare Minimum Survival Kit

WHERE + DATE_TRUNC (filter and bucket time data)
GROUP BY (summarize data efficiently)
AVG() OVER() (rolling calculations)
CORR() (measure pair relationships)

Next Step: Want a 5-question quiz to test your core knowledge? Or shall we dive deeper into one technique?

Understood! You're seeking maximum SQL mastery with minimum time investment by focusing on the critical 20% that delivers 80% of results in forex data analysis. Here's your laser-focused SQL SME blueprint:

The SQL SME 20% Framework for Forex

1. Core Principles to Internalize

Forex Data is Time-Series First
- Every query must handle timestamp filtering efficiently
- Key skill: DATE_TRUNC() + WHERE on time ranges
Bid/Ask Dynamics are Fundamental
- Spread analysis unlocks liquidity insights
- Key skill: (ask - bid) calculations with proper grouping
Rolling Windows > Static Aggregates
- Market conditions change constantly - analyze trends, not snapshots
- Key skill: AVG() OVER(ROWS N PRECEDING)

2. The 10 Essential Patterns (Memorize These)

#	Pattern	Forex Application	Example
1	Time Bucketing	Convert ticks → candles	`DATE_TRUNC('15 min', timestamp)`
2	Rolling Volatility	Measure risk	`STDDEV(price) OVER(ROWS 99 PRECEDING)`
3	Session Comparison	London vs. NY activity	`WHERE EXTRACT(HOUR FROM timestamp) IN (7,13)`
4	Pair Correlation	Hedge ratios	`CORR(eurusd, usdjpy)`
5	Spread Analysis	Liquidity monitoring	`AVG(ask - bid) GROUP BY symbol`
6	Event Impact	NFP/CPI reactions	`WHERE timestamp BETWEEN event-15min AND event+1H`
7	Liquidity Zones	Volume clusters	`NTILE(4) OVER(ORDER BY volume)`
8	Outlier Detection	Data quality checks	`WHERE price > 3*STDDEV() OVER()`
9	Gap Analysis	Weekend openings	`LAG(close) OVER() - open`
10	Rolling Sharpe	Strategy performance	`AVG(return)/STDDEV(return) OVER()`

3. SME-Level Documentation Template

For each pattern, document:

Business Purpose: "Identify optimal trading hours by comparing volatility across sessions"

Technical Implementation:

SELECT 
  EXTRACT(HOUR FROM timestamp) AS hour,
  STDDEV((bid+ask)/2) AS volatility
FROM ticks
WHERE symbol = 'EUR/USD'
GROUP BY hour
ORDER BY volatility DESC

Performance Considerations: "Add composite index on (symbol, timestamp) for 100x speedup"
Edge Cases: "Exclude holidays where volatility is artificially low"

4. Drills to Achieve Mastery

Daily Challenge (15 mins/day)

Day 1: Generate 1H candles with OHLC + volume
Day 2: Calculate 30-period rolling correlation between EUR/USD and GBP/USD
Day 3: Find days with spread > 2x 30-day average
Day 4: Compare pre/post-FOMC volatility
Day 5: Optimize a slow query using EXPLAIN ANALYZE

Weekly Project

Build a volatility surface showing:

SELECT 
  symbol,
  DATE_TRUNC('hour', timestamp) AS hour,
  STDDEV((bid+ask)/2) AS vol,
  AVG(ask-bid) AS spread
FROM ticks
GROUP BY symbol, hour

5. Forensic Analysis Checklist

When reviewing any forex query, ask:

Time Handling:
- ✅ Timestamps in UTC?
- ✅ Correct timezone conversions?
Spread Awareness:
- ✅ Using (bid+ask)/2 for mid-price?
- ✅ Calculating raw spread metrics?
Rolling vs Static:
- ✅ Using window functions where appropriate?
Performance:
- ✅ Indexes on (symbol, timestamp)?
- ✅ Avoiding full table scans?

6. Pro-Grade Resources

Dukascopy Tick Data: Real-world practice dataset
PostgreSQL Window Functions Docs: Master the OVER() clause
Forex Economic Calendar: For event impact studies

Key Mindset Shifts

From: "I need to learn all SQL functions"
To: "I need to master time-series patterns for bid/ask data"
From: "Queries must return perfect results"
To: "Queries must be fast enough for real-time decisions"
From: "Document after writing"
To: "Document patterns before implementation"

Next Step: Pick one pattern from the 10 above. I'll provide:

Detailed breakdown
Common mistakes
Optimization techniques
Forensic analysis questions

Which pattern would you like to deep-dive into first? (Recommend starting with Time Bucketing or Rolling Volatility)

Here’s the 20% of SQL skills that will deliver 80% of your forex data analysis needs, structured as a focused roadmap:

SQL for Forex Data: The 20% Priority Roadmap

1. Core Skills (Weeks 1-2)

Skill	Why It Matters	Key Syntax
Filtering Data	Isolate specific currency pairs/timeframes	`SELECT * FROM ticks WHERE symbol='EUR/USD' AND timestamp > '2023-01-01'`
Time Bucketing	Convert ticks to candles (1min/5min/1H)	`DATE_TRUNC('hour', timestamp) AS hour`
Basic Aggregates	Calculate spreads, averages, highs/lows	`AVG(ask-bid) AS avg_spread`, `MAX(ask) AS high`
Grouping	Summarize by pair/time period	`GROUP BY symbol, DATE_TRUNC('day', timestamp)`

2. Essential Techniques (Weeks 3-4)

Skill	Forex Application	Example
Joins	Combine tick data with economic calendars	`JOIN economic_events ON ticks.date = events.date`
Rolling Windows	Calculate moving averages/volatility	`AVG(price) OVER (ORDER BY timestamp ROWS 30 PRECEDING)`
Correlations	Compare pairs (EUR/USD vs. USD/JPY)	`CORR(eurusd_mid, usdjpy_mid)`
Session Analysis	Compare London/NY/Asia volatility	`WHERE EXTRACT(HOUR FROM timestamp) IN (7,13,21)`

3. Optimization (Week 5)

Skill	Impact	Implementation
Indexing	Speed up time/symbol queries	`CREATE INDEX idx_symbol_time ON ticks(symbol, timestamp)`
CTEs	Break complex queries into steps	`WITH filtered AS (...) SELECT * FROM filtered`
Partitioning	Faster queries on large datasets	`PARTITION BY RANGE (timestamp)`

Prioritized Cheat Sheet

10 Queries You’ll Use Daily

Current Spread:

SELECT symbol, AVG(ask-bid) AS spread 
FROM ticks 
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY symbol;

5-Min Candles:

SELECT 
  DATE_TRUNC('5 minutes', timestamp) AS time,
  MIN(bid) AS low,
  MAX(ask) AS high
FROM ticks
WHERE symbol = 'GBP/USD'
GROUP BY time;

Rolling Volatility:

SELECT 
  timestamp,
  STDDEV(ask) OVER (ORDER BY timestamp ROWS 100 PRECEDING) AS vol
FROM ticks
WHERE symbol = 'EUR/USD';

Session Volume:

SELECT 
  CASE 
    WHEN EXTRACT(HOUR FROM timestamp) BETWEEN 7 AND 15 THEN 'London' 
    ELSE 'Other' 
  END AS session,
  SUM(volume) AS total_volume
FROM ticks
GROUP BY session;

Study Plan

Week 1: Master SELECT, WHERE, GROUP BY, DATE_TRUNC
→ Goal: Generate hourly high/low/close for 1 pair
Week 2: Learn JOIN, AVG() OVER(), CORR()
→ Goal: Compare 2 pairs’ correlation last week vs. last month
Week 3: Optimize with indexes + CTEs
→ Goal: Make a 1M-row query run in <1 sec

Tools to Use Now

Data: Free forex ticks from Dukascopy
Database: DuckDB (simple, fast, no setup)
Visualization: Metabase (free) or Python’s Matplotlib

Avoid Until Later

Stored procedures
Advanced indexing strategies
Machine learning in SQL
Recursive queries

Key Principle: Focus on time-based analysis (the core of forex data) first. Everything else builds on this.

Want the absolute bare minimum? Learn these 4 things:

WHERE + DATE_TRUNC (filter and bucket time)
GROUP BY (summarize data)
AVG() OVER() (rolling calculations)
CORR() (pair relationships)

Would you like a 5-question quiz to test your 20% knowledge?

Here's a structured Technical Guide & Roadmap for Forex Tick Data Analysis with SQL, designed as a progressive learning path with clear milestones and reference examples:

Forex Tick Data Analysis: SQL Learning Roadmap

A step-by-step guide from beginner to advanced techniques

Phase 1: Foundations

1.1 Understanding Your Data

Structure: Forex ticks typically contain:

symbol (e.g., 'EUR/USD'), 
timestamp (precision to milliseconds), 
bid price, 
ask price, 
volume

Key Metrics:
- Spread: ask - bid (liquidity measure)
- Mid-price: (bid + ask) / 2 (reference price)

1.2 Basic SQL Operations

-- Sample data inspection
SELECT * FROM forex_ticks 
WHERE symbol = 'EUR/USD' 
LIMIT 100;

-- Count ticks per pair
SELECT symbol, COUNT(*) 
FROM forex_ticks 
GROUP BY symbol;

-- Time range filtering
SELECT MIN(timestamp), MAX(timestamp) 
FROM forex_ticks;

Phase 2: Core Analysis

2.1 Spread Analysis

-- Basic spread stats
SELECT 
  symbol,
  AVG(ask - bid) AS avg_spread,
  MAX(ask - bid) AS max_spread
FROM forex_ticks
GROUP BY symbol;

2.2 Time Bucketing

-- 5-minute candlesticks
SELECT 
  symbol,
  DATE_TRUNC('5 minutes', timestamp) AS time_bucket,
  MIN(bid) AS low,
  MAX(ask) AS high,
  AVG((bid+ask)/2) AS close
FROM forex_ticks
GROUP BY symbol, time_bucket;

2.3 Session Analysis

-- Volume by hour (GMT)
SELECT 
  EXTRACT(HOUR FROM timestamp) AS hour,
  AVG(volume) AS avg_volume
FROM forex_ticks
WHERE symbol = 'GBP/USD'
GROUP BY hour
ORDER BY hour;

Phase 3: Intermediate Techniques

3.1 Rolling Calculations

-- 30-minute moving average
SELECT 
  timestamp,
  symbol,
  AVG((bid+ask)/2) OVER (
    PARTITION BY symbol 
    ORDER BY timestamp 
    ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
  ) AS 30min_MA
FROM forex_ticks;

3.2 Pair Correlation

WITH hourly_prices AS (
  SELECT 
    DATE_TRUNC('hour', timestamp) AS hour,
    symbol,
    AVG((bid+ask)/2) AS mid_price
  FROM forex_ticks
  GROUP BY hour, symbol
)
SELECT 
  a.symbol AS pair1,
  b.symbol AS pair2,
  CORR(a.mid_price, b.mid_price) AS correlation
FROM hourly_prices a
JOIN hourly_prices b ON a.hour = b.hour
WHERE a.symbol < b.symbol
GROUP BY pair1, pair2;

Phase 4: Advanced Topics

4.1 Volatility Measurement

WITH returns AS (
  SELECT 
    symbol,
    timestamp,
    (ask - LAG(ask) OVER (PARTITION BY symbol ORDER BY timestamp)) / 
    LAG(ask) OVER (PARTITION BY symbol ORDER BY timestamp) AS return
  FROM forex_ticks
)
SELECT 
  symbol,
  STDDEV(return) AS hourly_volatility
FROM returns
GROUP BY symbol;

4.2 Event Impact Analysis

-- Compare 15-min pre/post NFP release
SELECT 
  AVG(CASE WHEN timestamp BETWEEN '2023-12-01 13:30' AND '2023-12-01 13:45' 
      THEN (bid+ask)/2 END) AS post_NFP,
  AVG(CASE WHEN timestamp BETWEEN '2023-12-01 13:15' AND '2023-12-01 13:30' 
      THEN (bid+ask)/2 END) AS pre_NFP
FROM forex_ticks
WHERE symbol = 'EUR/USD';

Study Roadmap

Weekly Learning Plan

Week	Focus Area	Key Skills
1	SQL Basics	`SELECT`, `WHERE`, `GROUP BY`
2	Time Handling	`DATE_TRUNC`, `EXTRACT`, timezones
3	Aggregations	`AVG`, `STDDEV`, `CORR`
4	Window Functions	`OVER`, `PARTITION BY`, rolling calcs
5	Optimization	Indexes, query planning
6	Advanced Patterns	Volatility modeling, microstructure

Cheat Sheet

Essential Functions

Function	Purpose	Example
`DATE_TRUNC`	Bucket timestamps	`DATE_TRUNC('hour', timestamp)`
`EXTRACT`	Get time parts	`EXTRACT(HOUR FROM timestamp)`
`CORR`	Correlation	`CORR(price1, price2)`
`AVG() OVER`	Moving average	`AVG(price) OVER (ORDER BY time ROWS 30 PRECEDING)`

Common Patterns

-- Get latest price per pair
SELECT DISTINCT ON (symbol) symbol, bid, ask
FROM forex_ticks
ORDER BY symbol, timestamp DESC;

-- Detect stale data
SELECT symbol, MAX(timestamp) AS last_update
FROM forex_ticks
GROUP BY symbol
HAVING MAX(timestamp) < NOW() - INTERVAL '5 minutes';

Next Steps

Practice Dataset: Download free forex tick data from Dukascopy Bank
Sandbox Setup: Install PostgreSQL + TimescaleDB for time-series optimizations
Projects:
- Build a volatility dashboard
- Analyze London vs. NY session spreads
- Track correlation breakdowns during crises

Pro Tip: Bookmark this guide and revisit each phase as your skills progress. Start with Phase 1 queries, then gradually incorporate more complex techniques.

The Ultimate SQL Getting Started Guide

This guide will take you from absolute beginner to SQL proficiency, with a focus on practical data analysis and EDA applications.

1. SQL Fundamentals

What is SQL?

SQL (Structured Query Language) is the standard language for interacting with relational databases. It allows you to:

Retrieve data
Insert, update, and delete records
Create and modify database structures
Perform complex calculations on data

Core Concepts

Databases: Collections of structured data
Tables: Data organized in rows and columns
Queries: Commands to interact with data
Schemas: Blueprints defining database structure

2. Setting Up Your SQL Environment

Choose a Database System

Option	Best For	Installation
SQLite	Beginners, small projects	Built into Python
PostgreSQL	Production, complex queries	Download here
MySQL	Web applications	Download here
DuckDB	Analytical workloads	`pip install duckdb`

Install a SQL Client

DBeaver (Free, multi-platform)
TablePlus (Paid, excellent UI)
VS Code + SQL Tools (For developers)

3. Basic SQL Syntax

SELECT Statements

-- Basic selection
SELECT column1, column2 FROM table_name;

-- Select all columns
SELECT * FROM table_name;

-- Filtering with WHERE
SELECT * FROM table_name WHERE condition;

-- Sorting with ORDER BY
SELECT * FROM table_name ORDER BY column1 DESC;

Common Data Types

INTEGER: Whole numbers
FLOAT/REAL: Decimal numbers
VARCHAR(n): Text (n characters max)
BOOLEAN: True/False
DATE/TIMESTAMP: Date and time values

4. Essential SQL Operations

Filtering Data

-- Basic conditions
SELECT * FROM employees WHERE salary > 50000;

-- Multiple conditions
SELECT * FROM products 
WHERE price BETWEEN 10 AND 100 
AND category = 'Electronics';

-- Pattern matching
SELECT * FROM customers 
WHERE name LIKE 'J%'; -- Starts with J

Sorting and Limiting

-- Sort by multiple columns
SELECT * FROM orders 
ORDER BY order_date DESC, total_amount DESC;

-- Limit results
SELECT * FROM large_table LIMIT 100;

Aggregation Functions

-- Basic aggregations
SELECT 
    COUNT(*) AS total_orders,
    AVG(amount) AS avg_order,
    MAX(amount) AS largest_order
FROM orders;

-- GROUP BY
SELECT 
    department, 
    AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

5. Joining Tables

Join Types

Join Type	Description	Example
INNER JOIN	Only matching rows	`SELECT * FROM A INNER JOIN B ON A.id = B.id`
LEFT JOIN	All from left table, matches from right	`SELECT * FROM A LEFT JOIN B ON A.id = B.id`
RIGHT JOIN	All from right table, matches from left	`SELECT * FROM A RIGHT JOIN B ON A.id = B.id`
FULL JOIN	All rows from both tables	`SELECT * FROM A FULL JOIN B ON A.id = B.id`

Practical Example

SELECT 
    o.order_id,
    c.customer_name,
    o.order_date,
    o.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > '2023-01-01'
ORDER BY o.total_amount DESC;

6. Advanced SQL Features

Common Table Expressions (CTEs)

WITH high_value_customers AS (
    SELECT customer_id, SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
    HAVING SUM(amount) > 1000
)
SELECT * FROM high_value_customers;

Window Functions

-- Running total
SELECT 
    date,
    revenue,
    SUM(revenue) OVER (ORDER BY date) AS running_total
FROM daily_sales;

-- Rank products by category
SELECT 
    product_name,
    category,
    price,
    RANK() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank
FROM products;

7. SQL for Data Analysis

Time Series Analysis

-- Daily aggregates
SELECT 
    DATE_TRUNC('day', transaction_time) AS day,
    COUNT(*) AS transactions,
    SUM(amount) AS total_amount
FROM transactions
GROUP BY 1
ORDER BY 1;

-- Month-over-month growth
WITH monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS total_sales
    FROM orders
    GROUP BY 1
)
SELECT 
    month,
    total_sales,
    (total_sales - LAG(total_sales) OVER (ORDER BY month)) / 
        LAG(total_sales) OVER (ORDER BY month) AS growth_rate
FROM monthly_sales;

Pivot Tables in SQL

-- Using CASE statements
SELECT 
    product_category,
    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2022 THEN amount ELSE 0 END) AS sales_2022,
    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2023 THEN amount ELSE 0 END) AS sales_2023
FROM orders
GROUP BY product_category;

8. Performance Optimization

Indexing Strategies

-- Create indexes
CREATE INDEX idx_customer_name ON customers(name);
CREATE INDEX idx_order_date ON orders(order_date);

-- Composite index
CREATE INDEX idx_category_price ON products(category, price);

Query Optimization Tips

Use EXPLAIN ANALYZE to understand query plans
Limit columns in SELECT (avoid SELECT *)
Filter early with WHERE clauses
Use appropriate join types

9. Learning Resources

Free Interactive Tutorials

Books

"SQL for Data Analysis" by Cathy Tanimura
"SQL Cookbook" by Anthony Molinaro

Practice Platforms

10. Next Steps

Install a database system and practice daily
Work with real datasets (try Kaggle datasets)
Build a portfolio project (e.g., analyze sales data)
Learn database design (normalization, relationships)

Remember: SQL is a skill best learned by doing. Start writing queries today!

Technical Overview: SQL for EDA (Structured Data Analysis)

You're diving into SQL-first EDA—excellent choice. Below is a structured roadmap covering key SQL concepts, EDA-specific queries, and pro tips to maximize efficiency.

1. Core SQL Concepts for EDA

A. Foundational Operations

Concept	Purpose	Example Query
Filtering	Subset data (`WHERE`, `HAVING`)	`SELECT * FROM prices WHERE asset = 'EUR_USD'`
Aggregation	Summarize data (`GROUP BY`)	`SELECT asset, AVG(close) FROM prices GROUP BY asset`
Joins	Combine tables (`INNER JOIN`)	`SELECT * FROM trades JOIN assets ON trades.id = assets.id`
Sorting	Order results (`ORDER BY`)	`SELECT * FROM prices ORDER BY time DESC`

B. Advanced EDA Tools

Concept	Purpose	Example Query
Window Functions	Calculate rolling stats, ranks	`SELECT time, AVG(close) OVER (ORDER BY time ROWS 29 PRECEDING) FROM prices`
CTEs (WITH)	Break complex queries into steps	`WITH filtered AS (SELECT * FROM prices WHERE volume > 1000) SELECT * FROM filtered`
Statistical Aggregates	Built-in stats (`STDDEV`, `CORR`, `PERCENTILE_CONT`)	`SELECT CORR(open, close) FROM prices`
Time-Series Handling	Extract dates, resample	`SELECT DATE_TRUNC('hour', time) AS hour, AVG(close) FROM prices GROUP BY 1`

2. Essential EDA Queries

A. Data Profiling

-- 1. Basic stats
SELECT 
    COUNT(*) AS row_count,
    COUNT(DISTINCT asset) AS unique_assets,
    MIN(close) AS min_price,
    MAX(close) AS max_price,
    AVG(close) AS mean_price,
    STDDEV(close) AS volatility
FROM prices;

-- 2. Missing values
SELECT 
    COUNT(*) - COUNT(close) AS missing_prices
FROM prices;

-- 3. Value distribution (histogram)
SELECT 
    FLOOR(close / 10) * 10 AS price_bin,
    COUNT(*) AS frequency
FROM prices
GROUP BY 1
ORDER BY 1;

B. Correlation Analysis

-- 1. Pairwise correlations
SELECT 
    CORR(EUR_USD, GBP_USD) AS eur_gbp,
    CORR(EUR_USD, USD_JPY) AS eur_jpy,
    CORR(GBP_USD, USD_JPY) AS gbp_jpy
FROM hourly_rates;

-- 2. Rolling correlation (30-day)
WITH normalized AS (
    SELECT 
        time,
        (EUR_USD - AVG(EUR_USD) OVER()) / STDDEV(EUR_USD) OVER() AS eur_norm,
        (GBP_USD - AVG(GBP_USD) OVER()) / STDDEV(GBP_USD) OVER() AS gbp_norm
    FROM hourly_rates
)
SELECT 
    time,
    AVG(eur_norm * gbp_norm) OVER(ORDER BY time ROWS 29 PRECEDING) AS rolling_corr
FROM normalized;

C. Time-Series EDA

-- 1. Hourly volatility patterns
SELECT 
    EXTRACT(HOUR FROM time) AS hour,
    AVG(ABS(close - open)) AS avg_volatility
FROM prices
GROUP BY 1
ORDER BY 1;

-- 2. Daily returns distribution
SELECT 
    DATE_TRUNC('day', time) AS day,
    (LAST(close) - FIRST(open)) / FIRST(open) AS daily_return
FROM prices
GROUP BY 1;

D. Outlier Detection

-- Z-score outliers (|Z| > 3)
WITH stats AS (
    SELECT 
        AVG(close) AS mean,
        STDDEV(close) AS stddev
    FROM prices
)
SELECT 
    time,
    close,
    (close - mean) / stddev AS z_score
FROM prices, stats
WHERE ABS((close - mean) / stddev) > 3;

3. Key Optimizations

A. Indexing for EDA

-- Speed up time-series queries
CREATE INDEX idx_prices_time ON prices(time);

-- Speed up asset-specific filters
CREATE INDEX idx_prices_asset ON prices(asset);

B. Partitioning Large Tables

-- Partition by time range (PostgreSQL)
CREATE TABLE prices (
    time TIMESTAMP,
    asset TEXT,
    close FLOAT
) PARTITION BY RANGE (time);

C. Materialized Views

-- Pre-compute frequent aggregates
CREATE MATERIALIZED VIEW hourly_stats AS
SELECT 
    DATE_TRUNC('hour', time) AS hour,
    AVG(close) AS avg_price,
    STDDEV(close) AS volatility
FROM prices
GROUP BY 1;

-- Refresh periodically
REFRESH MATERIALIZED VIEW hourly_stats;

4. Pro Tips

A. Use the Right Database

Database	Best For
PostgreSQL	Complex EDA, extensions (MADlib)
DuckDB	Embedded analytics, Parquet/CSV
SQLite	Lightweight, local prototyping

B. Learn These Functions

Function	Purpose
`DATE_TRUNC()`	Group by time intervals
`FIRST_VALUE()`	First/last in a window
`PERCENTILE_CONT()`	Median, quantiles
`ROLLUP`/`CUBE`	Hierarchical aggregations

C. Avoid Anti-Patterns

Don’t fetch all rows to Python for filtering (SELECT * → pd.read_sql()).
Do push computations to SQL (WHERE, GROUP BY in-database).
Don’t use Python loops for row-wise operations.
Do use window functions for rolling calculations.

5. Study Roadmap

Beginner:
- SQL syntax (SELECT, JOIN, GROUP BY)
- SQLZoo (free exercises)
Intermediate:
- Window functions (OVER, PARTITION BY)
- PostgreSQL Window Functions Guide
Advanced:
- Query optimization (EXPLAIN ANALYZE, indexing)
- Time-series SQL (TimescaleDB docs)
EDA-Specific:
- Statistical SQL (MADlib documentation)
- Correlation patterns (rolling, cross-asset)

6. Sample Workflow

flowchart TB
    A[Load Data] --> B[Profile Data]
    B --> C[Clean Data]
    C --> D[Explore Relationships]
    D --> E[Validate Hypotheses]
    E --> F[Visualize in Python]

Example:

Profile data → find missing values.
Clean → filter outliers.
Explore → calculate rolling correlations.
Validate → test "volatility clusters on Mondays".
Visualize → plot results in Python.

Final Thought

SQL is the tool for structured EDA—it’s faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and you’ll outperform 90% of analysts stuck in pandas.

Want a ready-to-run Docker container with PostgreSQL + sample forex data for practice? Let me know!

Here's how the SQL concepts you've presented fit into the EDA (Exploratory Data Analysis) world, organized by their relevance and application:

1. SQL Fundamentals in EDA

Data Manipulation Language (DML)

SELECT: Core to EDA for retrieving and filtering data (e.g., SELECT * FROM sales WHERE date > '2023-01-01').
INSERT/UPDATE/DELETE: Less common in pure EDA (used more in data preparation pipelines).

Data Definition Language (DDL)

CREATE/ALTER: Used to set up analysis environments (e.g., creating temp tables for intermediate results).
TRUNCATE/DROP: Rare in EDA unless resetting sandbox environments.

Data Control Language (DCL)

GRANT/REVOKE: Relevant for team-based EDA to manage access to datasets.

Transaction Control Language (TCL)

COMMIT/ROLLBACK: Critical for reproducible EDA to ensure query consistency.

2. Advanced SQL for Deeper EDA

Window Functions

Ranking: RANK() OVER (PARTITION BY region ORDER BY revenue DESC) to identify top performers.
Rolling Metrics: AVG(revenue) OVER (ORDER BY date ROWS 7 PRECEDING) for 7-day moving averages.

Common Table Expressions (CTEs)

Break complex EDA logic into readable steps:

WITH filtered_data AS (
  SELECT * FROM sales WHERE region = 'West'
)
SELECT product, SUM(revenue) FROM filtered_data GROUP BY product;

JSON Handling

Analyze semi-structured data (e.g., API responses stored in JSON columns):
```
SELECT json_extract(user_data, '$.demographics.age') FROM users;
```

3. Performance Optimization for Large-Scale EDA

Indexes

Speed up filtering on large tables:

CREATE INDEX idx_sales_date ON sales(date);

Query Planning

Use EXPLAIN ANALYZE to identify bottlenecks in EDA queries.

Partitioning

Improve performance on time-series EDA:

CREATE TABLE sales PARTITION BY RANGE (date);

4. SQL for Specific EDA Tasks

Data Profiling

SELECT 
  COUNT(*) AS row_count,
  COUNT(DISTINCT product_id) AS unique_products,
  AVG(price) AS avg_price,
  MIN(price) AS min_price,
  MAX(price) AS max_price
FROM products;

Correlation Analysis

SELECT CORR(price, units_sold) AS price_elasticity FROM sales;

Time-Series Analysis

SELECT 
  DATE_TRUNC('month', order_date) AS month,
  SUM(revenue) AS monthly_revenue,
  (SUM(revenue) - LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) / 
    LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) AS mom_growth
FROM orders
GROUP BY 1;

Outlier Detection

WITH stats AS (
  SELECT 
    AVG(price) AS mean, 
    STDDEV(price) AS stddev 
  FROM products
)
SELECT * FROM products, stats
WHERE ABS((price - mean) / stddev) > 3; -- Z-score > 3

5. Visualization Integration

While SQL handles the analysis, tools like:

Metabase: Connect directly to SQL databases for visualization
Python + SQLAlchemy: Run SQL queries and visualize with Matplotlib/Seaborn
Tableau: Direct SQL connections for dashboards

Example workflow:

# Python snippet for SQL-powered EDA
import pandas as pd
import seaborn as sns

df = pd.read_sql("""
    SELECT date, SUM(revenue) AS daily_revenue
    FROM sales 
    GROUP BY date
""", engine)

sns.lineplot(data=df, x='date', y='daily_revenue')

6. EDA Workflow with SQL

Data Discovery:

SELECT column_name, data_type FROM information_schema.columns 
WHERE table_name = 'sales';

Initial Profiling: Basic stats, missing values
Hypothesis Testing: Use SQL to validate assumptions
Feature Engineering: Create derived columns for analysis
Visualization Prep: Aggregate data for plotting

Key Tools for SQL-Based EDA

Tool	Best For	Open Source
DBeaver	Multi-DB exploration	✓
PostgreSQL	Advanced analytics	✓
DuckDB	Embedded analytical SQL	✓
Jupyter + SQL Magic	Interactive analysis	✓

When to Use SQL vs. Other Tools in EDA

Task	Best Tool
Initial data profiling	SQL
Complex aggregations	SQL
Statistical testing	Python/R
Advanced visualization	Python/R/Tableau
Machine learning prep	SQL + Python

Conclusion

SQL is foundational for EDA because:

Efficiency: Processes large datasets faster than in-memory tools
Reproducibility: Queries document the analysis steps
Precision: Exact calculations without sampling
Scalability: Handles TB-scale data with proper indexing

For modern EDA:

Start with SQL for data exploration/aggregation
Switch to Python/R for advanced statistics/ML
Use visualization tools that connect directly to SQL databases

Would you like me to develop a specific EDA workflow for your particular dataset or industry?

34 KiB Raw Blame History Unescape Escape

SQL for Forex Data Analysis: The 20% That Delivers 80% Results

Focused Learning Roadmap

Phase 1: Core Skills (Week 1-2)

Phase 2: Essential Techniques (Week 3-4)

Phase 3: Optimization (Week 5)

10 Essential Forex Queries You'll Use Daily

Study Plan

Tools to Start With

What to Avoid (For Now)

Bare Minimum Survival Kit

The SQL SME 20% Framework for Forex

1. Core Principles to Internalize

2. The 10 Essential Patterns (Memorize These)

3. SME-Level Documentation Template

4. Drills to Achieve Mastery

Daily Challenge (15 mins/day)

Weekly Project

5. Forensic Analysis Checklist

6. Pro-Grade Resources

Key Mindset Shifts

SQL for Forex Data: The 20% Priority Roadmap

1. Core Skills (Weeks 1-2)

2. Essential Techniques (Weeks 3-4)

3. Optimization (Week 5)

Prioritized Cheat Sheet

10 Queries You’ll Use Daily

Study Plan

Tools to Use Now

Avoid Until Later

Forex Tick Data Analysis: SQL Learning Roadmap

Phase 1: Foundations

1.1 Understanding Your Data

1.2 Basic SQL Operations

Phase 2: Core Analysis

2.1 Spread Analysis

2.2 Time Bucketing

2.3 Session Analysis

Phase 3: Intermediate Techniques

3.1 Rolling Calculations

3.2 Pair Correlation

Phase 4: Advanced Topics

4.1 Volatility Measurement

4.2 Event Impact Analysis

Study Roadmap

Weekly Learning Plan

Cheat Sheet

Essential Functions

Common Patterns

Next Steps

The Ultimate SQL Getting Started Guide

1. SQL Fundamentals

What is SQL?

Core Concepts

2. Setting Up Your SQL Environment

Choose a Database System

Install a SQL Client

3. Basic SQL Syntax

SELECT Statements

Common Data Types

4. Essential SQL Operations

Filtering Data

Sorting and Limiting

Aggregation Functions

5. Joining Tables

Join Types

Practical Example

6. Advanced SQL Features

Common Table Expressions (CTEs)

Window Functions

7. SQL for Data Analysis

Time Series Analysis

Pivot Tables in SQL

8. Performance Optimization

Indexing Strategies

Query Optimization Tips

9. Learning Resources

Free Interactive Tutorials

Books

Practice Platforms

34 KiB

Raw Blame History