Files

medusa 65adc021aa Add tech_docs/database/sql_getting_started.md

2025-06-18 04:34:22 +00:00

13 KiB

Raw Blame History

The Ultimate SQL Getting Started Guide

This guide will take you from absolute beginner to SQL proficiency, with a focus on practical data analysis and EDA applications.

1. SQL Fundamentals

What is SQL?

SQL (Structured Query Language) is the standard language for interacting with relational databases. It allows you to:

Retrieve data
Insert, update, and delete records
Create and modify database structures
Perform complex calculations on data

Core Concepts

Databases: Collections of structured data
Tables: Data organized in rows and columns
Queries: Commands to interact with data
Schemas: Blueprints defining database structure

2. Setting Up Your SQL Environment

Choose a Database System

Option	Best For	Installation
SQLite	Beginners, small projects	Built into Python
PostgreSQL	Production, complex queries	Download here
MySQL	Web applications	Download here
DuckDB	Analytical workloads	`pip install duckdb`

Install a SQL Client

DBeaver (Free, multi-platform)
TablePlus (Paid, excellent UI)
VS Code + SQL Tools (For developers)

3. Basic SQL Syntax

SELECT Statements

-- Basic selection
SELECT column1, column2 FROM table_name;

-- Select all columns
SELECT * FROM table_name;

-- Filtering with WHERE
SELECT * FROM table_name WHERE condition;

-- Sorting with ORDER BY
SELECT * FROM table_name ORDER BY column1 DESC;

Common Data Types

INTEGER: Whole numbers
FLOAT/REAL: Decimal numbers
VARCHAR(n): Text (n characters max)
BOOLEAN: True/False
DATE/TIMESTAMP: Date and time values

4. Essential SQL Operations

Filtering Data

-- Basic conditions
SELECT * FROM employees WHERE salary > 50000;

-- Multiple conditions
SELECT * FROM products 
WHERE price BETWEEN 10 AND 100 
AND category = 'Electronics';

-- Pattern matching
SELECT * FROM customers 
WHERE name LIKE 'J%'; -- Starts with J

Sorting and Limiting

-- Sort by multiple columns
SELECT * FROM orders 
ORDER BY order_date DESC, total_amount DESC;

-- Limit results
SELECT * FROM large_table LIMIT 100;

Aggregation Functions

-- Basic aggregations
SELECT 
    COUNT(*) AS total_orders,
    AVG(amount) AS avg_order,
    MAX(amount) AS largest_order
FROM orders;

-- GROUP BY
SELECT 
    department, 
    AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

5. Joining Tables

Join Types

Join Type	Description	Example
INNER JOIN	Only matching rows	`SELECT * FROM A INNER JOIN B ON A.id = B.id`
LEFT JOIN	All from left table, matches from right	`SELECT * FROM A LEFT JOIN B ON A.id = B.id`
RIGHT JOIN	All from right table, matches from left	`SELECT * FROM A RIGHT JOIN B ON A.id = B.id`
FULL JOIN	All rows from both tables	`SELECT * FROM A FULL JOIN B ON A.id = B.id`

Practical Example

SELECT 
    o.order_id,
    c.customer_name,
    o.order_date,
    o.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > '2023-01-01'
ORDER BY o.total_amount DESC;

6. Advanced SQL Features

Common Table Expressions (CTEs)

WITH high_value_customers AS (
    SELECT customer_id, SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
    HAVING SUM(amount) > 1000
)
SELECT * FROM high_value_customers;

Window Functions

-- Running total
SELECT 
    date,
    revenue,
    SUM(revenue) OVER (ORDER BY date) AS running_total
FROM daily_sales;

-- Rank products by category
SELECT 
    product_name,
    category,
    price,
    RANK() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank
FROM products;

7. SQL for Data Analysis

Time Series Analysis

-- Daily aggregates
SELECT 
    DATE_TRUNC('day', transaction_time) AS day,
    COUNT(*) AS transactions,
    SUM(amount) AS total_amount
FROM transactions
GROUP BY 1
ORDER BY 1;

-- Month-over-month growth
WITH monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS total_sales
    FROM orders
    GROUP BY 1
)
SELECT 
    month,
    total_sales,
    (total_sales - LAG(total_sales) OVER (ORDER BY month)) / 
        LAG(total_sales) OVER (ORDER BY month) AS growth_rate
FROM monthly_sales;

Pivot Tables in SQL

-- Using CASE statements
SELECT 
    product_category,
    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2022 THEN amount ELSE 0 END) AS sales_2022,
    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2023 THEN amount ELSE 0 END) AS sales_2023
FROM orders
GROUP BY product_category;

8. Performance Optimization

Indexing Strategies

-- Create indexes
CREATE INDEX idx_customer_name ON customers(name);
CREATE INDEX idx_order_date ON orders(order_date);

-- Composite index
CREATE INDEX idx_category_price ON products(category, price);

Query Optimization Tips

Use EXPLAIN ANALYZE to understand query plans
Limit columns in SELECT (avoid SELECT *)
Filter early with WHERE clauses
Use appropriate join types

9. Learning Resources

Free Interactive Tutorials

Books

"SQL for Data Analysis" by Cathy Tanimura
"SQL Cookbook" by Anthony Molinaro

Practice Platforms

10. Next Steps

Install a database system and practice daily
Work with real datasets (try Kaggle datasets)
Build a portfolio project (e.g., analyze sales data)
Learn database design (normalization, relationships)

Remember: SQL is a skill best learned by doing. Start writing queries today!

Technical Overview: SQL for EDA (Structured Data Analysis)

You're diving into SQL-first EDA—excellent choice. Below is a structured roadmap covering key SQL concepts, EDA-specific queries, and pro tips to maximize efficiency.

1. Core SQL Concepts for EDA

A. Foundational Operations

Concept	Purpose	Example Query
Filtering	Subset data (`WHERE`, `HAVING`)	`SELECT * FROM prices WHERE asset = 'EUR_USD'`
Aggregation	Summarize data (`GROUP BY`)	`SELECT asset, AVG(close) FROM prices GROUP BY asset`
Joins	Combine tables (`INNER JOIN`)	`SELECT * FROM trades JOIN assets ON trades.id = assets.id`
Sorting	Order results (`ORDER BY`)	`SELECT * FROM prices ORDER BY time DESC`

B. Advanced EDA Tools

Concept	Purpose	Example Query
Window Functions	Calculate rolling stats, ranks	`SELECT time, AVG(close) OVER (ORDER BY time ROWS 29 PRECEDING) FROM prices`
CTEs (WITH)	Break complex queries into steps	`WITH filtered AS (SELECT * FROM prices WHERE volume > 1000) SELECT * FROM filtered`
Statistical Aggregates	Built-in stats (`STDDEV`, `CORR`, `PERCENTILE_CONT`)	`SELECT CORR(open, close) FROM prices`
Time-Series Handling	Extract dates, resample	`SELECT DATE_TRUNC('hour', time) AS hour, AVG(close) FROM prices GROUP BY 1`

2. Essential EDA Queries

A. Data Profiling

-- 1. Basic stats
SELECT 
    COUNT(*) AS row_count,
    COUNT(DISTINCT asset) AS unique_assets,
    MIN(close) AS min_price,
    MAX(close) AS max_price,
    AVG(close) AS mean_price,
    STDDEV(close) AS volatility
FROM prices;

-- 2. Missing values
SELECT 
    COUNT(*) - COUNT(close) AS missing_prices
FROM prices;

-- 3. Value distribution (histogram)
SELECT 
    FLOOR(close / 10) * 10 AS price_bin,
    COUNT(*) AS frequency
FROM prices
GROUP BY 1
ORDER BY 1;

B. Correlation Analysis

-- 1. Pairwise correlations
SELECT 
    CORR(EUR_USD, GBP_USD) AS eur_gbp,
    CORR(EUR_USD, USD_JPY) AS eur_jpy,
    CORR(GBP_USD, USD_JPY) AS gbp_jpy
FROM hourly_rates;

-- 2. Rolling correlation (30-day)
WITH normalized AS (
    SELECT 
        time,
        (EUR_USD - AVG(EUR_USD) OVER()) / STDDEV(EUR_USD) OVER() AS eur_norm,
        (GBP_USD - AVG(GBP_USD) OVER()) / STDDEV(GBP_USD) OVER() AS gbp_norm
    FROM hourly_rates
)
SELECT 
    time,
    AVG(eur_norm * gbp_norm) OVER(ORDER BY time ROWS 29 PRECEDING) AS rolling_corr
FROM normalized;

C. Time-Series EDA

-- 1. Hourly volatility patterns
SELECT 
    EXTRACT(HOUR FROM time) AS hour,
    AVG(ABS(close - open)) AS avg_volatility
FROM prices
GROUP BY 1
ORDER BY 1;

-- 2. Daily returns distribution
SELECT 
    DATE_TRUNC('day', time) AS day,
    (LAST(close) - FIRST(open)) / FIRST(open) AS daily_return
FROM prices
GROUP BY 1;

D. Outlier Detection

-- Z-score outliers (|Z| > 3)
WITH stats AS (
    SELECT 
        AVG(close) AS mean,
        STDDEV(close) AS stddev
    FROM prices
)
SELECT 
    time,
    close,
    (close - mean) / stddev AS z_score
FROM prices, stats
WHERE ABS((close - mean) / stddev) > 3;

3. Key Optimizations

A. Indexing for EDA

-- Speed up time-series queries
CREATE INDEX idx_prices_time ON prices(time);

-- Speed up asset-specific filters
CREATE INDEX idx_prices_asset ON prices(asset);

B. Partitioning Large Tables

-- Partition by time range (PostgreSQL)
CREATE TABLE prices (
    time TIMESTAMP,
    asset TEXT,
    close FLOAT
) PARTITION BY RANGE (time);

C. Materialized Views

-- Pre-compute frequent aggregates
CREATE MATERIALIZED VIEW hourly_stats AS
SELECT 
    DATE_TRUNC('hour', time) AS hour,
    AVG(close) AS avg_price,
    STDDEV(close) AS volatility
FROM prices
GROUP BY 1;

-- Refresh periodically
REFRESH MATERIALIZED VIEW hourly_stats;

4. Pro Tips

A. Use the Right Database

Database	Best For
PostgreSQL	Complex EDA, extensions (MADlib)
DuckDB	Embedded analytics, Parquet/CSV
SQLite	Lightweight, local prototyping

B. Learn These Functions

Function	Purpose
`DATE_TRUNC()`	Group by time intervals
`FIRST_VALUE()`	First/last in a window
`PERCENTILE_CONT()`	Median, quantiles
`ROLLUP`/`CUBE`	Hierarchical aggregations

C. Avoid Anti-Patterns

Don’t fetch all rows to Python for filtering (SELECT * → pd.read_sql()).
Do push computations to SQL (WHERE, GROUP BY in-database).
Don’t use Python loops for row-wise operations.
Do use window functions for rolling calculations.

5. Study Roadmap

Beginner:
- SQL syntax (SELECT, JOIN, GROUP BY)
- SQLZoo (free exercises)
Intermediate:
- Window functions (OVER, PARTITION BY)
- PostgreSQL Window Functions Guide
Advanced:
- Query optimization (EXPLAIN ANALYZE, indexing)
- Time-series SQL (TimescaleDB docs)
EDA-Specific:
- Statistical SQL (MADlib documentation)
- Correlation patterns (rolling, cross-asset)

6. Sample Workflow

flowchart TB
    A[Load Data] --> B[Profile Data]
    B --> C[Clean Data]
    C --> D[Explore Relationships]
    D --> E[Validate Hypotheses]
    E --> F[Visualize in Python]

Example:

Profile data → find missing values.
Clean → filter outliers.
Explore → calculate rolling correlations.
Validate → test "volatility clusters on Mondays".
Visualize → plot results in Python.

Final Thought

SQL is the tool for structured EDA—it’s faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and you’ll outperform 90% of analysts stuck in pandas.

Want a ready-to-run Docker container with PostgreSQL + sample forex data for practice? Let me know!

13 KiB Raw Blame History Unescape Escape