Add tech_docs/database/sql_getting_started.md
This commit is contained in:
454
tech_docs/database/sql_getting_started.md
Normal file
454
tech_docs/database/sql_getting_started.md
Normal file
@@ -0,0 +1,454 @@
|
|||||||
|
# **The Ultimate SQL Getting Started Guide**
|
||||||
|
|
||||||
|
This guide will take you from absolute beginner to SQL proficiency, with a focus on practical data analysis and EDA applications.
|
||||||
|
|
||||||
|
## **1. SQL Fundamentals**
|
||||||
|
|
||||||
|
### **What is SQL?**
|
||||||
|
SQL (Structured Query Language) is the standard language for interacting with relational databases. It allows you to:
|
||||||
|
- Retrieve data
|
||||||
|
- Insert, update, and delete records
|
||||||
|
- Create and modify database structures
|
||||||
|
- Perform complex calculations on data
|
||||||
|
|
||||||
|
### **Core Concepts**
|
||||||
|
1. **Databases**: Collections of structured data
|
||||||
|
2. **Tables**: Data organized in rows and columns
|
||||||
|
3. **Queries**: Commands to interact with data
|
||||||
|
4. **Schemas**: Blueprints defining database structure
|
||||||
|
|
||||||
|
## **2. Setting Up Your SQL Environment**
|
||||||
|
|
||||||
|
### **Choose a Database System**
|
||||||
|
| Option | Best For | Installation |
|
||||||
|
|--------|----------|--------------|
|
||||||
|
| **SQLite** | Beginners, small projects | Built into Python |
|
||||||
|
| **PostgreSQL** | Production, complex queries | [Download here](https://www.postgresql.org/download/) |
|
||||||
|
| **MySQL** | Web applications | [Download here](https://dev.mysql.com/downloads/) |
|
||||||
|
| **DuckDB** | Analytical workloads | `pip install duckdb` |
|
||||||
|
|
||||||
|
### **Install a SQL Client**
|
||||||
|
- **DBeaver** (Free, multi-platform)
|
||||||
|
- **TablePlus** (Paid, excellent UI)
|
||||||
|
- **VS Code + SQL Tools** (For developers)
|
||||||
|
|
||||||
|
## **3. Basic SQL Syntax**
|
||||||
|
|
||||||
|
### **SELECT Statements**
|
||||||
|
```sql
|
||||||
|
-- Basic selection
|
||||||
|
SELECT column1, column2 FROM table_name;
|
||||||
|
|
||||||
|
-- Select all columns
|
||||||
|
SELECT * FROM table_name;
|
||||||
|
|
||||||
|
-- Filtering with WHERE
|
||||||
|
SELECT * FROM table_name WHERE condition;
|
||||||
|
|
||||||
|
-- Sorting with ORDER BY
|
||||||
|
SELECT * FROM table_name ORDER BY column1 DESC;
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Common Data Types**
|
||||||
|
- `INTEGER`: Whole numbers
|
||||||
|
- `FLOAT/REAL`: Decimal numbers
|
||||||
|
- `VARCHAR(n)`: Text (n characters max)
|
||||||
|
- `BOOLEAN`: True/False
|
||||||
|
- `DATE/TIMESTAMP`: Date and time values
|
||||||
|
|
||||||
|
## **4. Essential SQL Operations**
|
||||||
|
|
||||||
|
### **Filtering Data**
|
||||||
|
```sql
|
||||||
|
-- Basic conditions
|
||||||
|
SELECT * FROM employees WHERE salary > 50000;
|
||||||
|
|
||||||
|
-- Multiple conditions
|
||||||
|
SELECT * FROM products
|
||||||
|
WHERE price BETWEEN 10 AND 100
|
||||||
|
AND category = 'Electronics';
|
||||||
|
|
||||||
|
-- Pattern matching
|
||||||
|
SELECT * FROM customers
|
||||||
|
WHERE name LIKE 'J%'; -- Starts with J
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Sorting and Limiting**
|
||||||
|
```sql
|
||||||
|
-- Sort by multiple columns
|
||||||
|
SELECT * FROM orders
|
||||||
|
ORDER BY order_date DESC, total_amount DESC;
|
||||||
|
|
||||||
|
-- Limit results
|
||||||
|
SELECT * FROM large_table LIMIT 100;
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Aggregation Functions**
|
||||||
|
```sql
|
||||||
|
-- Basic aggregations
|
||||||
|
SELECT
|
||||||
|
COUNT(*) AS total_orders,
|
||||||
|
AVG(amount) AS avg_order,
|
||||||
|
MAX(amount) AS largest_order
|
||||||
|
FROM orders;
|
||||||
|
|
||||||
|
-- GROUP BY
|
||||||
|
SELECT
|
||||||
|
department,
|
||||||
|
AVG(salary) AS avg_salary
|
||||||
|
FROM employees
|
||||||
|
GROUP BY department;
|
||||||
|
```
|
||||||
|
|
||||||
|
## **5. Joining Tables**
|
||||||
|
|
||||||
|
### **Join Types**
|
||||||
|
| Join Type | Description | Example |
|
||||||
|
|-----------|-------------|---------|
|
||||||
|
| **INNER JOIN** | Only matching rows | `SELECT * FROM A INNER JOIN B ON A.id = B.id` |
|
||||||
|
| **LEFT JOIN** | All from left table, matches from right | `SELECT * FROM A LEFT JOIN B ON A.id = B.id` |
|
||||||
|
| **RIGHT JOIN** | All from right table, matches from left | `SELECT * FROM A RIGHT JOIN B ON A.id = B.id` |
|
||||||
|
| **FULL JOIN** | All rows from both tables | `SELECT * FROM A FULL JOIN B ON A.id = B.id` |
|
||||||
|
|
||||||
|
### **Practical Example**
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
o.order_id,
|
||||||
|
c.customer_name,
|
||||||
|
o.order_date,
|
||||||
|
o.total_amount
|
||||||
|
FROM orders o
|
||||||
|
JOIN customers c ON o.customer_id = c.customer_id
|
||||||
|
WHERE o.order_date > '2023-01-01'
|
||||||
|
ORDER BY o.total_amount DESC;
|
||||||
|
```
|
||||||
|
|
||||||
|
## **6. Advanced SQL Features**
|
||||||
|
|
||||||
|
### **Common Table Expressions (CTEs)**
|
||||||
|
```sql
|
||||||
|
WITH high_value_customers AS (
|
||||||
|
SELECT customer_id, SUM(amount) AS total_spent
|
||||||
|
FROM orders
|
||||||
|
GROUP BY customer_id
|
||||||
|
HAVING SUM(amount) > 1000
|
||||||
|
)
|
||||||
|
SELECT * FROM high_value_customers;
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Window Functions**
|
||||||
|
```sql
|
||||||
|
-- Running total
|
||||||
|
SELECT
|
||||||
|
date,
|
||||||
|
revenue,
|
||||||
|
SUM(revenue) OVER (ORDER BY date) AS running_total
|
||||||
|
FROM daily_sales;
|
||||||
|
|
||||||
|
-- Rank products by category
|
||||||
|
SELECT
|
||||||
|
product_name,
|
||||||
|
category,
|
||||||
|
price,
|
||||||
|
RANK() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank
|
||||||
|
FROM products;
|
||||||
|
```
|
||||||
|
|
||||||
|
## **7. SQL for Data Analysis**
|
||||||
|
|
||||||
|
### **Time Series Analysis**
|
||||||
|
```sql
|
||||||
|
-- Daily aggregates
|
||||||
|
SELECT
|
||||||
|
DATE_TRUNC('day', transaction_time) AS day,
|
||||||
|
COUNT(*) AS transactions,
|
||||||
|
SUM(amount) AS total_amount
|
||||||
|
FROM transactions
|
||||||
|
GROUP BY 1
|
||||||
|
ORDER BY 1;
|
||||||
|
|
||||||
|
-- Month-over-month growth
|
||||||
|
WITH monthly_sales AS (
|
||||||
|
SELECT
|
||||||
|
DATE_TRUNC('month', order_date) AS month,
|
||||||
|
SUM(amount) AS total_sales
|
||||||
|
FROM orders
|
||||||
|
GROUP BY 1
|
||||||
|
)
|
||||||
|
SELECT
|
||||||
|
month,
|
||||||
|
total_sales,
|
||||||
|
(total_sales - LAG(total_sales) OVER (ORDER BY month)) /
|
||||||
|
LAG(total_sales) OVER (ORDER BY month) AS growth_rate
|
||||||
|
FROM monthly_sales;
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Pivot Tables in SQL**
|
||||||
|
```sql
|
||||||
|
-- Using CASE statements
|
||||||
|
SELECT
|
||||||
|
product_category,
|
||||||
|
SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2022 THEN amount ELSE 0 END) AS sales_2022,
|
||||||
|
SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2023 THEN amount ELSE 0 END) AS sales_2023
|
||||||
|
FROM orders
|
||||||
|
GROUP BY product_category;
|
||||||
|
```
|
||||||
|
|
||||||
|
## **8. Performance Optimization**
|
||||||
|
|
||||||
|
### **Indexing Strategies**
|
||||||
|
```sql
|
||||||
|
-- Create indexes
|
||||||
|
CREATE INDEX idx_customer_name ON customers(name);
|
||||||
|
CREATE INDEX idx_order_date ON orders(order_date);
|
||||||
|
|
||||||
|
-- Composite index
|
||||||
|
CREATE INDEX idx_category_price ON products(category, price);
|
||||||
|
```
|
||||||
|
|
||||||
|
### **Query Optimization Tips**
|
||||||
|
1. Use `EXPLAIN ANALYZE` to understand query plans
|
||||||
|
2. Limit columns in `SELECT` (avoid `SELECT *`)
|
||||||
|
3. Filter early with `WHERE` clauses
|
||||||
|
4. Use appropriate join types
|
||||||
|
|
||||||
|
## **9. Learning Resources**
|
||||||
|
|
||||||
|
### **Free Interactive Tutorials**
|
||||||
|
1. [SQLZoo](https://sqlzoo.net/)
|
||||||
|
2. [Mode Analytics SQL Tutorial](https://mode.com/sql-tutorial/)
|
||||||
|
3. [PostgreSQL Exercises](https://pgexercises.com/)
|
||||||
|
|
||||||
|
### **Books**
|
||||||
|
- "SQL for Data Analysis" by Cathy Tanimura
|
||||||
|
- "SQL Cookbook" by Anthony Molinaro
|
||||||
|
|
||||||
|
### **Practice Platforms**
|
||||||
|
- [LeetCode SQL Problems](https://leetcode.com/problemset/database/)
|
||||||
|
- [HackerRank SQL](https://www.hackerrank.com/domains/sql)
|
||||||
|
|
||||||
|
## **10. Next Steps**
|
||||||
|
|
||||||
|
1. **Install a database system** and practice daily
|
||||||
|
2. **Work with real datasets** (try [Kaggle datasets](https://www.kaggle.com/datasets))
|
||||||
|
3. **Build a portfolio project** (e.g., analyze sales data)
|
||||||
|
4. **Learn database design** (normalization, relationships)
|
||||||
|
|
||||||
|
Remember: SQL is a skill best learned by doing. Start writing queries today!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Technical Overview: SQL for EDA (Structured Data Analysis)**
|
||||||
|
You're diving into SQL-first EDA—excellent choice. Below is a **structured roadmap** covering key SQL concepts, EDA-specific queries, and pro tips to maximize efficiency.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **1. Core SQL Concepts for EDA**
|
||||||
|
### **A. Foundational Operations**
|
||||||
|
| Concept | Purpose | Example Query |
|
||||||
|
|------------------|----------------------------------|----------------------------------------|
|
||||||
|
| **Filtering** | Subset data (`WHERE`, `HAVING`) | `SELECT * FROM prices WHERE asset = 'EUR_USD'` |
|
||||||
|
| **Aggregation** | Summarize data (`GROUP BY`) | `SELECT asset, AVG(close) FROM prices GROUP BY asset` |
|
||||||
|
| **Joins** | Combine tables (`INNER JOIN`) | `SELECT * FROM trades JOIN assets ON trades.id = assets.id` |
|
||||||
|
| **Sorting** | Order results (`ORDER BY`) | `SELECT * FROM prices ORDER BY time DESC` |
|
||||||
|
|
||||||
|
### **B. Advanced EDA Tools**
|
||||||
|
| Concept | Purpose | Example Query |
|
||||||
|
|-----------------------|----------------------------------------------|----------------------------------------|
|
||||||
|
| **Window Functions** | Calculate rolling stats, ranks | `SELECT time, AVG(close) OVER (ORDER BY time ROWS 29 PRECEDING) FROM prices` |
|
||||||
|
| **CTEs (WITH)** | Break complex queries into steps | `WITH filtered AS (SELECT * FROM prices WHERE volume > 1000) SELECT * FROM filtered` |
|
||||||
|
| **Statistical Aggregates** | Built-in stats (`STDDEV`, `CORR`, `PERCENTILE_CONT`) | `SELECT CORR(open, close) FROM prices` |
|
||||||
|
| **Time-Series Handling** | Extract dates, resample | `SELECT DATE_TRUNC('hour', time) AS hour, AVG(close) FROM prices GROUP BY 1` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **2. Essential EDA Queries**
|
||||||
|
### **A. Data Profiling**
|
||||||
|
```sql
|
||||||
|
-- 1. Basic stats
|
||||||
|
SELECT
|
||||||
|
COUNT(*) AS row_count,
|
||||||
|
COUNT(DISTINCT asset) AS unique_assets,
|
||||||
|
MIN(close) AS min_price,
|
||||||
|
MAX(close) AS max_price,
|
||||||
|
AVG(close) AS mean_price,
|
||||||
|
STDDEV(close) AS volatility
|
||||||
|
FROM prices;
|
||||||
|
|
||||||
|
-- 2. Missing values
|
||||||
|
SELECT
|
||||||
|
COUNT(*) - COUNT(close) AS missing_prices
|
||||||
|
FROM prices;
|
||||||
|
|
||||||
|
-- 3. Value distribution (histogram)
|
||||||
|
SELECT
|
||||||
|
FLOOR(close / 10) * 10 AS price_bin,
|
||||||
|
COUNT(*) AS frequency
|
||||||
|
FROM prices
|
||||||
|
GROUP BY 1
|
||||||
|
ORDER BY 1;
|
||||||
|
```
|
||||||
|
|
||||||
|
### **B. Correlation Analysis**
|
||||||
|
```sql
|
||||||
|
-- 1. Pairwise correlations
|
||||||
|
SELECT
|
||||||
|
CORR(EUR_USD, GBP_USD) AS eur_gbp,
|
||||||
|
CORR(EUR_USD, USD_JPY) AS eur_jpy,
|
||||||
|
CORR(GBP_USD, USD_JPY) AS gbp_jpy
|
||||||
|
FROM hourly_rates;
|
||||||
|
|
||||||
|
-- 2. Rolling correlation (30-day)
|
||||||
|
WITH normalized AS (
|
||||||
|
SELECT
|
||||||
|
time,
|
||||||
|
(EUR_USD - AVG(EUR_USD) OVER()) / STDDEV(EUR_USD) OVER() AS eur_norm,
|
||||||
|
(GBP_USD - AVG(GBP_USD) OVER()) / STDDEV(GBP_USD) OVER() AS gbp_norm
|
||||||
|
FROM hourly_rates
|
||||||
|
)
|
||||||
|
SELECT
|
||||||
|
time,
|
||||||
|
AVG(eur_norm * gbp_norm) OVER(ORDER BY time ROWS 29 PRECEDING) AS rolling_corr
|
||||||
|
FROM normalized;
|
||||||
|
```
|
||||||
|
|
||||||
|
### **C. Time-Series EDA**
|
||||||
|
```sql
|
||||||
|
-- 1. Hourly volatility patterns
|
||||||
|
SELECT
|
||||||
|
EXTRACT(HOUR FROM time) AS hour,
|
||||||
|
AVG(ABS(close - open)) AS avg_volatility
|
||||||
|
FROM prices
|
||||||
|
GROUP BY 1
|
||||||
|
ORDER BY 1;
|
||||||
|
|
||||||
|
-- 2. Daily returns distribution
|
||||||
|
SELECT
|
||||||
|
DATE_TRUNC('day', time) AS day,
|
||||||
|
(LAST(close) - FIRST(open)) / FIRST(open) AS daily_return
|
||||||
|
FROM prices
|
||||||
|
GROUP BY 1;
|
||||||
|
```
|
||||||
|
|
||||||
|
### **D. Outlier Detection**
|
||||||
|
```sql
|
||||||
|
-- Z-score outliers (|Z| > 3)
|
||||||
|
WITH stats AS (
|
||||||
|
SELECT
|
||||||
|
AVG(close) AS mean,
|
||||||
|
STDDEV(close) AS stddev
|
||||||
|
FROM prices
|
||||||
|
)
|
||||||
|
SELECT
|
||||||
|
time,
|
||||||
|
close,
|
||||||
|
(close - mean) / stddev AS z_score
|
||||||
|
FROM prices, stats
|
||||||
|
WHERE ABS((close - mean) / stddev) > 3;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **3. Key Optimizations**
|
||||||
|
### **A. Indexing for EDA**
|
||||||
|
```sql
|
||||||
|
-- Speed up time-series queries
|
||||||
|
CREATE INDEX idx_prices_time ON prices(time);
|
||||||
|
|
||||||
|
-- Speed up asset-specific filters
|
||||||
|
CREATE INDEX idx_prices_asset ON prices(asset);
|
||||||
|
```
|
||||||
|
|
||||||
|
### **B. Partitioning Large Tables**
|
||||||
|
```sql
|
||||||
|
-- Partition by time range (PostgreSQL)
|
||||||
|
CREATE TABLE prices (
|
||||||
|
time TIMESTAMP,
|
||||||
|
asset TEXT,
|
||||||
|
close FLOAT
|
||||||
|
) PARTITION BY RANGE (time);
|
||||||
|
```
|
||||||
|
|
||||||
|
### **C. Materialized Views**
|
||||||
|
```sql
|
||||||
|
-- Pre-compute frequent aggregates
|
||||||
|
CREATE MATERIALIZED VIEW hourly_stats AS
|
||||||
|
SELECT
|
||||||
|
DATE_TRUNC('hour', time) AS hour,
|
||||||
|
AVG(close) AS avg_price,
|
||||||
|
STDDEV(close) AS volatility
|
||||||
|
FROM prices
|
||||||
|
GROUP BY 1;
|
||||||
|
|
||||||
|
-- Refresh periodically
|
||||||
|
REFRESH MATERIALIZED VIEW hourly_stats;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **4. Pro Tips**
|
||||||
|
### **A. Use the Right Database**
|
||||||
|
| Database | Best For |
|
||||||
|
|--------------|----------------------------------|
|
||||||
|
| **PostgreSQL** | Complex EDA, extensions (MADlib) |
|
||||||
|
| **DuckDB** | Embedded analytics, Parquet/CSV |
|
||||||
|
| **SQLite** | Lightweight, local prototyping |
|
||||||
|
|
||||||
|
### **B. Learn These Functions**
|
||||||
|
| Function | Purpose |
|
||||||
|
|-----------------------|----------------------------------|
|
||||||
|
| `DATE_TRUNC()` | Group by time intervals |
|
||||||
|
| `FIRST_VALUE()` | First/last in a window |
|
||||||
|
| `PERCENTILE_CONT()` | Median, quantiles |
|
||||||
|
| `ROLLUP`/`CUBE` | Hierarchical aggregations |
|
||||||
|
|
||||||
|
### **C. Avoid Anti-Patterns**
|
||||||
|
- **Don’t** fetch all rows to Python for filtering (`SELECT *` → `pd.read_sql()`).
|
||||||
|
- **Do** push computations to SQL (`WHERE`, `GROUP BY` in-database).
|
||||||
|
- **Don’t** use Python loops for row-wise operations.
|
||||||
|
- **Do** use window functions for rolling calculations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **5. Study Roadmap**
|
||||||
|
1. **Beginner**:
|
||||||
|
- SQL syntax (`SELECT`, `JOIN`, `GROUP BY`)
|
||||||
|
- [SQLZoo](https://sqlzoo.net/) (free exercises)
|
||||||
|
|
||||||
|
2. **Intermediate**:
|
||||||
|
- Window functions (`OVER`, `PARTITION BY`)
|
||||||
|
- [PostgreSQL Window Functions Guide](https://www.postgresqltutorial.com/postgresql-window-function/)
|
||||||
|
|
||||||
|
3. **Advanced**:
|
||||||
|
- Query optimization (EXPLAIN ANALYZE, indexing)
|
||||||
|
- Time-series SQL ([TimescaleDB docs](https://docs.timescale.com/))
|
||||||
|
|
||||||
|
4. **EDA-Specific**:
|
||||||
|
- Statistical SQL ([MADlib documentation](https://madlib.apache.org/))
|
||||||
|
- Correlation patterns (rolling, cross-asset)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **6. Sample Workflow**
|
||||||
|
```mermaid
|
||||||
|
flowchart TB
|
||||||
|
A[Load Data] --> B[Profile Data]
|
||||||
|
B --> C[Clean Data]
|
||||||
|
C --> D[Explore Relationships]
|
||||||
|
D --> E[Validate Hypotheses]
|
||||||
|
E --> F[Visualize in Python]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example**:
|
||||||
|
1. Profile data → find missing values.
|
||||||
|
2. Clean → filter outliers.
|
||||||
|
3. Explore → calculate rolling correlations.
|
||||||
|
4. Validate → test "volatility clusters on Mondays".
|
||||||
|
5. Visualize → plot results in Python.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Final Thought**
|
||||||
|
SQL is **the** tool for structured EDA—it’s faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and you’ll outperform 90% of analysts stuck in pandas.
|
||||||
|
|
||||||
|
Want a **ready-to-run Docker container** with PostgreSQL + sample forex data for practice? Let me know!
|
||||||
Reference in New Issue
Block a user