Update tech_docs/database/sql_getting_started.md
This commit is contained in:
@@ -452,3 +452,179 @@ flowchart TB
|
||||
SQL is **the** tool for structured EDA—it’s faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and you’ll outperform 90% of analysts stuck in pandas.
|
||||
|
||||
Want a **ready-to-run Docker container** with PostgreSQL + sample forex data for practice? Let me know!
|
||||
|
||||
---
|
||||
|
||||
Here's how the SQL concepts you've presented fit into the EDA (Exploratory Data Analysis) world, organized by their relevance and application:
|
||||
|
||||
---
|
||||
|
||||
### **1. SQL Fundamentals in EDA**
|
||||
#### **Data Manipulation Language (DML)**
|
||||
- **SELECT**: Core to EDA for retrieving and filtering data (e.g., `SELECT * FROM sales WHERE date > '2023-01-01'`).
|
||||
- **INSERT/UPDATE/DELETE**: Less common in pure EDA (used more in data preparation pipelines).
|
||||
|
||||
#### **Data Definition Language (DDL)**
|
||||
- **CREATE/ALTER**: Used to set up analysis environments (e.g., creating temp tables for intermediate results).
|
||||
- **TRUNCATE/DROP**: Rare in EDA unless resetting sandbox environments.
|
||||
|
||||
#### **Data Control Language (DCL)**
|
||||
- **GRANT/REVOKE**: Relevant for team-based EDA to manage access to datasets.
|
||||
|
||||
#### **Transaction Control Language (TCL)**
|
||||
- **COMMIT/ROLLBACK**: Critical for reproducible EDA to ensure query consistency.
|
||||
|
||||
---
|
||||
|
||||
### **2. Advanced SQL for Deeper EDA**
|
||||
#### **Window Functions**
|
||||
- **Ranking**: `RANK() OVER (PARTITION BY region ORDER BY revenue DESC)` to identify top performers.
|
||||
- **Rolling Metrics**: `AVG(revenue) OVER (ORDER BY date ROWS 7 PRECEDING)` for 7-day moving averages.
|
||||
|
||||
#### **Common Table Expressions (CTEs)**
|
||||
- Break complex EDA logic into readable steps:
|
||||
```sql
|
||||
WITH filtered_data AS (
|
||||
SELECT * FROM sales WHERE region = 'West'
|
||||
)
|
||||
SELECT product, SUM(revenue) FROM filtered_data GROUP BY product;
|
||||
```
|
||||
|
||||
#### **JSON Handling**
|
||||
- Analyze semi-structured data (e.g., API responses stored in JSON columns):
|
||||
```sql
|
||||
SELECT json_extract(user_data, '$.demographics.age') FROM users;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **3. Performance Optimization for Large-Scale EDA**
|
||||
#### **Indexes**
|
||||
- Speed up filtering on large tables:
|
||||
```sql
|
||||
CREATE INDEX idx_sales_date ON sales(date);
|
||||
```
|
||||
|
||||
#### **Query Planning**
|
||||
- Use `EXPLAIN ANALYZE` to identify bottlenecks in EDA queries.
|
||||
|
||||
#### **Partitioning**
|
||||
- Improve performance on time-series EDA:
|
||||
```sql
|
||||
CREATE TABLE sales PARTITION BY RANGE (date);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **4. SQL for Specific EDA Tasks**
|
||||
#### **Data Profiling**
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) AS row_count,
|
||||
COUNT(DISTINCT product_id) AS unique_products,
|
||||
AVG(price) AS avg_price,
|
||||
MIN(price) AS min_price,
|
||||
MAX(price) AS max_price
|
||||
FROM products;
|
||||
```
|
||||
|
||||
#### **Correlation Analysis**
|
||||
```sql
|
||||
SELECT CORR(price, units_sold) AS price_elasticity FROM sales;
|
||||
```
|
||||
|
||||
#### **Time-Series Analysis**
|
||||
```sql
|
||||
SELECT
|
||||
DATE_TRUNC('month', order_date) AS month,
|
||||
SUM(revenue) AS monthly_revenue,
|
||||
(SUM(revenue) - LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) /
|
||||
LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) AS mom_growth
|
||||
FROM orders
|
||||
GROUP BY 1;
|
||||
```
|
||||
|
||||
#### **Outlier Detection**
|
||||
```sql
|
||||
WITH stats AS (
|
||||
SELECT
|
||||
AVG(price) AS mean,
|
||||
STDDEV(price) AS stddev
|
||||
FROM products
|
||||
)
|
||||
SELECT * FROM products, stats
|
||||
WHERE ABS((price - mean) / stddev) > 3; -- Z-score > 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **5. Visualization Integration**
|
||||
While SQL handles the analysis, tools like:
|
||||
- **Metabase**: Connect directly to SQL databases for visualization
|
||||
- **Python + SQLAlchemy**: Run SQL queries and visualize with Matplotlib/Seaborn
|
||||
- **Tableau**: Direct SQL connections for dashboards
|
||||
|
||||
Example workflow:
|
||||
```python
|
||||
# Python snippet for SQL-powered EDA
|
||||
import pandas as pd
|
||||
import seaborn as sns
|
||||
|
||||
df = pd.read_sql("""
|
||||
SELECT date, SUM(revenue) AS daily_revenue
|
||||
FROM sales
|
||||
GROUP BY date
|
||||
""", engine)
|
||||
|
||||
sns.lineplot(data=df, x='date', y='daily_revenue')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **6. EDA Workflow with SQL**
|
||||
1. **Data Discovery**:
|
||||
```sql
|
||||
SELECT column_name, data_type FROM information_schema.columns
|
||||
WHERE table_name = 'sales';
|
||||
```
|
||||
2. **Initial Profiling**: Basic stats, missing values
|
||||
3. **Hypothesis Testing**: Use SQL to validate assumptions
|
||||
4. **Feature Engineering**: Create derived columns for analysis
|
||||
5. **Visualization Prep**: Aggregate data for plotting
|
||||
|
||||
---
|
||||
|
||||
### **Key Tools for SQL-Based EDA**
|
||||
| Tool | Best For | Open Source |
|
||||
|------|----------|-------------|
|
||||
| **DBeaver** | Multi-DB exploration | ✓ |
|
||||
| **PostgreSQL** | Advanced analytics | ✓ |
|
||||
| **DuckDB** | Embedded analytical SQL | ✓ |
|
||||
| **Jupyter + SQL Magic** | Interactive analysis | ✓ |
|
||||
|
||||
---
|
||||
|
||||
### **When to Use SQL vs. Other Tools in EDA**
|
||||
| Task | Best Tool |
|
||||
|------|----------|
|
||||
| Initial data profiling | SQL |
|
||||
| Complex aggregations | SQL |
|
||||
| Statistical testing | Python/R |
|
||||
| Advanced visualization | Python/R/Tableau |
|
||||
| Machine learning prep | SQL + Python |
|
||||
|
||||
---
|
||||
|
||||
### **Conclusion**
|
||||
SQL is foundational for EDA because:
|
||||
1. **Efficiency**: Processes large datasets faster than in-memory tools
|
||||
2. **Reproducibility**: Queries document the analysis steps
|
||||
3. **Precision**: Exact calculations without sampling
|
||||
4. **Scalability**: Handles TB-scale data with proper indexing
|
||||
|
||||
For modern EDA:
|
||||
- Start with SQL for data exploration/aggregation
|
||||
- Switch to Python/R for advanced statistics/ML
|
||||
- Use visualization tools that connect directly to SQL databases
|
||||
|
||||
Would you like me to develop a specific EDA workflow for your particular dataset or industry?
|
||||
Reference in New Issue
Block a user