Update tech_docs/database/sql_getting_started.md

This commit is contained in:
2025-06-18 04:37:56 +00:00
parent 65adc021aa
commit ae21e7227f

View File

@@ -451,4 +451,180 @@ flowchart TB
### **Final Thought**
SQL is **the** tool for structured EDA—its faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and youll outperform 90% of analysts stuck in pandas.
Want a **ready-to-run Docker container** with PostgreSQL + sample forex data for practice? Let me know!
Want a **ready-to-run Docker container** with PostgreSQL + sample forex data for practice? Let me know!
---
Here's how the SQL concepts you've presented fit into the EDA (Exploratory Data Analysis) world, organized by their relevance and application:
---
### **1. SQL Fundamentals in EDA**
#### **Data Manipulation Language (DML)**
- **SELECT**: Core to EDA for retrieving and filtering data (e.g., `SELECT * FROM sales WHERE date > '2023-01-01'`).
- **INSERT/UPDATE/DELETE**: Less common in pure EDA (used more in data preparation pipelines).
#### **Data Definition Language (DDL)**
- **CREATE/ALTER**: Used to set up analysis environments (e.g., creating temp tables for intermediate results).
- **TRUNCATE/DROP**: Rare in EDA unless resetting sandbox environments.
#### **Data Control Language (DCL)**
- **GRANT/REVOKE**: Relevant for team-based EDA to manage access to datasets.
#### **Transaction Control Language (TCL)**
- **COMMIT/ROLLBACK**: Critical for reproducible EDA to ensure query consistency.
---
### **2. Advanced SQL for Deeper EDA**
#### **Window Functions**
- **Ranking**: `RANK() OVER (PARTITION BY region ORDER BY revenue DESC)` to identify top performers.
- **Rolling Metrics**: `AVG(revenue) OVER (ORDER BY date ROWS 7 PRECEDING)` for 7-day moving averages.
#### **Common Table Expressions (CTEs)**
- Break complex EDA logic into readable steps:
```sql
WITH filtered_data AS (
SELECT * FROM sales WHERE region = 'West'
)
SELECT product, SUM(revenue) FROM filtered_data GROUP BY product;
```
#### **JSON Handling**
- Analyze semi-structured data (e.g., API responses stored in JSON columns):
```sql
SELECT json_extract(user_data, '$.demographics.age') FROM users;
```
---
### **3. Performance Optimization for Large-Scale EDA**
#### **Indexes**
- Speed up filtering on large tables:
```sql
CREATE INDEX idx_sales_date ON sales(date);
```
#### **Query Planning**
- Use `EXPLAIN ANALYZE` to identify bottlenecks in EDA queries.
#### **Partitioning**
- Improve performance on time-series EDA:
```sql
CREATE TABLE sales PARTITION BY RANGE (date);
```
---
### **4. SQL for Specific EDA Tasks**
#### **Data Profiling**
```sql
SELECT
COUNT(*) AS row_count,
COUNT(DISTINCT product_id) AS unique_products,
AVG(price) AS avg_price,
MIN(price) AS min_price,
MAX(price) AS max_price
FROM products;
```
#### **Correlation Analysis**
```sql
SELECT CORR(price, units_sold) AS price_elasticity FROM sales;
```
#### **Time-Series Analysis**
```sql
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(revenue) AS monthly_revenue,
(SUM(revenue) - LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) /
LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) AS mom_growth
FROM orders
GROUP BY 1;
```
#### **Outlier Detection**
```sql
WITH stats AS (
SELECT
AVG(price) AS mean,
STDDEV(price) AS stddev
FROM products
)
SELECT * FROM products, stats
WHERE ABS((price - mean) / stddev) > 3; -- Z-score > 3
```
---
### **5. Visualization Integration**
While SQL handles the analysis, tools like:
- **Metabase**: Connect directly to SQL databases for visualization
- **Python + SQLAlchemy**: Run SQL queries and visualize with Matplotlib/Seaborn
- **Tableau**: Direct SQL connections for dashboards
Example workflow:
```python
# Python snippet for SQL-powered EDA
import pandas as pd
import seaborn as sns
df = pd.read_sql("""
SELECT date, SUM(revenue) AS daily_revenue
FROM sales
GROUP BY date
""", engine)
sns.lineplot(data=df, x='date', y='daily_revenue')
```
---
### **6. EDA Workflow with SQL**
1. **Data Discovery**:
```sql
SELECT column_name, data_type FROM information_schema.columns
WHERE table_name = 'sales';
```
2. **Initial Profiling**: Basic stats, missing values
3. **Hypothesis Testing**: Use SQL to validate assumptions
4. **Feature Engineering**: Create derived columns for analysis
5. **Visualization Prep**: Aggregate data for plotting
---
### **Key Tools for SQL-Based EDA**
| Tool | Best For | Open Source |
|------|----------|-------------|
| **DBeaver** | Multi-DB exploration | ✓ |
| **PostgreSQL** | Advanced analytics | ✓ |
| **DuckDB** | Embedded analytical SQL | ✓ |
| **Jupyter + SQL Magic** | Interactive analysis | ✓ |
---
### **When to Use SQL vs. Other Tools in EDA**
| Task | Best Tool |
|------|----------|
| Initial data profiling | SQL |
| Complex aggregations | SQL |
| Statistical testing | Python/R |
| Advanced visualization | Python/R/Tableau |
| Machine learning prep | SQL + Python |
---
### **Conclusion**
SQL is foundational for EDA because:
1. **Efficiency**: Processes large datasets faster than in-memory tools
2. **Reproducibility**: Queries document the analysis steps
3. **Precision**: Exact calculations without sampling
4. **Scalability**: Handles TB-scale data with proper indexing
For modern EDA:
- Start with SQL for data exploration/aggregation
- Switch to Python/R for advanced statistics/ML
- Use visualization tools that connect directly to SQL databases
Would you like me to develop a specific EDA workflow for your particular dataset or industry?