Update tech_docs/database/sql_getting_started.md
This commit is contained in:
@@ -451,4 +451,180 @@ flowchart TB
|
|||||||
### **Final Thought**
|
### **Final Thought**
|
||||||
SQL is **the** tool for structured EDA—it’s faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and you’ll outperform 90% of analysts stuck in pandas.
|
SQL is **the** tool for structured EDA—it’s faster, more scalable, and more maintainable than Python for these tasks. Master these concepts, and you’ll outperform 90% of analysts stuck in pandas.
|
||||||
|
|
||||||
Want a **ready-to-run Docker container** with PostgreSQL + sample forex data for practice? Let me know!
|
Want a **ready-to-run Docker container** with PostgreSQL + sample forex data for practice? Let me know!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Here's how the SQL concepts you've presented fit into the EDA (Exploratory Data Analysis) world, organized by their relevance and application:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **1. SQL Fundamentals in EDA**
|
||||||
|
#### **Data Manipulation Language (DML)**
|
||||||
|
- **SELECT**: Core to EDA for retrieving and filtering data (e.g., `SELECT * FROM sales WHERE date > '2023-01-01'`).
|
||||||
|
- **INSERT/UPDATE/DELETE**: Less common in pure EDA (used more in data preparation pipelines).
|
||||||
|
|
||||||
|
#### **Data Definition Language (DDL)**
|
||||||
|
- **CREATE/ALTER**: Used to set up analysis environments (e.g., creating temp tables for intermediate results).
|
||||||
|
- **TRUNCATE/DROP**: Rare in EDA unless resetting sandbox environments.
|
||||||
|
|
||||||
|
#### **Data Control Language (DCL)**
|
||||||
|
- **GRANT/REVOKE**: Relevant for team-based EDA to manage access to datasets.
|
||||||
|
|
||||||
|
#### **Transaction Control Language (TCL)**
|
||||||
|
- **COMMIT/ROLLBACK**: Critical for reproducible EDA to ensure query consistency.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **2. Advanced SQL for Deeper EDA**
|
||||||
|
#### **Window Functions**
|
||||||
|
- **Ranking**: `RANK() OVER (PARTITION BY region ORDER BY revenue DESC)` to identify top performers.
|
||||||
|
- **Rolling Metrics**: `AVG(revenue) OVER (ORDER BY date ROWS 7 PRECEDING)` for 7-day moving averages.
|
||||||
|
|
||||||
|
#### **Common Table Expressions (CTEs)**
|
||||||
|
- Break complex EDA logic into readable steps:
|
||||||
|
```sql
|
||||||
|
WITH filtered_data AS (
|
||||||
|
SELECT * FROM sales WHERE region = 'West'
|
||||||
|
)
|
||||||
|
SELECT product, SUM(revenue) FROM filtered_data GROUP BY product;
|
||||||
|
```
|
||||||
|
|
||||||
|
#### **JSON Handling**
|
||||||
|
- Analyze semi-structured data (e.g., API responses stored in JSON columns):
|
||||||
|
```sql
|
||||||
|
SELECT json_extract(user_data, '$.demographics.age') FROM users;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **3. Performance Optimization for Large-Scale EDA**
|
||||||
|
#### **Indexes**
|
||||||
|
- Speed up filtering on large tables:
|
||||||
|
```sql
|
||||||
|
CREATE INDEX idx_sales_date ON sales(date);
|
||||||
|
```
|
||||||
|
|
||||||
|
#### **Query Planning**
|
||||||
|
- Use `EXPLAIN ANALYZE` to identify bottlenecks in EDA queries.
|
||||||
|
|
||||||
|
#### **Partitioning**
|
||||||
|
- Improve performance on time-series EDA:
|
||||||
|
```sql
|
||||||
|
CREATE TABLE sales PARTITION BY RANGE (date);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **4. SQL for Specific EDA Tasks**
|
||||||
|
#### **Data Profiling**
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
COUNT(*) AS row_count,
|
||||||
|
COUNT(DISTINCT product_id) AS unique_products,
|
||||||
|
AVG(price) AS avg_price,
|
||||||
|
MIN(price) AS min_price,
|
||||||
|
MAX(price) AS max_price
|
||||||
|
FROM products;
|
||||||
|
```
|
||||||
|
|
||||||
|
#### **Correlation Analysis**
|
||||||
|
```sql
|
||||||
|
SELECT CORR(price, units_sold) AS price_elasticity FROM sales;
|
||||||
|
```
|
||||||
|
|
||||||
|
#### **Time-Series Analysis**
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
DATE_TRUNC('month', order_date) AS month,
|
||||||
|
SUM(revenue) AS monthly_revenue,
|
||||||
|
(SUM(revenue) - LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) /
|
||||||
|
LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', order_date))) AS mom_growth
|
||||||
|
FROM orders
|
||||||
|
GROUP BY 1;
|
||||||
|
```
|
||||||
|
|
||||||
|
#### **Outlier Detection**
|
||||||
|
```sql
|
||||||
|
WITH stats AS (
|
||||||
|
SELECT
|
||||||
|
AVG(price) AS mean,
|
||||||
|
STDDEV(price) AS stddev
|
||||||
|
FROM products
|
||||||
|
)
|
||||||
|
SELECT * FROM products, stats
|
||||||
|
WHERE ABS((price - mean) / stddev) > 3; -- Z-score > 3
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **5. Visualization Integration**
|
||||||
|
While SQL handles the analysis, tools like:
|
||||||
|
- **Metabase**: Connect directly to SQL databases for visualization
|
||||||
|
- **Python + SQLAlchemy**: Run SQL queries and visualize with Matplotlib/Seaborn
|
||||||
|
- **Tableau**: Direct SQL connections for dashboards
|
||||||
|
|
||||||
|
Example workflow:
|
||||||
|
```python
|
||||||
|
# Python snippet for SQL-powered EDA
|
||||||
|
import pandas as pd
|
||||||
|
import seaborn as sns
|
||||||
|
|
||||||
|
df = pd.read_sql("""
|
||||||
|
SELECT date, SUM(revenue) AS daily_revenue
|
||||||
|
FROM sales
|
||||||
|
GROUP BY date
|
||||||
|
""", engine)
|
||||||
|
|
||||||
|
sns.lineplot(data=df, x='date', y='daily_revenue')
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **6. EDA Workflow with SQL**
|
||||||
|
1. **Data Discovery**:
|
||||||
|
```sql
|
||||||
|
SELECT column_name, data_type FROM information_schema.columns
|
||||||
|
WHERE table_name = 'sales';
|
||||||
|
```
|
||||||
|
2. **Initial Profiling**: Basic stats, missing values
|
||||||
|
3. **Hypothesis Testing**: Use SQL to validate assumptions
|
||||||
|
4. **Feature Engineering**: Create derived columns for analysis
|
||||||
|
5. **Visualization Prep**: Aggregate data for plotting
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Key Tools for SQL-Based EDA**
|
||||||
|
| Tool | Best For | Open Source |
|
||||||
|
|------|----------|-------------|
|
||||||
|
| **DBeaver** | Multi-DB exploration | ✓ |
|
||||||
|
| **PostgreSQL** | Advanced analytics | ✓ |
|
||||||
|
| **DuckDB** | Embedded analytical SQL | ✓ |
|
||||||
|
| **Jupyter + SQL Magic** | Interactive analysis | ✓ |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **When to Use SQL vs. Other Tools in EDA**
|
||||||
|
| Task | Best Tool |
|
||||||
|
|------|----------|
|
||||||
|
| Initial data profiling | SQL |
|
||||||
|
| Complex aggregations | SQL |
|
||||||
|
| Statistical testing | Python/R |
|
||||||
|
| Advanced visualization | Python/R/Tableau |
|
||||||
|
| Machine learning prep | SQL + Python |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Conclusion**
|
||||||
|
SQL is foundational for EDA because:
|
||||||
|
1. **Efficiency**: Processes large datasets faster than in-memory tools
|
||||||
|
2. **Reproducibility**: Queries document the analysis steps
|
||||||
|
3. **Precision**: Exact calculations without sampling
|
||||||
|
4. **Scalability**: Handles TB-scale data with proper indexing
|
||||||
|
|
||||||
|
For modern EDA:
|
||||||
|
- Start with SQL for data exploration/aggregation
|
||||||
|
- Switch to Python/R for advanced statistics/ML
|
||||||
|
- Use visualization tools that connect directly to SQL databases
|
||||||
|
|
||||||
|
Would you like me to develop a specific EDA workflow for your particular dataset or industry?
|
||||||
Reference in New Issue
Block a user