Update projects/getting_started_ml.md

2024-06-07 12:42:55 +00:00
parent 34b7d5e9a5
commit ac951648cd
1 changed files with 303 additions and 1 deletions
--- a/projects/getting_started_ml.md
+++ b/projects/getting_started_ml.md
@@ -570,3 +570,305 @@ def split_data(df, target, test_size=0.2, val_size=0.1):
 - **Function: split_data:** A structured approach to splitting the data into training, validation, and test sets, which is essential for reliable machine learning model development.

 By focusing on these streamlined and well-defined steps, organizations can efficiently handle the train-test split process, ensuring that their models are well-trained and evaluated without the need for overly complex procedures. This approach balances practicality with the need for robust model development.
+
+---
+
+### Model Selection and Training: Functions Overview
+
+#### 1. Define the Problem
+
+- **Objective:** Clearly state the goal of the prediction.
+- **Target Variable:** Identify the dependent variable to be predicted.
+- **Features:** List the independent variables to be used for prediction.
+
+#### 2. Data Collection and Preparation
+
+- **Data Collection:** Continuously gather sensor data via MQTT and store it in TimescaleDB.
+
+- **Preprocess Data:**
+  - **Handle Missing Values:** Replace missing values with appropriate substitutes (e.g., mean, median).
+  - **Remove Outliers:** Identify and handle outliers in the dataset.
+
+- **Feature Engineering:**
+  - **Lag Features:** Create lagged versions of features to capture temporal dependencies.
+  - **Rolling Statistics:** Calculate rolling means, standard deviations, and other statistics over a specified window.
+  - **Time-Based Features:** Extract time-related features such as hour of the day and day of the week.
+  - **Interaction Terms:** Generate interaction terms between features to capture combined effects.
+
+#### 3. Exploratory Data Analysis (EDA)
+
+- **Visualize Data:**
+  - **Scatter Plots:** Visualize relationships between features and the target variable.
+  - **Histograms:** Understand the distribution of individual features.
+  - **Correlation Matrix:** Identify correlations between features and the target variable.
+
+- **Summary Statistics:**
+  - **Mean, Median, Mode:** Calculate central tendency measures.
+  - **Standard Deviation, Variance:** Measure the spread of the data.
+
+#### 4. Model Selection
+
+- **Choose Baseline Models:**
+  - **Linear Regression:** Simple and interpretable model for continuous target variables.
+  - **Logistic Regression:** Basic model for binary classification tasks.
+  - **Decision Trees:** Model that captures non-linear relationships and interactions.
+  - **Random Forests:** Ensemble model that reduces overfitting and captures complex patterns.
+
+#### 5. Train-Test Split
+
+- **Data Splitting:**
+  - **Training Set:** Used to train the model.
+  - **Validation Set:** Used to tune hyperparameters and avoid overfitting.
+  - **Test Set:** Used to evaluate the final model's performance.
+
+#### 6. Model Training
+
+- **Train Model:**
+  - **Fit:** Train the model on the training dataset.
+  - **Hyperparameter Tuning:** Optimize model parameters using grid search or random search.
+
+#### 7. Model Evaluation
+
+- **Evaluate Model:**
+  - **Validation Metrics:** Assess model performance on the validation set using metrics such as Mean Squared Error (MSE) and R-squared.
+  - **Test Metrics:** Evaluate the final model on the test set using the same metrics to ensure generalization.
+
+#### 8. Advanced Techniques
+
+- **Feature Selection:** Identify and retain the most important features to reduce dimensionality.
+- **Ensemble Methods:** Combine predictions from multiple models to improve accuracy.
+- **Cross-Validation:** Use cross-validation techniques to ensure the model generalizes well to unseen data.
+
+### Function Descriptions
+
+#### Data Collection and Preparation
+
+1. **collect_data:**
+   - **Purpose:** Gather time series data from MQTT sensors and store it in TimescaleDB.
+
+2. **preprocess_data:**
+   - **Purpose:** Clean and preprocess the collected data, handle missing values, and remove outliers.
+
+3. **feature_engineering:**
+   - **Purpose:** Create new features such as lag features, rolling statistics, time-based features, and interaction terms.
+
+#### Exploratory Data Analysis (EDA)
+
+4. **visualize_data:**
+   - **Purpose:** Use visualizations like scatter plots, histograms, and correlation matrices to explore relationships and distributions in the data.
+
+5. **calculate_summary_statistics:**
+   - **Purpose:** Calculate summary statistics (mean, median, mode, standard deviation, variance) to understand the central tendency and spread of the data.
+
+#### Model Selection
+
+6. **select_model:**
+   - **Purpose:** Choose a baseline machine learning model appropriate for the prediction task (e.g., Linear Regression, Logistic Regression, Decision Trees, Random Forests).
+
+#### Train-Test Split
+
+7. **split_data:**
+   - **Purpose:** Split the dataset into training, validation, and test sets to evaluate model performance.
+
+#### Model Training
+
+8. **train_model:**
+   - **Purpose:** Train the chosen model on the training dataset and tune hyperparameters using the validation set.
+
+#### Model Evaluation
+
+9. **evaluate_model:**
+   - **Purpose:** Assess the model’s performance on the validation and test sets using appropriate evaluation metrics.
+
+#### Advanced Techniques
+
+10. **feature_selection:**
+    - **Purpose:** Identify and retain the most important features to improve model performance and reduce complexity.
+
+11. **ensemble_methods:**
+    - **Purpose:** Combine predictions from multiple models to enhance accuracy and robustness.
+
+12. **cross_validation:**
+    - **Purpose:** Use cross-validation techniques to ensure the model generalizes well to new, unseen data.
+
+### Summary
+By systematically defining the problem, collecting and preparing data, conducting exploratory data analysis, selecting and training models, and evaluating their performance, you can establish a robust baseline for predicting target variables using MQTT sensor data. Each function plays a critical role in this structured approach, ensuring that the resulting model is accurate, reliable, and generalizable.
+
+---
+
+When selecting and training a baseline machine learning model using sensor data collected via MQTT, it is important to consider several factors to ensure the model is appropriate for the given use case. Here's a structured approach to selecting and training a baseline model:
+
+### Model Selection and Training
+
+#### 1. Define the Problem
+
+- **Objective:** Clearly define what you want to predict. For example, predicting the number of people in a room, temperature variations, or equipment failure.
+- **Target Variable:** Identify the target variable (dependent variable) you want to predict.
+- **Features:** Identify the features (independent variables) you will use for prediction.
+
+#### 2. Data Collection and Preparation
+
+- **Data Collection:** Ensure that data is continuously collected and stored in a structured format.
+- **Data Preparation:** Clean and preprocess the data, handle missing values, and engineer features.
+
+#### 3. Exploratory Data Analysis (EDA)
+
+- **Visualizations:** Use visualizations to understand the relationships between features and the target variable.
+- **Statistics:** Calculate summary statistics to understand the data distribution.
+
+#### 4. Model Selection
+
+- **Baseline Models:** Start with simple models to establish a baseline. Common choices include:
+  - **Linear Regression:** For continuous target variables.
+  - **Logistic Regression:** For binary classification.
+  - **Decision Trees:** For both regression and classification.
+  - **Random Forests:** For more complex patterns and interactions.
+
+- **Advanced Models:** Consider more advanced models if needed, such as:
+  - **Gradient Boosting Machines (GBM)**
+  - **Support Vector Machines (SVM)**
+  - **Neural Networks**
+
+#### 5. Train-Test Split
+
+- **Data Splitting:** Split the data into training, validation, and test sets to evaluate model performance.
+
+**Python Code for Train-Test Split:**
+
+```python
+from sklearn.model_selection import train_test_split
+
+# Assuming df is your preprocessed DataFrame and 'target' is your target variable
+X = df.drop(columns=['target'])
+y = df['target']
+
+# Split the data
+X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
+X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
+```
+
+#### 6. Model Training
+
+- **Training:** Fit the model to the training data.
+- **Hyperparameter Tuning:** Use grid search or random search to optimize hyperparameters.
+
+**Python Code for Model Training and Hyperparameter Tuning:**
+
+```python
+from sklearn.linear_model import LinearRegression
+from sklearn.metrics import mean_squared_error
+from sklearn.model_selection import GridSearchCV
+
+# Example with Linear Regression
+model = LinearRegression()
+
+# Fit the model
+model.fit(X_train, y_train)
+
+# Predict on validation set
+y_val_pred = model.predict(X_val)
+
+# Evaluate the model
+mse_val = mean_squared_error(y_val, y_val_pred)
+print(f"Validation Mean Squared Error: {mse_val:.2f}")
+```
+
+#### 7. Model Evaluation
+
+- **Evaluation Metrics:** Use appropriate metrics to evaluate model performance.
+  - **Regression:** Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared.
+  - **Classification:** Accuracy, Precision, Recall, F1-Score, ROC-AUC.
+
+**Python Code for Model Evaluation:**
+
+```python
+from sklearn.metrics import r2_score
+
+# Evaluate on test data
+y_test_pred = model.predict(X_test)
+mse_test = mean_squared_error(y_test, y_test_pred)
+r2_test = r2_score(y_test, y_test_pred)
+
+print(f"Test Mean Squared Error: {mse_test:.2f}")
+print(f"Test R-squared: {r2_test:.2f}")
+```
+
+#### 8. Advanced Techniques
+
+- **Feature Selection:** Identify the most important features and consider reducing dimensionality.
+- **Ensemble Methods:** Combine predictions from multiple models to improve accuracy.
+- **Cross-Validation:** Use cross-validation to ensure the model generalizes well to unseen data.
+
+### Example: Predicting Room Occupancy
+
+Here's an example of how you might structure the code to predict room occupancy based on sensor data:
+
+**Step-by-Step Code:**
+
+```python
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from sklearn.linear_model import LinearRegression
+from sklearn.metrics import mean_squared_error, r2_score
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+# Load and preprocess data
+# Assuming 'df' is the DataFrame loaded from TimescaleDB and 'people_count' is the target
+def preprocess_data(df):
+    df['temperature'] = df['temperature'].fillna(df['temperature'].mean())
+    df['humidity'] = df['humidity'].fillna(df['humidity'].mean())
+    df['fan_rpm'] = df['fan_rpm'].fillna(df['fan_rpm'].mean())
+    df['lag_temperature'] = df['temperature'].shift(1)
+    df['rolling_mean_temperature'] = df['temperature'].rolling(window=3).mean()
+    df = df.dropna()  # Drop rows with NaN values after shifting
+    return df
+
+df = preprocess_data(df)
+
+# Feature Engineering
+df['hour'] = df['time'].dt.hour
+df['day_of_week'] = df['time'].dt.dayofweek
+df['interaction_term'] = df['temperature'] * df['humidity']
+
+# Define features and target
+X = df[['temperature', 'humidity', 'fan_rpm', 'lag_temperature', 'rolling_mean_temperature', 'hour', 'day_of_week', 'interaction_term']]
+y = df['people_count']
+
+# Split the data
+X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
+X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
+
+# Train the model
+model = LinearRegression()
+model.fit(X_train, y_train)
+
+# Predict on validation set
+y_val_pred = model.predict(X_val)
+mse_val = mean_squared_error(y_val, y_val_pred)
+print(f"Validation Mean Squared Error: {mse_val:.2f}")
+
+# Evaluate on test data
+y_test_pred = model.predict(X_test)
+mse_test = mean_squared_error(y_test, y_test_pred)
+r2_test = r2_score(y_test, y_test_pred)
+print(f"Test Mean Squared Error: {mse_test:.2f}")
+print(f"Test R-squared: {r2_test:.2f}")
+
+# Visualize results
+sns.scatterplot(x=y_test, y=y_test_pred)
+plt.xlabel('Actual People Count')
+plt.ylabel('Predicted People Count')
+plt.title('Actual vs Predicted People Count')
+plt.show()
+```
+
+### Summary
+- **Define the Problem:** Clearly define the prediction objective and target variable.
+- **Data Collection and Preparation:** Collect, clean, and preprocess the data.
+- **Feature Engineering:** Create relevant features to enhance the model’s predictive power.
+- **Model Selection and Training:** Select a baseline model and train it on the data.
+- **Model Evaluation:** Evaluate the model’s performance using appropriate metrics.
+- **Advanced Techniques:** Use feature selection, ensemble methods, and cross-validation to improve the model.
+
+By following this structured approach, you can effectively select and train a baseline machine learning model to predict target variables using MQTT sensor data.