diff --git a/projects/getting_started_ml.md b/projects/getting_started_ml.md index 4e32f3c..a4127f5 100644 --- a/projects/getting_started_ml.md +++ b/projects/getting_started_ml.md @@ -569,4 +569,306 @@ def split_data(df, target, test_size=0.2, val_size=0.1): - **Key Considerations:** Balance between training, validation, and test sets, ensuring representativeness, and avoiding data leakage. - **Function: split_data:** A structured approach to splitting the data into training, validation, and test sets, which is essential for reliable machine learning model development. -By focusing on these streamlined and well-defined steps, organizations can efficiently handle the train-test split process, ensuring that their models are well-trained and evaluated without the need for overly complex procedures. This approach balances practicality with the need for robust model development. \ No newline at end of file +By focusing on these streamlined and well-defined steps, organizations can efficiently handle the train-test split process, ensuring that their models are well-trained and evaluated without the need for overly complex procedures. This approach balances practicality with the need for robust model development. + +--- + +### Model Selection and Training: Functions Overview + +#### 1. Define the Problem + +- **Objective:** Clearly state the goal of the prediction. +- **Target Variable:** Identify the dependent variable to be predicted. +- **Features:** List the independent variables to be used for prediction. + +#### 2. Data Collection and Preparation + +- **Data Collection:** Continuously gather sensor data via MQTT and store it in TimescaleDB. + +- **Preprocess Data:** + - **Handle Missing Values:** Replace missing values with appropriate substitutes (e.g., mean, median). + - **Remove Outliers:** Identify and handle outliers in the dataset. + +- **Feature Engineering:** + - **Lag Features:** Create lagged versions of features to capture temporal dependencies. + - **Rolling Statistics:** Calculate rolling means, standard deviations, and other statistics over a specified window. + - **Time-Based Features:** Extract time-related features such as hour of the day and day of the week. + - **Interaction Terms:** Generate interaction terms between features to capture combined effects. + +#### 3. Exploratory Data Analysis (EDA) + +- **Visualize Data:** + - **Scatter Plots:** Visualize relationships between features and the target variable. + - **Histograms:** Understand the distribution of individual features. + - **Correlation Matrix:** Identify correlations between features and the target variable. + +- **Summary Statistics:** + - **Mean, Median, Mode:** Calculate central tendency measures. + - **Standard Deviation, Variance:** Measure the spread of the data. + +#### 4. Model Selection + +- **Choose Baseline Models:** + - **Linear Regression:** Simple and interpretable model for continuous target variables. + - **Logistic Regression:** Basic model for binary classification tasks. + - **Decision Trees:** Model that captures non-linear relationships and interactions. + - **Random Forests:** Ensemble model that reduces overfitting and captures complex patterns. + +#### 5. Train-Test Split + +- **Data Splitting:** + - **Training Set:** Used to train the model. + - **Validation Set:** Used to tune hyperparameters and avoid overfitting. + - **Test Set:** Used to evaluate the final model's performance. + +#### 6. Model Training + +- **Train Model:** + - **Fit:** Train the model on the training dataset. + - **Hyperparameter Tuning:** Optimize model parameters using grid search or random search. + +#### 7. Model Evaluation + +- **Evaluate Model:** + - **Validation Metrics:** Assess model performance on the validation set using metrics such as Mean Squared Error (MSE) and R-squared. + - **Test Metrics:** Evaluate the final model on the test set using the same metrics to ensure generalization. + +#### 8. Advanced Techniques + +- **Feature Selection:** Identify and retain the most important features to reduce dimensionality. +- **Ensemble Methods:** Combine predictions from multiple models to improve accuracy. +- **Cross-Validation:** Use cross-validation techniques to ensure the model generalizes well to unseen data. + +### Function Descriptions + +#### Data Collection and Preparation + +1. **collect_data:** + - **Purpose:** Gather time series data from MQTT sensors and store it in TimescaleDB. + +2. **preprocess_data:** + - **Purpose:** Clean and preprocess the collected data, handle missing values, and remove outliers. + +3. **feature_engineering:** + - **Purpose:** Create new features such as lag features, rolling statistics, time-based features, and interaction terms. + +#### Exploratory Data Analysis (EDA) + +4. **visualize_data:** + - **Purpose:** Use visualizations like scatter plots, histograms, and correlation matrices to explore relationships and distributions in the data. + +5. **calculate_summary_statistics:** + - **Purpose:** Calculate summary statistics (mean, median, mode, standard deviation, variance) to understand the central tendency and spread of the data. + +#### Model Selection + +6. **select_model:** + - **Purpose:** Choose a baseline machine learning model appropriate for the prediction task (e.g., Linear Regression, Logistic Regression, Decision Trees, Random Forests). + +#### Train-Test Split + +7. **split_data:** + - **Purpose:** Split the dataset into training, validation, and test sets to evaluate model performance. + +#### Model Training + +8. **train_model:** + - **Purpose:** Train the chosen model on the training dataset and tune hyperparameters using the validation set. + +#### Model Evaluation + +9. **evaluate_model:** + - **Purpose:** Assess the model’s performance on the validation and test sets using appropriate evaluation metrics. + +#### Advanced Techniques + +10. **feature_selection:** + - **Purpose:** Identify and retain the most important features to improve model performance and reduce complexity. + +11. **ensemble_methods:** + - **Purpose:** Combine predictions from multiple models to enhance accuracy and robustness. + +12. **cross_validation:** + - **Purpose:** Use cross-validation techniques to ensure the model generalizes well to new, unseen data. + +### Summary +By systematically defining the problem, collecting and preparing data, conducting exploratory data analysis, selecting and training models, and evaluating their performance, you can establish a robust baseline for predicting target variables using MQTT sensor data. Each function plays a critical role in this structured approach, ensuring that the resulting model is accurate, reliable, and generalizable. + +--- + +When selecting and training a baseline machine learning model using sensor data collected via MQTT, it is important to consider several factors to ensure the model is appropriate for the given use case. Here's a structured approach to selecting and training a baseline model: + +### Model Selection and Training + +#### 1. Define the Problem + +- **Objective:** Clearly define what you want to predict. For example, predicting the number of people in a room, temperature variations, or equipment failure. +- **Target Variable:** Identify the target variable (dependent variable) you want to predict. +- **Features:** Identify the features (independent variables) you will use for prediction. + +#### 2. Data Collection and Preparation + +- **Data Collection:** Ensure that data is continuously collected and stored in a structured format. +- **Data Preparation:** Clean and preprocess the data, handle missing values, and engineer features. + +#### 3. Exploratory Data Analysis (EDA) + +- **Visualizations:** Use visualizations to understand the relationships between features and the target variable. +- **Statistics:** Calculate summary statistics to understand the data distribution. + +#### 4. Model Selection + +- **Baseline Models:** Start with simple models to establish a baseline. Common choices include: + - **Linear Regression:** For continuous target variables. + - **Logistic Regression:** For binary classification. + - **Decision Trees:** For both regression and classification. + - **Random Forests:** For more complex patterns and interactions. + +- **Advanced Models:** Consider more advanced models if needed, such as: + - **Gradient Boosting Machines (GBM)** + - **Support Vector Machines (SVM)** + - **Neural Networks** + +#### 5. Train-Test Split + +- **Data Splitting:** Split the data into training, validation, and test sets to evaluate model performance. + +**Python Code for Train-Test Split:** + +```python +from sklearn.model_selection import train_test_split + +# Assuming df is your preprocessed DataFrame and 'target' is your target variable +X = df.drop(columns=['target']) +y = df['target'] + +# Split the data +X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42) +X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) +``` + +#### 6. Model Training + +- **Training:** Fit the model to the training data. +- **Hyperparameter Tuning:** Use grid search or random search to optimize hyperparameters. + +**Python Code for Model Training and Hyperparameter Tuning:** + +```python +from sklearn.linear_model import LinearRegression +from sklearn.metrics import mean_squared_error +from sklearn.model_selection import GridSearchCV + +# Example with Linear Regression +model = LinearRegression() + +# Fit the model +model.fit(X_train, y_train) + +# Predict on validation set +y_val_pred = model.predict(X_val) + +# Evaluate the model +mse_val = mean_squared_error(y_val, y_val_pred) +print(f"Validation Mean Squared Error: {mse_val:.2f}") +``` + +#### 7. Model Evaluation + +- **Evaluation Metrics:** Use appropriate metrics to evaluate model performance. + - **Regression:** Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared. + - **Classification:** Accuracy, Precision, Recall, F1-Score, ROC-AUC. + +**Python Code for Model Evaluation:** + +```python +from sklearn.metrics import r2_score + +# Evaluate on test data +y_test_pred = model.predict(X_test) +mse_test = mean_squared_error(y_test, y_test_pred) +r2_test = r2_score(y_test, y_test_pred) + +print(f"Test Mean Squared Error: {mse_test:.2f}") +print(f"Test R-squared: {r2_test:.2f}") +``` + +#### 8. Advanced Techniques + +- **Feature Selection:** Identify the most important features and consider reducing dimensionality. +- **Ensemble Methods:** Combine predictions from multiple models to improve accuracy. +- **Cross-Validation:** Use cross-validation to ensure the model generalizes well to unseen data. + +### Example: Predicting Room Occupancy + +Here's an example of how you might structure the code to predict room occupancy based on sensor data: + +**Step-by-Step Code:** + +```python +import pandas as pd +from sklearn.model_selection import train_test_split +from sklearn.linear_model import LinearRegression +from sklearn.metrics import mean_squared_error, r2_score +import matplotlib.pyplot as plt +import seaborn as sns + +# Load and preprocess data +# Assuming 'df' is the DataFrame loaded from TimescaleDB and 'people_count' is the target +def preprocess_data(df): + df['temperature'] = df['temperature'].fillna(df['temperature'].mean()) + df['humidity'] = df['humidity'].fillna(df['humidity'].mean()) + df['fan_rpm'] = df['fan_rpm'].fillna(df['fan_rpm'].mean()) + df['lag_temperature'] = df['temperature'].shift(1) + df['rolling_mean_temperature'] = df['temperature'].rolling(window=3).mean() + df = df.dropna() # Drop rows with NaN values after shifting + return df + +df = preprocess_data(df) + +# Feature Engineering +df['hour'] = df['time'].dt.hour +df['day_of_week'] = df['time'].dt.dayofweek +df['interaction_term'] = df['temperature'] * df['humidity'] + +# Define features and target +X = df[['temperature', 'humidity', 'fan_rpm', 'lag_temperature', 'rolling_mean_temperature', 'hour', 'day_of_week', 'interaction_term']] +y = df['people_count'] + +# Split the data +X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42) +X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) + +# Train the model +model = LinearRegression() +model.fit(X_train, y_train) + +# Predict on validation set +y_val_pred = model.predict(X_val) +mse_val = mean_squared_error(y_val, y_val_pred) +print(f"Validation Mean Squared Error: {mse_val:.2f}") + +# Evaluate on test data +y_test_pred = model.predict(X_test) +mse_test = mean_squared_error(y_test, y_test_pred) +r2_test = r2_score(y_test, y_test_pred) +print(f"Test Mean Squared Error: {mse_test:.2f}") +print(f"Test R-squared: {r2_test:.2f}") + +# Visualize results +sns.scatterplot(x=y_test, y=y_test_pred) +plt.xlabel('Actual People Count') +plt.ylabel('Predicted People Count') +plt.title('Actual vs Predicted People Count') +plt.show() +``` + +### Summary +- **Define the Problem:** Clearly define the prediction objective and target variable. +- **Data Collection and Preparation:** Collect, clean, and preprocess the data. +- **Feature Engineering:** Create relevant features to enhance the model’s predictive power. +- **Model Selection and Training:** Select a baseline model and train it on the data. +- **Model Evaluation:** Evaluate the model’s performance using appropriate metrics. +- **Advanced Techniques:** Use feature selection, ensemble methods, and cross-validation to improve the model. + +By following this structured approach, you can effectively select and train a baseline machine learning model to predict target variables using MQTT sensor data. \ No newline at end of file