Update projects/getting_started_ml.md
This commit is contained in:
@@ -570,3 +570,305 @@ def split_data(df, target, test_size=0.2, val_size=0.1):
|
||||
- **Function: split_data:** A structured approach to splitting the data into training, validation, and test sets, which is essential for reliable machine learning model development.
|
||||
|
||||
By focusing on these streamlined and well-defined steps, organizations can efficiently handle the train-test split process, ensuring that their models are well-trained and evaluated without the need for overly complex procedures. This approach balances practicality with the need for robust model development.
|
||||
|
||||
---
|
||||
|
||||
### Model Selection and Training: Functions Overview
|
||||
|
||||
#### 1. Define the Problem
|
||||
|
||||
- **Objective:** Clearly state the goal of the prediction.
|
||||
- **Target Variable:** Identify the dependent variable to be predicted.
|
||||
- **Features:** List the independent variables to be used for prediction.
|
||||
|
||||
#### 2. Data Collection and Preparation
|
||||
|
||||
- **Data Collection:** Continuously gather sensor data via MQTT and store it in TimescaleDB.
|
||||
|
||||
- **Preprocess Data:**
|
||||
- **Handle Missing Values:** Replace missing values with appropriate substitutes (e.g., mean, median).
|
||||
- **Remove Outliers:** Identify and handle outliers in the dataset.
|
||||
|
||||
- **Feature Engineering:**
|
||||
- **Lag Features:** Create lagged versions of features to capture temporal dependencies.
|
||||
- **Rolling Statistics:** Calculate rolling means, standard deviations, and other statistics over a specified window.
|
||||
- **Time-Based Features:** Extract time-related features such as hour of the day and day of the week.
|
||||
- **Interaction Terms:** Generate interaction terms between features to capture combined effects.
|
||||
|
||||
#### 3. Exploratory Data Analysis (EDA)
|
||||
|
||||
- **Visualize Data:**
|
||||
- **Scatter Plots:** Visualize relationships between features and the target variable.
|
||||
- **Histograms:** Understand the distribution of individual features.
|
||||
- **Correlation Matrix:** Identify correlations between features and the target variable.
|
||||
|
||||
- **Summary Statistics:**
|
||||
- **Mean, Median, Mode:** Calculate central tendency measures.
|
||||
- **Standard Deviation, Variance:** Measure the spread of the data.
|
||||
|
||||
#### 4. Model Selection
|
||||
|
||||
- **Choose Baseline Models:**
|
||||
- **Linear Regression:** Simple and interpretable model for continuous target variables.
|
||||
- **Logistic Regression:** Basic model for binary classification tasks.
|
||||
- **Decision Trees:** Model that captures non-linear relationships and interactions.
|
||||
- **Random Forests:** Ensemble model that reduces overfitting and captures complex patterns.
|
||||
|
||||
#### 5. Train-Test Split
|
||||
|
||||
- **Data Splitting:**
|
||||
- **Training Set:** Used to train the model.
|
||||
- **Validation Set:** Used to tune hyperparameters and avoid overfitting.
|
||||
- **Test Set:** Used to evaluate the final model's performance.
|
||||
|
||||
#### 6. Model Training
|
||||
|
||||
- **Train Model:**
|
||||
- **Fit:** Train the model on the training dataset.
|
||||
- **Hyperparameter Tuning:** Optimize model parameters using grid search or random search.
|
||||
|
||||
#### 7. Model Evaluation
|
||||
|
||||
- **Evaluate Model:**
|
||||
- **Validation Metrics:** Assess model performance on the validation set using metrics such as Mean Squared Error (MSE) and R-squared.
|
||||
- **Test Metrics:** Evaluate the final model on the test set using the same metrics to ensure generalization.
|
||||
|
||||
#### 8. Advanced Techniques
|
||||
|
||||
- **Feature Selection:** Identify and retain the most important features to reduce dimensionality.
|
||||
- **Ensemble Methods:** Combine predictions from multiple models to improve accuracy.
|
||||
- **Cross-Validation:** Use cross-validation techniques to ensure the model generalizes well to unseen data.
|
||||
|
||||
### Function Descriptions
|
||||
|
||||
#### Data Collection and Preparation
|
||||
|
||||
1. **collect_data:**
|
||||
- **Purpose:** Gather time series data from MQTT sensors and store it in TimescaleDB.
|
||||
|
||||
2. **preprocess_data:**
|
||||
- **Purpose:** Clean and preprocess the collected data, handle missing values, and remove outliers.
|
||||
|
||||
3. **feature_engineering:**
|
||||
- **Purpose:** Create new features such as lag features, rolling statistics, time-based features, and interaction terms.
|
||||
|
||||
#### Exploratory Data Analysis (EDA)
|
||||
|
||||
4. **visualize_data:**
|
||||
- **Purpose:** Use visualizations like scatter plots, histograms, and correlation matrices to explore relationships and distributions in the data.
|
||||
|
||||
5. **calculate_summary_statistics:**
|
||||
- **Purpose:** Calculate summary statistics (mean, median, mode, standard deviation, variance) to understand the central tendency and spread of the data.
|
||||
|
||||
#### Model Selection
|
||||
|
||||
6. **select_model:**
|
||||
- **Purpose:** Choose a baseline machine learning model appropriate for the prediction task (e.g., Linear Regression, Logistic Regression, Decision Trees, Random Forests).
|
||||
|
||||
#### Train-Test Split
|
||||
|
||||
7. **split_data:**
|
||||
- **Purpose:** Split the dataset into training, validation, and test sets to evaluate model performance.
|
||||
|
||||
#### Model Training
|
||||
|
||||
8. **train_model:**
|
||||
- **Purpose:** Train the chosen model on the training dataset and tune hyperparameters using the validation set.
|
||||
|
||||
#### Model Evaluation
|
||||
|
||||
9. **evaluate_model:**
|
||||
- **Purpose:** Assess the model’s performance on the validation and test sets using appropriate evaluation metrics.
|
||||
|
||||
#### Advanced Techniques
|
||||
|
||||
10. **feature_selection:**
|
||||
- **Purpose:** Identify and retain the most important features to improve model performance and reduce complexity.
|
||||
|
||||
11. **ensemble_methods:**
|
||||
- **Purpose:** Combine predictions from multiple models to enhance accuracy and robustness.
|
||||
|
||||
12. **cross_validation:**
|
||||
- **Purpose:** Use cross-validation techniques to ensure the model generalizes well to new, unseen data.
|
||||
|
||||
### Summary
|
||||
By systematically defining the problem, collecting and preparing data, conducting exploratory data analysis, selecting and training models, and evaluating their performance, you can establish a robust baseline for predicting target variables using MQTT sensor data. Each function plays a critical role in this structured approach, ensuring that the resulting model is accurate, reliable, and generalizable.
|
||||
|
||||
---
|
||||
|
||||
When selecting and training a baseline machine learning model using sensor data collected via MQTT, it is important to consider several factors to ensure the model is appropriate for the given use case. Here's a structured approach to selecting and training a baseline model:
|
||||
|
||||
### Model Selection and Training
|
||||
|
||||
#### 1. Define the Problem
|
||||
|
||||
- **Objective:** Clearly define what you want to predict. For example, predicting the number of people in a room, temperature variations, or equipment failure.
|
||||
- **Target Variable:** Identify the target variable (dependent variable) you want to predict.
|
||||
- **Features:** Identify the features (independent variables) you will use for prediction.
|
||||
|
||||
#### 2. Data Collection and Preparation
|
||||
|
||||
- **Data Collection:** Ensure that data is continuously collected and stored in a structured format.
|
||||
- **Data Preparation:** Clean and preprocess the data, handle missing values, and engineer features.
|
||||
|
||||
#### 3. Exploratory Data Analysis (EDA)
|
||||
|
||||
- **Visualizations:** Use visualizations to understand the relationships between features and the target variable.
|
||||
- **Statistics:** Calculate summary statistics to understand the data distribution.
|
||||
|
||||
#### 4. Model Selection
|
||||
|
||||
- **Baseline Models:** Start with simple models to establish a baseline. Common choices include:
|
||||
- **Linear Regression:** For continuous target variables.
|
||||
- **Logistic Regression:** For binary classification.
|
||||
- **Decision Trees:** For both regression and classification.
|
||||
- **Random Forests:** For more complex patterns and interactions.
|
||||
|
||||
- **Advanced Models:** Consider more advanced models if needed, such as:
|
||||
- **Gradient Boosting Machines (GBM)**
|
||||
- **Support Vector Machines (SVM)**
|
||||
- **Neural Networks**
|
||||
|
||||
#### 5. Train-Test Split
|
||||
|
||||
- **Data Splitting:** Split the data into training, validation, and test sets to evaluate model performance.
|
||||
|
||||
**Python Code for Train-Test Split:**
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
# Assuming df is your preprocessed DataFrame and 'target' is your target variable
|
||||
X = df.drop(columns=['target'])
|
||||
y = df['target']
|
||||
|
||||
# Split the data
|
||||
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
|
||||
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
|
||||
```
|
||||
|
||||
#### 6. Model Training
|
||||
|
||||
- **Training:** Fit the model to the training data.
|
||||
- **Hyperparameter Tuning:** Use grid search or random search to optimize hyperparameters.
|
||||
|
||||
**Python Code for Model Training and Hyperparameter Tuning:**
|
||||
|
||||
```python
|
||||
from sklearn.linear_model import LinearRegression
|
||||
from sklearn.metrics import mean_squared_error
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
# Example with Linear Regression
|
||||
model = LinearRegression()
|
||||
|
||||
# Fit the model
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
# Predict on validation set
|
||||
y_val_pred = model.predict(X_val)
|
||||
|
||||
# Evaluate the model
|
||||
mse_val = mean_squared_error(y_val, y_val_pred)
|
||||
print(f"Validation Mean Squared Error: {mse_val:.2f}")
|
||||
```
|
||||
|
||||
#### 7. Model Evaluation
|
||||
|
||||
- **Evaluation Metrics:** Use appropriate metrics to evaluate model performance.
|
||||
- **Regression:** Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared.
|
||||
- **Classification:** Accuracy, Precision, Recall, F1-Score, ROC-AUC.
|
||||
|
||||
**Python Code for Model Evaluation:**
|
||||
|
||||
```python
|
||||
from sklearn.metrics import r2_score
|
||||
|
||||
# Evaluate on test data
|
||||
y_test_pred = model.predict(X_test)
|
||||
mse_test = mean_squared_error(y_test, y_test_pred)
|
||||
r2_test = r2_score(y_test, y_test_pred)
|
||||
|
||||
print(f"Test Mean Squared Error: {mse_test:.2f}")
|
||||
print(f"Test R-squared: {r2_test:.2f}")
|
||||
```
|
||||
|
||||
#### 8. Advanced Techniques
|
||||
|
||||
- **Feature Selection:** Identify the most important features and consider reducing dimensionality.
|
||||
- **Ensemble Methods:** Combine predictions from multiple models to improve accuracy.
|
||||
- **Cross-Validation:** Use cross-validation to ensure the model generalizes well to unseen data.
|
||||
|
||||
### Example: Predicting Room Occupancy
|
||||
|
||||
Here's an example of how you might structure the code to predict room occupancy based on sensor data:
|
||||
|
||||
**Step-by-Step Code:**
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.linear_model import LinearRegression
|
||||
from sklearn.metrics import mean_squared_error, r2_score
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
|
||||
# Load and preprocess data
|
||||
# Assuming 'df' is the DataFrame loaded from TimescaleDB and 'people_count' is the target
|
||||
def preprocess_data(df):
|
||||
df['temperature'] = df['temperature'].fillna(df['temperature'].mean())
|
||||
df['humidity'] = df['humidity'].fillna(df['humidity'].mean())
|
||||
df['fan_rpm'] = df['fan_rpm'].fillna(df['fan_rpm'].mean())
|
||||
df['lag_temperature'] = df['temperature'].shift(1)
|
||||
df['rolling_mean_temperature'] = df['temperature'].rolling(window=3).mean()
|
||||
df = df.dropna() # Drop rows with NaN values after shifting
|
||||
return df
|
||||
|
||||
df = preprocess_data(df)
|
||||
|
||||
# Feature Engineering
|
||||
df['hour'] = df['time'].dt.hour
|
||||
df['day_of_week'] = df['time'].dt.dayofweek
|
||||
df['interaction_term'] = df['temperature'] * df['humidity']
|
||||
|
||||
# Define features and target
|
||||
X = df[['temperature', 'humidity', 'fan_rpm', 'lag_temperature', 'rolling_mean_temperature', 'hour', 'day_of_week', 'interaction_term']]
|
||||
y = df['people_count']
|
||||
|
||||
# Split the data
|
||||
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
|
||||
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
|
||||
|
||||
# Train the model
|
||||
model = LinearRegression()
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
# Predict on validation set
|
||||
y_val_pred = model.predict(X_val)
|
||||
mse_val = mean_squared_error(y_val, y_val_pred)
|
||||
print(f"Validation Mean Squared Error: {mse_val:.2f}")
|
||||
|
||||
# Evaluate on test data
|
||||
y_test_pred = model.predict(X_test)
|
||||
mse_test = mean_squared_error(y_test, y_test_pred)
|
||||
r2_test = r2_score(y_test, y_test_pred)
|
||||
print(f"Test Mean Squared Error: {mse_test:.2f}")
|
||||
print(f"Test R-squared: {r2_test:.2f}")
|
||||
|
||||
# Visualize results
|
||||
sns.scatterplot(x=y_test, y=y_test_pred)
|
||||
plt.xlabel('Actual People Count')
|
||||
plt.ylabel('Predicted People Count')
|
||||
plt.title('Actual vs Predicted People Count')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Summary
|
||||
- **Define the Problem:** Clearly define the prediction objective and target variable.
|
||||
- **Data Collection and Preparation:** Collect, clean, and preprocess the data.
|
||||
- **Feature Engineering:** Create relevant features to enhance the model’s predictive power.
|
||||
- **Model Selection and Training:** Select a baseline model and train it on the data.
|
||||
- **Model Evaluation:** Evaluate the model’s performance using appropriate metrics.
|
||||
- **Advanced Techniques:** Use feature selection, ensemble methods, and cross-validation to improve the model.
|
||||
|
||||
By following this structured approach, you can effectively select and train a baseline machine learning model to predict target variables using MQTT sensor data.
|
||||
Reference in New Issue
Block a user