In addition to predictive analysis, several other types of analysis can be performed using Meraki sensor data and network telemetry. These analyses can provide valuable insights into the environment, operations, and overall system performance. Here are some key types of analysis: ### 1. **Descriptive Analysis** **Objective:** - Summarize and describe the main features of the dataset. **Techniques:** - **Summary Statistics:** Calculate mean, median, mode, standard deviation, and range for different sensor readings (e.g., temperature, humidity, fan RPM). - **Visualizations:** Use histograms, bar charts, box plots, and heatmaps to visualize the distribution and relationships between different variables. **Example:** ```python import matplotlib.pyplot as plt import seaborn as sns # Summary statistics summary_stats = sensor_data.describe() # Visualization sns.histplot(sensor_data['temperature'], kde=True) plt.title('Temperature Distribution') plt.xlabel('Temperature') plt.ylabel('Frequency') plt.show() sns.boxplot(x='sensor_serial', y='humidity', data=sensor_data) plt.title('Humidity by Sensor') plt.xlabel('Sensor') plt.ylabel('Humidity') plt.show() ``` ### 2. **Diagnostic Analysis** **Objective:** - Understand the underlying causes of trends, patterns, or anomalies in the data. **Techniques:** - **Correlation Analysis:** Examine the relationships between different variables. - **Anomaly Detection:** Identify and investigate unusual patterns or outliers in the data. **Example:** ```python # Correlation matrix corr_matrix = sensor_data[['temperature', 'humidity', 'fan_rpm']].corr() # Visualization sns.heatmap(corr_matrix, annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show() # Anomaly detection using Z-score from scipy.stats import zscore sensor_data['zscore_temp'] = zscore(sensor_data['temperature']) anomalies = sensor_data[sensor_data['zscore_temp'].abs() > 3] plt.plot(sensor_data['time'], sensor_data['temperature'], label='Temperature') plt.scatter(anomalies['time'], anomalies['temperature'], color='red', label='Anomalies') plt.title('Temperature Anomalies') plt.xlabel('Time') plt.ylabel('Temperature') plt.legend() plt.show() ``` ### 3. **Prescriptive Analysis** **Objective:** - Provide recommendations or actions based on the analysis. **Techniques:** - **Optimization:** Use mathematical models to find the best solution for a given problem (e.g., optimizing fan speeds for energy efficiency). - **Decision Trees:** Develop decision rules based on historical data to guide future actions. **Example:** ```python from sklearn.tree import DecisionTreeClassifier, plot_tree # Decision tree to recommend fan speed based on temperature and humidity X = sensor_data[['temperature', 'humidity']] y = sensor_data['fan_rpm_category'] # Assume fan RPM is categorized for simplicity model = DecisionTreeClassifier(max_depth=3) model.fit(X, y) plt.figure(figsize=(12,8)) plot_tree(model, feature_names=['temperature', 'humidity'], class_names=['Low', 'Medium', 'High'], filled=True) plt.title('Decision Tree for Fan RPM Recommendations') plt.show() ``` ### 4. **Predictive Maintenance Analysis** **Objective:** - Predict when maintenance should be performed to prevent unexpected equipment failures. **Techniques:** - **Survival Analysis:** Estimate the time until an event (e.g., failure) occurs. - **Time-to-Failure Models:** Predict the remaining useful life of equipment. **Example:** ```python from lifelines import KaplanMeierFitter # Simulated data for time to failure sensor_data['time_to_failure'] = ... # Time until sensor indicates failure sensor_data['event_observed'] = ... # 1 if failure observed, 0 otherwise kmf = KaplanMeierFitter() kmf.fit(durations=sensor_data['time_to_failure'], event_observed=sensor_data['event_observed']) kmf.plot_survival_function() plt.title('Kaplan-Meier Survival Curve') plt.xlabel('Time') plt.ylabel('Survival Probability') plt.show() ``` ### 5. **Real-time Monitoring and Alerts** **Objective:** - Continuously monitor sensor data and generate alerts for specific conditions. **Techniques:** - **Threshold-based Alerts:** Trigger alerts when sensor readings exceed predefined thresholds. - **Real-time Dashboards:** Use visualization tools to create live dashboards showing current sensor statuses. **Example:** ```python import dash from dash import dcc, html from dash.dependencies import Input, Output # Sample real-time data setup app = dash.Dash(__name__) app.layout = html.Div([ dcc.Graph(id='live-update-graph'), dcc.Interval( id='interval-component', interval=1*1000, # Update every second n_intervals=0 ) ]) @app.callback(Output('live-update-graph', 'figure'), Input('interval-component', 'n_intervals')) def update_graph_live(n): # Fetch the latest data latest_data = collect_sensor_data() # Assuming this function fetches the latest data fig = { 'data': [ {'x': latest_data['time'], 'y': latest_data['temperature'], 'type': 'line', 'name': 'Temperature'}, {'x': latest_data['time'], 'y': latest_data['humidity'], 'type': 'line', 'name': 'Humidity'}, {'x': latest_data['time'], 'y': latest_data['fan_rpm'], 'type': 'line', 'name': 'Fan RPM'} ], 'layout': { 'title': 'Live Sensor Data' } } return fig if __name__ == '__main__': app.run_server(debug=True) ``` ### 6. **Root Cause Analysis** **Objective:** - Identify the root cause of specific issues or anomalies in the data. **Techniques:** - **Causal Analysis:** Use causal inference methods to determine cause-and-effect relationships. - **Fishbone Diagrams:** Visualize potential causes of a problem. **Example:** ```python # Simple example of causal analysis using correlation import statsmodels.api as sm X = sensor_data[['temperature', 'humidity']] y = sensor_data['fan_rpm'] X = sm.add_constant(X) model = sm.OLS(y, X).fit() print(model.summary()) ``` ### Summary By leveraging these various types of analysis, you can gain a comprehensive understanding of your Meraki sensor data and network telemetry. Each type of analysis offers unique insights and value, ranging from summarizing current conditions to making data-driven decisions and predicting future events. This multifaceted approach ensures that you can optimize operations, maintain equipment effectively, and respond proactively to changes and anomalies in your environment. --- Combining MQTT data with the Meraki Dashboard API can unlock advanced capabilities for real-time monitoring, predictive maintenance, and comprehensive environmental and network management. Here's how you can leverage the Meraki Dashboard API alongside MQTT data to create a powerful, integrated solution. ### Advanced Capabilities with MQTT Data and Meraki Dashboard API 1. **Real-Time Monitoring and Alerts** 2. **Predictive Maintenance** 3. **Comprehensive Data Analytics** 4. **Optimized Environmental Control** 5. **Network Performance Management** ### 1. Real-Time Monitoring and Alerts **Objective:** - Continuously monitor environmental conditions and network performance, triggering alerts when predefined thresholds are exceeded. **Implementation:** - **Collect MQTT Data:** Subscribe to MQTT topics to collect sensor data in real time. - **Fetch Meraki Data:** Use the Meraki Dashboard API to fetch real-time data from Meraki devices. - **Set Up Alerts:** Configure thresholds for key metrics and send alerts when these thresholds are breached. **Example Python Code:** ```python import requests import paho.mqtt.client as mqtt import json # Meraki API credentials API_KEY = 'your_meraki_api_key' NETWORK_ID = 'your_network_id' MX_SERIAL = 'your_mx_serial' # MQTT broker configuration MQTT_BROKER = 'mqtt_broker_address' MQTT_TOPIC = 'your/topic/#' # Fetch data from Meraki API def fetch_meraki_data(): url = f"https://api.meraki.com/api/v1/networks/{NETWORK_ID}/devices/{MX_SERIAL}/performance" headers = { 'X-Cisco-Meraki-API-Key': API_KEY, 'Content-Type': 'application/json' } response = requests.get(url, headers=headers) return response.json() # MQTT callback for message reception def on_message(client, userdata, message): payload = json.loads(message.payload.decode('utf-8')) temperature = payload.get('temperature') humidity = payload.get('humidity') meraki_data = fetch_meraki_data() fan_rpm = meraki_data['fan_speed'] # Check thresholds and send alerts if temperature > 30: print("Alert: High temperature!") if humidity > 70: print("Alert: High humidity!") if fan_rpm > 5000: print("Alert: High fan RPM!") # MQTT client setup client = mqtt.Client() client.on_message = on_message client.connect(MQTT_BROKER) client.subscribe(MQTT_TOPIC) client.loop_forever() ``` ### 2. Predictive Maintenance **Objective:** - Predict when maintenance should be performed to prevent unexpected equipment failures. **Implementation:** - **Collect Historical Data:** Collect historical MQTT and Meraki data for training predictive models. - **Train Predictive Models:** Use machine learning algorithms to predict equipment failure based on historical data. - **Deploy Predictive Models:** Integrate predictive models into the monitoring system to trigger maintenance alerts. **Example Python Code:** ```python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error # Fetch historical data from MQTT and Meraki API (assume data collection code exists) historical_data = collect_historical_data() # Data preprocessing X = historical_data[['temperature', 'humidity', 'fan_rpm']] y = historical_data['time_to_failure'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train predictive model model = RandomForestRegressor() model.fit(X_train, y_train) # Evaluate the model y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error: {mse:.2f}") # Predict maintenance needs def predict_maintenance(temperature, humidity, fan_rpm): features = pd.DataFrame({'temperature': [temperature], 'humidity': [humidity], 'fan_rpm': [fan_rpm]}) time_to_failure = model.predict(features) if time_to_failure < 7: print("Alert: Maintenance needed soon!") # Example usage predict_maintenance(32, 65, 4800) ``` ### 3. Comprehensive Data Analytics **Objective:** - Perform advanced analytics on the combined MQTT and Meraki data to derive insights. **Implementation:** - **Data Integration:** Combine MQTT and Meraki data into a unified dataset. - **Data Analytics:** Use data analytics tools and techniques to extract insights from the integrated dataset. **Example Python Code:** ```python import pandas as pd # Integrate MQTT and Meraki data (assume data collection code exists) mqtt_data = collect_mqtt_data() meraki_data = collect_meraki_data() # Combine datasets combined_data = pd.merge(mqtt_data, meraki_data, on='timestamp') # Data analytics correlation_matrix = combined_data.corr() print(correlation_matrix) # Visualization import seaborn as sns import matplotlib.pyplot as plt sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show() ``` ### 4. Optimized Environmental Control **Objective:** - Optimize the environmental conditions (e.g., temperature, humidity) for the best equipment performance. **Implementation:** - **Real-Time Adjustments:** Use real-time data to adjust environmental controls dynamically. - **Feedback Loops:** Implement feedback loops to continuously optimize environmental settings. **Example Python Code:** ```python # Real-time environmental control def adjust_environment(temperature, humidity, fan_rpm): if temperature > 30: print("Adjusting cooling system to lower temperature.") if humidity > 70: print("Adjusting dehumidifier to lower humidity.") if fan_rpm > 5000: print("Adjusting fan speed to optimal level.") # Example usage adjust_environment(32, 75, 5200) ``` ### 5. Network Performance Management **Objective:** - Monitor and optimize network performance based on environmental conditions and equipment status. **Implementation:** - **Network Telemetry:** Collect network performance data using Meraki Dashboard API. - **Performance Optimization:** Use the collected data to optimize network performance dynamically. **Example Python Code:** ```python # Fetch network telemetry data def fetch_network_telemetry(): url = f"https://api.meraki.com/api/v1/networks/{NETWORK_ID}/devices/performance" headers = { 'X-Cisco-Meraki-API-Key': API_KEY, 'Content-Type': 'application/json' } response = requests.get(url, headers=headers) return response.json() # Performance optimization def optimize_network_performance(temperature, humidity, network_data): if temperature > 30: print("Adjusting network settings to optimize performance under high temperature.") if humidity > 70: print("Adjusting network settings to optimize performance under high humidity.") # Example: Prioritize critical traffic critical_traffic = [d for d in network_data if d['traffic_type'] == 'critical'] print(f"Optimizing {len(critical_traffic)} critical traffic flows.") # Example usage network_data = fetch_network_telemetry() optimize_network_performance(32, 75, network_data) ``` ### Summary By combining MQTT data with Meraki Dashboard API data, you can implement advanced capabilities such as: 1. **Real-Time Monitoring and Alerts:** - Continuously monitor environmental and network conditions, triggering alerts when thresholds are exceeded. 2. **Predictive Maintenance:** - Predict equipment maintenance needs based on historical data and machine learning models. 3. **Comprehensive Data Analytics:** - Perform advanced analytics on integrated datasets to derive actionable insights. 4. **Optimized Environmental Control:** - Dynamically adjust environmental controls for optimal equipment performance. 5. **Network Performance Management:** - Monitor and optimize network performance based on environmental conditions and equipment status. These advanced capabilities can significantly enhance the efficiency, reliability, and performance of your operations, providing a comprehensive solution for managing both environmental and network parameters. --- ### Focus on Train-Test Split for MQTT and Sensor Data Given the constraints and practical considerations within an organization, it is essential to streamline the approach to ensure it's feasible while maintaining robust predictive modeling. Here's a detailed focus on the train-test split process: #### Train-Test Split **Objective:** - To divide the data into distinct sets that serve different purposes in the model development process: training, validation, and testing. **Rationale:** - Proper data splitting ensures that the model can generalize well to new data and helps in evaluating the model’s performance effectively. ### Steps and Functions #### 1. Data Splitting **Function: split_data** **Purpose:** - To split the dataset into training, validation, and test sets. **Steps:** 1. **Identify Features and Target:** - **Features (X):** Independent variables that will be used for prediction. - **Target (y):** Dependent variable that needs to be predicted. 2. **Split the Data:** - **Training Set:** Typically 60-70% of the data. Used to train the model. - **Validation Set:** Typically 15-20% of the data. Used to tune hyperparameters and avoid overfitting. - **Test Set:** Typically 15-20% of the data. Used to evaluate the final model’s performance. **Considerations:** - Ensure the splits are representative of the overall dataset. - Use stratified sampling if dealing with classification problems to maintain the distribution of target classes. ### Detailed Description of the Functions #### Data Splitting 1. **Function: split_data** - **Purpose:** Split the dataset into training, validation, and test sets. - **Parameters:** - `df`: The preprocessed DataFrame. - `target`: The name of the target column. - `test_size`: Proportion of the data to include in the test split (default 0.2). - `val_size`: Proportion of the data to include in the validation split from the remaining training data (default 0.1). - **Returns:** - `X_train`, `X_val`, `X_test`: Features for training, validation, and test sets. - `y_train`, `y_val`, `y_test`: Target variable for training, validation, and test sets. ### Example Implementation of split_data Function ```python from sklearn.model_selection import train_test_split def split_data(df, target, test_size=0.2, val_size=0.1): """ Split the dataset into training, validation, and test sets. Parameters: df (DataFrame): The preprocessed DataFrame. target (str): The name of the target column. test_size (float): Proportion of the data to include in the test split. val_size (float): Proportion of the data to include in the validation split from the remaining training data. Returns: X_train, X_val, X_test: Features for training, validation, and test sets. y_train, y_val, y_test: Target variable for training, validation, and test sets. """ X = df.drop(columns=[target]) y = df[target] # Split into initial train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42, shuffle=False) # Calculate validation size relative to the training set val_size_adjusted = val_size / (1 - test_size) # Split the training set into new training and validation sets X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=val_size_adjusted, random_state=42, shuffle=False) return X_train, X_val, X_test, y_train, y_val, y_test ``` ### Explanation of the Example Implementation 1. **Data Preparation:** - Drop the target column from the DataFrame to get the feature set (X). - Separate the target column to get the target variable (y). 2. **Initial Split:** - Use `train_test_split` to split the data into training and test sets. Set `test_size` to the desired proportion (default 0.2). 3. **Validation Split:** - Calculate the adjusted validation size relative to the remaining training data. - Split the initial training set into a new training set and a validation set using `train_test_split`. 4. **Return Values:** - Return the features and target variables for the training, validation, and test sets. ### Summary - **Objective:** To ensure that data splitting is done effectively to facilitate robust model training and evaluation. - **Key Considerations:** Balance between training, validation, and test sets, ensuring representativeness, and avoiding data leakage. - **Function: split_data:** A structured approach to splitting the data into training, validation, and test sets, which is essential for reliable machine learning model development. By focusing on these streamlined and well-defined steps, organizations can efficiently handle the train-test split process, ensuring that their models are well-trained and evaluated without the need for overly complex procedures. This approach balances practicality with the need for robust model development. --- ### Model Selection and Training: Functions Overview #### 1. Define the Problem - **Objective:** Clearly state the goal of the prediction. - **Target Variable:** Identify the dependent variable to be predicted. - **Features:** List the independent variables to be used for prediction. #### 2. Data Collection and Preparation - **Data Collection:** Continuously gather sensor data via MQTT and store it in TimescaleDB. - **Preprocess Data:** - **Handle Missing Values:** Replace missing values with appropriate substitutes (e.g., mean, median). - **Remove Outliers:** Identify and handle outliers in the dataset. - **Feature Engineering:** - **Lag Features:** Create lagged versions of features to capture temporal dependencies. - **Rolling Statistics:** Calculate rolling means, standard deviations, and other statistics over a specified window. - **Time-Based Features:** Extract time-related features such as hour of the day and day of the week. - **Interaction Terms:** Generate interaction terms between features to capture combined effects. #### 3. Exploratory Data Analysis (EDA) - **Visualize Data:** - **Scatter Plots:** Visualize relationships between features and the target variable. - **Histograms:** Understand the distribution of individual features. - **Correlation Matrix:** Identify correlations between features and the target variable. - **Summary Statistics:** - **Mean, Median, Mode:** Calculate central tendency measures. - **Standard Deviation, Variance:** Measure the spread of the data. #### 4. Model Selection - **Choose Baseline Models:** - **Linear Regression:** Simple and interpretable model for continuous target variables. - **Logistic Regression:** Basic model for binary classification tasks. - **Decision Trees:** Model that captures non-linear relationships and interactions. - **Random Forests:** Ensemble model that reduces overfitting and captures complex patterns. #### 5. Train-Test Split - **Data Splitting:** - **Training Set:** Used to train the model. - **Validation Set:** Used to tune hyperparameters and avoid overfitting. - **Test Set:** Used to evaluate the final model's performance. #### 6. Model Training - **Train Model:** - **Fit:** Train the model on the training dataset. - **Hyperparameter Tuning:** Optimize model parameters using grid search or random search. #### 7. Model Evaluation - **Evaluate Model:** - **Validation Metrics:** Assess model performance on the validation set using metrics such as Mean Squared Error (MSE) and R-squared. - **Test Metrics:** Evaluate the final model on the test set using the same metrics to ensure generalization. #### 8. Advanced Techniques - **Feature Selection:** Identify and retain the most important features to reduce dimensionality. - **Ensemble Methods:** Combine predictions from multiple models to improve accuracy. - **Cross-Validation:** Use cross-validation techniques to ensure the model generalizes well to unseen data. ### Function Descriptions #### Data Collection and Preparation 1. **collect_data:** - **Purpose:** Gather time series data from MQTT sensors and store it in TimescaleDB. 2. **preprocess_data:** - **Purpose:** Clean and preprocess the collected data, handle missing values, and remove outliers. 3. **feature_engineering:** - **Purpose:** Create new features such as lag features, rolling statistics, time-based features, and interaction terms. #### Exploratory Data Analysis (EDA) 4. **visualize_data:** - **Purpose:** Use visualizations like scatter plots, histograms, and correlation matrices to explore relationships and distributions in the data. 5. **calculate_summary_statistics:** - **Purpose:** Calculate summary statistics (mean, median, mode, standard deviation, variance) to understand the central tendency and spread of the data. #### Model Selection 6. **select_model:** - **Purpose:** Choose a baseline machine learning model appropriate for the prediction task (e.g., Linear Regression, Logistic Regression, Decision Trees, Random Forests). #### Train-Test Split 7. **split_data:** - **Purpose:** Split the dataset into training, validation, and test sets to evaluate model performance. #### Model Training 8. **train_model:** - **Purpose:** Train the chosen model on the training dataset and tune hyperparameters using the validation set. #### Model Evaluation 9. **evaluate_model:** - **Purpose:** Assess the model’s performance on the validation and test sets using appropriate evaluation metrics. #### Advanced Techniques 10. **feature_selection:** - **Purpose:** Identify and retain the most important features to improve model performance and reduce complexity. 11. **ensemble_methods:** - **Purpose:** Combine predictions from multiple models to enhance accuracy and robustness. 12. **cross_validation:** - **Purpose:** Use cross-validation techniques to ensure the model generalizes well to new, unseen data. ### Summary By systematically defining the problem, collecting and preparing data, conducting exploratory data analysis, selecting and training models, and evaluating their performance, you can establish a robust baseline for predicting target variables using MQTT sensor data. Each function plays a critical role in this structured approach, ensuring that the resulting model is accurate, reliable, and generalizable. --- When selecting and training a baseline machine learning model using sensor data collected via MQTT, it is important to consider several factors to ensure the model is appropriate for the given use case. Here's a structured approach to selecting and training a baseline model: ### Model Selection and Training #### 1. Define the Problem - **Objective:** Clearly define what you want to predict. For example, predicting the number of people in a room, temperature variations, or equipment failure. - **Target Variable:** Identify the target variable (dependent variable) you want to predict. - **Features:** Identify the features (independent variables) you will use for prediction. #### 2. Data Collection and Preparation - **Data Collection:** Ensure that data is continuously collected and stored in a structured format. - **Data Preparation:** Clean and preprocess the data, handle missing values, and engineer features. #### 3. Exploratory Data Analysis (EDA) - **Visualizations:** Use visualizations to understand the relationships between features and the target variable. - **Statistics:** Calculate summary statistics to understand the data distribution. #### 4. Model Selection - **Baseline Models:** Start with simple models to establish a baseline. Common choices include: - **Linear Regression:** For continuous target variables. - **Logistic Regression:** For binary classification. - **Decision Trees:** For both regression and classification. - **Random Forests:** For more complex patterns and interactions. - **Advanced Models:** Consider more advanced models if needed, such as: - **Gradient Boosting Machines (GBM)** - **Support Vector Machines (SVM)** - **Neural Networks** #### 5. Train-Test Split - **Data Splitting:** Split the data into training, validation, and test sets to evaluate model performance. **Python Code for Train-Test Split:** ```python from sklearn.model_selection import train_test_split # Assuming df is your preprocessed DataFrame and 'target' is your target variable X = df.drop(columns=['target']) y = df['target'] # Split the data X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) ``` #### 6. Model Training - **Training:** Fit the model to the training data. - **Hyperparameter Tuning:** Use grid search or random search to optimize hyperparameters. **Python Code for Model Training and Hyperparameter Tuning:** ```python from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV # Example with Linear Regression model = LinearRegression() # Fit the model model.fit(X_train, y_train) # Predict on validation set y_val_pred = model.predict(X_val) # Evaluate the model mse_val = mean_squared_error(y_val, y_val_pred) print(f"Validation Mean Squared Error: {mse_val:.2f}") ``` #### 7. Model Evaluation - **Evaluation Metrics:** Use appropriate metrics to evaluate model performance. - **Regression:** Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared. - **Classification:** Accuracy, Precision, Recall, F1-Score, ROC-AUC. **Python Code for Model Evaluation:** ```python from sklearn.metrics import r2_score # Evaluate on test data y_test_pred = model.predict(X_test) mse_test = mean_squared_error(y_test, y_test_pred) r2_test = r2_score(y_test, y_test_pred) print(f"Test Mean Squared Error: {mse_test:.2f}") print(f"Test R-squared: {r2_test:.2f}") ``` #### 8. Advanced Techniques - **Feature Selection:** Identify the most important features and consider reducing dimensionality. - **Ensemble Methods:** Combine predictions from multiple models to improve accuracy. - **Cross-Validation:** Use cross-validation to ensure the model generalizes well to unseen data. ### Example: Predicting Room Occupancy Here's an example of how you might structure the code to predict room occupancy based on sensor data: **Step-by-Step Code:** ```python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt import seaborn as sns # Load and preprocess data # Assuming 'df' is the DataFrame loaded from TimescaleDB and 'people_count' is the target def preprocess_data(df): df['temperature'] = df['temperature'].fillna(df['temperature'].mean()) df['humidity'] = df['humidity'].fillna(df['humidity'].mean()) df['fan_rpm'] = df['fan_rpm'].fillna(df['fan_rpm'].mean()) df['lag_temperature'] = df['temperature'].shift(1) df['rolling_mean_temperature'] = df['temperature'].rolling(window=3).mean() df = df.dropna() # Drop rows with NaN values after shifting return df df = preprocess_data(df) # Feature Engineering df['hour'] = df['time'].dt.hour df['day_of_week'] = df['time'].dt.dayofweek df['interaction_term'] = df['temperature'] * df['humidity'] # Define features and target X = df[['temperature', 'humidity', 'fan_rpm', 'lag_temperature', 'rolling_mean_temperature', 'hour', 'day_of_week', 'interaction_term']] y = df['people_count'] # Split the data X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) # Train the model model = LinearRegression() model.fit(X_train, y_train) # Predict on validation set y_val_pred = model.predict(X_val) mse_val = mean_squared_error(y_val, y_val_pred) print(f"Validation Mean Squared Error: {mse_val:.2f}") # Evaluate on test data y_test_pred = model.predict(X_test) mse_test = mean_squared_error(y_test, y_test_pred) r2_test = r2_score(y_test, y_test_pred) print(f"Test Mean Squared Error: {mse_test:.2f}") print(f"Test R-squared: {r2_test:.2f}") # Visualize results sns.scatterplot(x=y_test, y=y_test_pred) plt.xlabel('Actual People Count') plt.ylabel('Predicted People Count') plt.title('Actual vs Predicted People Count') plt.show() ``` ### Summary - **Define the Problem:** Clearly define the prediction objective and target variable. - **Data Collection and Preparation:** Collect, clean, and preprocess the data. - **Feature Engineering:** Create relevant features to enhance the model’s predictive power. - **Model Selection and Training:** Select a baseline model and train it on the data. - **Model Evaluation:** Evaluate the model’s performance using appropriate metrics. - **Advanced Techniques:** Use feature selection, ensemble methods, and cross-validation to improve the model. By following this structured approach, you can effectively select and train a baseline machine learning model to predict target variables using MQTT sensor data.