Add tech_docs/airflow_mqtt.md

This commit is contained in:
2024-06-04 17:57:30 +00:00
parent 90f68c0ebf
commit 3bbf77a8a5

392
tech_docs/airflow_mqtt.md Normal file
View File

@@ -0,0 +1,392 @@
### Detailed Orchestration with Airflow
Orchestration with Airflow involves setting up Directed Acyclic Graphs (DAGs) that define a sequence of tasks to be executed in a specific order. This ensures that each step in the workflow is completed before the next one begins, and it allows for scheduling, monitoring, and managing the data pipeline efficiently.
Heres a more detailed explanation of the orchestration portion, including setting up Airflow, defining tasks, and managing dependencies.
#### Setting Up Airflow
1. **Install Airflow**:
- You can install Airflow using pip. It's recommended to use a virtual environment.
```bash
pip install apache-airflow
```
2. **Initialize Airflow Database**:
- Initialize the Airflow metadata database.
```bash
airflow db init
```
3. **Start Airflow Web Server and Scheduler**:
- Start the web server and scheduler in separate terminal windows.
```bash
airflow webserver
airflow scheduler
```
4. **Create Airflow Directory Structure**:
- Create the necessary directory structure for your Airflow project.
```bash
mkdir -p ~/airflow/dags
mkdir -p ~/airflow/plugins
mkdir -p ~/airflow/logs
```
5. **Set Up Airflow Configuration**:
- Ensure your Airflow configuration file (`airflow.cfg`) is correctly set up to point to these directories.
#### Defining the Airflow DAG
Create a DAG that orchestrates the entire workflow from data ingestion to ML inference.
##### Example Airflow DAG: `sensor_data_pipeline.py`
1. **Import Necessary Libraries**:
```python
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
import os
```
2. **Set Default Arguments**:
```python
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
```
3. **Define the DAG**:
```python
dag = DAG(
'sensor_data_pipeline',
default_args=default_args,
description='A DAG for processing sensor data',
schedule_interval=timedelta(minutes=10),
start_date=days_ago(1),
catchup=False,
)
```
4. **Define Tasks**:
- **Ingest MQTT Data**: Run the MQTT subscriber script to collect sensor data.
```python
def subscribe_to_mqtt():
import paho.mqtt.client as mqtt
import json
import pandas as pd
from datetime import datetime
import sqlite3
def on_message(client, userdata, message):
payload = json.loads(message.payload.decode())
df = pd.DataFrame([payload])
df['timestamp'] = datetime.now()
conn = sqlite3.connect('/path/to/sensor_data.db')
df.to_sql('raw_sensor_data', conn, if_exists='append', index=False)
conn.close()
client = mqtt.Client()
client.on_message = on_message
client.connect("mqtt_broker_host", 1883, 60)
client.subscribe("sensors/data")
client.loop_forever()
ingest_mqtt_data = PythonOperator(
task_id='ingest_mqtt_data',
python_callable=subscribe_to_mqtt,
dag=dag,
)
```
- **Transform Data with dbt**: Run dbt models to clean and transform the data.
```python
transform_data = BashOperator(
task_id='transform_data',
bash_command='dbt run --profiles-dir /path/to/your/dbt/project',
dag=dag,
)
```
- **Run ML Inference**: Execute the ML inference script to make predictions.
```python
def run_inference():
import pandas as pd
import sqlite3
import joblib
def load_transformed_data():
conn = sqlite3.connect('/path/to/sensor_data.db')
query = "SELECT * FROM aggregated_sensor_data"
df = pd.read_sql_query(query, conn)
conn.close()
return df
def make_predictions(data):
model = joblib.load('/path/to/your_model.pkl')
predictions = model.predict(data[['avg_temperature', 'avg_humidity']])
data['predictions'] = predictions
return data
def save_predictions(data):
conn = sqlite3.connect('/path/to/sensor_data.db')
data.to_sql('sensor_predictions', conn, if_exists='append', index=False)
conn.close()
data = load_transformed_data()
predictions = make_predictions(data)
save_predictions(predictions)
ml_inference = PythonOperator(
task_id='run_inference',
python_callable=run_inference,
dag=dag,
)
```
5. **Set Task Dependencies**:
```python
ingest_mqtt_data >> transform_data >> ml_inference
```
#### Directory Structure
Ensure your project is structured correctly to support the workflow.
```
sensor_data_project/
├── dags/
│ └── sensor_data_pipeline.py
├── dbt_project.yml
├── models/
│ ├── cleaned_sensor_data.sql
│ └── aggregated_sensor_data.sql
├── profiles.yml
├── scripts/
│ ├── mqtt_subscriber.py
│ ├── ml_inference.py
└── Dockerfile
```
#### Docker Integration (Optional)
For better scalability and reproducibility, consider containerizing your Airflow setup with Docker.
##### Dockerfile Example
```Dockerfile
FROM apache/airflow:2.1.2
# Copy DAGs and scripts
COPY dags/ /opt/airflow/dags/
COPY scripts/ /opt/airflow/scripts/
# Install additional Python packages
RUN pip install paho-mqtt pandas sqlite3 joblib dbt
# Set environment variables
ENV AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
# Entry point
ENTRYPOINT ["/usr/bin/dumb-init", "--"]
CMD ["bash", "-c", "airflow webserver & airflow scheduler"]
```
### Summary
Using Airflow for orchestration allows you to:
1. **Schedule and Automate**: Regularly schedule data ingestion, transformation, and ML inference tasks.
2. **Manage Dependencies**: Ensure tasks are executed in the correct order.
3. **Monitor and Alert**: Monitor the status of your workflows and get alerts on failures.
4. **Scalability**: Easily scale your workflows by distributing tasks across multiple workers.
By structuring your project with these components, you can create a robust, end-to-end data pipeline that ingests MQTT sensor data, processes it, runs ML inference, and provides actionable insights.
---
Yes, MQTT shares some similarities with multicast in that both involve a publisher/subscriber model where data is broadcast to multiple recipients. However, there are distinct differences in how they operate and their use cases:
### Similarities
1. **Publisher/Subscriber Model**:
- Both MQTT and multicast use a publisher/subscriber model where one entity publishes data and multiple entities can subscribe to receive that data.
2. **Efficient Data Distribution**:
- Both are designed for efficient data distribution to multiple recipients.
### Differences
1. **Transport Protocol**:
- **MQTT**: MQTT operates over TCP/IP and includes mechanisms for reliable message delivery, including Quality of Service (QoS) levels.
- **Multicast**: Multicast typically operates over UDP/IP, which does not guarantee delivery, ordering, or duplicate protection.
2. **Broker vs. Network Layer**:
- **MQTT**: Uses a broker (server) to manage message routing between publishers and subscribers. The broker handles message distribution, connection management, and QoS.
- **Multicast**: Operates at the network layer, where data packets are delivered to multiple recipients based on IP multicast group addresses. There is no central server; the network infrastructure handles data distribution.
3. **Message Reliability**:
- **MQTT**: Provides different QoS levels to ensure message delivery:
- QoS 0: At most once (fire and forget)
- QoS 1: At least once (acknowledged delivery)
- QoS 2: Exactly once (guaranteed delivery)
- **Multicast**: UDP multicast does not inherently provide reliable message delivery, although application-level protocols can be built on top of it to add reliability.
4. **Use Cases**:
- **MQTT**: Commonly used in IoT, where devices publish sensor data to a broker, and applications subscribe to this data. Ideal for scenarios requiring reliable communication and complex routing.
- **Multicast**: Often used in applications like streaming media, live broadcasts, and other scenarios where low-latency, one-to-many data distribution is needed, and reliability can be managed at the application level.
### Example: Using MQTT for Real-Time Data Streams
Let's consider an example where we use MQTT to subscribe to a stream of sensor data from IoT devices and process it using Airflow and dbt.
#### Step 1: Set Up MQTT Broker and Clients
1. **MQTT Broker**:
- Use an MQTT broker like Mosquitto to handle message routing.
```bash
mosquitto -v
```
2. **MQTT Publisher (Sensor)**:
- Simulate an IoT device publishing sensor data.
```python
import paho.mqtt.client as mqtt
import time
import json
import random
def publish_sensor_data():
client = mqtt.Client()
client.connect("localhost", 1883, 60)
while True:
sensor_data = {
"sensor_id": "sensor_1",
"timestamp": time.time(),
"temperature": random.uniform(20.0, 30.0),
"humidity": random.uniform(30.0, 50.0)
}
client.publish("sensors/data", json.dumps(sensor_data))
time.sleep(5)
if __name__ == "__main__":
publish_sensor_data()
```
3. **MQTT Subscriber (Airflow Task)**:
- Subscribe to the MQTT topic and process incoming messages.
```python
import paho.mqtt.client as mqtt
import json
import pandas as pd
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def on_message(client, userdata, message):
payload = json.loads(message.payload.decode())
process_payload(payload)
def process_payload(payload):
df = pd.DataFrame([payload])
df.to_csv('/tmp/sensor_data.csv', mode='a', header=False, index=False)
def subscribe_to_mqtt():
client = mqtt.Client()
client.on_message = on_message
client.connect("localhost", 1883, 60)
client.subscribe("sensors/data")
client.loop_start()
def ingest_mqtt_data():
subscribe_to_mqtt()
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'mqtt_ingestion_dag',
default_args=default_args,
description='A DAG to ingest MQTT data',
schedule_interval=timedelta(minutes=10),
start_date=days_ago(1),
catchup=False,
) as dag:
ingest_mqtt_task = PythonOperator(
task_id='ingest_mqtt_data',
python_callable=ingest_mqtt_data,
)
ingest_mqtt_task
```
#### Step 2: Transform Data Using dbt
1. **Set Up dbt Models**:
- Define models to process the ingested sensor data.
```sql
-- models/sensor_data.sql
WITH raw_data AS (
SELECT
*
FROM
{{ ref('raw_sensor_data') }}
)
SELECT
sensor_id,
timestamp,
temperature,
humidity
FROM
raw_data;
```
2. **Run dbt Models in Airflow**:
- Schedule dbt runs to transform the data after ingestion.
```python
from airflow.operators.bash import BashOperator
dbt_run = BashOperator(
task_id='dbt_run',
bash_command='dbt run --profiles-dir /path/to/your/dbt/project',
)
ingest_mqtt_task >> dbt_run
```
### Summary
While MQTT and multicast both enable efficient data distribution to multiple recipients, MQTT provides additional features such as message reliability, quality of service, and broker-based routing, making it well-suited for IoT and other applications requiring reliable, real-time data streams. By integrating MQTT with tools like Airflow and dbt, you can build robust data pipelines that handle real-time data ingestion, transformation, and analysis, providing valuable business insights.