Add tech_docs/concurrency_parallelism.md

This commit is contained in:
2024-06-18 22:16:20 +00:00
parent 4782c309d7
commit b6685bed13

View File

@@ -0,0 +1,123 @@
## concurrency and parallelism, a technical primer
1. Concurrency:
- Definition: Concurrency refers to the ability of a system to perform multiple tasks or processes simultaneously, but not necessarily at the same instant.
- Key Points:
- Tasks can progress independently and interleave their execution.
- Concurrency improves responsiveness and efficiency by allowing tasks to progress concurrently.
- Can be achieved on a single processing unit by switching between tasks.
- Example: Multithreading in Python
```python
import threading
def task1():
print("Task 1 started")
# Perform task 1
print("Task 1 completed")
def task2():
print("Task 2 started")
# Perform task 2
print("Task 2 completed")
thread1 = threading.Thread(target=task1)
thread2 = threading.Thread(target=task2)
thread1.start()
thread2.start()
thread1.join()
thread2.join()
```
2. Parallelism:
- Definition: Parallelism refers to the actual simultaneous execution of multiple tasks or processes on different processing units or cores.
- Key Points:
- Tasks are executed simultaneously on different processing units.
- Parallelism requires hardware support, such as multiple processors or cores.
- Aims to improve performance and solve problems faster by leveraging multiple computing resources.
- Example: Multiprocessing in Python
```python
import multiprocessing
def task1():
print("Task 1 started")
# Perform task 1
print("Task 1 completed")
def task2():
print("Task 2 started")
# Perform task 2
print("Task 2 completed")
process1 = multiprocessing.Process(target=task1)
process2 = multiprocessing.Process(target=task2)
process1.start()
process2.start()
process1.join()
process2.join()
```
3. Concurrency in Pipelines:
- Definition: Concurrency in pipelines allows multiple tasks or processes to progress independently through different stages of execution.
- Examples:
- CI/CD Pipeline: Stages like code compilation, unit testing, and packaging can operate concurrently on different builds.
- ETL Pipeline: Stages like data extraction, transformation, and loading can process different batches of data concurrently.
- Key Points:
- Stages in a pipeline can operate concurrently on different data or tasks.
- Concurrency improves throughput and performance by efficiently utilizing resources.
- Stages may or may not execute in parallel, depending on available resources and dependencies.
4. DAGs (Directed Acyclic Graphs) in Apache Airflow:
- Definition: A DAG represents a collection of tasks organized based on their dependencies and relationships.
- Key Points:
- Tasks in a DAG are independent units of work that can be executed concurrently.
- Airflow's scheduler can execute multiple tasks in parallel, leveraging available resources.
- DAGs enable task parallelism, task concurrency, and DAG concurrency.
- Example: Simple DAG in Apache Airflow
```python
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2023, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'example_dag',
default_args=default_args,
description='A simple DAG example',
schedule_interval=timedelta(days=1),
)
def task1():
print("Task 1 executed")
def task2():
print("Task 2 executed")
task1_operator = PythonOperator(
task_id='task1',
python_callable=task1,
dag=dag,
)
task2_operator = PythonOperator(
task_id='task2',
python_callable=task2,
dag=dag,
)
task1_operator >> task2_operator
```
This guide provides a concise overview of concurrency and parallelism, along with technical examples in Python for multithreading, multiprocessing, and a simple DAG in Apache Airflow. It highlights the key differences between concurrency and parallelism and illustrates how concurrency is utilized in pipelines and DAGs to enable efficient execution of tasks.