123 lines
4.5 KiB
Markdown
123 lines
4.5 KiB
Markdown
## concurrency and parallelism, a technical primer
|
|
|
|
1. Concurrency:
|
|
- Definition: Concurrency refers to the ability of a system to perform multiple tasks or processes simultaneously, but not necessarily at the same instant.
|
|
- Key Points:
|
|
- Tasks can progress independently and interleave their execution.
|
|
- Concurrency improves responsiveness and efficiency by allowing tasks to progress concurrently.
|
|
- Can be achieved on a single processing unit by switching between tasks.
|
|
- Example: Multithreading in Python
|
|
```python
|
|
import threading
|
|
|
|
def task1():
|
|
print("Task 1 started")
|
|
# Perform task 1
|
|
print("Task 1 completed")
|
|
|
|
def task2():
|
|
print("Task 2 started")
|
|
# Perform task 2
|
|
print("Task 2 completed")
|
|
|
|
thread1 = threading.Thread(target=task1)
|
|
thread2 = threading.Thread(target=task2)
|
|
|
|
thread1.start()
|
|
thread2.start()
|
|
|
|
thread1.join()
|
|
thread2.join()
|
|
```
|
|
|
|
2. Parallelism:
|
|
- Definition: Parallelism refers to the actual simultaneous execution of multiple tasks or processes on different processing units or cores.
|
|
- Key Points:
|
|
- Tasks are executed simultaneously on different processing units.
|
|
- Parallelism requires hardware support, such as multiple processors or cores.
|
|
- Aims to improve performance and solve problems faster by leveraging multiple computing resources.
|
|
- Example: Multiprocessing in Python
|
|
```python
|
|
import multiprocessing
|
|
|
|
def task1():
|
|
print("Task 1 started")
|
|
# Perform task 1
|
|
print("Task 1 completed")
|
|
|
|
def task2():
|
|
print("Task 2 started")
|
|
# Perform task 2
|
|
print("Task 2 completed")
|
|
|
|
process1 = multiprocessing.Process(target=task1)
|
|
process2 = multiprocessing.Process(target=task2)
|
|
|
|
process1.start()
|
|
process2.start()
|
|
|
|
process1.join()
|
|
process2.join()
|
|
```
|
|
|
|
3. Concurrency in Pipelines:
|
|
- Definition: Concurrency in pipelines allows multiple tasks or processes to progress independently through different stages of execution.
|
|
- Examples:
|
|
- CI/CD Pipeline: Stages like code compilation, unit testing, and packaging can operate concurrently on different builds.
|
|
- ETL Pipeline: Stages like data extraction, transformation, and loading can process different batches of data concurrently.
|
|
- Key Points:
|
|
- Stages in a pipeline can operate concurrently on different data or tasks.
|
|
- Concurrency improves throughput and performance by efficiently utilizing resources.
|
|
- Stages may or may not execute in parallel, depending on available resources and dependencies.
|
|
|
|
4. DAGs (Directed Acyclic Graphs) in Apache Airflow:
|
|
- Definition: A DAG represents a collection of tasks organized based on their dependencies and relationships.
|
|
- Key Points:
|
|
- Tasks in a DAG are independent units of work that can be executed concurrently.
|
|
- Airflow's scheduler can execute multiple tasks in parallel, leveraging available resources.
|
|
- DAGs enable task parallelism, task concurrency, and DAG concurrency.
|
|
- Example: Simple DAG in Apache Airflow
|
|
```python
|
|
from airflow import DAG
|
|
from airflow.operators.python_operator import PythonOperator
|
|
from datetime import datetime, timedelta
|
|
|
|
default_args = {
|
|
'owner': 'airflow',
|
|
'depends_on_past': False,
|
|
'start_date': datetime(2023, 1, 1),
|
|
'email_on_failure': False,
|
|
'email_on_retry': False,
|
|
'retries': 1,
|
|
'retry_delay': timedelta(minutes=5),
|
|
}
|
|
|
|
dag = DAG(
|
|
'example_dag',
|
|
default_args=default_args,
|
|
description='A simple DAG example',
|
|
schedule_interval=timedelta(days=1),
|
|
)
|
|
|
|
def task1():
|
|
print("Task 1 executed")
|
|
|
|
def task2():
|
|
print("Task 2 executed")
|
|
|
|
task1_operator = PythonOperator(
|
|
task_id='task1',
|
|
python_callable=task1,
|
|
dag=dag,
|
|
)
|
|
|
|
task2_operator = PythonOperator(
|
|
task_id='task2',
|
|
python_callable=task2,
|
|
dag=dag,
|
|
)
|
|
|
|
task1_operator >> task2_operator
|
|
```
|
|
|
|
This guide provides a concise overview of concurrency and parallelism, along with technical examples in Python for multithreading, multiprocessing, and a simple DAG in Apache Airflow. It highlights the key differences between concurrency and parallelism and illustrates how concurrency is utilized in pipelines and DAGs to enable efficient execution of tasks. |