4.5 KiB
4.5 KiB
concurrency and parallelism, a technical primer
-
Concurrency:
- Definition: Concurrency refers to the ability of a system to perform multiple tasks or processes simultaneously, but not necessarily at the same instant.
- Key Points:
- Tasks can progress independently and interleave their execution.
- Concurrency improves responsiveness and efficiency by allowing tasks to progress concurrently.
- Can be achieved on a single processing unit by switching between tasks.
- Example: Multithreading in Python
import threading def task1(): print("Task 1 started") # Perform task 1 print("Task 1 completed") def task2(): print("Task 2 started") # Perform task 2 print("Task 2 completed") thread1 = threading.Thread(target=task1) thread2 = threading.Thread(target=task2) thread1.start() thread2.start() thread1.join() thread2.join()
-
Parallelism:
- Definition: Parallelism refers to the actual simultaneous execution of multiple tasks or processes on different processing units or cores.
- Key Points:
- Tasks are executed simultaneously on different processing units.
- Parallelism requires hardware support, such as multiple processors or cores.
- Aims to improve performance and solve problems faster by leveraging multiple computing resources.
- Example: Multiprocessing in Python
import multiprocessing def task1(): print("Task 1 started") # Perform task 1 print("Task 1 completed") def task2(): print("Task 2 started") # Perform task 2 print("Task 2 completed") process1 = multiprocessing.Process(target=task1) process2 = multiprocessing.Process(target=task2) process1.start() process2.start() process1.join() process2.join()
-
Concurrency in Pipelines:
- Definition: Concurrency in pipelines allows multiple tasks or processes to progress independently through different stages of execution.
- Examples:
- CI/CD Pipeline: Stages like code compilation, unit testing, and packaging can operate concurrently on different builds.
- ETL Pipeline: Stages like data extraction, transformation, and loading can process different batches of data concurrently.
- Key Points:
- Stages in a pipeline can operate concurrently on different data or tasks.
- Concurrency improves throughput and performance by efficiently utilizing resources.
- Stages may or may not execute in parallel, depending on available resources and dependencies.
-
DAGs (Directed Acyclic Graphs) in Apache Airflow:
- Definition: A DAG represents a collection of tasks organized based on their dependencies and relationships.
- Key Points:
- Tasks in a DAG are independent units of work that can be executed concurrently.
- Airflow's scheduler can execute multiple tasks in parallel, leveraging available resources.
- DAGs enable task parallelism, task concurrency, and DAG concurrency.
- Example: Simple DAG in Apache Airflow
from airflow import DAG from airflow.operators.python_operator import PythonOperator from datetime import datetime, timedelta default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2023, 1, 1), 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5), } dag = DAG( 'example_dag', default_args=default_args, description='A simple DAG example', schedule_interval=timedelta(days=1), ) def task1(): print("Task 1 executed") def task2(): print("Task 2 executed") task1_operator = PythonOperator( task_id='task1', python_callable=task1, dag=dag, ) task2_operator = PythonOperator( task_id='task2', python_callable=task2, dag=dag, ) task1_operator >> task2_operator
This guide provides a concise overview of concurrency and parallelism, along with technical examples in Python for multithreading, multiprocessing, and a simple DAG in Apache Airflow. It highlights the key differences between concurrency and parallelism and illustrates how concurrency is utilized in pipelines and DAGs to enable efficient execution of tasks.