the_information_nexus/concurrency_parallelism.md at cfce9581dfae1b0180fa0198196702556feb6e6a

medusa/the_information_nexus

Fork 0

Files

medusa b6685bed13 Add tech_docs/concurrency_parallelism.md

2024-06-18 22:16:20 +00:00

4.5 KiB

Raw Blame History

concurrency and parallelism, a technical primer

Concurrency:
- Definition: Concurrency refers to the ability of a system to perform multiple tasks or processes simultaneously, but not necessarily at the same instant.
- Key Points:
  - Tasks can progress independently and interleave their execution.
  - Concurrency improves responsiveness and efficiency by allowing tasks to progress concurrently.
  - Can be achieved on a single processing unit by switching between tasks.
- Example: Multithreading in Python
```
import threading

def task1():
    print("Task 1 started")
    # Perform task 1
    print("Task 1 completed")

def task2():
    print("Task 2 started")
    # Perform task 2
    print("Task 2 completed")

thread1 = threading.Thread(target=task1)
thread2 = threading.Thread(target=task2)

thread1.start()
thread2.start()

thread1.join()
thread2.join()
```
Parallelism:
- Definition: Parallelism refers to the actual simultaneous execution of multiple tasks or processes on different processing units or cores.
- Key Points:
  - Tasks are executed simultaneously on different processing units.
  - Parallelism requires hardware support, such as multiple processors or cores.
  - Aims to improve performance and solve problems faster by leveraging multiple computing resources.
- Example: Multiprocessing in Python
```
import multiprocessing

def task1():
    print("Task 1 started")
    # Perform task 1
    print("Task 1 completed")

def task2():
    print("Task 2 started")
    # Perform task 2
    print("Task 2 completed")

process1 = multiprocessing.Process(target=task1)
process2 = multiprocessing.Process(target=task2)

process1.start()
process2.start()

process1.join()
process2.join()
```
Concurrency in Pipelines:
- Definition: Concurrency in pipelines allows multiple tasks or processes to progress independently through different stages of execution.
- Examples:
  - CI/CD Pipeline: Stages like code compilation, unit testing, and packaging can operate concurrently on different builds.
  - ETL Pipeline: Stages like data extraction, transformation, and loading can process different batches of data concurrently.
- Key Points:
  - Stages in a pipeline can operate concurrently on different data or tasks.
  - Concurrency improves throughput and performance by efficiently utilizing resources.
  - Stages may or may not execute in parallel, depending on available resources and dependencies.

DAGs (Directed Acyclic Graphs) in Apache Airflow:

Definition: A DAG represents a collection of tasks organized based on their dependencies and relationships.
Key Points:
- Tasks in a DAG are independent units of work that can be executed concurrently.
- Airflow's scheduler can execute multiple tasks in parallel, leveraging available resources.
- DAGs enable task parallelism, task concurrency, and DAG concurrency.

Example: Simple DAG in Apache Airflow

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2023, 1, 1),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'example_dag',
    default_args=default_args,
    description='A simple DAG example',
    schedule_interval=timedelta(days=1),
)

def task1():
    print("Task 1 executed")

def task2():
    print("Task 2 executed")

task1_operator = PythonOperator(
    task_id='task1',
    python_callable=task1,
    dag=dag,
)

task2_operator = PythonOperator(
    task_id='task2',
    python_callable=task2,
    dag=dag,
)

task1_operator >> task2_operator

This guide provides a concise overview of concurrency and parallelism, along with technical examples in Python for multithreading, multiprocessing, and a simple DAG in Apache Airflow. It highlights the key differences between concurrency and parallelism and illustrates how concurrency is utilized in pipelines and DAGs to enable efficient execution of tasks.

4.5 KiB Raw Blame History

concurrency and parallelism, a technical primer

4.5 KiB

Raw Blame History