## concurrency and parallelism, a technical primer 1. Concurrency: - Definition: Concurrency refers to the ability of a system to perform multiple tasks or processes simultaneously, but not necessarily at the same instant. - Key Points: - Tasks can progress independently and interleave their execution. - Concurrency improves responsiveness and efficiency by allowing tasks to progress concurrently. - Can be achieved on a single processing unit by switching between tasks. - Example: Multithreading in Python ```python import threading def task1(): print("Task 1 started") # Perform task 1 print("Task 1 completed") def task2(): print("Task 2 started") # Perform task 2 print("Task 2 completed") thread1 = threading.Thread(target=task1) thread2 = threading.Thread(target=task2) thread1.start() thread2.start() thread1.join() thread2.join() ``` 2. Parallelism: - Definition: Parallelism refers to the actual simultaneous execution of multiple tasks or processes on different processing units or cores. - Key Points: - Tasks are executed simultaneously on different processing units. - Parallelism requires hardware support, such as multiple processors or cores. - Aims to improve performance and solve problems faster by leveraging multiple computing resources. - Example: Multiprocessing in Python ```python import multiprocessing def task1(): print("Task 1 started") # Perform task 1 print("Task 1 completed") def task2(): print("Task 2 started") # Perform task 2 print("Task 2 completed") process1 = multiprocessing.Process(target=task1) process2 = multiprocessing.Process(target=task2) process1.start() process2.start() process1.join() process2.join() ``` 3. Concurrency in Pipelines: - Definition: Concurrency in pipelines allows multiple tasks or processes to progress independently through different stages of execution. - Examples: - CI/CD Pipeline: Stages like code compilation, unit testing, and packaging can operate concurrently on different builds. - ETL Pipeline: Stages like data extraction, transformation, and loading can process different batches of data concurrently. - Key Points: - Stages in a pipeline can operate concurrently on different data or tasks. - Concurrency improves throughput and performance by efficiently utilizing resources. - Stages may or may not execute in parallel, depending on available resources and dependencies. 4. DAGs (Directed Acyclic Graphs) in Apache Airflow: - Definition: A DAG represents a collection of tasks organized based on their dependencies and relationships. - Key Points: - Tasks in a DAG are independent units of work that can be executed concurrently. - Airflow's scheduler can execute multiple tasks in parallel, leveraging available resources. - DAGs enable task parallelism, task concurrency, and DAG concurrency. - Example: Simple DAG in Apache Airflow ```python from airflow import DAG from airflow.operators.python_operator import PythonOperator from datetime import datetime, timedelta default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2023, 1, 1), 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5), } dag = DAG( 'example_dag', default_args=default_args, description='A simple DAG example', schedule_interval=timedelta(days=1), ) def task1(): print("Task 1 executed") def task2(): print("Task 2 executed") task1_operator = PythonOperator( task_id='task1', python_callable=task1, dag=dag, ) task2_operator = PythonOperator( task_id='task2', python_callable=task2, dag=dag, ) task1_operator >> task2_operator ``` This guide provides a concise overview of concurrency and parallelism, along with technical examples in Python for multithreading, multiprocessing, and a simple DAG in Apache Airflow. It highlights the key differences between concurrency and parallelism and illustrates how concurrency is utilized in pipelines and DAGs to enable efficient execution of tasks.