Files
the_information_nexus/tech_docs/concurrency_parallelism.md

4.5 KiB

concurrency and parallelism, a technical primer

  1. Concurrency:

    • Definition: Concurrency refers to the ability of a system to perform multiple tasks or processes simultaneously, but not necessarily at the same instant.
    • Key Points:
      • Tasks can progress independently and interleave their execution.
      • Concurrency improves responsiveness and efficiency by allowing tasks to progress concurrently.
      • Can be achieved on a single processing unit by switching between tasks.
    • Example: Multithreading in Python
      import threading
      
      def task1():
          print("Task 1 started")
          # Perform task 1
          print("Task 1 completed")
      
      def task2():
          print("Task 2 started")
          # Perform task 2
          print("Task 2 completed")
      
      thread1 = threading.Thread(target=task1)
      thread2 = threading.Thread(target=task2)
      
      thread1.start()
      thread2.start()
      
      thread1.join()
      thread2.join()
      
  2. Parallelism:

    • Definition: Parallelism refers to the actual simultaneous execution of multiple tasks or processes on different processing units or cores.
    • Key Points:
      • Tasks are executed simultaneously on different processing units.
      • Parallelism requires hardware support, such as multiple processors or cores.
      • Aims to improve performance and solve problems faster by leveraging multiple computing resources.
    • Example: Multiprocessing in Python
      import multiprocessing
      
      def task1():
          print("Task 1 started")
          # Perform task 1
          print("Task 1 completed")
      
      def task2():
          print("Task 2 started")
          # Perform task 2
          print("Task 2 completed")
      
      process1 = multiprocessing.Process(target=task1)
      process2 = multiprocessing.Process(target=task2)
      
      process1.start()
      process2.start()
      
      process1.join()
      process2.join()
      
  3. Concurrency in Pipelines:

    • Definition: Concurrency in pipelines allows multiple tasks or processes to progress independently through different stages of execution.
    • Examples:
      • CI/CD Pipeline: Stages like code compilation, unit testing, and packaging can operate concurrently on different builds.
      • ETL Pipeline: Stages like data extraction, transformation, and loading can process different batches of data concurrently.
    • Key Points:
      • Stages in a pipeline can operate concurrently on different data or tasks.
      • Concurrency improves throughput and performance by efficiently utilizing resources.
      • Stages may or may not execute in parallel, depending on available resources and dependencies.
  4. DAGs (Directed Acyclic Graphs) in Apache Airflow:

    • Definition: A DAG represents a collection of tasks organized based on their dependencies and relationships.
    • Key Points:
      • Tasks in a DAG are independent units of work that can be executed concurrently.
      • Airflow's scheduler can execute multiple tasks in parallel, leveraging available resources.
      • DAGs enable task parallelism, task concurrency, and DAG concurrency.
    • Example: Simple DAG in Apache Airflow
      from airflow import DAG
      from airflow.operators.python_operator import PythonOperator
      from datetime import datetime, timedelta
      
      default_args = {
          'owner': 'airflow',
          'depends_on_past': False,
          'start_date': datetime(2023, 1, 1),
          'email_on_failure': False,
          'email_on_retry': False,
          'retries': 1,
          'retry_delay': timedelta(minutes=5),
      }
      
      dag = DAG(
          'example_dag',
          default_args=default_args,
          description='A simple DAG example',
          schedule_interval=timedelta(days=1),
      )
      
      def task1():
          print("Task 1 executed")
      
      def task2():
          print("Task 2 executed")
      
      task1_operator = PythonOperator(
          task_id='task1',
          python_callable=task1,
          dag=dag,
      )
      
      task2_operator = PythonOperator(
          task_id='task2',
          python_callable=task2,
          dag=dag,
      )
      
      task1_operator >> task2_operator
      

This guide provides a concise overview of concurrency and parallelism, along with technical examples in Python for multithreading, multiprocessing, and a simple DAG in Apache Airflow. It highlights the key differences between concurrency and parallelism and illustrates how concurrency is utilized in pipelines and DAGs to enable efficient execution of tasks.