From b6685bed13402b92ec4f528855270621c81afe59 Mon Sep 17 00:00:00 2001 From: medusa Date: Tue, 18 Jun 2024 22:16:20 +0000 Subject: [PATCH] Add tech_docs/concurrency_parallelism.md --- tech_docs/concurrency_parallelism.md | 123 +++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 tech_docs/concurrency_parallelism.md diff --git a/tech_docs/concurrency_parallelism.md b/tech_docs/concurrency_parallelism.md new file mode 100644 index 0000000..b905a08 --- /dev/null +++ b/tech_docs/concurrency_parallelism.md @@ -0,0 +1,123 @@ +## concurrency and parallelism, a technical primer + +1. Concurrency: + - Definition: Concurrency refers to the ability of a system to perform multiple tasks or processes simultaneously, but not necessarily at the same instant. + - Key Points: + - Tasks can progress independently and interleave their execution. + - Concurrency improves responsiveness and efficiency by allowing tasks to progress concurrently. + - Can be achieved on a single processing unit by switching between tasks. + - Example: Multithreading in Python + ```python + import threading + + def task1(): + print("Task 1 started") + # Perform task 1 + print("Task 1 completed") + + def task2(): + print("Task 2 started") + # Perform task 2 + print("Task 2 completed") + + thread1 = threading.Thread(target=task1) + thread2 = threading.Thread(target=task2) + + thread1.start() + thread2.start() + + thread1.join() + thread2.join() + ``` + +2. Parallelism: + - Definition: Parallelism refers to the actual simultaneous execution of multiple tasks or processes on different processing units or cores. + - Key Points: + - Tasks are executed simultaneously on different processing units. + - Parallelism requires hardware support, such as multiple processors or cores. + - Aims to improve performance and solve problems faster by leveraging multiple computing resources. + - Example: Multiprocessing in Python + ```python + import multiprocessing + + def task1(): + print("Task 1 started") + # Perform task 1 + print("Task 1 completed") + + def task2(): + print("Task 2 started") + # Perform task 2 + print("Task 2 completed") + + process1 = multiprocessing.Process(target=task1) + process2 = multiprocessing.Process(target=task2) + + process1.start() + process2.start() + + process1.join() + process2.join() + ``` + +3. Concurrency in Pipelines: + - Definition: Concurrency in pipelines allows multiple tasks or processes to progress independently through different stages of execution. + - Examples: + - CI/CD Pipeline: Stages like code compilation, unit testing, and packaging can operate concurrently on different builds. + - ETL Pipeline: Stages like data extraction, transformation, and loading can process different batches of data concurrently. + - Key Points: + - Stages in a pipeline can operate concurrently on different data or tasks. + - Concurrency improves throughput and performance by efficiently utilizing resources. + - Stages may or may not execute in parallel, depending on available resources and dependencies. + +4. DAGs (Directed Acyclic Graphs) in Apache Airflow: + - Definition: A DAG represents a collection of tasks organized based on their dependencies and relationships. + - Key Points: + - Tasks in a DAG are independent units of work that can be executed concurrently. + - Airflow's scheduler can execute multiple tasks in parallel, leveraging available resources. + - DAGs enable task parallelism, task concurrency, and DAG concurrency. + - Example: Simple DAG in Apache Airflow + ```python + from airflow import DAG + from airflow.operators.python_operator import PythonOperator + from datetime import datetime, timedelta + + default_args = { + 'owner': 'airflow', + 'depends_on_past': False, + 'start_date': datetime(2023, 1, 1), + 'email_on_failure': False, + 'email_on_retry': False, + 'retries': 1, + 'retry_delay': timedelta(minutes=5), + } + + dag = DAG( + 'example_dag', + default_args=default_args, + description='A simple DAG example', + schedule_interval=timedelta(days=1), + ) + + def task1(): + print("Task 1 executed") + + def task2(): + print("Task 2 executed") + + task1_operator = PythonOperator( + task_id='task1', + python_callable=task1, + dag=dag, + ) + + task2_operator = PythonOperator( + task_id='task2', + python_callable=task2, + dag=dag, + ) + + task1_operator >> task2_operator + ``` + +This guide provides a concise overview of concurrency and parallelism, along with technical examples in Python for multithreading, multiprocessing, and a simple DAG in Apache Airflow. It highlights the key differences between concurrency and parallelism and illustrates how concurrency is utilized in pipelines and DAGs to enable efficient execution of tasks. \ No newline at end of file