diff --git a/docs/tech_docs/python/pandas.md b/docs/tech_docs/python/pandas.md new file mode 100644 index 0000000..27e2b71 --- /dev/null +++ b/docs/tech_docs/python/pandas.md @@ -0,0 +1,139 @@ +The `pandas` library is indispensable for data scientists, analysts, and anyone working with data in Python. It provides high-performance, easy-to-use data structures and data analysis tools. Below is a concise reference guide for common use cases with `pandas`, formatted in Markdown syntax: + +# `pandas` Reference Guide + +## Installation +``` +pip install pandas +``` + +## Basic Concepts + +### Importing pandas +```python +import pandas as pd +``` + +### Data Structures +- **Series**: One-dimensional array with labels. +- **DataFrame**: Two-dimensional, size-mutable, potentially heterogeneous tabular data with labeled axes. + +## Creating DataFrames +```python +# From a dictionary +df = pd.DataFrame({ + 'A': [1, 2, 3], + 'B': ['a', 'b', 'c'] +}) + +# From a list of lists +df = pd.DataFrame([ + [1, 'a'], + [2, 'b'], + [3, 'c'] +], columns=['A', 'B']) +``` + +## Reading Data +```python +# Read from CSV +df = pd.read_csv('filename.csv') + +# Read from Excel +df = pd.read_excel('filename.xlsx') + +# Other formats include: read_sql, read_json, read_html, read_clipboard, read_pickle, etc. +``` + +## Data Inspection +```python +# View the first n rows (default 5) +df.head() + +# View the last n rows (default 5) +df.tail() + +# Data summary +df.info() + +# Statistical summary for numerical columns +df.describe() +``` + +## Data Selection +```python +# Select a column +df['A'] + +# Select multiple columns +df[['A', 'B']] + +# Select rows by position +df.iloc[0] # First row +df.iloc[0:5] # First five rows + +# Select rows by label +df.loc[0] # Row with index label 0 +df.loc[0:5] # Rows with index labels from 0 to 5, inclusive +``` + +## Data Manipulation +```python +# Add a new column +df['C'] = [10, 20, 30] + +# Drop a column +df.drop('C', axis=1, inplace=True) + +# Rename columns +df.rename(columns={'A': 'Alpha', 'B': 'Beta'}, inplace=True) + +# Filter rows +filtered_df = df[df['Alpha'] > 1] + +# Apply a function to a column +df['Alpha'] = df['Alpha'].apply(lambda x: x * 2) +``` + +## Handling Missing Data +```python +# Drop rows with any missing values +df.dropna() + +# Fill missing values +df.fillna(value=0) +``` + +## Grouping and Aggregating +```python +# Group by a column and calculate mean +grouped_df = df.groupby('B').mean() + +# Multiple aggregation functions +grouped_df = df.groupby('B').agg(['mean', 'sum']) +``` + +## Merging, Joining, and Concatenating +```python +# Concatenate DataFrames +pd.concat([df1, df2]) + +# Merge DataFrames +pd.merge(df1, df2, on='key') + +# Join DataFrames +df1.join(df2, on='key') +``` + +## Saving Data +```python +# Write to CSV +df.to_csv('filename.csv') + +# Write to Excel +df.to_excel('filename.xlsx') + +# Other formats include: to_sql, to_json, to_html, to_clipboard, to_pickle, etc. +``` + +`pandas` is incredibly powerful for data cleaning, transformation, analysis, and visualization. This guide covers the basics, but the library's capabilities are vast and highly customizable to suit complex data manipulation and analysis tasks. \ No newline at end of file