structure updates
This commit is contained in:
297
tech_docs/python/pandas.md
Normal file
297
tech_docs/python/pandas.md
Normal file
@@ -0,0 +1,297 @@
|
||||
The `pandas` library is indispensable for data scientists, analysts, and anyone working with data in Python. It provides high-performance, easy-to-use data structures and data analysis tools. Below is a concise reference guide for common use cases with `pandas`, formatted in Markdown syntax:
|
||||
|
||||
# `pandas` Reference Guide
|
||||
|
||||
## Installation
|
||||
```
|
||||
pip install pandas
|
||||
```
|
||||
|
||||
## Basic Concepts
|
||||
|
||||
### Importing pandas
|
||||
```python
|
||||
import pandas as pd
|
||||
```
|
||||
|
||||
### Data Structures
|
||||
- **Series**: One-dimensional array with labels.
|
||||
- **DataFrame**: Two-dimensional, size-mutable, potentially heterogeneous tabular data with labeled axes.
|
||||
|
||||
## Creating DataFrames
|
||||
```python
|
||||
# From a dictionary
|
||||
df = pd.DataFrame({
|
||||
'A': [1, 2, 3],
|
||||
'B': ['a', 'b', 'c']
|
||||
})
|
||||
|
||||
# From a list of lists
|
||||
df = pd.DataFrame([
|
||||
[1, 'a'],
|
||||
[2, 'b'],
|
||||
[3, 'c']
|
||||
], columns=['A', 'B'])
|
||||
```
|
||||
|
||||
## Reading Data
|
||||
```python
|
||||
# Read from CSV
|
||||
df = pd.read_csv('filename.csv')
|
||||
|
||||
# Read from Excel
|
||||
df = pd.read_excel('filename.xlsx')
|
||||
|
||||
# Other formats include: read_sql, read_json, read_html, read_clipboard, read_pickle, etc.
|
||||
```
|
||||
|
||||
## Data Inspection
|
||||
```python
|
||||
# View the first n rows (default 5)
|
||||
df.head()
|
||||
|
||||
# View the last n rows (default 5)
|
||||
df.tail()
|
||||
|
||||
# Data summary
|
||||
df.info()
|
||||
|
||||
# Statistical summary for numerical columns
|
||||
df.describe()
|
||||
```
|
||||
|
||||
## Data Selection
|
||||
```python
|
||||
# Select a column
|
||||
df['A']
|
||||
|
||||
# Select multiple columns
|
||||
df[['A', 'B']]
|
||||
|
||||
# Select rows by position
|
||||
df.iloc[0] # First row
|
||||
df.iloc[0:5] # First five rows
|
||||
|
||||
# Select rows by label
|
||||
df.loc[0] # Row with index label 0
|
||||
df.loc[0:5] # Rows with index labels from 0 to 5, inclusive
|
||||
```
|
||||
|
||||
## Data Manipulation
|
||||
```python
|
||||
# Add a new column
|
||||
df['C'] = [10, 20, 30]
|
||||
|
||||
# Drop a column
|
||||
df.drop('C', axis=1, inplace=True)
|
||||
|
||||
# Rename columns
|
||||
df.rename(columns={'A': 'Alpha', 'B': 'Beta'}, inplace=True)
|
||||
|
||||
# Filter rows
|
||||
filtered_df = df[df['Alpha'] > 1]
|
||||
|
||||
# Apply a function to a column
|
||||
df['Alpha'] = df['Alpha'].apply(lambda x: x * 2)
|
||||
```
|
||||
|
||||
## Handling Missing Data
|
||||
```python
|
||||
# Drop rows with any missing values
|
||||
df.dropna()
|
||||
|
||||
# Fill missing values
|
||||
df.fillna(value=0)
|
||||
```
|
||||
|
||||
## Grouping and Aggregating
|
||||
```python
|
||||
# Group by a column and calculate mean
|
||||
grouped_df = df.groupby('B').mean()
|
||||
|
||||
# Multiple aggregation functions
|
||||
grouped_df = df.groupby('B').agg(['mean', 'sum'])
|
||||
```
|
||||
|
||||
## Merging, Joining, and Concatenating
|
||||
```python
|
||||
# Concatenate DataFrames
|
||||
pd.concat([df1, df2])
|
||||
|
||||
# Merge DataFrames
|
||||
pd.merge(df1, df2, on='key')
|
||||
|
||||
# Join DataFrames
|
||||
df1.join(df2, on='key')
|
||||
```
|
||||
|
||||
## Saving Data
|
||||
```python
|
||||
# Write to CSV
|
||||
df.to_csv('filename.csv')
|
||||
|
||||
# Write to Excel
|
||||
df.to_excel('filename.xlsx')
|
||||
|
||||
# Other formats include: to_sql, to_json, to_html, to_clipboard, to_pickle, etc.
|
||||
```
|
||||
|
||||
`pandas` is incredibly powerful for data cleaning, transformation, analysis, and visualization. This guide covers the basics, but the library's capabilities are vast and highly customizable to suit complex data manipulation and analysis tasks.
|
||||
|
||||
---
|
||||
|
||||
For high-performance, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive, `pandas` stands out as a crucial tool in Python data science libraries. It provides essential data manipulation capabilities akin to those found in programming languages like R. Here’s a concise reference guide for common use cases with `pandas`, especially tailored for data manipulation and cleaning tasks:
|
||||
|
||||
# `pandas` Reference Guide
|
||||
|
||||
## Installation
|
||||
```
|
||||
pip install pandas
|
||||
```
|
||||
|
||||
## Basic Concepts
|
||||
|
||||
### Importing pandas
|
||||
```python
|
||||
import pandas as pd
|
||||
```
|
||||
|
||||
### Series and DataFrame
|
||||
- **Series**: One-dimensional labeled array capable of holding data of any type.
|
||||
- **DataFrame**: Two-dimensional labeled data structure with columns of potentially different types.
|
||||
|
||||
## Creating DataFrames
|
||||
```python
|
||||
# From a dictionary
|
||||
df = pd.DataFrame({
|
||||
'Column1': [1, 2, 3],
|
||||
'Column2': ['a', 'b', 'c']
|
||||
})
|
||||
|
||||
# From a list of dictionaries
|
||||
df = pd.DataFrame([
|
||||
{'Column1': 1, 'Column2': 'a'},
|
||||
{'Column1': 2, 'Column2': 'b'}
|
||||
])
|
||||
|
||||
# From a CSV file
|
||||
df = pd.read_csv('filename.csv')
|
||||
|
||||
# From an Excel file
|
||||
df = pd.read_excel('filename.xlsx')
|
||||
```
|
||||
|
||||
## Basic DataFrame Operations
|
||||
|
||||
### Viewing Data
|
||||
```python
|
||||
# View the first 5 rows
|
||||
df.head()
|
||||
|
||||
# View the last 5 rows
|
||||
df.tail()
|
||||
|
||||
# Display the index, columns, and underlying numpy data
|
||||
df.info()
|
||||
```
|
||||
|
||||
### Data Selection
|
||||
```python
|
||||
# Select a single column
|
||||
df['Column1']
|
||||
|
||||
# Select multiple columns
|
||||
df[['Column1', 'Column2']]
|
||||
|
||||
# Select rows by position
|
||||
df.iloc[0] # First row
|
||||
|
||||
# Select rows by label
|
||||
df.loc[0] # Row with index label 0
|
||||
```
|
||||
|
||||
### Data Filtering
|
||||
```python
|
||||
# Rows where Column1 is greater than 1
|
||||
df[df['Column1'] > 1]
|
||||
```
|
||||
|
||||
### Adding and Dropping Columns
|
||||
```python
|
||||
# Adding a new column
|
||||
df['Column3'] = [4, 5, 6]
|
||||
|
||||
# Dropping a column
|
||||
df.drop('Column3', axis=1, inplace=True)
|
||||
```
|
||||
|
||||
### Renaming Columns
|
||||
```python
|
||||
df.rename(columns={'Column1': 'NewName1'}, inplace=True)
|
||||
```
|
||||
|
||||
### Handling Missing Data
|
||||
```python
|
||||
# Drop rows with any missing values
|
||||
df.dropna()
|
||||
|
||||
# Fill missing values
|
||||
df.fillna(value=0)
|
||||
```
|
||||
|
||||
## Data Manipulation
|
||||
|
||||
### Applying Functions
|
||||
```python
|
||||
# Apply a function to each item
|
||||
df['Column1'] = df['Column1'].apply(lambda x: x * 2)
|
||||
```
|
||||
|
||||
### Grouping Data
|
||||
```python
|
||||
# Group by column and calculate mean
|
||||
df.groupby('Column1').mean()
|
||||
```
|
||||
|
||||
### Merging and Concatenating
|
||||
```python
|
||||
# Concatenate DataFrames
|
||||
pd.concat([df1, df2])
|
||||
|
||||
# Merge DataFrames
|
||||
pd.merge(df1, df2, on='key_column')
|
||||
```
|
||||
|
||||
### Aggregating Data
|
||||
```python
|
||||
df.agg({
|
||||
'Column1': ['min', 'max', 'mean'],
|
||||
'Column2': ['sum']
|
||||
})
|
||||
```
|
||||
|
||||
## Working with Time Series
|
||||
```python
|
||||
# Convert column to datetime
|
||||
df['DateColumn'] = pd.to_datetime(df['DateColumn'])
|
||||
|
||||
# Set the DateTime column as the index
|
||||
df.set_index('DateColumn', inplace=True)
|
||||
|
||||
# Resample and aggregate by month
|
||||
df.resample('M').mean()
|
||||
```
|
||||
|
||||
## Saving Data
|
||||
```python
|
||||
# Write to a CSV file
|
||||
df.to_csv('new_file.csv')
|
||||
|
||||
# Write to an Excel file
|
||||
df.to_excel('new_file.xlsx')
|
||||
```
|
||||
|
||||
`pandas` is an indispensable tool for data munging/wrangling. It provides high-level abstractions for complex operations, simplifying tasks like data filtering, transformation, and aggregation. This guide covers foundational operations but barely scratches the surface of `pandas`' capabilities, which are vast and varied, extending well beyond these basics to support complex data manipulation and analysis tasks.
|
||||
```
|
||||
|
||||
Given its powerful and flexible data manipulation capabilities, `pandas` is a cornerstone library for anyone working with data in Python, offering a depth of functionality that covers nearly every aspect of data analysis and manipulation.
|
||||
Reference in New Issue
Block a user