Files
2024-05-01 12:28:44 -06:00

297 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

The `pandas` library is indispensable for data scientists, analysts, and anyone working with data in Python. It provides high-performance, easy-to-use data structures and data analysis tools. Below is a concise reference guide for common use cases with `pandas`, formatted in Markdown syntax:
# `pandas` Reference Guide
## Installation
```
pip install pandas
```
## Basic Concepts
### Importing pandas
```python
import pandas as pd
```
### Data Structures
- **Series**: One-dimensional array with labels.
- **DataFrame**: Two-dimensional, size-mutable, potentially heterogeneous tabular data with labeled axes.
## Creating DataFrames
```python
# From a dictionary
df = pd.DataFrame({
'A': [1, 2, 3],
'B': ['a', 'b', 'c']
})
# From a list of lists
df = pd.DataFrame([
[1, 'a'],
[2, 'b'],
[3, 'c']
], columns=['A', 'B'])
```
## Reading Data
```python
# Read from CSV
df = pd.read_csv('filename.csv')
# Read from Excel
df = pd.read_excel('filename.xlsx')
# Other formats include: read_sql, read_json, read_html, read_clipboard, read_pickle, etc.
```
## Data Inspection
```python
# View the first n rows (default 5)
df.head()
# View the last n rows (default 5)
df.tail()
# Data summary
df.info()
# Statistical summary for numerical columns
df.describe()
```
## Data Selection
```python
# Select a column
df['A']
# Select multiple columns
df[['A', 'B']]
# Select rows by position
df.iloc[0] # First row
df.iloc[0:5] # First five rows
# Select rows by label
df.loc[0] # Row with index label 0
df.loc[0:5] # Rows with index labels from 0 to 5, inclusive
```
## Data Manipulation
```python
# Add a new column
df['C'] = [10, 20, 30]
# Drop a column
df.drop('C', axis=1, inplace=True)
# Rename columns
df.rename(columns={'A': 'Alpha', 'B': 'Beta'}, inplace=True)
# Filter rows
filtered_df = df[df['Alpha'] > 1]
# Apply a function to a column
df['Alpha'] = df['Alpha'].apply(lambda x: x * 2)
```
## Handling Missing Data
```python
# Drop rows with any missing values
df.dropna()
# Fill missing values
df.fillna(value=0)
```
## Grouping and Aggregating
```python
# Group by a column and calculate mean
grouped_df = df.groupby('B').mean()
# Multiple aggregation functions
grouped_df = df.groupby('B').agg(['mean', 'sum'])
```
## Merging, Joining, and Concatenating
```python
# Concatenate DataFrames
pd.concat([df1, df2])
# Merge DataFrames
pd.merge(df1, df2, on='key')
# Join DataFrames
df1.join(df2, on='key')
```
## Saving Data
```python
# Write to CSV
df.to_csv('filename.csv')
# Write to Excel
df.to_excel('filename.xlsx')
# Other formats include: to_sql, to_json, to_html, to_clipboard, to_pickle, etc.
```
`pandas` is incredibly powerful for data cleaning, transformation, analysis, and visualization. This guide covers the basics, but the library's capabilities are vast and highly customizable to suit complex data manipulation and analysis tasks.
---
For high-performance, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive, `pandas` stands out as a crucial tool in Python data science libraries. It provides essential data manipulation capabilities akin to those found in programming languages like R. Heres a concise reference guide for common use cases with `pandas`, especially tailored for data manipulation and cleaning tasks:
# `pandas` Reference Guide
## Installation
```
pip install pandas
```
## Basic Concepts
### Importing pandas
```python
import pandas as pd
```
### Series and DataFrame
- **Series**: One-dimensional labeled array capable of holding data of any type.
- **DataFrame**: Two-dimensional labeled data structure with columns of potentially different types.
## Creating DataFrames
```python
# From a dictionary
df = pd.DataFrame({
'Column1': [1, 2, 3],
'Column2': ['a', 'b', 'c']
})
# From a list of dictionaries
df = pd.DataFrame([
{'Column1': 1, 'Column2': 'a'},
{'Column1': 2, 'Column2': 'b'}
])
# From a CSV file
df = pd.read_csv('filename.csv')
# From an Excel file
df = pd.read_excel('filename.xlsx')
```
## Basic DataFrame Operations
### Viewing Data
```python
# View the first 5 rows
df.head()
# View the last 5 rows
df.tail()
# Display the index, columns, and underlying numpy data
df.info()
```
### Data Selection
```python
# Select a single column
df['Column1']
# Select multiple columns
df[['Column1', 'Column2']]
# Select rows by position
df.iloc[0] # First row
# Select rows by label
df.loc[0] # Row with index label 0
```
### Data Filtering
```python
# Rows where Column1 is greater than 1
df[df['Column1'] > 1]
```
### Adding and Dropping Columns
```python
# Adding a new column
df['Column3'] = [4, 5, 6]
# Dropping a column
df.drop('Column3', axis=1, inplace=True)
```
### Renaming Columns
```python
df.rename(columns={'Column1': 'NewName1'}, inplace=True)
```
### Handling Missing Data
```python
# Drop rows with any missing values
df.dropna()
# Fill missing values
df.fillna(value=0)
```
## Data Manipulation
### Applying Functions
```python
# Apply a function to each item
df['Column1'] = df['Column1'].apply(lambda x: x * 2)
```
### Grouping Data
```python
# Group by column and calculate mean
df.groupby('Column1').mean()
```
### Merging and Concatenating
```python
# Concatenate DataFrames
pd.concat([df1, df2])
# Merge DataFrames
pd.merge(df1, df2, on='key_column')
```
### Aggregating Data
```python
df.agg({
'Column1': ['min', 'max', 'mean'],
'Column2': ['sum']
})
```
## Working with Time Series
```python
# Convert column to datetime
df['DateColumn'] = pd.to_datetime(df['DateColumn'])
# Set the DateTime column as the index
df.set_index('DateColumn', inplace=True)
# Resample and aggregate by month
df.resample('M').mean()
```
## Saving Data
```python
# Write to a CSV file
df.to_csv('new_file.csv')
# Write to an Excel file
df.to_excel('new_file.xlsx')
```
`pandas` is an indispensable tool for data munging/wrangling. It provides high-level abstractions for complex operations, simplifying tasks like data filtering, transformation, and aggregation. This guide covers foundational operations but barely scratches the surface of `pandas`' capabilities, which are vast and varied, extending well beyond these basics to support complex data manipulation and analysis tasks.
```
Given its powerful and flexible data manipulation capabilities, `pandas` is a cornerstone library for anyone working with data in Python, offering a depth of functionality that covers nearly every aspect of data analysis and manipulation.