Files

medusa 8a90f7562b Add docs/tech_docs/python/pandas.md

2024-03-28 18:17:47 +00:00

2.8 KiB

Raw Blame History

The pandas library is indispensable for data scientists, analysts, and anyone working with data in Python. It provides high-performance, easy-to-use data structures and data analysis tools. Below is a concise reference guide for common use cases with pandas, formatted in Markdown syntax:

`pandas` Reference Guide

Installation

pip install pandas

Basic Concepts

Importing pandas

import pandas as pd

Data Structures

Series: One-dimensional array with labels.
DataFrame: Two-dimensional, size-mutable, potentially heterogeneous tabular data with labeled axes.

Creating DataFrames

# From a dictionary
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c']
})

# From a list of lists
df = pd.DataFrame([
    [1, 'a'],
    [2, 'b'],
    [3, 'c']
], columns=['A', 'B'])

Reading Data

# Read from CSV
df = pd.read_csv('filename.csv')

# Read from Excel
df = pd.read_excel('filename.xlsx')

# Other formats include: read_sql, read_json, read_html, read_clipboard, read_pickle, etc.

Data Inspection

# View the first n rows (default 5)
df.head()

# View the last n rows (default 5)
df.tail()

# Data summary
df.info()

# Statistical summary for numerical columns
df.describe()

Data Selection

# Select a column
df['A']

# Select multiple columns
df[['A', 'B']]

# Select rows by position
df.iloc[0]  # First row
df.iloc[0:5]  # First five rows

# Select rows by label
df.loc[0]  # Row with index label 0
df.loc[0:5]  # Rows with index labels from 0 to 5, inclusive

Data Manipulation

# Add a new column
df['C'] = [10, 20, 30]

# Drop a column
df.drop('C', axis=1, inplace=True)

# Rename columns
df.rename(columns={'A': 'Alpha', 'B': 'Beta'}, inplace=True)

# Filter rows
filtered_df = df[df['Alpha'] > 1]

# Apply a function to a column
df['Alpha'] = df['Alpha'].apply(lambda x: x * 2)

Handling Missing Data

# Drop rows with any missing values
df.dropna()

# Fill missing values
df.fillna(value=0)

Grouping and Aggregating

# Group by a column and calculate mean
grouped_df = df.groupby('B').mean()

# Multiple aggregation functions
grouped_df = df.groupby('B').agg(['mean', 'sum'])

Merging, Joining, and Concatenating

# Concatenate DataFrames
pd.concat([df1, df2])

# Merge DataFrames
pd.merge(df1, df2, on='key')

# Join DataFrames
df1.join(df2, on='key')

Saving Data

# Write to CSV
df.to_csv('filename.csv')

# Write to Excel
df.to_excel('filename.xlsx')

# Other formats include: to_sql, to_json, to_html, to_clipboard, to_pickle, etc.

pandas is incredibly powerful for data cleaning, transformation, analysis, and visualization. This guide covers the basics, but the library's capabilities are vast and highly customizable to suit complex data manipulation and analysis tasks.

2.8 KiB Raw Blame History

pandas Reference Guide