Files
2024-05-01 12:28:44 -06:00

132 lines
3.3 KiB
Markdown

This library provides simple and efficient tools for predictive data analysis and is built on NumPy, SciPy, and matplotlib. It includes a wide range of supervised and unsupervised learning algorithms. Below is a concise reference guide for common use cases with `scikit-learn`, formatted in Markdown syntax:
# `scikit-learn` Reference Guide
## Installation
```
pip install scikit-learn
```
## Basic Concepts
### Importing scikit-learn
```python
import sklearn
```
## Preprocessing Data
```python
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Standardize features
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
# One-hot encode categorical variables
encoder = OneHotEncoder().fit(X_categorical)
X_encoded = encoder.transform(X_categorical)
```
## Splitting Data
```python
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```
## Supervised Learning Algorithms
### Linear Regression
```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
```
### Classification (Logistic Regression)
```python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
```
### Decision Trees
```python
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
```
## Unsupervised Learning Algorithms
### K-Means Clustering
```python
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(X)
labels = model.predict(X)
```
### Principal Component Analysis (PCA)
```python
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
```
## Model Evaluation
### Cross-Validation
```python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
```
### Classification Metrics
```python
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)
```
### Regression Metrics
```python
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
```
## Tuning Hyperparameters
```python
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [10, 50, 100], 'max_features': ['auto', 'sqrt', 'log2']}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
```
## Saving and Loading Models
```python
from joblib import dump, load
# Save a model
dump(model, 'model.joblib')
# Load a model
model = load('model.joblib')
```
`scikit-learn` is a versatile and comprehensive library that simplifies the implementation of many machine learning algorithms for data analysis projects. This guide touches on key features such as data preprocessing, model selection, training, and evaluation, but `scikit-learn`'s functionality extends far beyond these basics, making it a foundational tool in the machine learning practitioner's toolkit.