Files
2024-05-01 12:28:44 -06:00

3.3 KiB

This library provides simple and efficient tools for predictive data analysis and is built on NumPy, SciPy, and matplotlib. It includes a wide range of supervised and unsupervised learning algorithms. Below is a concise reference guide for common use cases with scikit-learn, formatted in Markdown syntax:

scikit-learn Reference Guide

Installation

pip install scikit-learn

Basic Concepts

Importing scikit-learn

import sklearn

Preprocessing Data

from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Standardize features
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)

# One-hot encode categorical variables
encoder = OneHotEncoder().fit(X_categorical)
X_encoded = encoder.transform(X_categorical)

Splitting Data

from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Supervised Learning Algorithms

Linear Regression

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Classification (Logistic Regression)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Decision Trees

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Unsupervised Learning Algorithms

K-Means Clustering

from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit(X)
labels = model.predict(X)

Principal Component Analysis (PCA)

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

Model Evaluation

Cross-Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)

Classification Metrics

from sklearn.metrics import accuracy_score, confusion_matrix

accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

Regression Metrics

from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

Tuning Hyperparameters

from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [10, 50, 100], 'max_features': ['auto', 'sqrt', 'log2']}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

Saving and Loading Models

from joblib import dump, load

# Save a model
dump(model, 'model.joblib')

# Load a model
model = load('model.joblib')

scikit-learn is a versatile and comprehensive library that simplifies the implementation of many machine learning algorithms for data analysis projects. This guide touches on key features such as data preprocessing, model selection, training, and evaluation, but scikit-learn's functionality extends far beyond these basics, making it a foundational tool in the machine learning practitioner's toolkit.