This library provides simple and efficient tools for predictive data analysis and is built on NumPy, SciPy, and matplotlib. It includes a wide range of supervised and unsupervised learning algorithms. Below is a concise reference guide for common use cases with `scikit-learn`, formatted in Markdown syntax: # `scikit-learn` Reference Guide ## Installation ``` pip install scikit-learn ``` ## Basic Concepts ### Importing scikit-learn ```python import sklearn ``` ## Preprocessing Data ```python from sklearn.preprocessing import StandardScaler, OneHotEncoder # Standardize features scaler = StandardScaler().fit(X) X_scaled = scaler.transform(X) # One-hot encode categorical variables encoder = OneHotEncoder().fit(X_categorical) X_encoded = encoder.transform(X_categorical) ``` ## Splitting Data ```python from sklearn.model_selection import train_test_split # Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) ``` ## Supervised Learning Algorithms ### Linear Regression ```python from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) ``` ### Classification (Logistic Regression) ```python from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) ``` ### Decision Trees ```python from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(X_train, y_train) predictions = model.predict(X_test) ``` ## Unsupervised Learning Algorithms ### K-Means Clustering ```python from sklearn.cluster import KMeans model = KMeans(n_clusters=3) model.fit(X) labels = model.predict(X) ``` ### Principal Component Analysis (PCA) ```python from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X) ``` ## Model Evaluation ### Cross-Validation ```python from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5) ``` ### Classification Metrics ```python from sklearn.metrics import accuracy_score, confusion_matrix accuracy = accuracy_score(y_test, predictions) conf_matrix = confusion_matrix(y_test, predictions) ``` ### Regression Metrics ```python from sklearn.metrics import mean_squared_error, r2_score mse = mean_squared_error(y_test, predictions) r2 = r2_score(y_test, predictions) ``` ## Tuning Hyperparameters ```python from sklearn.model_selection import GridSearchCV param_grid = {'n_estimators': [10, 50, 100], 'max_features': ['auto', 'sqrt', 'log2']} grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5) grid_search.fit(X_train, y_train) best_params = grid_search.best_params_ ``` ## Saving and Loading Models ```python from joblib import dump, load # Save a model dump(model, 'model.joblib') # Load a model model = load('model.joblib') ``` `scikit-learn` is a versatile and comprehensive library that simplifies the implementation of many machine learning algorithms for data analysis projects. This guide touches on key features such as data preprocessing, model selection, training, and evaluation, but `scikit-learn`'s functionality extends far beyond these basics, making it a foundational tool in the machine learning practitioner's toolkit.