Catalog / Scikit-learn Cheat Sheet
Scikit-learn Cheat Sheet
A concise cheat sheet for the scikit-learn library, covering essential functionalities for machine learning in Python. This guide includes key concepts, model selection, preprocessing techniques, and evaluation metrics with practical examples.
Core Concepts & Model Selection
Supervised Learning Estimators
Scikit-learn offers various supervised learning estimators for different tasks. Linear Models:
Support Vector Machines (SVM):
Ensemble Methods:
Unsupervised Learning Estimators
Scikit-learn provides unsupervised learning estimators for tasks like clustering and dimensionality reduction. Clustering:
Dimensionality Reduction:
Model Fitting and Prediction
Fit the model using the training data
Predict class labels or values for data
Apply dimensionality reduction or feature extraction to
Fit the model and then transform
Preprocessing and Feature Engineering
Scaling and Normalization
Scaling and normalization techniques adjust feature values to a standard range. StandardScaler: Standardize features by removing the mean and scaling to unit variance.
MinMaxScaler: Scales features to a range between zero and one.
RobustScaler: Scale features using statistics that are robust to outliers.
Encoding Categorical Variables
Encoding transforms categorical data into numerical format. OneHotEncoder: Encodes categorical features as a one-hot numeric array.
LabelEncoder: Encodes target labels with value between 0 and n_classes-1.
Fills missing values with a specified strategy (e.g., mean, median, most_frequent).
Model Evaluation and Validation
Metrics for Classification
Evaluation metrics quantify the performance of classification models. Accuracy: Ratio of correctly predicted instances to total instances.
Precision: Ratio of true positives to the sum of true positives and false positives.
Recall: Ratio of true positives to the sum of true positives and false negatives.
F1-score: Weighted average of precision and recall.
Confusion Matrix: Table showing the correct and incorrect predictions, broken down by class.
Metrics for Regression
Evaluation metrics quantify the performance of regression models. Mean Squared Error (MSE): Average of the squares of the errors.
Root Mean Squared Error (RMSE): Square root of the MSE.
R-squared (Coefficient of Determination): Proportion of variance in the dependent variable that can be predicted from the independent variables.
Evaluate a model by cross-validation.
Provides train/test indices to split data in train/test sets.
Pipeline and Grid Search
Pipelines streamline the sequence of data transformations and model fitting.
Grid Search
Grid search systematically searches hyperparameter space for the best model.
Column Transformer
Apply different transformers to different columns of an array or pandas DataFrame.