Catalog / Scikit-learn Cheat Sheet
Scikit-learn Cheat Sheet
A concise cheat sheet for the scikit-learn library, covering essential functionalities for machine learning in Python. This guide includes key concepts, model selection, preprocessing techniques, and evaluation metrics with practical examples.
Core Concepts & Model Selection
Supervised Learning Estimators
Scikit-learn offers various supervised learning estimators for different tasks. Linear Models:
Example:
|
Support Vector Machines (SVM):
Example:
|
Ensemble Methods:
Example:
|
Unsupervised Learning Estimators
Scikit-learn provides unsupervised learning estimators for tasks like clustering and dimensionality reduction. Clustering:
Example:
|
Dimensionality Reduction:
Example:
|
Model Fitting and Prediction
|
Fit the model using the training data
|
|
Predict class labels or values for data
|
|
Apply dimensionality reduction or feature extraction to
|
|
Fit the model and then transform
|
Preprocessing and Feature Engineering
Scaling and Normalization
Scaling and normalization techniques adjust feature values to a standard range. StandardScaler: Standardize features by removing the mean and scaling to unit variance.
|
MinMaxScaler: Scales features to a range between zero and one.
|
RobustScaler: Scale features using statistics that are robust to outliers.
|
Encoding Categorical Variables
Encoding transforms categorical data into numerical format. OneHotEncoder: Encodes categorical features as a one-hot numeric array.
|
LabelEncoder: Encodes target labels with value between 0 and n_classes-1.
|
Imputation
|
Fills missing values with a specified strategy (e.g., mean, median, most_frequent).
|
Model Evaluation and Validation
Metrics for Classification
Evaluation metrics quantify the performance of classification models. Accuracy: Ratio of correctly predicted instances to total instances.
|
Precision: Ratio of true positives to the sum of true positives and false positives.
|
Recall: Ratio of true positives to the sum of true positives and false negatives.
|
F1-score: Weighted average of precision and recall.
|
Confusion Matrix: Table showing the correct and incorrect predictions, broken down by class.
|
Metrics for Regression
Evaluation metrics quantify the performance of regression models. Mean Squared Error (MSE): Average of the squares of the errors.
|
Root Mean Squared Error (RMSE): Square root of the MSE.
|
R-squared (Coefficient of Determination): Proportion of variance in the dependent variable that can be predicted from the independent variables.
|
Cross-Validation
|
Evaluate a model by cross-validation.
|
|
Provides train/test indices to split data in train/test sets.
|
Pipeline and Grid Search
Pipeline
Pipelines streamline the sequence of data transformations and model fitting.
|
Grid Search
Grid search systematically searches hyperparameter space for the best model.
|
Column Transformer
Apply different transformers to different columns of an array or pandas DataFrame.
|