Core Concepts & Model Selection
Scikit-learn offers various supervised learning estimators for different tasks.
Linear Models:
LinearRegression : For regression tasks.
LogisticRegression : For classification tasks.
Ridge : Linear least squares with L2 regularization.
Example:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
|
Support Vector Machines (SVM):
SVC : Support Vector Classification.
SVR : Support Vector Regression.
Example:
from sklearn.svm import SVC
model = SVC()
|
Ensemble Methods:
RandomForestClassifier : For classification tasks.
RandomForestRegressor : For regression tasks.
GradientBoostingClassifier : For classification tasks.
Example:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
|
Scikit-learn provides unsupervised learning estimators for tasks like clustering and dimensionality reduction.
Clustering:
KMeans : K-Means clustering.
AgglomerativeClustering : Agglomerative clustering.
Example:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
|
Dimensionality Reduction:
PCA : Principal Component Analysis.
TruncatedSVD : Truncated Singular Value Decomposition.
Example:
from sklearn.decomposition import PCA
model = PCA(n_components=2)
|
|
Fit the model using the training data X and target y .
model.fit(X_train, y_train)
|
|
Predict class labels or values for data X .
y_pred = model.predict(X_test)
|
|
Apply dimensionality reduction or feature extraction to X .
X_transformed = model.transform(X)
|
|
Fit the model and then transform X .
X_transformed = model.fit_transform(X)
|
Preprocessing and Feature Engineering
Scaling and normalization techniques adjust feature values to a standard range.
StandardScaler: Standardize features by removing the mean and scaling to unit variance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
|
MinMaxScaler: Scales features to a range between zero and one.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
|
RobustScaler: Scale features using statistics that are robust to outliers.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
|
Encoding transforms categorical data into numerical format.
OneHotEncoder: Encodes categorical features as a one-hot numeric array.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
X_encoded = encoder.fit_transform(X)
|
LabelEncoder: Encodes target labels with value between 0 and n_classes-1.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(y)
|
|
Fills missing values with a specified strategy (e.g., mean, median, most_frequent).
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
|
Model Evaluation and Validation
Evaluation metrics quantify the performance of classification models.
Accuracy: Ratio of correctly predicted instances to total instances.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
|
Precision: Ratio of true positives to the sum of true positives and false positives.
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred)
|
Recall: Ratio of true positives to the sum of true positives and false negatives.
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred)
|
F1-score: Weighted average of precision and recall.
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred)
|
Confusion Matrix: Table showing the correct and incorrect predictions, broken down by class.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
|
Evaluation metrics quantify the performance of regression models.
Mean Squared Error (MSE): Average of the squares of the errors.
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
|
Root Mean Squared Error (RMSE): Square root of the MSE.
import numpy as np
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
|
R-squared (Coefficient of Determination): Proportion of variance in the dependent variable that can be predicted from the independent variables.
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
|
|
Evaluate a model by cross-validation.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
|
|
Provides train/test indices to split data in train/test sets.
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
|