Catalog / Data Science Cheatsheet
Data Science Cheatsheet
A comprehensive cheat sheet covering essential concepts, tools, and techniques in Data Science. It provides a quick reference for machine learning algorithms, data manipulation, statistical methods, and more.
Fundamentals
Key Concepts
Supervised Learning |
Learning from labeled data to predict outcomes. |
Unsupervised Learning |
Discovering patterns in unlabeled data. |
Reinforcement Learning |
Training an agent to make decisions in an environment to maximize a reward. |
Bias-Variance Tradeoff |
Balancing model complexity to minimize both bias (underfitting) and variance (overfitting). |
Cross-Validation |
Evaluating model performance on multiple subsets of the data to ensure generalization. |
Feature Engineering |
Creating new features or transforming existing ones to improve model accuracy. |
Common Algorithms
Linear Regression: Predicts a continuous outcome using a linear equation. |
Logistic Regression: Predicts a binary outcome using a logistic function. |
Decision Trees: Partitions data into subsets based on feature values to make predictions. |
Random Forest: An ensemble of decision trees that averages predictions to improve accuracy. |
Support Vector Machines (SVM): Finds the optimal hyperplane to separate data into classes. |
K-Nearest Neighbors (KNN): Classifies data based on the majority class among its k nearest neighbors. |
K-Means Clustering: Partitions data into k clusters based on distance to cluster centroids. |
Python for Data Science
Data Manipulation with Pandas
Creating a DataFrame |
|
Selecting Columns |
|
Filtering Rows |
|
Grouping and Aggregation |
|
Handling Missing Data |
|
Data Visualization with Matplotlib and Seaborn
Basic Plotting with Matplotlib |
|
Scatter Plot with Seaborn |
|
Histogram with Seaborn |
|
Box Plot with Seaborn |
|
Scikit-learn for Machine Learning
Training a Model
|
Making Predictions
|
Model Evaluation
|
Data Preprocessing
|
Train-Test Split
|
Statistical Methods
Descriptive Statistics
Mean |
Average value of a dataset. |
Median |
Middle value of a sorted dataset. |
Mode |
Most frequent value in a dataset. |
Standard Deviation |
Measure of the spread of data around the mean. |
Variance |
Square of the standard deviation. |
Inferential Statistics
Hypothesis Testing |
A method for testing a claim or hypothesis about a population parameter. |
P-value |
Probability of obtaining results as extreme as the observed results, assuming the null hypothesis is true. |
Confidence Interval |
Range of values likely to contain the true population parameter with a certain level of confidence. |
T-test |
Used to compare the means of two groups. |
ANOVA |
Used to compare the means of more than two groups. |
Model Evaluation and Tuning
Evaluation Metrics
Accuracy |
Fraction of correctly classified instances. |
Precision |
Fraction of true positives among predicted positives. |
Recall |
Fraction of true positives among actual positives. |
F1-Score |
Harmonic mean of precision and recall. |
AUC-ROC |
Area under the Receiver Operating Characteristic curve, measures the ability of a classifier to distinguish between classes. |
Mean Squared Error (MSE) |
Average squared difference between predicted and actual values. |
R-squared |
Proportion of variance in the dependent variable that can be predicted from the independent variables. |
Hyperparameter Tuning
Grid Search: Exhaustively search a specified subset of the hyperparameters of a learning algorithm. |
Randomized Search: Sample a given number of candidates from a hyperparameter search space. |
Bayesian Optimization: Uses Bayesian inference to find the hyperparameters that optimize a given metric. |
Cross-Validation: Evaluate model performance on multiple subsets of the data to ensure generalization. |