Binary:
garv-aggarwal / DATA 301
DATA 301
A comprehensive cheat sheet covering essential data analysis tools and techniques, including data representation, Excel, SQL, Python, R, visualization, open data, APIs, and geospatial concepts.
Data Representation & Excel Basics
Data Representation
|
Base-2 numeral system (0, 1). Fundamental for digital data. |
Sizes: |
MB (Megabyte), GB (Gigabyte), TB (Terabyte). 1 TB = 1024 GB, 1 GB = 1024 MB. |
Data Types: |
Strings (text), Dates (YYYY-MM-DD), Floats (decimal numbers). |
File Encodings: |
UTF-8 (Unicode Transformation Format - 8-bit) is the most common. ASCII, UTF-16 are others. |
Character Encoding |
A character encoding tells the computer how to interpret raw zeroes and ones into actual characters. It determines what code belongs to which character. Common encodings: ASCII, UTF-8, UTF-16 |
Endianness |
Describes the order in which bytes of a multi-byte data type (like integers) are stored in computer memory. Big-Endian: Most significant byte first. Little-Endian: Least significant byte first. |
Essential Excel Formulas
|
Returns one value if a condition is TRUE and another value if it’s FALSE. |
|
Counts the number of cells within a range that meet the given criteria. |
|
Adds all the numbers in a range of cells. |
|
Returns the largest value in a set of numbers. |
|
Joins several text strings into one text string. Use |
|
Calculates the average (arithmetic mean) of the numbers in a range. |
Excel - Pivot Tables
Pivot tables are used to summarize and analyze data. They allow you to rearrange and group data in different ways to see patterns and trends. Key Components:
|
Creating a Pivot Table:
|
Example Scenario:
|
SQL Fundamentals
Basic SQL Commands
|
Creates a new table in the database. Define column names and their data types. |
|
Inserts a new row into a table. |
|
Modifies existing data in a table. Use |
|
Deletes rows from a table. Always use a |
|
Retrieves data from one or more tables. |
Filtering and Aggregations
|
Filters rows based on a specified condition. |
|
Groups rows that have the same values in specified columns. Often used with aggregate functions. |
|
Sorts the result-set based on one or more columns. |
|
Limits the number of rows returned. |
|
Filters results after grouping, used with |
Joins
JOIN: Returns rows when there is a match in both tables based on the join condition.
|
LEFT JOIN (or LEFT OUTER JOIN): Returns all rows from the left table (TableA), and the matched rows from the right table (TableB). If there is no match in TableB, it returns NULL values for columns from TableB.
|
RIGHT JOIN (or RIGHT OUTER JOIN): Returns all rows from the right table (TableB), and the matched rows from the left table (TableA). If there is no match in TableA, it returns NULL values for columns from TableA.
|
FULL OUTER JOIN: Returns all rows when there is a match in one of the tables. If there are rows in TableA that do not match TableB, or rows in TableB that do not match TableA, the FULL OUTER JOIN will include these rows in the result set. Columns that do not have matching values will contain NULL.
|
INNER JOIN: Same as JOIN. Returns rows only when there’s a match in both tables.
|
Python and R Programming
Python Fundamentals
Input Handling: |
Using
|
Loops: |
|
List/Dictionary Logic: |
Creating, accessing, and manipulating lists and dictionaries.
|
File Reading & String Parsing: |
Reading data from files and manipulating strings.
|
Function Definitions: |
Defining reusable blocks of code.
|
R Programming Fundamentals
Vectors: |
Creating and manipulating vectors.
|
Subsetting: |
Using
|
Data Frames: |
Working with data frames.
|
Statistical Tests: |
Performing t-tests and other statistical analyses.
|
Summary Statistics: |
Calculating descriptive statistics.
|
Data Processing - Python & R
Python - Sum, Max, Filter: |
Simple data processing operations.
|
R - Filtering & Summary Stats: |
Data manipulation and summary statistics.
|
Visualization, Open Data, and Geospatial
Data Visualization Tools
Python: |
Using
|
R: |
Using
|
Tableau: |
A data visualization tool for creating interactive dashboards and reports. Dimensions (blue pills) are categorical, and Measures (green pills) are numerical. Drag dimensions and measures to rows, columns, and marks cards to create visualizations. |
Open Data & APIs
CSV/JSON Data Loading (Python): |
Loading data from CSV and JSON files.
|
Accessing APIs (Python): |
Accessing data from APIs, e.g., Google Maps.
|
CSV/JSON Data Loading (R): |
Loading data from CSV and JSON files.
|
Accessing APIs (R): |
Accessing data from APIs
|
Geospatial (GIS) Fundamentals
Coordinate Data: |
Geospatial File Types:
|
Simple Mapping (Python):
|
Simple Mapping (R):
|