Unit-1 : Fundamentals of Data Analytics
INDEX
1.1 Exploratory Data Analysis (EDA)
- 1.1.1 Types of Exploratory Data Analysis:
1.1.2 Univariate Analysis
1.1.3 Bivariate Analysis
1.1.4 Multivariate Analysis
1.1.5 Handling Missing Data and Outliers
1.2 Understanding the Data:
- 1.2.1 Quantitative Data : Discrete and Continuous
- 1.2.2 Qualitative Data : Non-numerical (Normal and Ordinal)
1.3 Spread of Data
- 1.3.1 Normal Distribution
- 1.3.2 Skewed Distribution
- 1.3.3 Skewness and Kurtiosis
Unit-1 : Fundamentals of Data Analytics
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the first and most crucial step in the data analysis process. Before building any model or making decisions, analysts must understand what the data looks like, how it behaves, and whether it is complete and consistent.
EDA focuses on discovering patterns, spotting anomalies, checking assumptions, and summarizing data using numerical methods and visual techniques.
Why EDA is Important?
Helps you understand the story behind the data
Identifies missing values, duplicates, noisy data
Detects outliers and unusual patterns
Helps decide which statistical methods or models are suitable
Provides insights that guide feature selection
Ensures data reliability and quality
Simple Example
Suppose you have a dataset of students’ marks:[45, 78, 90, 32, 88, 90, 100, 12]
From EDA, you may find:
Mean = 66.8
Student who scored 12 is an outlier
Marks are not evenly distributed
High variation in performance
1.1.1 Types of Exploratory Data Analysis
EDA can be classified into multiple types based on purpose and technique.
1. Descriptive EDA
Descriptive EDA provides a statistical summary of data.
It answers basic questions like:
What is the average value?
How spread out are the values?
What is the most frequent category?
Tools Used:
Mean, Median, Mode
Range, Variance, Standard Deviation
Frequency Tables
Proportion and percentages
Example:
Dataset: Heights of 5 students = [150, 152, 160, 165, 170]
Mean height = 159.4
Standard Deviation shows how spread out the heights are
This gives a quick understanding of the variable.
2. Graphical EDA
Graphical EDA uses visual representations of data to identify patterns, trends, and relationships.
Common Graphs
Histogram – distribution of numerical data
Bar Chart – comparison of categories
Pie Chart – proportion of categories
Box Plot – median, spread, outliers
Scatter Plot – relationship between two numerical variables
Heatmap – correlation between multiple variables
Example:
A histogram of students’ marks may reveal:
Most students scored between 60–80
Marks are left-skewed or right-skewed
Some outliers exist
3. Univariate, Bivariate, and Multivariate EDA
Univariate EDA
Explores one variable at a time.
Example: Only analyzing “Salary”.
Bivariate EDA
Explores the relationship between two variables.
Example: “Salary vs Experience”.
Multivariate EDA
Explores relationships among three or more variables.
Example: “Salary vs Experience vs Education Level”.
4. Quantitative vs Qualitative EDA
Quantitative (Numerical) Data
Includes numbers like:
Age, Height, Salary, Temperature
Qualitative (Categorical) Data
Includes categories like:
Gender, City, Product Type
1.1.2 Univariate Analysis (Detailed Explanation)
Univariate Analysis is a type of exploratory data analysis where only one variable is examined at a time.
The goal is to understand the basic characteristics of that variable—its distribution, central value, spread, and overall behavior.
“Uni” means “one,” so univariate analysis answers questions about a single column in the dataset.
Why Do We Use Univariate Analysis?
Univariate analysis helps you understand:
What values the variable takes
How those values are distributed
Whether the data is skewed or normal
Whether there are outliers
What the typical (central) value is
How spread out the data is
It is the foundation of data understanding before moving to more advanced steps like bivariate or multivariate analysis.
Types of Variables in Univariate Analysis
Univariate analysis differs depending on the type of variable:
1. Numerical (Quantitative) Variables
These variables contain numbers.
Examples: marks, height, weight, salary, age.
Two types:
Continuous data: decimals possible (e.g., height = 172.5 cm)
Discrete data: whole numbers (e.g., number of students = 45)
2. Categorical (Qualitative) Variables
These variables contain text/labels instead of numbers.
Examples: gender, city, preferred product, blood group.
Two types:
Nominal: no order (e.g., Red, Blue, Green)
Ordinal: ordered categories (e.g., Grade A, B, C)
Univariate Analysis Techniques for Numerical Data
A. Measures of Central Tendency
These describe the “center” of the data.
1. Mean (Average)
Sum of values Ă· number of values
Example:
Ages = [10, 12, 14]
Mean = (10 + 12 + 14) / 3 = 12
2. Median (Middle Value)
Sorted dataset’s center value.
Example:
Data = [5, 7, 9]
Median = 7
3. Mode (Most Frequent Value)
Example:
Data = [2, 2, 3, 4]
Mode = 2
B. Measures of Dispersion (Spread of Data)
1. Range
Max – Min
Example: 90 – 20 = 70
2. Variance
Average squared difference from mean.
3. Standard Deviation
Square root of variance.
Higher SD = more spread out data.
C. Shape of the Distribution
Univariate analysis checks:
Skewness (left/right skew)
Kurtosis (peakedness/flatness)
Normal distribution
Example:
Salary data is usually right-skewed because few people earn very high salaries.
D. Visual Tools for Numerical Univariate Analysis
1. Histogram
Shows how frequently values appear in different ranges.
2. Box Plot
Shows median, quartiles, and outliers.
3. Density Plot
Smooth version of a histogram.
Example:
Data: Marks = [40, 50, 60, 70, 100]
Histogram shows distribution
Box plot highlights 100 as an outlier
Univariate Analysis Techniques for Categorical Data
A. Frequency Distribution
Count how many times each category appears.
Example:
Fruits Bought:
Apple: 30
Banana: 20
Mango: 10
B. Percentage Distribution
Percentage for each class.
Apple → 30 / (30+20+10) = 50%
C. Mode
Most frequent category.
Here: Apple
Visual Tools for Categorical Data
1. Bar Chart
Best for comparing category frequencies.
2. Pie Chart
Displays proportions.
3. Donut Chart
Alternative to a pie chart.
Real-life Example of Univariate Analysis
Dataset: Monthly Salaries (in ₹)
[20,000, 22,000, 25,000, 27,000, 40,000, 90,000]
Univariate Analysis Gives:
Mean = 37,166
Median = 26,000
Mode = None
Range = 90,000 – 20,000 = 70,000
SD = Large (high variation)
Outlier = 90,000
Distribution = Right-skewed
This helps to understand that most salaries are between 20k–40k except one very high value.
Importance of Univariate Analysis
Helps understand each variable individually
Supports decision-making
Helps in cleaning data (detect missing values/outliers)
Essential for selecting modeling techniques
Helps decide binning, transformation, or normalization
1.1.3 Bivariate AnalysisÂ
Bivariate Analysis is a method of exploring and analyzing two variables together to understand the relationship, association, influence, or comparison between them.
“Bi” means two, so bivariate analysis studies how one variable changes with respect to another.
It answers questions like:
Does one variable affect the other?
Are two variables related?
How strong is the relationship?
What type of relationship exists (linear, non-linear, categorical)?
Purpose of Bivariate Analysis
Bivariate analysis helps to:
Identify patterns between two variables
Check cause-and-effect relationships (not always causation but correlation)
Compare groups
Understand the strength and direction of relationships
Prepare data for predictive modeling
Example:
Does height increase with age?
Does study time affect marks?
Does gender influence purchasing behavior?
Types of Bivariate Analysis Based on Data Types
There are three major combinations possible between two variables:
Numerical vs Numerical
Numerical vs Categorical
Categorical vs Categorical
Each combination has its own methods, charts, and interpretation.
1. Numerical vs Numerical
This type studies the relationship between two numerical (quantitative) variables.
Examples:
Hours Studied vs Marks
Age vs Height
Salary vs Experience
Temperature vs Ice-cream Sales
Techniques Used
A. Scatter Plot
Points plotted on X and Y axes
Shows pattern or direction
Example:
Hours studied (X) vs Marks scored (Y)
If points rise upward → positive relationship.
B. Correlation Coefficient
Measures:
Strength (weak/strong)
Direction (positive/negative)
Values range from:
+1 → perfect positive correlation
–1 → perfect negative correlation
0 → no correlation
Example:
Study hours and marks may have correlation of +0.85 (strong positive).
C. Line Graphs
Used when data is related to time or sequential.
D. Regression Analysis (Basic Level)
Simple linear regression predicts one variable based on another.
Example:
Predicting marks based on study time.
2. Numerical vs Categorical
This type compares a numerical variable across different categories.
Examples:
Salary (numerical) vs Gender (categorical)
Height (numerical) vs Sports Category
Marks (numerical) vs School Type (Private/Government)
Techniques Used
A. Box Plot
Shows:
Median
Quartiles
Spread
Outliers
Example:
Comparing salary of male vs female employees.
B. Bar Charts / Error Bars
Shows mean/median value for each category.
C. Grouped Descriptive Statistics
Mean, median, SD for each category.
Example:
Average marks of boys vs girls.
3. Categorical vs Categorical
This type studies the relationship between two categorical variables.
Examples:
Gender vs Purchase Decision
Education Level vs Job Type
City vs Internet Usage Category
Techniques Used
A. Cross-Tabulation (Contingency Table)
A table showing the frequencies of categories.
Example:
Â
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.