Unit-1 : Fundamentals of Data Analytics

INDEX

1.1 Exploratory Data Analysis (EDA)

1.1.1 Types of Exploratory Data Analysis:
1.1.2 Univariate Analysis
1.1.3 Bivariate Analysis
1.1.4 Multivariate Analysis
1.1.5 Handling Missing Data and Outliers

1.2 Understanding the Data:

1.2.1 Quantitative Data : Discrete and Continuous
1.2.2 Qualitative Data : Non-numerical (Normal and Ordinal)

1.3 Spread of Data

1.3.1 Normal Distribution
1.3.2 Skewed Distribution
1.3.3 Skewness and Kurtiosis

Unit-1 : Fundamentals of Data Analytics

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the first and most crucial step in the data analysis process. Before building any model or making decisions, analysts must understand what the data looks like, how it behaves, and whether it is complete and consistent.

EDA focuses on discovering patterns, spotting anomalies, checking assumptions, and summarizing data using numerical methods and visual techniques.

Why EDA is Important?

Helps you understand the story behind the data
Identifies missing values, duplicates, noisy data
Detects outliers and unusual patterns
Helps decide which statistical methods or models are suitable
Provides insights that guide feature selection
Ensures data reliability and quality

Simple Example

Suppose you have a dataset of students’ marks:
[45, 78, 90, 32, 88, 90, 100, 12]

From EDA, you may find:

Mean = 66.8
Student who scored 12 is an outlier
Marks are not evenly distributed
High variation in performance

1.1.1 Types of Exploratory Data Analysis

EDA can be classified into multiple types based on purpose and technique.

1. Descriptive EDA

Descriptive EDA provides a statistical summary of data.
It answers basic questions like:

What is the average value?
How spread out are the values?
What is the most frequent category?

Tools Used:

Mean, Median, Mode
Range, Variance, Standard Deviation
Frequency Tables
Proportion and percentages

Example:

Dataset: Heights of 5 students = [150, 152, 160, 165, 170]

Mean height = 159.4
Standard Deviation shows how spread out the heights are

This gives a quick understanding of the variable.

2. Graphical EDA

Graphical EDA uses visual representations of data to identify patterns, trends, and relationships.

Common Graphs

Histogram – distribution of numerical data
Bar Chart – comparison of categories
Pie Chart – proportion of categories
Box Plot – median, spread, outliers
Scatter Plot – relationship between two numerical variables
Heatmap – correlation between multiple variables

Example:

A histogram of students’ marks may reveal:

Most students scored between 60–80
Marks are left-skewed or right-skewed
Some outliers exist

3. Univariate, Bivariate, and Multivariate EDA

Univariate EDA

Explores one variable at a time.
Example: Only analyzing “Salary”.

Bivariate EDA

Explores the relationship between two variables.
Example: “Salary vs Experience”.

Multivariate EDA

Explores relationships among three or more variables.
Example: “Salary vs Experience vs Education Level”.

4. Quantitative vs Qualitative EDA

Quantitative (Numerical) Data

Includes numbers like:
Age, Height, Salary, Temperature

Qualitative (Categorical) Data

Includes categories like:
Gender, City, Product Type

1.1.2 Univariate Analysis (Detailed Explanation)

Univariate Analysis is a type of exploratory data analysis where only one variable is examined at a time.
The goal is to understand the basic characteristics of that variable—its distribution, central value, spread, and overall behavior.

“Uni” means “one,” so univariate analysis answers questions about a single column in the dataset.

Why Do We Use Univariate Analysis?

Univariate analysis helps you understand:

What values the variable takes
How those values are distributed
Whether the data is skewed or normal
Whether there are outliers
What the typical (central) value is
How spread out the data is

It is the foundation of data understanding before moving to more advanced steps like bivariate or multivariate analysis.

Types of Variables in Univariate Analysis

Univariate analysis differs depending on the type of variable:

1. Numerical (Quantitative) Variables

These variables contain numbers.
Examples: marks, height, weight, salary, age.

Two types:

Continuous data: decimals possible (e.g., height = 172.5 cm)
Discrete data: whole numbers (e.g., number of students = 45)

2. Categorical (Qualitative) Variables

These variables contain text/labels instead of numbers.
Examples: gender, city, preferred product, blood group.

Two types:

Nominal: no order (e.g., Red, Blue, Green)
Ordinal: ordered categories (e.g., Grade A, B, C)

Univariate Analysis Techniques for Numerical Data

A. Measures of Central Tendency

These describe the “center” of the data.

1. Mean (Average)

Sum of values ÷ number of values
Example:
Ages = [10, 12, 14]
Mean = (10 + 12 + 14) / 3 = 12

2. Median (Middle Value)

Sorted dataset’s center value.
Example:
Data = [5, 7, 9]
Median = 7

3. Mode (Most Frequent Value)

Example:
Data = [2, 2, 3, 4]
Mode = 2

B. Measures of Dispersion (Spread of Data)

1. Range

Max – Min
Example: 90 – 20 = 70

2. Variance

Average squared difference from mean.

3. Standard Deviation

Square root of variance.
Higher SD = more spread out data.

C. Shape of the Distribution

Univariate analysis checks:

Skewness (left/right skew)
Kurtosis (peakedness/flatness)
Normal distribution

Example:
Salary data is usually right-skewed because few people earn very high salaries.

D. Visual Tools for Numerical Univariate Analysis

1. Histogram

Shows how frequently values appear in different ranges.

2. Box Plot

Shows median, quartiles, and outliers.

3. Density Plot

Smooth version of a histogram.

Example:

Data: Marks = [40, 50, 60, 70, 100]

Histogram shows distribution
Box plot highlights 100 as an outlier

Univariate Analysis Techniques for Categorical Data

A. Frequency Distribution

Count how many times each category appears.

Example:
Fruits Bought:

Apple: 30
Banana: 20
Mango: 10

B. Percentage Distribution

Percentage for each class.

Apple → 30 / (30+20+10) = 50%

C. Mode

Most frequent category.
Here: Apple

Visual Tools for Categorical Data

1. Bar Chart

Best for comparing category frequencies.

2. Pie Chart

Displays proportions.

3. Donut Chart

Alternative to a pie chart.

Real-life Example of Univariate Analysis

Dataset: Monthly Salaries (in ₹)

[20,000, 22,000, 25,000, 27,000, 40,000, 90,000]

Univariate Analysis Gives:

Mean = 37,166
Median = 26,000
Mode = None
Range = 90,000 – 20,000 = 70,000
SD = Large (high variation)
Outlier = 90,000
Distribution = Right-skewed

This helps to understand that most salaries are between 20k–40k except one very high value.

Importance of Univariate Analysis

Helps understand each variable individually
Supports decision-making
Helps in cleaning data (detect missing values/outliers)
Essential for selecting modeling techniques
Helps decide binning, transformation, or normalization

1.1.3 Bivariate Analysis

Bivariate Analysis is a method of exploring and analyzing two variables together to understand the relationship, association, influence, or comparison between them.

“Bi” means two, so bivariate analysis studies how one variable changes with respect to another.

It answers questions like:

Does one variable affect the other?
Are two variables related?
How strong is the relationship?
What type of relationship exists (linear, non-linear, categorical)?

Purpose of Bivariate Analysis

Bivariate analysis helps to:

Identify patterns between two variables
Check cause-and-effect relationships (not always causation but correlation)
Compare groups
Understand the strength and direction of relationships
Prepare data for predictive modeling

Example:
Does height increase with age?
Does study time affect marks?
Does gender influence purchasing behavior?

Types of Bivariate Analysis Based on Data Types

There are three major combinations possible between two variables:

Numerical vs Numerical
Numerical vs Categorical
Categorical vs Categorical

Each combination has its own methods, charts, and interpretation.

1. Numerical vs Numerical

This type studies the relationship between two numerical (quantitative) variables.

Examples:

Hours Studied vs Marks
Age vs Height
Salary vs Experience
Temperature vs Ice-cream Sales

Techniques Used

A. Scatter Plot

Points plotted on X and Y axes
Shows pattern or direction

Example:
Hours studied (X) vs Marks scored (Y)
If points rise upward → positive relationship.

B. Correlation Coefficient

Measures:

Strength (weak/strong)
Direction (positive/negative)

Values range from:

+1 → perfect positive correlation
–1 → perfect negative correlation
0 → no correlation

Example:
Study hours and marks may have correlation of +0.85 (strong positive).

C. Line Graphs

Used when data is related to time or sequential.

D. Regression Analysis (Basic Level)

Simple linear regression predicts one variable based on another.

Example:
Predicting marks based on study time.

2. Numerical vs Categorical

This type compares a numerical variable across different categories.

Examples:

Salary (numerical) vs Gender (categorical)
Height (numerical) vs Sports Category
Marks (numerical) vs School Type (Private/Government)

Techniques Used

A. Box Plot

Shows:

Median
Quartiles
Spread
Outliers

Example:
Comparing salary of male vs female employees.

B. Bar Charts / Error Bars

Shows mean/median value for each category.

C. Grouped Descriptive Statistics

Mean, median, SD for each category.

Example:
Average marks of boys vs girls.

3. Categorical vs Categorical

This type studies the relationship between two categorical variables.

Examples:

Gender vs Purchase Decision
Education Level vs Job Type
City vs Internet Usage Category

Techniques Used

A. Cross-Tabulation (Contingency Table)

A table showing the frequencies of categories.

Example:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

602 – Data Analytics using Python Unit-1 : Fundamentals of Data Analytics

Unit-1 : Fundamentals of Data Analytics

INDEX

Unit-1 : Fundamentals of Data Analytics

Exploratory Data Analysis (EDA)

Why EDA is Important?

Simple Example

1.1.1 Types of Exploratory Data Analysis

1. Descriptive EDA

Tools Used:

Example:

2. Graphical EDA

Common Graphs

Example:

3. Univariate, Bivariate, and Multivariate EDA

Univariate EDA

Bivariate EDA

Multivariate EDA

4. Quantitative vs Qualitative EDA

Quantitative (Numerical) Data

Qualitative (Categorical) Data

1.1.2 Univariate Analysis (Detailed Explanation)

Why Do We Use Univariate Analysis?

Types of Variables in Univariate Analysis

1. Numerical (Quantitative) Variables

Two types:

2. Categorical (Qualitative) Variables

Two types:

Univariate Analysis Techniques for Numerical Data

A. Measures of Central Tendency

1. Mean (Average)

2. Median (Middle Value)

3. Mode (Most Frequent Value)

B. Measures of Dispersion (Spread of Data)

1. Range

2. Variance

3. Standard Deviation

C. Shape of the Distribution

D. Visual Tools for Numerical Univariate Analysis

1. Histogram

2. Box Plot

3. Density Plot

Example:

Univariate Analysis Techniques for Categorical Data

A. Frequency Distribution

B. Percentage Distribution

C. Mode

Visual Tools for Categorical Data

1. Bar Chart

2. Pie Chart

3. Donut Chart

Real-life Example of Univariate Analysis

Dataset: Monthly Salaries (in ₹)

Univariate Analysis Gives:

Importance of Univariate Analysis

1.1.3 Bivariate Analysis

Purpose of Bivariate Analysis

Types of Bivariate Analysis Based on Data Types

1. Numerical vs Numerical

Techniques Used

A. Scatter Plot

B. Correlation Coefficient

C. Line Graphs

D. Regression Analysis (Basic Level)

2. Numerical vs Categorical

Techniques Used

A. Box Plot

B. Bar Charts / Error Bars

C. Grouped Descriptive Statistics

3. Categorical vs Categorical

Techniques Used

A. Cross-Tabulation (Contingency Table)

Leave a Reply Cancel reply

sign up!