Unit-1 : Fundamentals of Data Analytics

INDEX

1.1 Exploratory Data Analysis (EDA)

  • 1.1.1 Types of Exploratory Data Analysis:
    1.1.2 Univariate Analysis
    1.1.3 Bivariate Analysis
    1.1.4 Multivariate Analysis
    1.1.5 Handling Missing Data and Outliers

1.2 Understanding the Data:

  • 1.2.1 Quantitative Data : Discrete and Continuous
  • 1.2.2 Qualitative Data : Non-numerical (Normal and Ordinal)

1.3 Spread of Data

  • 1.3.1 Normal Distribution
  • 1.3.2 Skewed Distribution
  • 1.3.3 Skewness and Kurtiosis

Unit-1 : Fundamentals of Data Analytics

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the first and most crucial step in the data analysis process. Before building any model or making decisions, analysts must understand what the data looks like, how it behaves, and whether it is complete and consistent.

EDA focuses on discovering patterns, spotting anomalies, checking assumptions, and summarizing data using numerical methods and visual techniques.

Why EDA is Important?

  • Helps you understand the story behind the data

  • Identifies missing values, duplicates, noisy data

  • Detects outliers and unusual patterns

  • Helps decide which statistical methods or models are suitable

  • Provides insights that guide feature selection

  • Ensures data reliability and quality

Simple Example

Suppose you have a dataset of students’ marks:
[45, 78, 90, 32, 88, 90, 100, 12]

From EDA, you may find:

  • Mean = 66.8

  • Student who scored 12 is an outlier

  • Marks are not evenly distributed

  • High variation in performance

1.1.1 Types of Exploratory Data Analysis

EDA can be classified into multiple types based on purpose and technique.

1. Descriptive EDA

Descriptive EDA provides a statistical summary of data.
It answers basic questions like:

  • What is the average value?

  • How spread out are the values?

  • What is the most frequent category?

Tools Used:

  • Mean, Median, Mode

  • Range, Variance, Standard Deviation

  • Frequency Tables

  • Proportion and percentages

Example:

Dataset: Heights of 5 students = [150, 152, 160, 165, 170]

  • Mean height = 159.4

  • Standard Deviation shows how spread out the heights are

This gives a quick understanding of the variable.

2. Graphical EDA

Graphical EDA uses visual representations of data to identify patterns, trends, and relationships.

Common Graphs

  • Histogram – distribution of numerical data

  • Bar Chart – comparison of categories

  • Pie Chart – proportion of categories

  • Box Plot – median, spread, outliers

  • Scatter Plot – relationship between two numerical variables

  • Heatmap – correlation between multiple variables

Example:

A histogram of students’ marks may reveal:

  • Most students scored between 60–80

  • Marks are left-skewed or right-skewed

  • Some outliers exist

3. Univariate, Bivariate, and Multivariate EDA

Univariate EDA

Explores one variable at a time.
Example: Only analyzing “Salary”.

Bivariate EDA

Explores the relationship between two variables.
Example: “Salary vs Experience”.

Multivariate EDA

Explores relationships among three or more variables.
Example: “Salary vs Experience vs Education Level”.

4. Quantitative vs Qualitative EDA

Quantitative (Numerical) Data

Includes numbers like:
Age, Height, Salary, Temperature

Qualitative (Categorical) Data

Includes categories like:
Gender, City, Product Type

1.1.2 Univariate Analysis (Detailed Explanation)

Univariate Analysis is a type of exploratory data analysis where only one variable is examined at a time.
The goal is to understand the basic characteristics of that variable—its distribution, central value, spread, and overall behavior.

“Uni” means “one,” so univariate analysis answers questions about a single column in the dataset.

Why Do We Use Univariate Analysis?

Univariate analysis helps you understand:

  • What values the variable takes

  • How those values are distributed

  • Whether the data is skewed or normal

  • Whether there are outliers

  • What the typical (central) value is

  • How spread out the data is

It is the foundation of data understanding before moving to more advanced steps like bivariate or multivariate analysis.

Types of Variables in Univariate Analysis

Univariate analysis differs depending on the type of variable:

1. Numerical (Quantitative) Variables

These variables contain numbers.
Examples: marks, height, weight, salary, age.

Two types:

  • Continuous data: decimals possible (e.g., height = 172.5 cm)

  • Discrete data: whole numbers (e.g., number of students = 45)

2. Categorical (Qualitative) Variables

These variables contain text/labels instead of numbers.
Examples: gender, city, preferred product, blood group.

Two types:

  • Nominal: no order (e.g., Red, Blue, Green)

  • Ordinal: ordered categories (e.g., Grade A, B, C)

Univariate Analysis Techniques for Numerical Data

A. Measures of Central Tendency

These describe the “center” of the data.

1. Mean (Average)

Sum of values Ă· number of values
Example:
Ages = [10, 12, 14]
Mean = (10 + 12 + 14) / 3 = 12

2. Median (Middle Value)

Sorted dataset’s center value.
Example:
Data = [5, 7, 9]
Median = 7

3. Mode (Most Frequent Value)

Example:
Data = [2, 2, 3, 4]
Mode = 2

B. Measures of Dispersion (Spread of Data)

1. Range

Max – Min
Example: 90 – 20 = 70

2. Variance

Average squared difference from mean.

3. Standard Deviation

Square root of variance.
Higher SD = more spread out data.

C. Shape of the Distribution

Univariate analysis checks:

  • Skewness (left/right skew)

  • Kurtosis (peakedness/flatness)

  • Normal distribution

Example:
Salary data is usually right-skewed because few people earn very high salaries.

D. Visual Tools for Numerical Univariate Analysis

1. Histogram

Shows how frequently values appear in different ranges.

2. Box Plot

Shows median, quartiles, and outliers.

3. Density Plot

Smooth version of a histogram.

Example:

Data: Marks = [40, 50, 60, 70, 100]

  • Histogram shows distribution

  • Box plot highlights 100 as an outlier

Univariate Analysis Techniques for Categorical Data

A. Frequency Distribution

Count how many times each category appears.

Example:
Fruits Bought:

  • Apple: 30

  • Banana: 20

  • Mango: 10

B. Percentage Distribution

Percentage for each class.

Apple → 30 / (30+20+10) = 50%

C. Mode

Most frequent category.
Here: Apple

Visual Tools for Categorical Data

1. Bar Chart

Best for comparing category frequencies.

2. Pie Chart

Displays proportions.

3. Donut Chart

Alternative to a pie chart.

Real-life Example of Univariate Analysis

Dataset: Monthly Salaries (in ₹)

[20,000, 22,000, 25,000, 27,000, 40,000, 90,000]

Univariate Analysis Gives:

  • Mean = 37,166

  • Median = 26,000

  • Mode = None

  • Range = 90,000 – 20,000 = 70,000

  • SD = Large (high variation)

  • Outlier = 90,000

  • Distribution = Right-skewed

This helps to understand that most salaries are between 20k–40k except one very high value.

Importance of Univariate Analysis

  • Helps understand each variable individually

  • Supports decision-making

  • Helps in cleaning data (detect missing values/outliers)

  • Essential for selecting modeling techniques

  • Helps decide binning, transformation, or normalization

1.1.3 Bivariate Analysis 

Bivariate Analysis is a method of exploring and analyzing two variables together to understand the relationship, association, influence, or comparison between them.

“Bi” means two, so bivariate analysis studies how one variable changes with respect to another.

It answers questions like:

  • Does one variable affect the other?

  • Are two variables related?

  • How strong is the relationship?

  • What type of relationship exists (linear, non-linear, categorical)?


Purpose of Bivariate Analysis

Bivariate analysis helps to:

  • Identify patterns between two variables

  • Check cause-and-effect relationships (not always causation but correlation)

  • Compare groups

  • Understand the strength and direction of relationships

  • Prepare data for predictive modeling

Example:
Does height increase with age?
Does study time affect marks?
Does gender influence purchasing behavior?


Types of Bivariate Analysis Based on Data Types

There are three major combinations possible between two variables:

  1. Numerical vs Numerical

  2. Numerical vs Categorical

  3. Categorical vs Categorical

Each combination has its own methods, charts, and interpretation.


1. Numerical vs Numerical

This type studies the relationship between two numerical (quantitative) variables.

Examples:

  • Hours Studied vs Marks

  • Age vs Height

  • Salary vs Experience

  • Temperature vs Ice-cream Sales

Techniques Used

A. Scatter Plot

  • Points plotted on X and Y axes

  • Shows pattern or direction

Example:
Hours studied (X) vs Marks scored (Y)
If points rise upward → positive relationship.

B. Correlation Coefficient

Measures:

  • Strength (weak/strong)

  • Direction (positive/negative)

Values range from:

  • +1 → perfect positive correlation

  • –1 → perfect negative correlation

  • 0 → no correlation

Example:
Study hours and marks may have correlation of +0.85 (strong positive).

C. Line Graphs

Used when data is related to time or sequential.

D. Regression Analysis (Basic Level)

Simple linear regression predicts one variable based on another.

Example:
Predicting marks based on study time.


2. Numerical vs Categorical

This type compares a numerical variable across different categories.

Examples:

  • Salary (numerical) vs Gender (categorical)

  • Height (numerical) vs Sports Category

  • Marks (numerical) vs School Type (Private/Government)

Techniques Used

A. Box Plot

Shows:

  • Median

  • Quartiles

  • Spread

  • Outliers

Example:
Comparing salary of male vs female employees.

B. Bar Charts / Error Bars

Shows mean/median value for each category.

C. Grouped Descriptive Statistics

Mean, median, SD for each category.

Example:
Average marks of boys vs girls.


3. Categorical vs Categorical

This type studies the relationship between two categorical variables.

Examples:

  • Gender vs Purchase Decision

  • Education Level vs Job Type

  • City vs Internet Usage Category

Techniques Used

A. Cross-Tabulation (Contingency Table)

A table showing the frequencies of categories.

Example:

 

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Leave a Reply

Your email address will not be published. Required fields are marked *

sign up!

We’ll send you the hottest deals straight to your inbox so you’re always in on the best-kept software secrets.