Unit-4: Data Filtering and cleaning

Unit-4: Data Filtering and cleaning
4.1 Subsetting and filtering data.
4.2 Adding, removing, and renaming variables/columns.
4.3 Data Cleaning and Transformation
4.4 Identifying and handling missing values.
4.5 Data type conversion and recoding variables.
NOTES
Unit 4.1: Subsetting and Filtering Data
🔹 Brief Explanation:
Subsetting and filtering data means selecting specific parts of a dataset based on certain conditions. It helps focus on relevant information by removing unwanted rows or columns. This process is important for data analysis and cleaning.
🔹 Detailed Explanation:
When working with large datasets, it is often necessary to extract only the data that is important or relevant for analysis. This is where subsetting and filtering come into play.
✅ 1. Subsetting Data
Subsetting refers to selecting specific rows, columns, or both from a dataset.
A. Subsetting Rows:
This means selecting specific records based on index numbers or a condition.
Example in R:
B. Subsetting Columns:
This involves selecting only the needed columns from a dataset.
Example in R:
C. Subsetting Rows and Columns Together:
✅ 2. Filtering Data
Filtering means selecting rows based on specific logical conditions.
Examples in R:
Filter rows where Age > 18
Filter rows where Gender is “Male”
Using dplyr package:
✅ Why Subsetting and Filtering Are Important?
Helps focus only on meaningful data
Reduces memory and processing time
Makes data analysis more efficient
Helps remove errors and outliers
✅ Use Cases:
Task | Purpose |
---|---|
Selecting specific columns | To focus on important variables |
Filtering rows | To analyze specific categories or conditions |
Removing missing or invalid data | To clean the dataset |
✅ Real-Life Example:
Suppose you have a dataset of student records with names, marks, and grades.
You want to:
View only students who scored above 75%
Focus only on “Name” and “Marks” columns
You will:
Filter rows where Marks > 75
Subset columns “Name” and “Marks”
Unit 4.2: Adding, Removing, and Renaming Variables/Columns
🔹 Brief Explanation:
Adding, removing, and renaming columns (also called variables) in a dataset is an important part of data cleaning. It helps you modify the structure of your data according to your needs, such as including new information, getting rid of unwanted data, or making column names more readable and meaningful.
🔹 Detailed Explanation:
When working with data frames in R (or other data tools), it’s common to manage variables (columns) to prepare the dataset for analysis. Here’s how:
✅ 1. Adding Variables/Columns
You can add a new column to a data frame by assigning a new value or calculation to it.
Example in R:
This creates a new column
TotalMarks
by adding marks of three subjects.
You can also add columns using the mutate()
function from the dplyr
package:
✅ 2. Removing Variables/Columns
To remove a column, you can set it to NULL
or use a function that excludes it.
Example in R (using NULL):
Using dplyr’s select()
to remove a column:
The minus sign
-
tells R to exclude that column.
✅ 3. Renaming Variables/Columns
Renaming helps make column names more understandable and meaningful.
Example in R (base method):
Using dplyr’s rename()
function:
Here,
NewName
is the new column name, andOldName
is the existing column name.
Operation | Purpose |
---|---|
Add Columns | To calculate new values, such as totals, averages, or categories |
Remove Columns | To eliminate irrelevant or redundant data |
Rename Columns | To improve clarity, consistency, and understanding of the dataset |
✅ Real-Life Example:
Suppose you’re working with a student performance dataset:
Add a column
Percentage
based on total marks.Remove a column
RollNo
if it is no longer required.Rename the column
StdName
toStudentName
for better readability.
Unit 4.3: Data Cleaning and Transformation
🔹 Brief Explanation:
Data cleaning and transformation are essential steps in preparing data for analysis. Data cleaning involves fixing or removing incorrect, incomplete, or duplicated data. Data transformation means converting data into the right format or structure so that it can be analyzed efficiently.
🔹 Detailed Explanation:
✅ 1. What is Data Cleaning?
Data cleaning (also called data cleansing) is the process of detecting and correcting (or removing) errors and inconsistencies in a dataset. The goal is to improve the quality and reliability of the data before using it for analysis or modeling.
Task | Description |
---|---|
Handling missing values | Replacing or removing missing (NA or NULL) values |
Removing duplicates | Eliminating repeated rows |
Correcting errors | Fixing typos or incorrect entries |
Standardizing formats | Ensuring consistent format (e.g., date as YYYY-MM-DD) |
Trimming whitespaces | Removing extra spaces from text |
Example in R:
✅ 2. What is Data Transformation?
Data transformation involves changing the structure, format, or values of data to make it suitable for analysis. This process helps organize data, enhance its quality, and derive meaningful insights.
Common Data Transformation Tasks:
Task | Description |
---|---|
Changing data types | Converting strings to numbers, or dates to proper date format |
Scaling data | Normalizing or standardizing values for comparison |
Creating new columns | Deriving new values from existing columns (e.g., calculating BMI) |
Aggregating data | Summarizing data using totals, averages, counts, etc. |
Encoding categorical variables | Converting text values (like "Male", "Female") to numeric codes |
✅ Why Data Cleaning and Transformation are Important:
Benefits |
---|
Ensures data accuracy and completeness |
Helps avoid misleading or wrong analysis results |
Makes data compatible with statistical models and tools |
Saves time and effort during the analysis process |
✅ Real-Life Example:
Suppose you receive a dataset from a school with the following problems:
Some marks fields are empty.
Names are written with extra spaces.
Date of Birth is in different formats.
Duplicate entries exist.
You must:
Fill missing marks with the average.
Trim spaces from names.
Convert all dates to a common format.
Remove duplicates.
Only after cleaning and transforming, you can confidently analyze the data.
Unit 4.4: Identifying and Handling Missing Values
🔹 Brief Explanation:
Missing values are gaps or empty fields in a dataset where data should be present. Identifying and handling these missing values is a crucial step in data cleaning. If not treated properly, missing values can lead to incorrect or misleading analysis results.
🔹 Detailed Explanation:
✅ 1. What are Missing Values?
Missing values occur when no data is recorded for a particular cell in a dataset. In R, missing values are represented as NA
. These can arise due to various reasons such as:
Data entry errors
Survey questions left unanswered
Technical issues during data collection
Unavailable information
✅ 2. Identifying Missing Values
Before handling missing data, we must first detect where the missing values are.
In R:
Check for missing values in a dataset:
Identify rows with missing values:
✅ 3. Handling Missing Values
There are multiple ways to deal with missing data, depending on the type of analysis and the amount of missing data.
A. Removing Missing Data
Remove rows with missing values:
Remove only if too many values are missing or they affect results significantly.
B. Replacing (Imputing) Missing Data
Instead of removing, we can fill in the missing values with appropriate substitutes:
Method | Description |
---|---|
Mean/Median/Mode | Replace with average or most common value |
Constant value | Replace with 0, "Unknown", or other custom value |
Prediction model | Use regression or machine learning to estimate the missing value |
C. Using Forward or Backward Fill (Time Series)
In time-series data:
Risk | Impact |
---|---|
Skewed analysis | Can distort average or trend calculations |
Reduced accuracy | Affects prediction or classification models |
Data loss | Too much deletion can remove useful information |
✅ Best Practices:
Always explore the reason for missing data before deciding how to handle it.
Use visual tools (like heatmaps) to locate missing values easily.
Choose a handling method that fits the nature of data and analysis goals.
✅ Real-Life Example:
You have a student dataset where some students’ “Math Marks” are missing. If the number is small, you may delete those rows. But if many are missing, you might fill them with the class average for Math. This ensures that your analysis remains balanced and useful.
Unit 4.5: Data Type Conversion and Recoding Variables

Brief Explanation:
Data type conversion is the process of changing a variable from one data type to another (e.g., from character to numeric).
Recoding variables means modifying the values of variables — for example, changing categories or grouping values for better analysis. Both tasks are essential in preparing data for analysis and visualization.

Detailed Explanation:

1. What is Data Type Conversion?
In any dataset, each column or variable has a specific data type such as:
Numeric (e.g., 25, 99.5)
Character (e.g., “Yes”, “No”)
Factor (for categorical data)
Date/Time
Sometimes, we need to change a column’s type for proper analysis or visualization. This is known as data type conversion or type casting.

Why Convert Data Types?
To perform calculations (e.g., converting strings to numbers)
To format data correctly (e.g., treating text as factors for categories)
To avoid errors during analysis or visualization
✅ 2. What is Recoding Variables?
Recoding is the process of changing values within a variable for simplification or categorization.
🔸 Why Recode Variables?
To group values into categories (e.g., “1”, “2”, “3” → “Low”, “Medium”, “High”)
To correct or standardize values (e.g., spelling mistakes)
To convert text into numerical codes for modeling
Methods to Recode Variables:
Task | Example |
---|---|
Convert “25” (string) to 25 (number) | For numerical analysis |
Convert “Male”, “Female” to factor | For classification models |
Recode age 0–17 as “Child”, 18–60 as “Adult” | For demographic studies |
Replace "Y", "N" with "Yes", "No" | For consistency and clarity |
✅ Best Practices:
Always check the current data types using
str()
orclass()
functions.Avoid unintended conversions that may lead to loss of information.
Use meaningful labels when recoding categories.
✅ Conclusion:
Subsetting and filtering are essential skills for data cleaning and preparation. They help in narrowing down large datasets to only the relevant and useful data, making analysis easier and more effective.
- Adding, removing, and renaming columns are core tasks in data cleaning and transformation. They help you organize data effectively, make it more readable, and ensure it’s suitable for analysis or visualization.
- Data cleaning ensures the dataset is correct, consistent, and usable, while data transformation reshapes it to fit the needs of analysis or visualization. These steps are crucial for making data reliable, readable, and ready for further use in decision-making or modeling.
Identifying and managing missing values is a vital part of data preprocessing. Whether you remove or replace them, your choice should ensure that the dataset remains accurate and meaningful for analysis. Proper handling improves data quality and the reliability of results.
- Data type conversion ensures that variables are in the correct format for processing, while recoding helps simplify and organize values for better interpretation. Both steps play a crucial role in cleaning and preparing data for analysis, machine learning, or reporting.