- Introduction to Data Quality
- Common Data Quality Issues
- Detecting and Handling Missing Data
- Identifying and Removing Duplicates
- Handling Inconsistent Data
- Detecting Outliers
Data quality is crucial for accurate analysis and decision-making. Poor data quality can lead to incorrect conclusions.
Key aspects of data quality:
Completeness: No missing data
Accuracy: Correct and reliable data
Consistency: Uniform format and representation
Uniqueness: No duplicate entries
Download the dataset for this practice [Right click - Open in new tab] :
Move the file into your BI-Lab project folder and import the data into R Studio.
# Import the data with the "Import" button in RStudio [RECOMMENDED] OR
#import it with the following code (after replacing it with your own
# path/address of the data)
df <- read.csv("df_DQ.csv")
# View first few rows
head(df)
## ID Name Age Gender Income ## 1 1 John Doe 28 Male 55000 ## 2 2 Jane Smith 34 Female 62000 ## 3 3 Emily Davis NA Female 58000 ## 4 4 Michael Brown 45 M 72000 ## 5 5 Sarah Wilson 28 F 62000 ## 6 6 John Doe 28 Male 55000
# is.na() is a function to check missing value. Check for missing values sum(is.na(df))
## [1] 2
# Check missing values by column colSums(is.na(df))
## ID Name Age Gender Income ## 0 0 1 0 1
# If you want to remove records/rows with missing values: df_no_na <- na.omit(df) # Or, you can impute missing values with the mean or other values that make sense df$Age[is.na(df$Age)] <- mean(df$Age, na.rm = TRUE) ##Q: What does na.rm = TRUE mean? What is it is set to be FALSE?
# Check for duplicate rows sum(duplicated(df)) # View duplicate rows df[duplicated(df), ]
## [1] 0
# Remove duplicate rows # [row#,colum#] is an important syntax to rember to locate specific value, we will learn more about this in subsequent labs df_unique <- df[!duplicated(df), ]
# Example: Standardize categorical data df$Gender <- tolower(df$Gender) df$Gender[df$Gender == "m"] <- "male" df$Gender[df$Gender == "f"] <- "female"
# Using boxplot to visualize outliers boxplot(df$Income, main="Income Distribution", ylab="Income")
# Identify outliers
# Remove outliers df_no_outliers <- df[df$Income<100000,]
Detect and handle missing values in the dataset you have chosen for your group project.
Identify and remove duplicate entries.
Standardize inconsistent categorical data.
Detect and remove outliers from numeric columns.
Use the following functions where appropriate:
is.na()
duplicated()
boxplot()
Ensuring data quality is essential for accurate analysis.
Learn to detect and handle missing data, duplicates, inconsistencies, and outliers.
Apply these techniques in your future data analysis projects to improve the quality and reliability of your insights.