Agenda

  1. Introduction to Data Quality
  2. Common Data Quality Issues
  3. Detecting and Handling Missing Data
  4. Identifying and Removing Duplicates
  5. Handling Inconsistent Data
  6. Detecting Outliers

Introduction to Data Quality

Data quality is crucial for accurate analysis and decision-making. Poor data quality can lead to incorrect conclusions.

Key aspects of data quality:

  • Completeness: No missing data

  • Accuracy: Correct and reliable data

  • Consistency: Uniform format and representation

  • Uniqueness: No duplicate entries

Example Dataset

Download the dataset for this practice [Right click - Open in new tab] :

Data Quality Sample

Move the file into your BI-Lab project folder and import the data into R Studio.

Load Data into R

# Import the data with the "Import" button in RStudio [RECOMMENDED] OR
#import it with the following code (after replacing it with your own
# path/address of the data)
df <- read.csv("df_DQ.csv")  
# View first few rows 
head(df) 
##   ID          Name Age Gender Income
## 1  1      John Doe  28   Male  55000
## 2  2    Jane Smith  34 Female  62000
## 3  3   Emily Davis  NA Female  58000
## 4  4 Michael Brown  45      M  72000
## 5  5  Sarah Wilson  28      F  62000
## 6  6      John Doe  28   Male  55000

Detecting Missing Values

# is.na() is a function to check missing value. Check for missing values 
sum(is.na(df))  
## [1] 2
# Check missing values by column 
colSums(is.na(df)) 
##     ID   Name    Age Gender Income 
##      0      0      1      0      1

Handling Missing Values

# If you want to remove records/rows with missing values: 
df_no_na <- na.omit(df)  

# Or, you can impute missing values with the mean or other values that make sense 
df$Age[is.na(df$Age)] <- mean(df$Age, na.rm = TRUE) 
##Q: What does na.rm = TRUE mean? What is it is set to be FALSE?

Identifying Duplicates

# Check for duplicate rows 
sum(duplicated(df))  # View duplicate rows df[duplicated(df), ] 
## [1] 0

Removing Duplicates

# Remove duplicate rows 
# [row#,colum#] is an important syntax to rember to locate specific value, we will learn more about this in subsequent labs
df_unique <- df[!duplicated(df), ] 

Handling Inconsistent Data

# Example: Standardize categorical data 
df$Gender <- tolower(df$Gender) 
df$Gender[df$Gender == "m"] <- "male" 
df$Gender[df$Gender == "f"] <- "female" 

Detecting Outliers

# Using boxplot to visualize outliers 
boxplot(df$Income, main="Income Distribution", ylab="Income")  

# Identify outliers 

Handling Outliers

# Remove outliers 
df_no_outliers <- df[df$Income<100000,]

Your Turn

  1. Detect and handle missing values in the dataset you have chosen for your group project.

  2. Identify and remove duplicate entries.

  3. Standardize inconsistent categorical data.

  4. Detect and remove outliers from numeric columns.

Use the following functions where appropriate:

  • is.na()

  • duplicated()

  • boxplot()

Summary

  • Ensuring data quality is essential for accurate analysis.

  • Learn to detect and handle missing data, duplicates, inconsistencies, and outliers.

  • Apply these techniques in your future data analysis projects to improve the quality and reliability of your insights.