This notebook focuses on data cleaning and preparation, a critical stage in any data analytics workflow. Students will learn how to identify and handle missing values, detect inconsistencies and outliers, transform variables, and produce clean, analysis-ready datasets.
This notebook is self-contained and designed to be followed independently.
By the end of this notebook, students should be able to:
We will primarily use built-in R datasets to focus on skills rather than data sourcing.
data(mtcars)
data(airquality)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Interpretation: - str() shows structure and data types -
summary() reveals ranges, central tendency, and missing
values
colSums(is.na(airquality))
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
Option 1: Remove rows with missing values
clean_air_omit <- na.omit(airquality)
Option 2: Simple imputation
airquality$Ozone[is.na(airquality$Ozone)] <-
mean(airquality$Ozone, na.rm = TRUE)
filtered_mtcars <- subset(mtcars, mpg > 20)
summary(filtered_mtcars)
## mpg cyl disp hp
## Min. :21.00 Min. :4.000 Min. : 71.10 Min. : 52.0
## 1st Qu.:21.43 1st Qu.:4.000 1st Qu.: 83.03 1st Qu.: 66.0
## Median :23.60 Median :4.000 Median :120.20 Median : 94.0
## Mean :25.48 Mean :4.429 Mean :123.89 Mean : 88.5
## 3rd Qu.:29.62 3rd Qu.:4.000 3rd Qu.:145.22 3rd Qu.:109.8
## Max. :33.90 Max. :6.000 Max. :258.00 Max. :113.0
## drat wt qsec vs
## Min. :3.080 Min. :1.513 Min. :16.46 Min. :0.0000
## 1st Qu.:3.790 1st Qu.:1.986 1st Qu.:17.39 1st Qu.:1.0000
## Median :3.910 Median :2.393 Median :18.75 Median :1.0000
## Mean :3.976 Mean :2.418 Mean :18.82 Mean :0.7857
## 3rd Qu.:4.103 3rd Qu.:2.851 3rd Qu.:19.79 3rd Qu.:1.0000
## Max. :4.930 Max. :3.215 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3 Min. :1.000
## 1st Qu.:0.2500 1st Qu.:4 1st Qu.:1.000
## Median :1.0000 Median :4 Median :2.000
## Mean :0.7143 Mean :4 Mean :1.857
## 3rd Qu.:1.0000 3rd Qu.:4 3rd Qu.:2.000
## Max. :1.0000 Max. :5 Max. :4.000
boxplot(mtcars$mpg, main = "MPG Distribution")
Identify: - Extreme values - Possible data entry errors vs real observations
mtcars$efficiency <- ifelse(mtcars$mpg > 25, "High", "Low")
table(mtcars$efficiency)
##
## High Low
## 6 26
mtcars$cyl <- factor(mtcars$cyl)
str(mtcars)
## 'data.frame': 32 obs. of 12 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp : num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec : num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear : num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4 ...
## $ efficiency: chr "Low" "Low" "Low" "Low" ...
final_data <- mtcars[, c("mpg", "cyl", "hp", "efficiency")]
str(final_data)
## 'data.frame': 32 obs. of 4 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ efficiency: chr "Low" "Low" "Low" "Low" ...
This dataset: - Has no missing values - Uses appropriate data types - Contains only relevant variables
You are required to clean, document, and prepare a dataset for analysis.
# Load required libraries
library(datasets) # for getting the datasets
library(dplyr) # for data manipulation
## Warning: package 'dplyr' was built under R version 4.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) # for visualization
## Warning: package 'ggplot2' was built under R version 4.5.3
Choose one of the following: - airquality -
mtcars - iris
# Load the dataset
data("airquality")
# View basic information
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
# Summary statistics
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
# Display first few rows
head(airquality)
missing_counts <- colSums(is.na(airquality))
print("Missing values per column:")
## [1] "Missing values per column:"
print(missing_counts)
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
# library(VIM) # optional - for visualization
# To avoid losing the wanted data let's create a copy for cleaning
airquality_clean <- airquality
# Impute missing Ozone with median Ozone per Month
airquality_clean <- airquality_clean %>%
group_by(Month) %>%
mutate(Ozone = ifelse(is.na(Ozone), median(Ozone, na.rm = TRUE), Ozone)) %>%
ungroup()
# Impute missing Solar.R with median Solar.R per Month
airquality_clean <- airquality_clean %>%
group_by(Month) %>%
mutate(Solar.R = ifelse(is.na(Solar.R), median(Solar.R, na.rm = TRUE), Solar.R)) %>%
ungroup()
# Verify no missing values remain
print("Missing values after impution:")
## [1] "Missing values after impution:"
print(colSums(is.na(airquality_clean)))
## Ozone Solar.R Wind Temp Month Day
## 0 0 0 0 0 0
# Detect outliers in Wind using IQR method
Q1_wind <- quantile(airquality_clean$Wind, 0.25)
Q3_wind <- quantile(airquality_clean$Wind, 0.75)
IQR_wind <- Q3_wind - Q1_wind
lower_bound_wind <- Q1_wind - 1.5 * IQR_wind
upper_bound_wind <- Q3_wind + 1.5 * IQR_wind
outliers_wind <- airquality_clean %>%
filter(Wind < lower_bound_wind | Wind > upper_bound_wind)
print("Potential outliers in Wind:")
## [1] "Potential outliers in Wind:"
print(outliers_wind)
## # A tibble: 3 × 6
## Ozone Solar.R Wind Temp Month Day
## <dbl> <dbl> <dbl> <int> <int> <int>
## 1 8 19 20.1 61 5 9
## 2 6 78 18.4 57 5 18
## 3 37 284 20.7 72 6 17
# Boxplot for visualization
boxplot(airquality_clean,
main = "Boxplot of Wind Speed",
ylab = "Wind (mph)",
col = "#3f7")
# Detect outliers in Ozone
Q1_ozone <- quantile(airquality_clean$Ozone, 0.25)
Q3_ozone <- quantile(airquality_clean$Ozone, 0.75)
IQR_ozone <- Q3_ozone - Q1_ozone
upper_bound_ozone <- Q3_ozone + 1.5 * IQR_ozone
outliers_ozone <- airquality_clean %>%
filter(Ozone >upper_bound_ozone)
print("Potential outliers in Ozone:")
## [1] "Potential outliers in Ozone:"
print(outliers_ozone$Ozone)
## [1] 115 135 108 122 110 168 118
# Derived Variable 1: Temperature Category
airquality_clean <- airquality_clean %>%
mutate(Temp_Category = case_when(
Temp < 70 ~ "Cool",
Temp >= 70 & Temp < 80 ~ "Mild",
Temp >= 80 & Temp < 90 ~ "Warm",
Temp >= 90 ~ "Hot"
))
# Derived Variable 2: Wind chill index (simplified)
# Using Temp in Fahrenheit and Wind in mph
airquality_clean <- airquality_clean %>%
mutate(Wind_Chill = 35.74 + 0.6215 * Temp -
35.75 * (Wind^0.16) + 0.4275 * Temp * (Wind^0.16))
head(airquality_clean[, c("Temp", "Temp_Category", "Wind", "Wind_Chill")])
# Reorder columns for logical grouping
airquality_final <- airquality_clean %>%
select(Month, Day, Ozone, Solar.R, Wind, Temp, Wind_Chill,
Temp_Category, everything())
# save the clean dataset in R format
save(airquality_final, file = "clean_airquality.RData")
# Export as CSV for external use
write.csv(airquality_final, "clean_airquality.csv", row.names = FALSE)
# Display final structure and summary
str(airquality_final)
## tibble [153 × 8] (S3: tbl_df/tbl/data.frame)
## $ Month : int [1:153] 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int [1:153] 1 2 3 4 5 6 7 8 9 10 ...
## $ Ozone : num [1:153] 41 36 12 18 18 28 23 19 8 18 ...
## $ Solar.R : num [1:153] 190 118 149 313 194 194 299 99 19 194 ...
## $ Wind : num [1:153] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int [1:153] 67 72 74 62 56 66 65 59 61 69 ...
## $ Wind_Chill : num [1:153] 67.6 73.6 75.6 60.6 52.5 ...
## $ Temp_Category: chr [1:153] "Cool" "Mild" "Mild" "Cool" ...
summary(airquality_final)
## Month Day Ozone Solar.R
## Min. :5.000 Min. : 1.0 Min. : 1.0 Min. : 7.0
## 1st Qu.:6.000 1st Qu.: 8.0 1st Qu.: 20.0 1st Qu.:120.0
## Median :7.000 Median :16.0 Median : 24.0 Median :197.5
## Mean :6.993 Mean :15.8 Mean : 39.5 Mean :186.4
## 3rd Qu.:8.000 3rd Qu.:23.0 3rd Qu.: 52.0 3rd Qu.:256.0
## Max. :9.000 Max. :31.0 Max. :168.0 Max. :334.0
## Wind Temp Wind_Chill Temp_Category
## Min. : 1.700 Min. :56.00 Min. : 52.47 Length:153
## 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.: 73.56 Class :character
## Median : 9.700 Median :79.00 Median : 81.92 Mode :character
## Mean : 9.958 Mean :77.88 Mean : 80.47
## 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.: 89.38
## Max. :20.700 Max. :97.00 Max. :104.25
airquality and count them
per columncolSums(is.na(airquality))
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
Solar.R valuesSolar.r_Clean <- airquality[!is.na(airquality$Solar.R),]
print(paste("Rows removed with missing Solar.R:",
nrow(airquality) - nrow(Solar.r_Clean)))
## [1] "Rows removed with missing Solar.R: 7"
Temp_levels <- airquality %>%
mutate(Temp_Category = case_when(
Temp < 70 ~ "Cool",
Temp >= 70 & Temp < 80 ~ "Mild",
Temp >= 80 & Temp < 90 ~ "Warm",
Temp >= 90 ~ "Hot"
))
head(Temp_levels)
Wind and comment on outliersboxplot(airquality$Wind,
main = "Wind Speed Boxplot",
ylab = "Wind (mph)",
col = "#fe1")
5. Produce a clean subset suitable for regression analysis
# Remove outliers for regression (optional)
airquality_regression <- airquality_clean %>%
filter(Wind <= upper_bound_wind) %>% # remove wind outliers
filter(Ozone <= upper_bound_ozone) # remove Ozone outliers
# Alternative :Use all data with robust regression methods
airquality_regression_complete <- airquality_clean # keep all data
# Save the regression-ready dataset
save(airquality_regression_complete, file = "airquality_regression_ready.RData")
write.csv(airquality_regression_complete, "airquality_regression_ready.csv", row.names = FALSE)
Data cleaning is not optional — it is the foundation of reliable analytics. Well-prepared data leads to trustworthy insights and models.