#Step 1: Missing Data Visualization
What and Why: Visualizing missing data helps identify the extent and pattern of missingness. It guides decisions about removal vs. imputation and helps catch structure (e.g., MAR, MCAR).
suppressPackageStartupMessages(library(tidyverse))
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
suppressPackageStartupMessages(library(Amelia))
## Warning: package 'Amelia' was built under R version 4.4.3
## Warning: package 'Rcpp' was built under R version 4.4.3
# Load dataset
setwd("C:/My Project Week 1")
titanic <- read.csv("train.csv", stringsAsFactors = FALSE)
# Visualize missing data
missmap(titanic, main = "Initial Missing‑Data Map",
col = c("yellow", "black"), legend = TRUE)
#Step 2: Identify Rows and Columns with ≥ 20% Missing
What and Why: Removing data with significant missingness avoids unreliable imputations. The 20% threshold is a common heuristic for balancing data quality with preservation.
# Identify columns with ≥ 20% missing
col_missing_pct <- colMeans(is.na(titanic)) * 100
cat("Columns with ≥ 20 % missing:\n")
## Columns with ≥ 20 % missing:
print(col_missing_pct[col_missing_pct >= 20])
## named numeric(0)
# Drop columns ≥ 20% missing
cols_to_drop <- names(col_missing_pct[col_missing_pct >= 20])
titanic_reduced <- titanic %>% select(-all_of(cols_to_drop))
# Identify rows with ≥ 20% missing (after dropping columns)
row_missing_pct <- rowMeans(is.na(titanic_reduced)) * 100
high_missing_rows <- which(row_missing_pct > 20)
cat("Number of rows with > 20 % missing:", length(high_missing_rows), "\n")
## Number of rows with > 20 % missing: 0
#Step 3: Remove Rows and Columns with ≥ 20% Missing
What and Why: Dropping highly incomplete data reduces risk of introducing bias or spurious noise via poor imputation.
# Remove rows with ≥ 20% missing
titanic_cleaned <- titanic_reduced[row_missing_pct <= 20, ]
# Remove columns with ≥ 20% missing
cols_to_drop <- names(col_missing_pct[col_missing_pct >= 20])
titanic_cleaned <- titanic_cleaned[, !(names(titanic_cleaned) %in% cols_to_drop)]
What and Why: choose an imputation strategy based on variable type and distribution: - Mean: For symmetric, numeric distributions. - Median: For skewed numeric data. - Mode: For categorical data.
# Median imputation for Age
median_age <- median(titanic_cleaned$Age, na.rm = TRUE)
titanic_cleaned$Age[is.na(titanic_cleaned$Age)] <- median_age
# Mode function for Embarked
get_mode <- function(v) {
ux <- na.omit(unique(v))
ux[which.max(tabulate(match(v, ux)))]
}
mode_embarked <- get_mode(titanic_cleaned$Embarked)
titanic_cleaned$Embarked[is.na(titanic_cleaned$Embarked)] <- mode_embarked
What and Why: Ensure all missing values are addressed by visualizing again. This acts as quality control.
if (nrow(titanic_cleaned) > 0) {
missmap(titanic_cleaned,
main = "Post‑Cleaning Missing‑Data Map",
col = c("yellow", "black"), legend = TRUE)
} else {
message("No rows left after filtering — skipping final missmap().")
}
Explanation: Data cleaning is essential for ensuring accuracy in Exploratory Data Analysis (EDA). Missing or inconsistent data can distort summary statistics, introduce bias, and reduce model performance.
In Exploratory Data Analysis (EDA), data quality has a direct impact on the validity and reliability of insights. Missing values can create false trends or obscure real ones, leading to incorrect conclusions. Imputation methods, if not chosen carefully, may conceal or introduce outliers that distort the true distribution of the data. Furthermore, statistical models built on noisy or incomplete datasets may fail to generalize well, reducing their predictive accuracy and usefulness. Therefore, proper data cleaning is essential for establishing a trustworthy foundation for analysis and decision-making.
Please refer to the report with the Executive Summary for this discussion.