This document details the exploratory data analysis (EDA) and preprocessing steps performed on a customer dataset to prepare it for subsequent modeling. The primary objectives were to handle missing data, understand the distribution of key variables, explore correlations, and appropriately scale the features. This process is crucial for ensuring the quality and reliability of any predictive models built on the data.
Missing Data: Identified and imputed missing values in the ‘Age’ and ‘Income’ columns, which constituted approximately 14.8% of the dataset, using mean imputation to preserve data integrity.
Exploratory Data Analysis (EDA):
Visualized the distribution of ‘Age’, ‘Credit Score’, and ‘Income’ using histograms, revealing patterns such as varied age group representation, credit score clustering, and a positively skewed income distribution.
Examined the gender distribution across regions, finding an approximate balance.
Analyzed correlations between numerical variables, observing weak relationships.
Data Splitting: Divided the dataset into training (80%) and testing (20%) sets using createDataPartition() from the caret package, ensuring reproducibility with a set seed.
Feature Scaling: Applied standardization (Z-score scaling) to the numerical features, chosen for its suitability with continuous data and robustness to outliers.
Details:
The following sections provide a detailed walkthrough of the data loading, missing data handling, EDA, data splitting, and feature scaling processes. Each step is accompanied by R code and interpretations of the results.
library(readr)
dataset <- read_csv("preprocessing_dataset.csv")
## Rows: 1020 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Gender, Region
## dbl (3): Age, Income, Credit_Score
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dataset)
## spc_tbl_ [1,020 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : num [1:1020] 20.3 69.2 65.6 43.1 27.4 ...
## $ Income : num [1:1020] 132075 44869 203902 237118 61801 ...
## $ Credit_Score: num [1:1020] 427 704 598 585 717 ...
## $ Gender : chr [1:1020] "Female" "Male" "Other" "Male" ...
## $ Region : chr [1:1020] "East" "West" "East" "South" ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Income = col_double(),
## .. Credit_Score = col_double(),
## .. Gender = col_character(),
## .. Region = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
missing_values <- sapply(dataset, function(x) sum(is.na(x)))
print(missing_values)
## Age Income Credit_Score Gender Region
## 51 100 0 0 0
Using Imputation method
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
dataset_imputed <- dataset
indices_to_impute <- 1:3
cols_to_impute <- names(dataset)[indices_to_impute]
for (col_name in cols_to_impute){
dataset_imputed[is.na(dataset_imputed[[col_name]]),
col_name] <- mean(dataset[[col_name]], na.rm = TRUE)
}
missing_values <- sapply(dataset_imputed, function(x) sum(is.na(x)))
print(missing_values)
## Age Income Credit_Score Gender Region
## 0 0 0 0 0
# Only for Backup. To be used for comparison
cleaned_dataset <- dataset_imputed
4.1. Visualize Data Distributions
Distributions of Age
library(ggplot2)
ggplot(data = dataset_imputed, aes(x = Age)) +
geom_histogram(binwidth = 7.45, fill = 'orange', color = 'black') +
labs(title = 'Distribution of Age', x = 'Age', y = 'Frequency') +
theme(plot.title = element_text(hjust = 0.5))
Interpretation:
Distribution of Credit Score
ggplot(data = dataset_imputed, aes(x = Credit_Score)) +
geom_histogram(binwidth = 50, fill = 'lightblue', color = 'black') +
labs(title = 'Distribution of Credit Score', x = 'Credit Score', y = 'Frequency') +
theme(plot.title = element_text(hjust = 0.5))
Interpretation:
Distribution of Income
ggplot(data = dataset_imputed, aes(x = Income)) +
geom_histogram(binwidth = 30000, fill = 'lightgreen', color = 'black') +
labs(title = 'Distribution of Income', x = 'Income', y = 'Frequency') +
scale_x_continuous(labels = scales::comma) +
theme(plot.title = element_text(hjust = 0.5))
Interpretation:
Distribution of Gender Across Region
countGender <- dataset_imputed %>%
group_by(Region, Gender) %>%
summarise(n = n())
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
ggplot(countGender, aes(x = Region, y = n, fill = Gender)) +
geom_col(position = "dodge") + # Use dodge for side-by-side bars
labs(title = "Gender Distribution by Region",
x = "Region",
y = "Count",
fill = "Gender") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Interpretation: It appears that all the genders are approximately balanced across the region.
4.2. Explore Correlations
library(corrplot)
## corrplot 0.95 loaded
Correlation_matrix <- cor(dataset_imputed[,1:3])
corrplot(Correlation_matrix, method = 'color')
Interpretation:
While a correlation plot (corrplot) visually highlighted variable correlations, the cor() function was used to quantitatively validate these relationships.
Using cor() function to identify the correlation between above variables
dataset_imputed %>%
with(cor(Age, Income))
## [1] 0.03620266
Interpretation: Closer to 0 but on the positive side. A weak or no linear correlation.
dataset_imputed %>%
with(cor(Age, Credit_Score))
## [1] -0.06539502
Interpretation: Closer to 0 but on the negative side. A weak or no linear correlation.
dataset_imputed %>%
with(cor(Income, Credit_Score))
## [1] -0.004583165
Interpretation: Closer to 0 but on the negative side. A weak or no linear correlation.
library(caret)
## Loading required package: lattice
set.seed(123)
training_index <- createDataPartition(dataset_imputed$Credit_Score, p = .80, list = FALSE)
training_Set <- dataset_imputed[training_index,]
testing_Set <- dataset_imputed[-training_index,]
training_Set[1:5,]
testing_Set[1:5,]
I used credit score as the dependent variable. The dataset was split into training (80%) and testing (20%) sets using the createDataPartition() function.
Confirming the Reproducibility of the Data Split
set.seed(123)
training_index <- createDataPartition(dataset_imputed$Credit_Score, p = .80, list = FALSE)
training_Set <- dataset_imputed[training_index,]
testing_Set <- dataset_imputed[-training_index,]
training_Set[1:5,]
testing_Set[1:5,]
I prefer to chose Standardization Method because of the following reasons:
1. Standardization is well-suited for continuous data, especially when the variables might have different units or scales. e.g., age in years, income in dollars, credit score on a specific scale.
2. In case of an extreme outliers, ‘Normalization’ can compress that extreme outlier to be in the range of 0 to 1. May be this makes it harder for a model to learn meaningful pattern.
3. Standardization is less affected because it centers the data around the mean and scales it by the standard deviation.
4. Normalization is mostly used when the data is bounded within a specific range.
Summary: Considering the above points, ‘Standardization’ appears to be safe option for using for this dataset.
standardized_params <- preProcess(training_Set[,1:3],
method = c('center','scale'))
standardized_training_set <- predict(standardized_params, training_Set[,1:3])
standardized_testing_set <- predict(standardized_params, testing_Set[,1:3])
7.1 The original and modified datasets (Before and After cleaning)
Before Cleaning
data.frame(dataset[25:30,])
After Cleaning
data.frame(cleaned_dataset[25:30,])
Summary:
Initially, I loaded the data using the read.csv()
function. Then, using the View()
function, I inspected the
data to identify any missing values (NAs).
I used the sapply()
function to determine the number
of missing values in each column. To preserve the original dataset, I
created a copy and assigned it to another variable, where I would make
modifications.
After running sapply()
, I assessed whether the
missing values constituted less than 5% or more. Finding that
approximately 14% of the data was missing, I decided against deleting
the rows with NAs and instead opted for an imputation
technique.
Replacing the missing values with the mean of their respective columns seemed most appropriate for this dataset.
After imputation, I verified that all missing values had been
replaced by using sapply()
again and visually inspecting
the modified data with the View()
function.
7.2 The original and modified datasets (Before and After scaling)
Before Scaling
data.frame(training_Set[1:5,])
After Scaling
data.frame(standardized_training_set[1:5,])
Summary:
8.1 What insights did you gain about the dataset from your EDA?
I began by exploring the dataset’s columns to understand their meaning and overall context. I then examined the distribution of each variable, followed by an analysis of correlations. These findings are detailed below.
a) Variable Distributions: Histograms for Age, Income, and Credit Score, along with a grouped bar chart showing gender distribution across regions, revealed the following:
Age: The age distribution is characterized by low frequencies at the extremes (very young and very old ages) and multiple peaks and valleys, suggesting varying representation across age groups.
Credit Score: Credit scores range from 300 to 850, with fewer individuals at the extremes. Distinct peaks indicate potential clustering within specific ranges, particularly in the 600-700 and 700-800 ranges.
Income: The income distribution is positively skewed (long tail to the right), with a concentration of individuals in lower income brackets and a prominent peak around 200K.
Gender in Region: Gender is approximately balanced across all regions.
b) Correlation: A correlation analysis, visualized using a correlation plot (corrplot), revealed weak positive and negative correlations among the numerical variables. This observation was confirmed quantitatively using the cor() function, which yielded correlation coefficients close to zero, further indicating the weak relationships.
8.2 Why did you choose the specific method for handling missing data?
8.3 Why did you select the chosen scaling method, and how did it affect the data?