Exploratory Data Analysis and Preprocessing in R

Introduction to EDA and Preprocessing in R

This document details the exploratory data analysis (EDA) and preprocessing steps performed on a customer dataset to prepare it for subsequent modeling. The primary objectives were to handle missing data, understand the distribution of key variables, explore correlations, and appropriately scale the features. This process is crucial for ensuring the quality and reliability of any predictive models built on the data.

Key Findings and Objectives:

Missing Data: Identified and imputed missing values in the ‘Age’ and ‘Income’ columns, which constituted approximately 14.8% of the dataset, using mean imputation to preserve data integrity.
Exploratory Data Analysis (EDA):
- Visualized the distribution of ‘Age’, ‘Credit Score’, and ‘Income’ using histograms, revealing patterns such as varied age group representation, credit score clustering, and a positively skewed income distribution.
- Examined the gender distribution across regions, finding an approximate balance.
- Analyzed correlations between numerical variables, observing weak relationships.
Data Splitting: Divided the dataset into training (80%) and testing (20%) sets using createDataPartition() from the caret package, ensuring reproducibility with a set seed.
Feature Scaling: Applied standardization (Z-score scaling) to the numerical features, chosen for its suitability with continuous data and robustness to outliers.

Details:

The following sections provide a detailed walkthrough of the data loading, missing data handling, EDA, data splitting, and feature scaling processes. Each step is accompanied by R code and interpretations of the results.

Code and Analysis:

1. Load the Dataset

library(readr)
dataset <- read_csv("preprocessing_dataset.csv")

## Rows: 1020 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Gender, Region
## dbl (3): Age, Income, Credit_Score
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(dataset)

## spc_tbl_ [1,020 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age         : num [1:1020] 20.3 69.2 65.6 43.1 27.4 ...
##  $ Income      : num [1:1020] 132075 44869 203902 237118 61801 ...
##  $ Credit_Score: num [1:1020] 427 704 598 585 717 ...
##  $ Gender      : chr [1:1020] "Female" "Male" "Other" "Male" ...
##  $ Region      : chr [1:1020] "East" "West" "East" "South" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_double(),
##   ..   Income = col_double(),
##   ..   Credit_Score = col_double(),
##   ..   Gender = col_character(),
##   ..   Region = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

2. Identify Missing Data

missing_values <- sapply(dataset, function(x) sum(is.na(x)))
print(missing_values)

##          Age       Income Credit_Score       Gender       Region 
##           51          100            0            0            0

‘Age’ has 51 rows with NA.
‘Income’ has 100 rows with NA.

3. Handle Missing Data

Using Imputation method

I decided to use imputation method because in this dataset approximately 151 rows out of 1020 records are missing which is approximately 14.8% of the data.
This is a substantial enough proportion that simply removing the rows with missing values could significantly impact the analysis and potentially introduce bias, especially if the missingness is not completely random.
If the missing data percentage is lesser than 5% then I would have chosen to remove the rows with missing data.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

dataset_imputed <- dataset
indices_to_impute <- 1:3
cols_to_impute <- names(dataset)[indices_to_impute]

for (col_name in cols_to_impute){
  dataset_imputed[is.na(dataset_imputed[[col_name]]), 
                  col_name] <- mean(dataset[[col_name]], na.rm = TRUE)
}

missing_values <- sapply(dataset_imputed, function(x) sum(is.na(x)))
print(missing_values)

##          Age       Income Credit_Score       Gender       Region 
##            0            0            0            0            0

# Only for Backup. To be used for comparison
cleaned_dataset <- dataset_imputed

4. Exploratory Data Analysis (EDA)

4.1. Visualize Data Distributions

Calculated binwidth using :
- Sturges’ formula: binwidth = (max - min) / (1 + log2(n))
- Freedman-Diaconis Rule: binwidth = 2 * IQR / n^(1/3)

Distributions of Age

library(ggplot2)
ggplot(data = dataset_imputed, aes(x = Age)) +
  geom_histogram(binwidth = 7.45, fill = 'orange', color = 'black') +
  labs(title = 'Distribution of Age', x = 'Age', y = 'Frequency') +
  theme(plot.title = element_text(hjust = 0.5))

Interpretation:

There are very few people at the very young age (closer to 0) and very old age (closer to 100) ranges.
There are multiple peaks and valleys, which suggest that some of the age group are more or less dominant in the data set.

Distribution of Credit Score

ggplot(data = dataset_imputed, aes(x = Credit_Score)) +
  geom_histogram(binwidth = 50, fill = 'lightblue', color = 'black') +
  labs(title = 'Distribution of Credit Score', x = 'Credit Score', y = 'Frequency') +
  theme(plot.title = element_text(hjust = 0.5))

Interpretation:

There are fewer individuals with very low (around 300) or very high (around 700) credit scores.
There are multiple peaks, which suggest that credit scores tend to group within certain ranges.
For example: There is a noticeable cluster in the 600-700 range and another around 700-800.

Distribution of Income

ggplot(data = dataset_imputed, aes(x = Income)) +
  geom_histogram(binwidth = 30000, fill = 'lightgreen', color = 'black') +
  labs(title = 'Distribution of Income', x = 'Income', y = 'Frequency') +
  scale_x_continuous(labels = scales::comma) +
  theme(plot.title = element_text(hjust = 0.5))

Interpretation:

The distribution is positively skewed, it has long tail extending towards higher incomes.
A large proportion of individuals fall within the lower income brackets (closer to 0).
There is a significant peak in the income range around 200K.

Distribution of Gender Across Region

countGender <- dataset_imputed %>% 
  group_by(Region, Gender) %>% 
  summarise(n = n())

## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.

ggplot(countGender, aes(x = Region, y = n, fill = Gender)) +
  geom_col(position = "dodge") +  # Use dodge for side-by-side bars
  labs(title = "Gender Distribution by Region",
       x = "Region",
       y = "Count",
       fill = "Gender") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

Interpretation: It appears that all the genders are approximately balanced across the region.

4.2. Explore Correlations

library(corrplot)

## corrplot 0.95 loaded

Correlation_matrix <- cor(dataset_imputed[,1:3])
corrplot(Correlation_matrix, method = 'color')

Interpretation:

Age vs. Income and vice-versa: There is a weak positive correlation between the two.
Age vs. Credit-Score and vice-versa: There is a weak negative correlation between the two. This means that there is a slight tendency for credit score to decrease as age increases.
Income vs. Credit-Score and vice-versa: There is a weak positive correlation between the two.
To summarize, Because all the correlations are weak (close to zero), it suggests that these variables may not be very good predictors of each other on their own.

While a correlation plot (corrplot) visually highlighted variable correlations, the cor() function was used to quantitatively validate these relationships.

Using cor() function to identify the correlation between above variables

dataset_imputed %>% 
  with(cor(Age, Income))

## [1] 0.03620266

Interpretation: Closer to 0 but on the positive side. A weak or no linear correlation.

dataset_imputed %>% 
  with(cor(Age, Credit_Score))

## [1] -0.06539502

Interpretation: Closer to 0 but on the negative side. A weak or no linear correlation.

dataset_imputed %>% 
  with(cor(Income, Credit_Score))

## [1] -0.004583165

Interpretation: Closer to 0 but on the negative side. A weak or no linear correlation.

5. Split the Dataset

library(caret)

## Loading required package: lattice

set.seed(123)
training_index <- createDataPartition(dataset_imputed$Credit_Score, p = .80, list = FALSE)
training_Set <- dataset_imputed[training_index,]
testing_Set <- dataset_imputed[-training_index,]
training_Set[1:5,]

testing_Set[1:5,]

I used credit score as the dependent variable. The dataset was split into training (80%) and testing (20%) sets using the createDataPartition() function.

Confirming the Reproducibility of the Data Split

set.seed(123)
training_index <- createDataPartition(dataset_imputed$Credit_Score, p = .80, list = FALSE)
training_Set <- dataset_imputed[training_index,]
testing_Set <- dataset_imputed[-training_index,]
training_Set[1:5,]

testing_Set[1:5,]

6. Scale the Features

I prefer to chose Standardization Method because of the following reasons:

1. Standardization is well-suited for continuous data, especially when the variables might have different units or scales. e.g., age in years, income in dollars, credit score on a specific scale.

2. In case of an extreme outliers, ‘Normalization’ can compress that extreme outlier to be in the range of 0 to 1. May be this makes it harder for a model to learn meaningful pattern.

3. Standardization is less affected because it centers the data around the mean and scales it by the standard deviation.

4. Normalization is mostly used when the data is bounded within a specific range.

Summary: Considering the above points, ‘Standardization’ appears to be safe option for using for this dataset.

standardized_params <- preProcess(training_Set[,1:3], 
                                  method = c('center','scale'))
standardized_training_set <- predict(standardized_params, training_Set[,1:3])
standardized_testing_set <- predict(standardized_params, testing_Set[,1:3])

7. Output Results

7.1 The original and modified datasets (Before and After cleaning)

Before Cleaning

data.frame(dataset[25:30,])

After Cleaning

data.frame(cleaned_dataset[25:30,])

Summary:

Initially, I loaded the data using the read.csv() function. Then, using the View() function, I inspected the data to identify any missing values (NAs).
I used the sapply() function to determine the number of missing values in each column. To preserve the original dataset, I created a copy and assigned it to another variable, where I would make modifications.
After running sapply(), I assessed whether the missing values constituted less than 5% or more. Finding that approximately 14% of the data was missing, I decided against deleting the rows with NAs and instead opted for an imputation technique.
Replacing the missing values with the mean of their respective columns seemed most appropriate for this dataset.
After imputation, I verified that all missing values had been replaced by using sapply() again and visually inspecting the modified data with the View() function.

7.2 The original and modified datasets (Before and After scaling)

Before Scaling

data.frame(training_Set[1:5,])

After Scaling

data.frame(standardized_training_set[1:5,])

Summary:

Before scaling the data, I first ensured that it was clean and free of missing values.
After analyzing the five columns and their meaning within the dataset, I chose standardization as the appropriate scaling technique.
I reasoned that standardization was a good fit for this dataset, considering factors such as whether any columns were range-bound, how normalization would impact extreme outliers, and other relevant aspects of the data.
I ensured that I made no changes to the original dataset, working instead with a backup copy.
After applying the scaling, I displayed both the pre- and post-scaling datasets to validate the results.

8. Reflection

8.1 What insights did you gain about the dataset from your EDA?

I began by exploring the dataset’s columns to understand their meaning and overall context. I then examined the distribution of each variable, followed by an analysis of correlations. These findings are detailed below.

a) Variable Distributions: Histograms for Age, Income, and Credit Score, along with a grouped bar chart showing gender distribution across regions, revealed the following:

Age: The age distribution is characterized by low frequencies at the extremes (very young and very old ages) and multiple peaks and valleys, suggesting varying representation across age groups.
Credit Score: Credit scores range from 300 to 850, with fewer individuals at the extremes. Distinct peaks indicate potential clustering within specific ranges, particularly in the 600-700 and 700-800 ranges.
Income: The income distribution is positively skewed (long tail to the right), with a concentration of individuals in lower income brackets and a prominent peak around 200K.
Gender in Region: Gender is approximately balanced across all regions.

b) Correlation: A correlation analysis, visualized using a correlation plot (corrplot), revealed weak positive and negative correlations among the numerical variables. This observation was confirmed quantitatively using the cor() function, which yielded correlation coefficients close to zero, further indicating the weak relationships.

8.2 Why did you choose the specific method for handling missing data?

Given that missing data constituted approximately 14% of the dataset (exceeding the 5% threshold I had set), deleting rows with missing data would have resulted in a significant loss of information, potentially biasing the analysis and reducing the statistical power of any models built on the data.
The mean, as a measure of central tendency, was deemed appropriate for replacing missing values without significantly distorting the overall distribution or introducing substantial bias.
While acknowledging that mean imputation may not be universally optimal (e.g., its potential to underestimate variability), it was considered a pragmatic initial strategy given the data’s characteristics and the priority of maximizing data retention.
I am sure that in future coursework, I will learn more advanced techniques and examples that will equip me to select the most appropriate imputation method for a given dataset.

8.3 Why did you select the chosen scaling method, and how did it affect the data?

I chose standardization (Z-score normalization) for scaling my data because it’s well-suited for continuous variables with different units, like age, income, and credit score.
Standardization is also more robust to outliers than normalization, which can compress extreme values into a small range and potentially hinder model learning.
Since my data wasn’t inherently bounded by a specific range, standardization, which centers the data around the mean and scales it by standard deviation, seemed the most appropriate and generally applicable method.

Exploratory Data Analysis and Preprocessing in R

Saurabh Srivastava

2025-02-02

Introduction to EDA and Preprocessing in R

Key Findings and Objectives:

Code and Analysis:

1. Load the Dataset

2. Identify Missing Data

3. Handle Missing Data

4. Exploratory Data Analysis (EDA)

5. Split the Dataset

6. Scale the Features

7. Output Results

8. Reflection