Introduction to EDA and Preprocessing in R

This document details the exploratory data analysis (EDA) and preprocessing steps performed on a customer dataset to prepare it for subsequent modeling. The primary objectives were to handle missing data, understand the distribution of key variables, explore correlations, and appropriately scale the features. This process is crucial for ensuring the quality and reliability of any predictive models built on the data.

Key Findings and Objectives:

Details:

The following sections provide a detailed walkthrough of the data loading, missing data handling, EDA, data splitting, and feature scaling processes. Each step is accompanied by R code and interpretations of the results.

Code and Analysis:

1. Load the Dataset

library(readr)
dataset <- read_csv("preprocessing_dataset.csv")
## Rows: 1020 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Gender, Region
## dbl (3): Age, Income, Credit_Score
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dataset)
## spc_tbl_ [1,020 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age         : num [1:1020] 20.3 69.2 65.6 43.1 27.4 ...
##  $ Income      : num [1:1020] 132075 44869 203902 237118 61801 ...
##  $ Credit_Score: num [1:1020] 427 704 598 585 717 ...
##  $ Gender      : chr [1:1020] "Female" "Male" "Other" "Male" ...
##  $ Region      : chr [1:1020] "East" "West" "East" "South" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_double(),
##   ..   Income = col_double(),
##   ..   Credit_Score = col_double(),
##   ..   Gender = col_character(),
##   ..   Region = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

2. Identify Missing Data

missing_values <- sapply(dataset, function(x) sum(is.na(x)))
print(missing_values)
##          Age       Income Credit_Score       Gender       Region 
##           51          100            0            0            0

3. Handle Missing Data

Using Imputation method

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

dataset_imputed <- dataset
indices_to_impute <- 1:3
cols_to_impute <- names(dataset)[indices_to_impute]

for (col_name in cols_to_impute){
  dataset_imputed[is.na(dataset_imputed[[col_name]]), 
                  col_name] <- mean(dataset[[col_name]], na.rm = TRUE)
}

missing_values <- sapply(dataset_imputed, function(x) sum(is.na(x)))
print(missing_values)
##          Age       Income Credit_Score       Gender       Region 
##            0            0            0            0            0
# Only for Backup. To be used for comparison
cleaned_dataset <- dataset_imputed 

4. Exploratory Data Analysis (EDA)

4.1. Visualize Data Distributions

Distributions of Age

library(ggplot2)
ggplot(data = dataset_imputed, aes(x = Age)) +
  geom_histogram(binwidth = 7.45, fill = 'orange', color = 'black') +
  labs(title = 'Distribution of Age', x = 'Age', y = 'Frequency') +
  theme(plot.title = element_text(hjust = 0.5))

Interpretation:

Distribution of Credit Score

ggplot(data = dataset_imputed, aes(x = Credit_Score)) +
  geom_histogram(binwidth = 50, fill = 'lightblue', color = 'black') +
  labs(title = 'Distribution of Credit Score', x = 'Credit Score', y = 'Frequency') +
  theme(plot.title = element_text(hjust = 0.5))

Interpretation:

Distribution of Income

ggplot(data = dataset_imputed, aes(x = Income)) +
  geom_histogram(binwidth = 30000, fill = 'lightgreen', color = 'black') +
  labs(title = 'Distribution of Income', x = 'Income', y = 'Frequency') +
  scale_x_continuous(labels = scales::comma) +
  theme(plot.title = element_text(hjust = 0.5))

Interpretation:

Distribution of Gender Across Region

countGender <- dataset_imputed %>% 
  group_by(Region, Gender) %>% 
  summarise(n = n())
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
ggplot(countGender, aes(x = Region, y = n, fill = Gender)) +
  geom_col(position = "dodge") +  # Use dodge for side-by-side bars
  labs(title = "Gender Distribution by Region",
       x = "Region",
       y = "Count",
       fill = "Gender") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

Interpretation: It appears that all the genders are approximately balanced across the region.

4.2. Explore Correlations

library(corrplot)
## corrplot 0.95 loaded
Correlation_matrix <- cor(dataset_imputed[,1:3])
corrplot(Correlation_matrix, method = 'color')

Interpretation:

While a correlation plot (corrplot) visually highlighted variable correlations, the cor() function was used to quantitatively validate these relationships.

Using cor() function to identify the correlation between above variables

dataset_imputed %>% 
  with(cor(Age, Income))
## [1] 0.03620266

Interpretation: Closer to 0 but on the positive side. A weak or no linear correlation.

dataset_imputed %>% 
  with(cor(Age, Credit_Score))
## [1] -0.06539502

Interpretation: Closer to 0 but on the negative side. A weak or no linear correlation.

dataset_imputed %>% 
  with(cor(Income, Credit_Score))
## [1] -0.004583165

Interpretation: Closer to 0 but on the negative side. A weak or no linear correlation.

5. Split the Dataset

library(caret)
## Loading required package: lattice
set.seed(123)
training_index <- createDataPartition(dataset_imputed$Credit_Score, p = .80, list = FALSE)
training_Set <- dataset_imputed[training_index,]
testing_Set <- dataset_imputed[-training_index,]
training_Set[1:5,]
testing_Set[1:5,]

I used credit score as the dependent variable. The dataset was split into training (80%) and testing (20%) sets using the createDataPartition() function.

Confirming the Reproducibility of the Data Split

set.seed(123)
training_index <- createDataPartition(dataset_imputed$Credit_Score, p = .80, list = FALSE)
training_Set <- dataset_imputed[training_index,]
testing_Set <- dataset_imputed[-training_index,]
training_Set[1:5,]
testing_Set[1:5,]

6. Scale the Features

I prefer to chose Standardization Method because of the following reasons:

1. Standardization is well-suited for continuous data, especially when the variables might have different units or scales. e.g., age in years, income in dollars, credit score on a specific scale.

2. In case of an extreme outliers, ‘Normalization’ can compress that extreme outlier to be in the range of 0 to 1. May be this makes it harder for a model to learn meaningful pattern.

3. Standardization is less affected because it centers the data around the mean and scales it by the standard deviation.

4. Normalization is mostly used when the data is bounded within a specific range.

Summary: Considering the above points, ‘Standardization’ appears to be safe option for using for this dataset.

standardized_params <- preProcess(training_Set[,1:3], 
                                  method = c('center','scale'))
standardized_training_set <- predict(standardized_params, training_Set[,1:3])
standardized_testing_set <- predict(standardized_params, testing_Set[,1:3])

7. Output Results

7.1 The original and modified datasets (Before and After cleaning)

Before Cleaning

data.frame(dataset[25:30,])

After Cleaning

data.frame(cleaned_dataset[25:30,])

Summary:

7.2 The original and modified datasets (Before and After scaling)

Before Scaling

data.frame(training_Set[1:5,])

After Scaling

data.frame(standardized_training_set[1:5,])

Summary:

8. Reflection

8.1 What insights did you gain about the dataset from your EDA?

I began by exploring the dataset’s columns to understand their meaning and overall context. I then examined the distribution of each variable, followed by an analysis of correlations. These findings are detailed below.

a) Variable Distributions: Histograms for Age, Income, and Credit Score, along with a grouped bar chart showing gender distribution across regions, revealed the following:

b) Correlation: A correlation analysis, visualized using a correlation plot (corrplot), revealed weak positive and negative correlations among the numerical variables. This observation was confirmed quantitatively using the cor() function, which yielded correlation coefficients close to zero, further indicating the weak relationships.

8.2 Why did you choose the specific method for handling missing data?

8.3 Why did you select the chosen scaling method, and how did it affect the data?