Assignment - Applying EDA and Preprocessing Techniques

Library needed:

library(ggplot2)
library(corrplot)

## corrplot 0.95 loaded

library(caret)

## Loading required package: lattice

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

POINT 1: I downloaded the dataset “preprocessing_dataset.csv”

POINT 2: Load the dataset

dataset <- read.csv("preprocessing_dataset.csv", header = T, sep = ",")

head(dataset)

##        Age    Income Credit_Score Gender Region
## 1 20.28857 132075.04     426.9625 Female   East
## 2 69.21595  44869.27     704.3927   Male   West
## 3 65.62840 203902.12     597.6344  Other   East
## 4 43.11777 237117.77     585.4740   Male  South
## 5 27.40355  61800.95     717.4642 Female  North
## 6 92.94005 266309.56     395.4117 Female   East

POINT 3: Exploratory Data Analysis (EDA)

Visualize Data Distributions:

AGE

ggplot(data = dataset, aes( x = Age))+
  geom_histogram(binwidth = 5, fill = "yellow", color = "red")+
  labs(title = "Distribution of Age", 
       x = "Age", 
       y= "Frequency")

Interpretation: The distribution is not properly normal. The age group with more people is around 90 years old. Apparently there are no outliers.

INCOME

ggplot(data = dataset, aes( x = Income))+
  geom_histogram(binwidth = 20000, fill = "yellow", color = "red")+
  labs(title = "Distribution of Income", 
       x = "Income", 
       y= "Frequency")

Interpretation: The distribution is not properly normal. The income group with more people is the one that has an income around 1e+05. Apparently there are no outliers.

CREDIT SCORE

ggplot(data = dataset, aes( x = Credit_Score))+
  geom_histogram(binwidth = 50, fill = "yellow", color = "red")+
  labs(title = "Distribution of Credit Score", 
       x = "Credit Score", 
       y= "Frequency")

Interpretation: The distribution is not properly normal. The group with more people is the one that has a Credit Score around 700. Apparently there are no outliers.

GENDER

Gender_count <- dataset %>%
  group_by(Gender) %>%
  summarise(Count = n())


ggplot(Gender_count)+
  geom_col(aes(x = Gender ,
               y = Count ,
               fill = Gender))+
  labs(title = "Count of People by Gender")

Interpretation: Most of the people are female and there are no cases of mis-spelling.

REGION

Region_count <- dataset %>%
  group_by(Region) %>%
  summarise(Count = n())


ggplot(Region_count)+
  geom_col(aes(x = Region ,
               y = Count ,
               fill = Region))+
  labs(title = "Count of People by Region")

Interpretation: Most of the people are from the South and there are no cases of mis-spelling.

(UNOFFICIAL) Explore Correlations

Since there are NAs in the dataset I will postpone this passage after point 4 where I will decide how to handle missing values. Altrnatively at this point, just to give a quick look at correlations I could have followed the following steps:

dataset_no_na <- dataset[complete.cases(dataset[, 1:3]), ]

Compute and visualize the correlation matrix

Correlation_matrix <- cor(dataset_no_na[, 1:3])
corrplot(Correlation_matrix, method = "color")

(UNOFFICIAL) Summarize the relationships between the variables based on the correlation results:

Apparently there is no strong correlation between the variables either positive or negative. Age and Income have a slight positive correlation, Age and Credit Score a slight negative correlation and Income and Credit Score a correlation close to 0.

More specifically, looking at the results from cor(dataset_no_na[, 1:3]):

Income vs Age = 0.039031189

Income vs Credit Score = -0.002737487

Age vs Credit Score = -0.064956235

Identify Missing Data

missing_values <- sapply(dataset, function(x) sum(is.na(x)))

print(missing_values)

##          Age       Income Credit_Score       Gender       Region 
##           51          100            0            0            0

POINT 4: Handle Missing Data

Impute missing data

dataset_imputed <- dataset

for (col in 1:3) {
  dataset_imputed[is.na(dataset_imputed[ , col]), col] <- median(dataset[,col], na.rm =TRUE)
}

Choice and rationale for why I selected this method:

Since the number of missing data represents more than 10% of my entire dataset I decided to opt for imputation rather then removing the rows with the missing values. Given that the distribution of the data is not normal, instead of the mean I decided to impute the data with the median, which is a better option since it is more robust to skewed data. In this way I have more data in case I want to create a predictive model.

Verify the results and ensure the dataset is complete

missing_values_check2 <- sapply(dataset_imputed, function(x) sum(is.na(x)))

print(missing_values_check2)

##          Age       Income Credit_Score       Gender       Region 
##            0            0            0            0            0

(OFFICIAL) Explore Correlations (Part of POINT 3):

Now that I have decided how to handle missing data I can compute and visualize the correlation matrix with the imputed dataset:

Correlation_matrix <- cor(dataset_imputed[, 1:3])
corrplot(Correlation_matrix, method = "color")

(OFFICIAL) Summarize the relationships between the variables based on the correlation results

(Spoiler: Noticed no big changes in correlation compared to the “unofficial” version)

There is no strong correlation between the variables either positive or negative. Age and Income have a slight positive correlation, Age and Credit Score a slight negative correlation and Income and Credit Score a correlation close to 0.

More specifically, looking at the results from cor(dataset_imputed[, 1:3]):

Income vs Age = 0.036893475

Income vs Credit Score = -0.004748209

Age vs Credit Score = -0.065352681

These results say that:

1) If the age goes up the income tends to follow.

2) There is no meaningful relationship between income and credit score.

3) If the age goes up the credit score tends to go down.

POINT 5: Split the Dataset

To ensure reproducibility:

set.seed(123)

Splitting the dataset into 80% training and 20% test set, keeping the same Region distribution in both of the sets

training_index <- createDataPartition(dataset_imputed$Region, p = .80, list = FALSE)

training_set   <- dataset_imputed[ training_index,]  # Training data (80%)
test_set       <- dataset_imputed[-training_index,]  # Test data     (20%)

POINT 6: Scale the Features

standardized_params <- preProcess(training_set[1:3], method = c('center', 'scale'))

standardized_training_set <- predict(standardized_params, training_set[1:3])

standardized_test_set <- predict(standardized_params, test_set[1:3])

Why Standardization:

Since the variables present different units and wide-ranging scales I opted for Standardization. In this way the expected outcome of the method I selected is a standardized dataset with variables that present a mean of 0 and a standard deviation of 1, making them comparable despite they have different units. Standardization helps to ensure that all the features contribute equally to the model, avoiding that variables with larger scale dominate the analysis.

POINT 7: Summarize the steps taken and display the first few rows of the datasets

Summary of the steps taken so far:

1) Loaded the dataset and performed an EDA visualizing data distribution to check for outliers.

2) Identified and handled missing values imputing them with the median.

3) Explored the correlation between the numeric variables.

4) Split the dataset into a 80/20 training and test set.

5) Scaled the features using the standardization method.

The original dataset

print("Original Dataset:")

## [1] "Original Dataset:"

print(head(dataset))

##        Age    Income Credit_Score Gender Region
## 1 20.28857 132075.04     426.9625 Female   East
## 2 69.21595  44869.27     704.3927   Male   West
## 3 65.62840 203902.12     597.6344  Other   East
## 4 43.11777 237117.77     585.4740   Male  South
## 5 27.40355  61800.95     717.4642 Female  North
## 6 92.94005 266309.56     395.4117 Female   East

The cleaned dataset

print("Cleaned Dataset:")

## [1] "Cleaned Dataset:"

print(head(dataset_imputed))

##        Age    Income Credit_Score Gender Region
## 1 20.28857 132075.04     426.9625 Female   East
## 2 69.21595  44869.27     704.3927   Male   West
## 3 65.62840 203902.12     597.6344  Other   East
## 4 43.11777 237117.77     585.4740   Male  South
## 5 27.40355  61800.95     717.4642 Female  North
## 6 92.94005 266309.56     395.4117 Female   East

The scaled dataset

print("Standardized Dataset:")

## [1] "Standardized Dataset:"

print(head(standardized_training_set))

##           Age     Income Credit_Score
## 1  -1.7175914 -0.5427698 -0.978588280
## 4  -0.7014016  0.6091776  0.004947866
## 7  -0.2897269 -1.3195599  0.754787564
## 8  -1.3765529 -0.1752177 -1.345034619
## 9   1.4870895  0.9147030 -1.517213952
## 10  0.5461071 -1.2247669 -1.653468270

print(head(standardized_test_set))

##           Age     Income Credit_Score
## 2   0.4602995 -1.4991088   0.74281803
## 3   0.3006080  0.2449193   0.08040119
## 5  -1.4008843 -1.3134281   0.82392448
## 6   1.5163238  0.9293083  -1.17435567
## 16 -0.6629195  1.3418338   1.13404082
## 31  0.7990745 -1.4196464   0.14312943

POINT 8: Reflection

From my exploratory data analysis I gained different insights. The dataset was free from outliers and the numerical variables weren’t normally distributed. Moreover I noticed that females were the most frequent gender, although not by much, and the South region had the highest frequency as well.

Looking at the missing data I discovered that there were 51 missing data in the Age column and 100 in the Income column. Removing rows with these missing values would have resulted in losing more than 10% of the data, a significant number, so I decided to replace the NAs with the median which is a better option for skewed data compared to the mean since it’s more robust to them.

Computing and visualizing the correlation matrix between the numerical variables I noticed that there wasn’t a strong correlation between them, neither positive or negative.

For the scaling method I opted for the Standardization because of the different units and ranges of the variables. In this way if I want to build a predictive model each feature will contribute equally, preventing features with larger ranges to dominate.

Report of the other answers given previously to facilitate the evaluation

1) Why I handled missing values with imputation:

Since the number of missing data represents more than 10% of my entire dataset I decided to opt for imputation rather then removing the rows with the missing values. Given that the distribution of the data is not normal, instead of the mean I decided to impute the data with the median, which is a better option since it is more robust to skewed data. In this way I have more data in case I want to create a predictive model.

2) Summary of the relationships between the variables based on the correlation results:

There is no strong correlation between the variables either positive or negative. Age and Income have a slight positive correlation, Age and Credit Score a slight negative correlation and Income and Credit Score a correlation close to 0.

More specifically, looking at the results from cor(dataset_imputed[, 1:3]):

Income vs Age = 0.036893475

Income vs Credit Score = -0.004748209

Age vs Credit Score = -0.065352681

These results say that:

1) If the age goes up the income tends to follow.

2) There is no meaningful relationship between income and credit score.

3) If the age goes up the credit score tends to go down.

3) Why Standardization:

Since the variables present different units and wide-ranging scales I opted for Standardization. In this way the expected outcome of the method I selected is a standardized dataset with variables that present a mean of 0 and a standard deviation of 1, making them comparable despite they have different units. Standardization helps to ensure that all the features contribute equally to the model, avoiding that variables with larger scale dominate the analysis.

Assignment - Applying EDA and Preprocessing Techniques

Giacomo Bizzotto

2025-02-01

Library needed:

POINT 1: I downloaded the dataset “preprocessing_dataset.csv”

POINT 2: Load the dataset

POINT 3: Exploratory Data Analysis (EDA)

Visualize Data Distributions:

AGE

Interpretation: The distribution is not properly normal. The age group with more people is around 90 years old. Apparently there are no outliers.

INCOME

Interpretation: The distribution is not properly normal. The income group with more people is the one that has an income around 1e+05. Apparently there are no outliers.

CREDIT SCORE

Interpretation: The distribution is not properly normal. The group with more people is the one that has a Credit Score around 700. Apparently there are no outliers.

GENDER

Interpretation: Most of the people are female and there are no cases of mis-spelling.

REGION

Interpretation: Most of the people are from the South and there are no cases of mis-spelling.

(UNOFFICIAL) Explore Correlations

Since there are NAs in the dataset I will postpone this passage after point 4 where I will decide how to handle missing values. Altrnatively at this point, just to give a quick look at correlations I could have followed the following steps:

Compute and visualize the correlation matrix

(UNOFFICIAL) Summarize the relationships between the variables based on the correlation results:

Apparently there is no strong correlation between the variables either positive or negative. Age and Income have a slight positive correlation, Age and Credit Score a slight negative correlation and Income and Credit Score a correlation close to 0.

More specifically, looking at the results from cor(dataset_no_na[, 1:3]):

Income vs Age = 0.039031189

Income vs Credit Score = -0.002737487

Age vs Credit Score = -0.064956235

Identify Missing Data

POINT 4: Handle Missing Data

Impute missing data

Choice and rationale for why I selected this method:

Verify the results and ensure the dataset is complete

(OFFICIAL) Explore Correlations (Part of POINT 3):

Now that I have decided how to handle missing data I can compute and visualize the correlation matrix with the imputed dataset:

(OFFICIAL) Summarize the relationships between the variables based on the correlation results

(Spoiler: Noticed no big changes in correlation compared to the “unofficial” version)

There is no strong correlation between the variables either positive or negative. Age and Income have a slight positive correlation, Age and Credit Score a slight negative correlation and Income and Credit Score a correlation close to 0.

More specifically, looking at the results from cor(dataset_imputed[, 1:3]):

Income vs Age = 0.036893475

Income vs Credit Score = -0.004748209

Age vs Credit Score = -0.065352681

These results say that:

1) If the age goes up the income tends to follow.

2) There is no meaningful relationship between income and credit score.

3) If the age goes up the credit score tends to go down.

POINT 5: Split the Dataset

To ensure reproducibility:

Splitting the dataset into 80% training and 20% test set, keeping the same Region distribution in both of the sets

POINT 6: Scale the Features

Why Standardization:

POINT 7: Summarize the steps taken and display the first few rows of the datasets

Summary of the steps taken so far:

1) Loaded the dataset and performed an EDA visualizing data distribution to check for outliers.

2) Identified and handled missing values imputing them with the median.

3) Explored the correlation between the numeric variables.

4) Split the dataset into a 80/20 training and test set.

5) Scaled the features using the standardization method.

The original dataset

The cleaned dataset

The scaled dataset

POINT 8: Reflection

From my exploratory data analysis I gained different insights. The dataset was free from outliers and the numerical variables weren’t normally distributed. Moreover I noticed that females were the most frequent gender, although not by much, and the South region had the highest frequency as well.

Computing and visualizing the correlation matrix between the numerical variables I noticed that there wasn’t a strong correlation between them, neither positive or negative.

For the scaling method I opted for the Standardization because of the different units and ranges of the variables. In this way if I want to build a predictive model each feature will contribute equally, preventing features with larger ranges to dominate.

Report of the other answers given previously to facilitate the evaluation

1) Why I handled missing values with imputation:

2) Summary of the relationships between the variables based on the correlation results:

There is no strong correlation between the variables either positive or negative. Age and Income have a slight positive correlation, Age and Credit Score a slight negative correlation and Income and Credit Score a correlation close to 0.

More specifically, looking at the results from cor(dataset_imputed[, 1:3]):

Income vs Age = 0.036893475

Income vs Credit Score = -0.004748209

Age vs Credit Score = -0.065352681

These results say that:

1) If the age goes up the income tends to follow.

2) There is no meaningful relationship between income and credit score.

3) If the age goes up the credit score tends to go down.

3) Why Standardization: