Library needed:
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
POINT 1: I downloaded the dataset “preprocessing_dataset.csv”
POINT 2: Load the dataset
dataset <- read.csv("preprocessing_dataset.csv", header = T, sep = ",")
head(dataset)
## Age Income Credit_Score Gender Region
## 1 20.28857 132075.04 426.9625 Female East
## 2 69.21595 44869.27 704.3927 Male West
## 3 65.62840 203902.12 597.6344 Other East
## 4 43.11777 237117.77 585.4740 Male South
## 5 27.40355 61800.95 717.4642 Female North
## 6 92.94005 266309.56 395.4117 Female East
POINT 3: Exploratory Data Analysis (EDA)
Visualize Data Distributions:
AGE
ggplot(data = dataset, aes( x = Age))+
geom_histogram(binwidth = 5, fill = "yellow", color = "red")+
labs(title = "Distribution of Age",
x = "Age",
y= "Frequency")

Interpretation: The distribution is not properly normal. The age
group with more people is around 90 years old. Apparently there are no
outliers.
INCOME
ggplot(data = dataset, aes( x = Income))+
geom_histogram(binwidth = 20000, fill = "yellow", color = "red")+
labs(title = "Distribution of Income",
x = "Income",
y= "Frequency")

Interpretation: The distribution is not properly normal. The income
group with more people is the one that has an income around 1e+05.
Apparently there are no outliers.
CREDIT SCORE
ggplot(data = dataset, aes( x = Credit_Score))+
geom_histogram(binwidth = 50, fill = "yellow", color = "red")+
labs(title = "Distribution of Credit Score",
x = "Credit Score",
y= "Frequency")

Interpretation: The distribution is not properly normal. The group
with more people is the one that has a Credit Score around 700.
Apparently there are no outliers.
GENDER
Gender_count <- dataset %>%
group_by(Gender) %>%
summarise(Count = n())
ggplot(Gender_count)+
geom_col(aes(x = Gender ,
y = Count ,
fill = Gender))+
labs(title = "Count of People by Gender")

Interpretation: Most of the people are female and there are no cases
of mis-spelling.
REGION
Region_count <- dataset %>%
group_by(Region) %>%
summarise(Count = n())
ggplot(Region_count)+
geom_col(aes(x = Region ,
y = Count ,
fill = Region))+
labs(title = "Count of People by Region")

Interpretation: Most of the people are from the South and there are
no cases of mis-spelling.
(UNOFFICIAL) Explore Correlations
Since there are NAs in the dataset I will postpone this passage
after point 4 where I will decide how to handle missing values.
Altrnatively at this point, just to give a quick look at correlations I
could have followed the following steps:
dataset_no_na <- dataset[complete.cases(dataset[, 1:3]), ]
Compute and visualize the correlation matrix
Correlation_matrix <- cor(dataset_no_na[, 1:3])
corrplot(Correlation_matrix, method = "color")

(UNOFFICIAL) Summarize the relationships between the variables based
on the correlation results:
Apparently there is no strong correlation between the variables
either positive or negative. Age and Income have a slight positive
correlation, Age and Credit Score a slight negative correlation and
Income and Credit Score a correlation close to 0.
More specifically, looking at the results from cor(dataset_no_na[,
1:3]):
Income vs Age = 0.039031189
Income vs Credit Score = -0.002737487
Age vs Credit Score = -0.064956235
Identify Missing Data
missing_values <- sapply(dataset, function(x) sum(is.na(x)))
print(missing_values)
## Age Income Credit_Score Gender Region
## 51 100 0 0 0
POINT 4: Handle Missing Data
Impute missing data
dataset_imputed <- dataset
for (col in 1:3) {
dataset_imputed[is.na(dataset_imputed[ , col]), col] <- median(dataset[,col], na.rm =TRUE)
}
Choice and rationale for why I selected this method:
Since the number of missing data represents more than 10% of my
entire dataset I decided to opt for imputation rather then removing the
rows with the missing values. Given that the distribution of the data is
not normal, instead of the mean I decided to impute the data with the
median, which is a better option since it is more robust to skewed data.
In this way I have more data in case I want to create a predictive
model.
Verify the results and ensure the dataset is complete
missing_values_check2 <- sapply(dataset_imputed, function(x) sum(is.na(x)))
print(missing_values_check2)
## Age Income Credit_Score Gender Region
## 0 0 0 0 0
(OFFICIAL) Explore Correlations (Part of POINT 3):
Now that I have decided how to handle missing data I can compute and
visualize the correlation matrix with the imputed dataset:
Correlation_matrix <- cor(dataset_imputed[, 1:3])
corrplot(Correlation_matrix, method = "color")

(OFFICIAL) Summarize the relationships between the variables based
on the correlation results
(Spoiler: Noticed no big changes in correlation compared to the
“unofficial” version)
There is no strong correlation between the variables either positive
or negative. Age and Income have a slight positive correlation, Age and
Credit Score a slight negative correlation and Income and Credit Score a
correlation close to 0.
More specifically, looking at the results from cor(dataset_imputed[,
1:3]):
Income vs Age = 0.036893475
Income vs Credit Score = -0.004748209
Age vs Credit Score = -0.065352681
These results say that:
1) If the age goes up the income tends to follow.
2) There is no meaningful relationship between income and credit
score.
3) If the age goes up the credit score tends to go down.
POINT 5: Split the Dataset
To ensure reproducibility:
set.seed(123)
Splitting the dataset into 80% training and 20% test set, keeping
the same Region distribution in both of the sets
training_index <- createDataPartition(dataset_imputed$Region, p = .80, list = FALSE)
training_set <- dataset_imputed[ training_index,] # Training data (80%)
test_set <- dataset_imputed[-training_index,] # Test data (20%)
POINT 6: Scale the Features
standardized_params <- preProcess(training_set[1:3], method = c('center', 'scale'))
standardized_training_set <- predict(standardized_params, training_set[1:3])
standardized_test_set <- predict(standardized_params, test_set[1:3])
Why Standardization:
Since the variables present different units and wide-ranging scales
I opted for Standardization. In this way the expected outcome of the
method I selected is a standardized dataset with variables that present
a mean of 0 and a standard deviation of 1, making them comparable
despite they have different units. Standardization helps to ensure that
all the features contribute equally to the model, avoiding that
variables with larger scale dominate the analysis.
POINT 7: Summarize the steps taken and display the first few rows of
the datasets
Summary of the steps taken so far:
1) Loaded the dataset and performed an EDA visualizing data
distribution to check for outliers.
2) Identified and handled missing values imputing them with the
median.
3) Explored the correlation between the numeric variables.
4) Split the dataset into a 80/20 training and test set.
5) Scaled the features using the standardization method.
The original dataset
print("Original Dataset:")
## [1] "Original Dataset:"
print(head(dataset))
## Age Income Credit_Score Gender Region
## 1 20.28857 132075.04 426.9625 Female East
## 2 69.21595 44869.27 704.3927 Male West
## 3 65.62840 203902.12 597.6344 Other East
## 4 43.11777 237117.77 585.4740 Male South
## 5 27.40355 61800.95 717.4642 Female North
## 6 92.94005 266309.56 395.4117 Female East
The cleaned dataset
print("Cleaned Dataset:")
## [1] "Cleaned Dataset:"
print(head(dataset_imputed))
## Age Income Credit_Score Gender Region
## 1 20.28857 132075.04 426.9625 Female East
## 2 69.21595 44869.27 704.3927 Male West
## 3 65.62840 203902.12 597.6344 Other East
## 4 43.11777 237117.77 585.4740 Male South
## 5 27.40355 61800.95 717.4642 Female North
## 6 92.94005 266309.56 395.4117 Female East
The scaled dataset
print("Standardized Dataset:")
## [1] "Standardized Dataset:"
print(head(standardized_training_set))
## Age Income Credit_Score
## 1 -1.7175914 -0.5427698 -0.978588280
## 4 -0.7014016 0.6091776 0.004947866
## 7 -0.2897269 -1.3195599 0.754787564
## 8 -1.3765529 -0.1752177 -1.345034619
## 9 1.4870895 0.9147030 -1.517213952
## 10 0.5461071 -1.2247669 -1.653468270
print(head(standardized_test_set))
## Age Income Credit_Score
## 2 0.4602995 -1.4991088 0.74281803
## 3 0.3006080 0.2449193 0.08040119
## 5 -1.4008843 -1.3134281 0.82392448
## 6 1.5163238 0.9293083 -1.17435567
## 16 -0.6629195 1.3418338 1.13404082
## 31 0.7990745 -1.4196464 0.14312943
POINT 8: Reflection
From my exploratory data analysis I gained different insights. The
dataset was free from outliers and the numerical variables weren’t
normally distributed. Moreover I noticed that females were the most
frequent gender, although not by much, and the South region had the
highest frequency as well.
Looking at the missing data I discovered that there were 51 missing
data in the Age column and 100 in the Income column. Removing rows with
these missing values would have resulted in losing more than 10% of the
data, a significant number, so I decided to replace the NAs with the
median which is a better option for skewed data compared to the mean
since it’s more robust to them.
Computing and visualizing the correlation matrix between the
numerical variables I noticed that there wasn’t a strong correlation
between them, neither positive or negative.
For the scaling method I opted for the Standardization because of
the different units and ranges of the variables. In this way if I want
to build a predictive model each feature will contribute equally,
preventing features with larger ranges to dominate.
Report of the other answers given previously to facilitate the
evaluation
1) Why I handled missing values with imputation:
Since the number of missing data represents more than 10% of my
entire dataset I decided to opt for imputation rather then removing the
rows with the missing values. Given that the distribution of the data is
not normal, instead of the mean I decided to impute the data with the
median, which is a better option since it is more robust to skewed data.
In this way I have more data in case I want to create a predictive
model.
2) Summary of the relationships between the variables based on the
correlation results:
There is no strong correlation between the variables either positive
or negative. Age and Income have a slight positive correlation, Age and
Credit Score a slight negative correlation and Income and Credit Score a
correlation close to 0.
More specifically, looking at the results from cor(dataset_imputed[,
1:3]):
Income vs Age = 0.036893475
Income vs Credit Score = -0.004748209
Age vs Credit Score = -0.065352681
These results say that:
1) If the age goes up the income tends to follow.
2) There is no meaningful relationship between income and credit
score.
3) If the age goes up the credit score tends to go down.
3) Why Standardization:
Since the variables present different units and wide-ranging scales
I opted for Standardization. In this way the expected outcome of the
method I selected is a standardized dataset with variables that present
a mean of 0 and a standard deviation of 1, making them comparable
despite they have different units. Standardization helps to ensure that
all the features contribute equally to the model, avoiding that
variables with larger scale dominate the analysis.