library(ggplot2)
library(dplyr)
library(tidyverse)
library(statsr)
library(skimr)
library(kableExtra)
library(grid)
library(gridExtra)
library(scales)
library(GGally)
library(corrplot)
library(PerformanceAnalytics)
Credit scores with causes and consequences
Introduction
This analysis report is the additional report in basis of the assignment from the Business Analysis course by ESSEC Business School. The analysis is implemented according to the phases, Ask, Prepare, Process, Share, and Action.
1.Ask
1.1. Data
Data of credit scoring data is provided by Essec Business School for the purpose of the causes and consequences analysis. Through out this analysis, we will check out the correlations in the data set of credit score and find out causes and consequences with visualizations.
1.2. Questions
- What are the three strong relations of variables in the dataset?
2.Prepare
2.1. Set up necessary packages for the analysis.
2.2.Load the data
#To clean up the memory of your current R session run the following line
rm(list=ls(all=TRUE))
#Set up working directory
setwd("C:/Users/satos/Documents/project/Quarto/04.business-analytics/causes-consequences")
#Load credit socoring data with a csv file.
<- read_csv("DATA_3.01_CREDIT.csv") df1
2.3.Statistic summary
str(df1) # The str() function shows the structure of your dataset and details the type of variables that it contains
spc_tbl_ [300 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Income : num [1:300] 14.9 106 104.6 148.9 55.9 ...
$ Rating : num [1:300] 283 483 514 681 357 569 259 512 266 491 ...
$ Cards : num [1:300] 2 3 4 3 2 4 2 2 5 3 ...
$ Age : num [1:300] 34 82 71 36 68 77 37 87 66 41 ...
$ Education: num [1:300] 11 15 11 11 16 10 12 9 13 19 ...
$ Gender : chr [1:300] "Male" "Female" "Male" "Female" ...
$ Student : chr [1:300] "No" "Yes" "No" "No" ...
$ Married : chr [1:300] "Yes" "Yes" "No" "No" ...
$ Ethnicity: chr [1:300] "Caucasian" "Asian" "Asian" "Asian" ...
$ Balance : num [1:300] 333 903 580 964 331 ...
- attr(*, "spec")=
.. cols(
.. Income = col_double(),
.. Rating = col_double(),
.. Cards = col_double(),
.. Age = col_double(),
.. Education = col_double(),
.. Gender = col_character(),
.. Student = col_character(),
.. Married = col_character(),
.. Ethnicity = col_character(),
.. Balance = col_double()
.. )
- attr(*, "problems")=<externalptr>
summary(df1) # The summary() function provides for each variable in your dataset the minimum, mean, maximum and quartiles
Income Rating Cards Age
Min. : 10.35 Min. : 93.0 Min. :1.000 Min. :24.00
1st Qu.: 21.03 1st Qu.:235.0 1st Qu.:2.000 1st Qu.:41.00
Median : 33.12 Median :339.0 Median :3.000 Median :55.00
Mean : 44.05 Mean :348.1 Mean :3.027 Mean :54.98
3rd Qu.: 55.98 3rd Qu.:433.0 3rd Qu.:4.000 3rd Qu.:69.00
Max. :186.63 Max. :949.0 Max. :8.000 Max. :91.00
Education Gender Student Married
Min. : 5.00 Length:300 Length:300 Length:300
1st Qu.:11.00 Class :character Class :character Class :character
Median :14.00 Mode :character Mode :character Mode :character
Mean :13.39
3rd Qu.:16.00
Max. :20.00
Ethnicity Balance
Length:300 Min. : 0.00
Class :character 1st Qu.: 15.75
Mode :character Median : 433.50
Mean : 502.69
3rd Qu.: 857.75
Max. :1809.00
The credit score dataset is made up of 300 observations and 10 variables.
2.4. Check NA values.
table(is.na(df1))
FALSE
3000
Zero NA value is confirmed.
2.5.Exploratory data analysis
Let’s see the correlation between all the numerical variables
chart.Correlation(df1[c(1:5,10)], histogram = TRUE, method = "pearson")
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Accordingly, “Rating” and “Balance” has the strongest relations. Then, “Income” and “Rating” is at the second. Lastly, “Income” and “Balance” has the third biggest correlations. In order to see the correlation results more clearly, I am going to plot a correlation matrix plot too. #### 2.5.1 Corre matrix
<- cor(df1[ ,sapply(df1,is.numeric)], use= "complete.obs")
cor_matrix
corrplot(cor_matrix, method="shade", shade.col=NA, cl.pos="n", tl.col="black",
tl.srt=30, addCoef.col="black")
Finally, let us look closer into just three variables mentioned earlier.
<- df1 %>%
scores select(Income,Rating, Balance)
<- function(data, mapping, method = "lm", ...) {
lowerFn <- ggplot(data = data, mapping = mapping) +
p geom_point(colour = "black") +
geom_smooth(method = method, color = "red", ...)
p#A function to help add a regression line to the scatter plots to the ggpairs function.
}
ggpairs(scores, lower = list(continuous = wrap(lowerFn, method = "lm")),
diag = list(continuous = wrap("barDiag", colour = "white")),
upper = list(continuous = wrap("cor", size = 8)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`geom_smooth()` using formula = 'y ~ x'
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Out of these three variables, we have to choose a suitable response variable and an explanatory variable. The remaining variable has to be left out of the model as two explanatory variables having strong collinearity with each other might complicate the model and violates the conditions for linear regression modeling.
Since Rating has a higher correlation with both of Balance and Income, I have decided to choose it as the explanatory variable. For the response variable, I chose the Balance variable as the response variable. Since the Income variable have lower collinearity with other variables, this will be excluded from the model.
2.5.2 Response variable and categorical variables
<- ggplot(data = df1, aes(y = Balance, fill = Ethnicity)) +
bp1 geom_boxplot() + scale_fill_brewer(palette="Paired") +
labs(title = "Balance by Ethnicity", y = "Balance",
x = "Ethnicity") + theme(plot.title = element_text(hjust = 0.5))
<- ggplot(data = df1, aes(y = Balance, fill = Gender)) +
bp2 geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Balance by Gender", y = "Balance",
x = "Gender") + theme(plot.title = element_text(hjust = 0.5))
<- ggplot(data = df1, aes(y = Balance, fill = Student)) +
bp3 geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Balance by Student", y = "Balance",
x = "Student") + theme(plot.title = element_text(hjust = 0.5))
<- ggplot(data = df1, aes(y = Balance, fill = Married)) +
bp4 geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Balance by Married", y = "Balance",
x = "Married") + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(bp1, bp2, bp3, bp4, ncol = 1, top = "Balance and Categorical Variables")
In the first plot, the all three ethnicity have the similar average balance.
In the second plot, the differences of balance by gender are not confirmed.
In the third plot, the average balance by students higher than non-student.
In the fourth plot, the differences of balace by married are not confirmed.
<- ggplot(data = df1, aes(y = Income, fill = Ethnicity)) +
bp5 geom_boxplot() + scale_fill_brewer(palette="Paired") +
labs(title = "Income by Ethnicity", y = "Income",
x = "Ethnicity") + theme(plot.title = element_text(hjust = 0.5))
<- ggplot(data = df1, aes(y = Income, fill = Gender)) +
bp6 geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Income by Gender", y = "Income",
x = "Gender") + theme(plot.title = element_text(hjust = 0.5))
<- ggplot(data = df1, aes(y = Income, fill = Student)) +
bp7 geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Income by Student", y = "Income",
x = "Student") + theme(plot.title = element_text(hjust = 0.5))
<- ggplot(data = df1, aes(y = Income, fill = Married)) +
bp8 geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Income by Married", y = "Income",
x = "Married") + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(bp5, bp6, bp7, bp8, ncol = 1, top = "Income and Categorical Variables")
For the reference, we will check rating and income as response variables.
<- ggplot(data = df1, aes(y = Rating, fill = Ethnicity)) +
bp9 geom_boxplot() + scale_fill_brewer(palette="Paired") +
labs(title = "Rating by Ethnicity", y = "Rating",
x = "Ethnicity") + theme(plot.title = element_text(hjust = 0.5))
<- ggplot(data = df1, aes(y = Rating, fill = Gender)) +
bp10 geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Rating by Gender", y = "Rating",
x = "Gender") + theme(plot.title = element_text(hjust = 0.5))
<- ggplot(data = df1, aes(y = Rating, fill = Student)) +
bp11 geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Rating by Student", y = "Rating",
x = "Student") + theme(plot.title = element_text(hjust = 0.5))
<- ggplot(data = df1, aes(y = Rating, fill = Married)) +
bp12 geom_boxplot() + scale_fill_brewer(palette="Dark2") +
labs(title = "Rating by Married", y = "Rating",
x = "Married") + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(bp9, bp10, bp11, bp12, ncol = 1, top = "Rating and Categorical Variables")
Accordingly, Rating and Income does not influence largely to other categorical values.
2.5.3. Exploratory data analysis (EDA)
<- ggplot(data = df1, aes(x = Ethnicity, fill = Ethnicity)) +
p1 geom_bar(aes(y=100*(..count..)/sum(..count..))) +
scale_fill_brewer(palette="Paired") + xlab("Ethnicity") + ylab("Percentage (%)") +
coord_flip()
<- ggplot(data = df1, aes(x = Gender, fill = Gender)) +
p2 geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Paired") +
xlab("Gender") + ylab("Percentage (%)") + coord_flip()
grid.arrange(p1, p2, nrow = 1, top = "Ethnicity & Gender")
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.
<- ggplot(data = df1, aes(x = Cards)) +
p3 geom_histogram(binwidth = 1, color="black", fill="grey",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("Number of holding cards") + ylab("Density")
<- ggplot(data = df1, aes(x = Age)) +
p4 geom_histogram(binwidth = 1, color="darkblue", fill="lightblue",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("Ages of card holders") + ylab("Density")
grid.arrange(p3, p4, nrow = 2, top = "Numbers of holding Cards and Age")
- In the data base, holding 2 and 3 cards are quite popular.
- The ages of card holders look distributed evenly by 80 years old.
<- ggplot(data = df1, aes(x = Rating)) +
p5 geom_histogram(binwidth = 50, color="black", fill="yellow",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("Rating") + ylab("Density")
<- ggplot(data = df1, aes(x = Education)) +
p6 geom_histogram(binwidth = 1, color="black", fill="yellow",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("Education years") + ylab("Density") + scale_y_continuous(labels = comma)
grid.arrange(p5, p6, nrow = 1, top = "Ratings and Education")
- Lower than 500 rates look most of the card holders.
- People who are received for 15 years of education are the majority of cardholders.
<- ggplot(data = df1, aes(x = Income , fill = Income )) +
p9 geom_histogram(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Dark2") + xlab("Income (thousand USD)") +
ylab("Percentage %")
<- ggplot(data = df1, aes(x = Balance)) +
p10 geom_histogram(binwidth = 5, color="darkgreen", fill="lightgreen",
aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) +
xlab("Balance") + ylab("Density")
grid.arrange(p9, p10, nrow = 1, top = "Income & Balance")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: The following aesthetics were dropped during statistical transformation: fill
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
- Less than 40.000 USD as annual income are the majority of the card holders.
- 0 balance is the majority of the group.
<- ggplot(data = df1, aes(x = Student, fill = Student)) +
p11 geom_bar(aes(y=100*(..count..)/sum(..count..))) +
scale_fill_brewer(palette="Paired") + xlab("Student") + ylab("Percentage (%)") +
coord_flip()
<- ggplot(data = df1, aes(x = Married, fill = Married)) +
p12 geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Paired") +
xlab("Married") + ylab("Percentage (%)") + coord_flip()
grid.arrange(p11, p12, nrow = 1, top = "Student & Married")
- Over 80% of holders are not students.
- More than 60% of holders are married.
4.Act
According to the Correlation analysis, we found Balance is the most impactful factor for Rating of credit scores for the future card-holders.
In addition, We have created a linear regression model that can predict the credit score of card holders based on certain conditions. Therefore, we have almost narrowed down the required attributes of card holders in order to predict credit scores.
But, the model is far from perfect. More can be done to improve the model’s predictive ability. If test data has more observations, the model could be helped to have more variability in the data.