Credit scores with causes and consequences

Author

Satoshi Matsumoto

Introduction

This analysis report is the additional report in basis of the assignment from the Business Analysis course by ESSEC Business School. The analysis is implemented according to the phases, Ask, Prepare, Process, Share, and Action.

1.Ask

1.1. Data

Data of credit scoring data is provided by Essec Business School for the purpose of the causes and consequences analysis. Through out this analysis, we will check out the correlations in the data set of credit score and find out causes and consequences with visualizations.

1.2. Questions

What are the three strong relations of variables in the dataset?

2.Prepare

2.1. Set up necessary packages for the analysis.

library(ggplot2)
library(dplyr)
library(tidyverse)
library(statsr)
library(skimr)
library(kableExtra)
library(grid)
library(gridExtra)
library(scales)
library(GGally)
library(corrplot)
library(PerformanceAnalytics)

2.2.Load the data

#To clean up the memory of your current R session run the following line
rm(list=ls(all=TRUE))
#Set up working directory
setwd("C:/Users/satos/Documents/project/Quarto/04.business-analytics/causes-consequences")

#Load credit socoring data with a csv file. 
df1 <- read_csv("DATA_3.01_CREDIT.csv")

2.3.Statistic summary

str(df1) # The str() function shows the structure of your dataset and details the type of variables that it contains

spc_tbl_ [300 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Income   : num [1:300] 14.9 106 104.6 148.9 55.9 ...
 $ Rating   : num [1:300] 283 483 514 681 357 569 259 512 266 491 ...
 $ Cards    : num [1:300] 2 3 4 3 2 4 2 2 5 3 ...
 $ Age      : num [1:300] 34 82 71 36 68 77 37 87 66 41 ...
 $ Education: num [1:300] 11 15 11 11 16 10 12 9 13 19 ...
 $ Gender   : chr [1:300] "Male" "Female" "Male" "Female" ...
 $ Student  : chr [1:300] "No" "Yes" "No" "No" ...
 $ Married  : chr [1:300] "Yes" "Yes" "No" "No" ...
 $ Ethnicity: chr [1:300] "Caucasian" "Asian" "Asian" "Asian" ...
 $ Balance  : num [1:300] 333 903 580 964 331 ...
 - attr(*, "spec")=
  .. cols(
  ..   Income = col_double(),
  ..   Rating = col_double(),
  ..   Cards = col_double(),
  ..   Age = col_double(),
  ..   Education = col_double(),
  ..   Gender = col_character(),
  ..   Student = col_character(),
  ..   Married = col_character(),
  ..   Ethnicity = col_character(),
  ..   Balance = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

summary(df1) # The summary() function provides for each variable in your dataset the minimum, mean, maximum and quartiles

     Income           Rating          Cards            Age       
 Min.   : 10.35   Min.   : 93.0   Min.   :1.000   Min.   :24.00  
 1st Qu.: 21.03   1st Qu.:235.0   1st Qu.:2.000   1st Qu.:41.00  
 Median : 33.12   Median :339.0   Median :3.000   Median :55.00  
 Mean   : 44.05   Mean   :348.1   Mean   :3.027   Mean   :54.98  
 3rd Qu.: 55.98   3rd Qu.:433.0   3rd Qu.:4.000   3rd Qu.:69.00  
 Max.   :186.63   Max.   :949.0   Max.   :8.000   Max.   :91.00  
   Education        Gender            Student            Married         
 Min.   : 5.00   Length:300         Length:300         Length:300        
 1st Qu.:11.00   Class :character   Class :character   Class :character  
 Median :14.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :13.39                                                           
 3rd Qu.:16.00                                                           
 Max.   :20.00                                                           
  Ethnicity            Balance       
 Length:300         Min.   :   0.00  
 Class :character   1st Qu.:  15.75  
 Mode  :character   Median : 433.50  
                    Mean   : 502.69  
                    3rd Qu.: 857.75  
                    Max.   :1809.00

The credit score dataset is made up of 300 observations and 10 variables.

2.4. Check NA values.

table(is.na(df1))


FALSE 
 3000

Zero NA value is confirmed.

2.5.Exploratory data analysis

Let’s see the correlation between all the numerical variables

chart.Correlation(df1[c(1:5,10)], histogram = TRUE, method = "pearson")

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Accordingly, “Rating” and “Balance” has the strongest relations. Then, “Income” and “Rating” is at the second. Lastly, “Income” and “Balance” has the third biggest correlations. In order to see the correlation results more clearly, I am going to plot a correlation matrix plot too. #### 2.5.1 Corre matrix

cor_matrix <- cor(df1[ ,sapply(df1,is.numeric)], use= "complete.obs")

corrplot(cor_matrix, method="shade", shade.col=NA, cl.pos="n", tl.col="black", 
         tl.srt=30, addCoef.col="black")

Finally, let us look closer into just three variables mentioned earlier.

scores <- df1 %>%
  select(Income,Rating, Balance)

lowerFn <- function(data, mapping, method = "lm", ...) {
  p <- ggplot(data = data, mapping = mapping) +
    geom_point(colour = "black") +
    geom_smooth(method = method, color = "red", ...)
  p
} #A function to help add a regression line to the scatter plots to the ggpairs function.


ggpairs(scores, lower = list(continuous = wrap(lowerFn, method = "lm")),
  diag = list(continuous = wrap("barDiag", colour = "white")),
  upper = list(continuous = wrap("cor", size = 8)))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`geom_smooth()` using formula = 'y ~ x'
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Out of these three variables, we have to choose a suitable response variable and an explanatory variable. The remaining variable has to be left out of the model as two explanatory variables having strong collinearity with each other might complicate the model and violates the conditions for linear regression modeling.

Since Rating has a higher correlation with both of Balance and Income, I have decided to choose it as the explanatory variable. For the response variable, I chose the Balance variable as the response variable. Since the Income variable have lower collinearity with other variables, this will be excluded from the model.

2.5.2 Response variable and categorical variables

bp1 <- ggplot(data = df1, aes(y = Balance, fill = Ethnicity)) + 
  geom_boxplot() + scale_fill_brewer(palette="Paired") + 
  labs(title = "Balance by Ethnicity", y = "Balance", 
       x = "Ethnicity") + theme(plot.title = element_text(hjust = 0.5))

bp2 <- ggplot(data = df1, aes(y = Balance, fill = Gender)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Balance by Gender", y = "Balance", 
       x = "Gender") + theme(plot.title = element_text(hjust = 0.5))

bp3 <- ggplot(data = df1, aes(y = Balance, fill = Student)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Balance by Student", y = "Balance", 
       x = "Student") + theme(plot.title = element_text(hjust = 0.5))

bp4 <- ggplot(data = df1, aes(y = Balance, fill = Married)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Balance by Married", y = "Balance", 
       x = "Married") + theme(plot.title = element_text(hjust = 0.5))

grid.arrange(bp1, bp2, bp3, bp4, ncol = 1, top = "Balance and Categorical Variables")

In the first plot, the all three ethnicity have the similar average balance.

In the second plot, the differences of balance by gender are not confirmed.

In the third plot, the average balance by students higher than non-student.

In the fourth plot, the differences of balace by married are not confirmed.

bp5 <- ggplot(data = df1, aes(y = Income, fill = Ethnicity)) + 
  geom_boxplot() + scale_fill_brewer(palette="Paired") + 
  labs(title = "Income by Ethnicity", y = "Income", 
       x = "Ethnicity") + theme(plot.title = element_text(hjust = 0.5))

bp6 <- ggplot(data = df1, aes(y = Income, fill = Gender)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Income by Gender", y = "Income", 
       x = "Gender") + theme(plot.title = element_text(hjust = 0.5))

bp7 <- ggplot(data = df1, aes(y = Income, fill = Student)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Income by Student", y = "Income", 
       x = "Student") + theme(plot.title = element_text(hjust = 0.5))

bp8 <- ggplot(data = df1, aes(y = Income, fill = Married)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Income by Married", y = "Income", 
       x = "Married") + theme(plot.title = element_text(hjust = 0.5))

grid.arrange(bp5, bp6, bp7, bp8, ncol = 1, top = "Income and Categorical Variables")

For the reference, we will check rating and income as response variables.

bp9 <- ggplot(data = df1, aes(y = Rating, fill = Ethnicity)) + 
  geom_boxplot() + scale_fill_brewer(palette="Paired") + 
  labs(title = "Rating by Ethnicity", y = "Rating", 
       x = "Ethnicity") + theme(plot.title = element_text(hjust = 0.5))

bp10 <- ggplot(data = df1, aes(y = Rating, fill = Gender)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Rating by Gender", y = "Rating", 
       x = "Gender") + theme(plot.title = element_text(hjust = 0.5))

bp11 <- ggplot(data = df1, aes(y = Rating, fill = Student)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Rating by Student", y = "Rating", 
       x = "Student") + theme(plot.title = element_text(hjust = 0.5))

bp12 <- ggplot(data = df1, aes(y = Rating, fill = Married)) + 
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + 
  labs(title = "Rating by Married", y = "Rating", 
       x = "Married") + theme(plot.title = element_text(hjust = 0.5))

grid.arrange(bp9, bp10, bp11, bp12, ncol = 1, top = "Rating and Categorical Variables")

Accordingly, Rating and Income does not influence largely to other categorical values.

2.5.3. Exploratory data analysis (EDA)

p1 <- ggplot(data = df1, aes(x = Ethnicity, fill = Ethnicity)) + 
  geom_bar(aes(y=100*(..count..)/sum(..count..))) + 
  scale_fill_brewer(palette="Paired") + xlab("Ethnicity") + ylab("Percentage (%)") + 
  coord_flip() 

p2 <- ggplot(data = df1, aes(x = Gender, fill = Gender)) + 
  geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Paired") + 
  xlab("Gender") + ylab("Percentage (%)") + coord_flip()

grid.arrange(p1, p2, nrow = 1, top = "Ethnicity & Gender")

Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.

p3 <- ggplot(data = df1, aes(x = Cards)) + 
  geom_histogram(binwidth = 1, color="black", fill="grey", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("Number of holding cards") + ylab("Density")

p4 <- ggplot(data = df1, aes(x = Age)) + 
  geom_histogram(binwidth = 1, color="darkblue", fill="lightblue", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("Ages of card holders") + ylab("Density")

grid.arrange(p3, p4, nrow = 2, top = "Numbers of holding Cards and Age")

In the data base, holding 2 and 3 cards are quite popular.
The ages of card holders look distributed evenly by 80 years old.

p5 <- ggplot(data = df1, aes(x = Rating)) + 
  geom_histogram(binwidth = 50, color="black", fill="yellow", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("Rating") + ylab("Density")


p6 <- ggplot(data = df1, aes(x = Education)) + 
  geom_histogram(binwidth = 1, color="black", fill="yellow", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("Education years") + ylab("Density") + scale_y_continuous(labels = comma)

grid.arrange(p5, p6, nrow = 1, top = "Ratings and Education")

Lower than 500 rates look most of the card holders.
People who are received for 15 years of education are the majority of cardholders.

p9 <- ggplot(data = df1, aes(x = Income , fill = Income )) + 
  geom_histogram(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Dark2") + xlab("Income (thousand USD)") + 
  ylab("Percentage %")


p10 <- ggplot(data = df1, aes(x = Balance)) + 
  geom_histogram(binwidth = 5, color="darkgreen", fill="lightgreen", 
                 aes(y=..density..), alpha=1) + geom_density(lwd = 0.8) + 
  xlab("Balance") + ylab("Density")

grid.arrange(p9, p10, nrow = 1, top = "Income & Balance")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: The following aesthetics were dropped during statistical transformation: fill
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Less than 40.000 USD as annual income are the majority of the card holders.
0 balance is the majority of the group.

p11 <- ggplot(data = df1, aes(x = Student, fill = Student)) + 
  geom_bar(aes(y=100*(..count..)/sum(..count..))) + 
  scale_fill_brewer(palette="Paired") + xlab("Student") + ylab("Percentage (%)") + 
  coord_flip() 

p12 <- ggplot(data = df1, aes(x = Married, fill = Married)) + 
  geom_bar(aes(y=100*(..count..)/sum(..count..))) + scale_fill_brewer(palette="Paired") + 
  xlab("Married") + ylab("Percentage (%)") + coord_flip()

grid.arrange(p11, p12, nrow = 1, top = "Student & Married")

Over 80% of holders are not students.
More than 60% of holders are married.

3.Process & Share

3.1.Modeling

#Remove the income variable
df3<- df1 %>%
  select(-Income)
str(df3)

tibble [300 × 9] (S3: tbl_df/tbl/data.frame)
 $ Rating   : num [1:300] 283 483 514 681 357 569 259 512 266 491 ...
 $ Cards    : num [1:300] 2 3 4 3 2 4 2 2 5 3 ...
 $ Age      : num [1:300] 34 82 71 36 68 77 37 87 66 41 ...
 $ Education: num [1:300] 11 15 11 11 16 10 12 9 13 19 ...
 $ Gender   : chr [1:300] "Male" "Female" "Male" "Female" ...
 $ Student  : chr [1:300] "No" "Yes" "No" "No" ...
 $ Married  : chr [1:300] "Yes" "Yes" "No" "No" ...
 $ Ethnicity: chr [1:300] "Caucasian" "Asian" "Asian" "Asian" ...
 $ Balance  : num [1:300] 333 903 580 964 331 ...

We have 8 explanatory variables and a response variable. Now, we will create a linear regression model with all the variables and assess the results.

full_model <- lm(Rating~., data = df3) #~. represents use the rest of the explanatory variables.
summary(full_model)


Call:
lm(formula = Rating ~ ., data = df3)

Residuals:
    Min      1Q  Median      3Q     Max 
-145.06  -42.98  -10.81   26.38  267.93 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.788e+02  2.635e+01   6.786 6.49e-11 ***
Cards              -2.690e+00  2.961e+00  -0.908  0.36448    
Age                 7.634e-01  2.324e-01   3.285  0.00115 ** 
Education          -4.511e-01  1.300e+00  -0.347  0.72875    
GenderMale          3.124e+00  8.002e+00   0.390  0.69651    
StudentYes         -1.147e+02  1.357e+01  -8.449 1.44e-15 ***
MarriedYes          7.726e+00  8.243e+00   0.937  0.34942    
EthnicityAsian     -5.921e+00  1.100e+01  -0.538  0.59066    
EthnicityCaucasian -4.840e+00  9.696e+00  -0.499  0.61802    
Balance             3.015e-01  9.023e-03  33.417  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 68.38 on 290 degrees of freedom
Multiple R-squared:  0.8008,    Adjusted R-squared:  0.7946 
F-statistic: 129.5 on 9 and 290 DF,  p-value: < 2.2e-16

At present, we have a R-squared value of 79.46%, meaning around 80% of the variability of the response variables is explained by our model. The adjusted R-squared is a modified version of R-squared that adjusts for predictors that are not significant in a regression model. In this case, we achieved an adjusted R-squared of 79.46%, which is not too less from the R-squared values.

3.2. Significant variables by ANOVA function

anova(full_model)

Analysis of Variance Table

Response: Rating
           Df  Sum Sq Mean Sq   F value    Pr(>F)    
Cards       1   62533   62533   13.3756 0.0003027 ***
Age         1    9415    9415    2.0138 0.1569475    
Education   1   61751   61751   13.2083 0.0003296 ***
Gender      1       8       8    0.0017 0.9671049    
Student     1    8557    8557    1.8302 0.1771551    
Married     1   39947   39947    8.5445 0.0037387 ** 
Ethnicity   2   47115   23557    5.0388 0.0070605 ** 
Balance     1 5220783 5220783 1116.7065 < 2.2e-16 ***
Residuals 290 1355797    4675                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We found some variables are not significant.

Next, we will use the step function use “both” direction for both backward and front step-wise elimination to remove model features with low predictive value.

lr_model <- step(full_model, direction = "both", scope = full_model)

Start:  AIC=2544.84
Rating ~ Cards + Age + Education + Gender + Student + Married + 
    Ethnicity + Balance

            Df Sum of Sq     RSS    AIC
- Ethnicity  2      1607 1357404 2541.2
- Education  1       563 1356360 2543.0
- Gender     1       713 1356510 2543.0
- Cards      1      3857 1359654 2543.7
- Married    1      4107 1359904 2543.7
<none>                   1355797 2544.8
- Age        1     50438 1406235 2553.8
- Student    1    333763 1689560 2608.9
- Balance    1   5220783 6576580 3016.6

Step:  AIC=2541.19
Rating ~ Cards + Age + Education + Gender + Student + Married + 
    Balance

            Df Sum of Sq     RSS    AIC
- Education  1       491 1357896 2539.3
- Gender     1       668 1358072 2539.3
- Married    1      3648 1361052 2540.0
- Cards      1      3953 1361357 2540.1
<none>                   1357404 2541.2
- Age        1     51566 1408971 2550.4
- Student    1    338262 1695666 2605.9
- Balance    1   5266290 6623695 3014.7

Step:  AIC=2539.3
Rating ~ Cards + Age + Gender + Student + Married + Balance

          Df Sum of Sq     RSS    AIC
- Gender   1       663 1358559 2537.4
- Married  1      3590 1361486 2538.1
- Cards    1      4044 1361940 2538.2
<none>                 1357896 2539.3
- Age      1     52176 1410071 2548.6
- Student  1    343239 1701135 2604.9
- Balance  1   5331769 6689664 3015.7

Step:  AIC=2537.45
Rating ~ Cards + Age + Student + Married + Balance

          Df Sum of Sq     RSS    AIC
- Married  1      3577 1362135 2536.2
- Cards    1      3932 1362490 2536.3
<none>                 1358559 2537.4
- Age      1     51715 1410273 2546.7
- Student  1    348229 1706788 2603.9
- Balance  1   5331133 6689692 3013.7

Step:  AIC=2536.23
Rating ~ Cards + Age + Student + Balance

          Df Sum of Sq     RSS    AIC
- Cards    1      4022 1366157 2535.1
<none>                 1362135 2536.2
- Age      1     49624 1411759 2545.0
- Student  1    358157 1720292 2604.3
- Balance  1   5365843 6727978 3013.4

Step:  AIC=2535.12
Rating ~ Age + Student + Balance

          Df Sum of Sq     RSS    AIC
<none>                 1366157 2535.1
- Age      1     48090 1414247 2543.5
- Student  1    355910 1722067 2602.6
- Balance  1   5421441 6787598 3014.1

We have achieved a parsimonious model with just six predictors from 8 predictors. Initially, the AIC value was 2544.84 with 8 predictors. This has been reduced to a value of 2535.12 after removing 5 non significant variables with low predictive power.

The predictors chosen by the model are Rating, Age, Student, and Balance.

summary(lr_model)


Call:
lm(formula = Rating ~ Age + Student + Balance, data = df3)

Residuals:
     Min       1Q   Median       3Q      Max 
-144.485  -43.828   -9.934   26.566  271.190 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.684e+02  1.403e+01  12.002  < 2e-16 ***
Age          7.377e-01  2.285e-01   3.228  0.00139 ** 
StudentYes  -1.166e+02  1.328e+01  -8.781  < 2e-16 ***
Balance      3.015e-01  8.798e-03  34.273  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 67.94 on 296 degrees of freedom
Multiple R-squared:  0.7993,    Adjusted R-squared:  0.7972 
F-statistic: 392.9 on 3 and 296 DF,  p-value: < 2.2e-16

After the elimination step, we haven’t lost much in term both R-squared and adjust R-squared value. The current R-squared value is 79.93%, just slightly lower than 80.08% previously. Meanwhile, the adjust R-squared has slightly higher at 79.46%.

Finally, the p-value based on the F-statistics test is below 0.05, meaning all of the current predictors are statistically significant.

3.3 Model Diagnostics

Now, we will check whether our model follows the four assumptions of linear regression.

Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.
Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.
Homoscedasticity: The residuals have constant variance at every level of x.
Normality: The residuals of the model are normally distributed.

We have already checked for linearity in the previous section.

We have also taken care for multicollinearity in the previous section by dropping the critics score from our model.

Let us now visualize our model for checking the remaining conditions.

g1 <- ggplot(data = lr_model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals") + theme(plot.title = element_text(hjust = 0.5)) + 
  labs(title = "a) Variability Condition")

g2 <- ggplot(data = lr_model, aes(x = .resid)) +
  geom_histogram(binwidth = 1, fill='white', color='black', aes(y=..density..)) +
  xlab("Residuals") + geom_density(lwd = 0.8) + theme(plot.title = element_text(hjust = 0.5)) + 
  labs(title = "b) Normality Condition 1")

g3 <- ggplot(data = lr_model, aes(sample = .resid)) +
  stat_qq() + stat_qq_line(col = "red") + theme(plot.title = element_text(hjust = 0.5)) + 
  labs(title = "c) Normality Condition 2")

grid.arrange(g1,g2,g3, nrow = 1)

From above plot a, the residuals is randomly scattered around the zero axis. This signifies that there is a constant variance of residuals and there is no fan shape in residuals plot.

From plot b, we can see a nearly normal distribution of residuals around zero. Similarly from plot c, majority of the points stays on the line, with just slight skewness around the tail. Overall, the normal distribution of residuals condition is met.

3.4 Prediction

3.4.1 Test dataframe

Building a test data frame I am going to create a data frame with randomly created.

test_df <- data.frame(Ethnicity = c("Caucasian", "Asian", "Asian", "African American", 
                                "Caucasian"), 
                      Income = c(40,100,15,80,35), 
                      Age = c(34,50,20,60,30), 
                      Cards = c(1,2,1,2,3), 
                      Education = c(11, 15, 9,16, 12),
                      Married = c("Yes", "Yes", "No","Yes", "Yes"),
                      Student = c("No", "No", "Yes", "No", "No"),
                      Balance = c(333,903, 10, 580, 331)
)

test_df

         Ethnicity Income Age Cards Education Married Student Balance
1        Caucasian     40  34     1        11     Yes      No     333
2            Asian    100  50     2        15     Yes      No     903
3            Asian     15  20     1         9      No     Yes      10
4 African American     80  60     2        16     Yes      No     580
5        Caucasian     35  30     3        12     Yes      No     331

Next, I will predict the rating and the predict the intervals of the rating using my model. Subsequently, I will save the results to a new data frame and add the actual rating and then analyze the result.

predict_df <- predict(lr_model, test_df, interval = "predict")

mydf <- cbind(test_df, round(predict_df))

mydf <- mydf %>%
  mutate(Actual_rating = c(300,480,68,400,293), 
         estimate = ifelse(abs(Actual_rating - fit) <= 5, "Good Prediction", 
                           ifelse(Actual_rating >= lwr & Actual_rating <= upr, 
                                  "Not great but within the range", "Bad prediction")))

3.4.2. Predict scores

mydf %>%
  kbl() %>%
  column_spec(c(1,9,13), bold = T) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                full_width = F, position = "center", font_size = 11)

Ethnicity	Income	Age	Cards	Education	Married	Student	Balance	fit	lwr	upr	Actual_rating	estimate
Caucasian	40	34	1	11	Yes	No	333	294	160	428	300	Not great but within the range
Asian	100	50	2	15	Yes	No	903	478	343	612	480	Good Prediction
Asian	15	20	1	9	No	Yes	10	70	-68	207	68	Good Prediction
African American	80	60	2	16	Yes	No	580	388	254	522	400	Not great but within the range
Caucasian	35	30	3	12	Yes	No	331	290	156	425	293	Good Prediction

Our model succeeded in predicting almost accurately the audience score of half of the movies in the table with an absolute difference of 5 points. Predictions for the rest of the movies were not great but still was within the range of the predicted interval. It is clear that the model improves its prediction accuracy for credit scores.

4.Act

According to the Correlation analysis, we found Balance is the most impactful factor for Rating of credit scores for the future card-holders.

In addition, We have created a linear regression model that can predict the credit score of card holders based on certain conditions. Therefore, we have almost narrowed down the required attributes of card holders in order to predict credit scores.

But, the model is far from perfect. More can be done to improve the model’s predictive ability. If test data has more observations, the model could be helped to have more variability in the data.