A Bank Manager was tasked to identify the indicators that affect and thus predict the loan amount given to the Bank customers whose application are already approved. The dataset was obtained from https://www.kaggle.com/datasets/vipin20/loan-application-data.
Before we do the modelling, let’s load all the libraries that we will use.
# load all libraries for further usage
library(dplyr) # data transformation
library(GGally) # EDA, correlation check
library(performance) # model comparison
library(MLmetrics) # error value calculation
library(lmtest) # homoscedasticity assumption checking
library(car) # multicollinearity assumption checkingloan_application <- read.csv("datainput/df1_loan.csv", stringsAsFactors = T)
head(loan_application)Since we only want to use the data for those loan applications which have been approved, we would need to drop the rejected application.
approved_loan <- loan_application[!grepl("N", loan_application$Loan_Status),]
head(approved_loan)Let’s check whether the structure of the data is already correct.
str(approved_loan)#> 'data.frame': 345 obs. of 15 variables:
#> $ X : int 0 2 3 4 5 6 8 10 11 12 ...
#> $ Loan_ID : Factor w/ 500 levels "LP001002","LP001003",..: 1 3 4 5 6 7 9 11 12 13 ...
#> $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#> $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 3 3 3 3 ...
#> $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 2 2 2 4 2 4 4 4 4 ...
#> $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 2 1 1 2 1 1 1 1 ...
#> $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 2 2 2 1 2 ...
#> $ ApplicantIncome : int 5849 3000 2583 6000 5417 2333 4006 3200 2500 3073 ...
#> $ CoapplicantIncome: num 0 0 2358 0 4196 ...
#> $ LoanAmount : num NA 66 120 141 267 95 168 70 109 200 ...
#> $ Loan_Amount_Term : num 360 360 360 360 360 360 360 360 360 360 ...
#> $ Credit_History : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ Loan_Status : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
#> $ Total_Income : Factor w/ 457 levels "$10000.0","$10047.0",..: 342 99 268 347 452 162 316 167 199 20 ...
It seems that the Total_Income variable is not
considered as numerical values due to it using dollar sign ($). Let’s
change it by first deleting the column and creating a new one by summing
up ApplicantIncome and CoapplicantIncome
variables so that it has numerical values. We should also remove
unnecessary columns such as X, Loan_ID and
loan_status columns.
approved_loan2 <- approved_loan[-c(1:2,14:15)]
approved_loan2$Total_Income <- rowSums(approved_loan2[,c("ApplicantIncome", "CoapplicantIncome")])
str(approved_loan2)#> 'data.frame': 345 obs. of 12 variables:
#> $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#> $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 3 3 3 3 ...
#> $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 2 2 2 4 2 4 4 4 4 ...
#> $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 2 1 1 2 1 1 1 1 ...
#> $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 2 2 2 1 2 ...
#> $ ApplicantIncome : int 5849 3000 2583 6000 5417 2333 4006 3200 2500 3073 ...
#> $ CoapplicantIncome: num 0 0 2358 0 4196 ...
#> $ LoanAmount : num NA 66 120 141 267 95 168 70 109 200 ...
#> $ Loan_Amount_Term : num 360 360 360 360 360 360 360 360 360 360 ...
#> $ Credit_History : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ Total_Income : num 5849 3000 4941 6000 9613 ...
To simplify the dataset, let’s also drop ApplicantIncome
and CoapplicantIncome as this is already represented by
Total_Income column as a proxy.
approved_loan3 <- approved_loan2[-c(6:7)]
str(approved_loan3)#> 'data.frame': 345 obs. of 10 variables:
#> $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#> $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 3 3 3 3 ...
#> $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 2 2 2 4 2 4 4 4 4 ...
#> $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 2 1 1 2 1 1 1 1 ...
#> $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 2 2 2 1 2 ...
#> $ LoanAmount : num NA 66 120 141 267 95 168 70 109 200 ...
#> $ Loan_Amount_Term: num 360 360 360 360 360 360 360 360 360 360 ...
#> $ Credit_History : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ Total_Income : num 5849 3000 4941 6000 9613 ...
Based on the information that we got from the dataset, the following are the variables description:
Gender: Gender of applicant male or femaleMarried: Married Status! Yes or noDependents: Dependents of applicantEducation: Education, Graduate or Not GraduateSelf_Employed: Self_Employed! Yes or NoLoanAmount: Loan Amount for loanLoan_Amount_Term: Loan Amount Term (the length of time
it will take for a loan to be completely paid off when the borrower is
making regular payments)Credit_History: Credit HistoryProperty_Area: Property AreaTotal_Income: Total Income in a household of
ApplicantNext, we need to check whether our dataset has any missing value or not.
anyNA(approved_loan3)#> [1] TRUE
colSums(is.na(approved_loan3))#> Gender Married Dependents Education
#> 0 0 0 0
#> Self_Employed LoanAmount Loan_Amount_Term Credit_History
#> 0 10 8 32
#> Property_Area Total_Income
#> 0 0
Since the data has several missing values, let’s exclude those data that have missing values.
approved_loan_clean <- na.omit(approved_loan3)
anyNA(approved_loan_clean)#> [1] FALSE
colSums(is.na(approved_loan_clean))#> Gender Married Dependents Education
#> 0 0 0 0
#> Self_Employed LoanAmount Loan_Amount_Term Credit_History
#> 0 0 0 0
#> Property_Area Total_Income
#> 0 0
GREAT! Now, we can start with our analysis.
Next, we will need to perform EDA to see whether there are any patterns of correlation between variables.
First, we can start with a correlation checking.
# correlation checking
ggcorr(approved_loan_clean, label = T, hjust = 1, layout.exp = 3)From the above data, TotalIncome seems to have a strong
positive correlation with LoanAmount.
Next, let’s see whether there are any outliers in the dataset by
using summary and boxplot parameters.
#Checking the summary of the data
summary(approved_loan_clean)#> Gender Married Dependents Education Self_Employed
#> : 5 : 2 : 6 Graduate :242 : 17
#> Female: 48 No : 98 0 :169 Not Graduate: 53 No :243
#> Male :242 Yes:195 1 : 42 Yes: 35
#> 2 : 54
#> 3+: 24
#>
#> LoanAmount Loan_Amount_Term Credit_History Property_Area
#> Min. : 17.0 Min. : 60.0 Min. :0.0000 Rural : 73
#> 1st Qu.:100.0 1st Qu.:360.0 1st Qu.:1.0000 Semiurban:133
#> Median :125.0 Median :360.0 Median :1.0000 Urban : 89
#> Mean :139.4 Mean :339.9 Mean :0.9797
#> 3rd Qu.:158.0 3rd Qu.:360.0 3rd Qu.:1.0000
#> Max. :700.0 Max. :480.0 Max. :1.0000
#> Total_Income
#> Min. : 1963
#> 1st Qu.: 4190
#> Median : 5300
#> Mean : 6975
#> 3rd Qu.: 7404
#> Max. :63337
From the above summary, we note that Total_Income and
LoanAmount variables have outliers. Let’s confirm it by
using the boxplot parameter.
# outlier checking
boxplot(approved_loan_clean)It can be seen from the above boxplot, that the variable of
Total_Income has the most outliers.
Let’s confirm it by using another boxplot and histogram graphic.
boxplot(approved_loan_clean$Total_Income)hist(approved_loan_clean$Total_Income, breaks = 50)
To make sure that our data is not biased from the outliers, let’s clean
this up.
# filter
appdata_no_outlier <- approved_loan_clean[approved_loan_clean$Total_Income < 9500,]
summary(appdata_no_outlier)#> Gender Married Dependents Education Self_Employed
#> : 1 : 2 : 6 Graduate :195 : 15
#> Female: 43 No : 86 0 :147 Not Graduate: 51 No :205
#> Male :202 Yes:158 1 : 33 Yes: 26
#> 2 : 42
#> 3+: 18
#>
#> LoanAmount Loan_Amount_Term Credit_History Property_Area
#> Min. : 17.0 Min. : 60.0 Min. :0.0000 Rural : 60
#> 1st Qu.: 98.0 1st Qu.:360.0 1st Qu.:1.0000 Semiurban:113
#> Median :120.0 Median :360.0 Median :1.0000 Urban : 73
#> Mean :122.0 Mean :341.6 Mean :0.9837
#> 3rd Qu.:140.8 3rd Qu.:360.0 3rd Qu.:1.0000
#> Max. :280.0 Max. :480.0 Max. :1.0000
#> Total_Income
#> Min. :1963
#> 1st Qu.:3881
#> Median :4915
#> Mean :5126
#> 3rd Qu.:6000
#> Max. :9328
Just to be safe, let’s see again whether we have really removed the outliers on our dataset.
boxplot(appdata_no_outlier$Total_Income)hist(appdata_no_outlier$Total_Income, breaks = 50)Based on the boxplot and histogram above, it’s safe to say that our
dataset of appdata_no_outlier is considerably free from
outliers.
After done with the EDA, let’s find the linear regression model that
best fit in predicting the loan amount. We will try 2 datasets, one with
outlier and the other without outliers. Each dataset will have three
models, - linear regression model with Total_Income as the
only predictor - linear regression model with all predictors - linear
regression model produced from the step-wise regression process.In our
case, we will be using the backward step.
Total_Income as the only predictormodel_totalincome <- lm(LoanAmount ~ Total_Income, data = approved_loan_clean)
model_totalincome_nooutlier <- lm(LoanAmount ~ Total_Income, data = appdata_no_outlier)model_all <- lm(LoanAmount ~ ., data = approved_loan_clean)
model_all_nooutlier <- lm(LoanAmount ~ ., data = appdata_no_outlier)model_step <- step(object = model_all, direction = "backward", trace = FALSE)
model_step_nooutlier <- step(object = model_all_nooutlier, direction = "backward", trace = FALSE)Based on the three models above, we will compare their performance,
particularly looking at the AIC, R-squared,
and RMSE scores.
# model performance comparison
comparison <- compare_performance(model_totalincome, model_totalincome_nooutlier, model_all, model_all_nooutlier, model_step, model_step_nooutlier)
as.data.frame(comparison)Kindly note the following: - the smaller the AIC, the lower the observation values that are not captured, hence the better the model. - for predictors > 1, better use adjusted R-squared as it is more representative. This is because the more predictors used, the Multiple R-squared will increase, regardless of whether the predictor is significant or not. Adjusted R-square will only increase if the predictors can minimize the error production of the model. Hence, the bigger the Adjusted R-squared, the better. - Root Mean Square or RMSE is a standard and popular way to measure the error of a model in predicting quantitative data showing how spread out the data is around the best-fit line. If the RMSE value is smaller, it means that the predicted value is close to the observed or observed value, and vice versa.
From the above comparison, we note that - AIC:
model_step_nooutlier has the lowest AIC score -
R2_adjusted: model_all_nooutlier has the highest
R2_adjusted score - RMSE: model_all_nooutlier has the
lowest RMSE score
This shows that all data without outliers fare better than the data with outliers.
Since model_all_nooutlier and
model_step_nooutlier have similar scores, let’s see what
sets them apart by doing model interpretation.
First, let’s see the differences of the model by looking at their summary:
summary(model_all_nooutlier)#>
#> Call:
#> lm(formula = LoanAmount ~ ., data = appdata_no_outlier)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -94.898 -13.954 1.226 15.987 151.869
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 76.928879 42.951475 1.791 0.074603 .
#> GenderFemale -26.901162 30.883850 -0.871 0.384643
#> GenderMale -35.645748 30.382171 -1.173 0.241916
#> MarriedNo -35.929035 25.980570 -1.383 0.168037
#> MarriedYes -31.778713 25.909939 -1.227 0.221268
#> Dependents0 -9.897133 15.020940 -0.659 0.510629
#> Dependents1 0.130389 15.762678 0.008 0.993407
#> Dependents2 0.098077 15.660335 0.006 0.995009
#> Dependents3+ 3.180343 16.518995 0.193 0.847501
#> EducationNot Graduate -5.115139 4.822618 -1.061 0.289964
#> Self_EmployedNo 2.888578 8.034176 0.360 0.719525
#> Self_EmployedYes -3.449699 9.849817 -0.350 0.726488
#> Loan_Amount_Term 0.113893 0.031456 3.621 0.000362 ***
#> Credit_History -14.884879 15.192641 -0.980 0.328247
#> Property_AreaSemiurban 1.877377 4.878094 0.385 0.700699
#> Property_AreaUrban -10.118956 5.454082 -1.855 0.064839 .
#> Total_Income 0.018439 0.001251 14.736 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 29.49 on 229 degrees of freedom
#> Multiple R-squared: 0.5546, Adjusted R-squared: 0.5235
#> F-statistic: 17.82 on 16 and 229 DF, p-value: < 2.2e-16
summary(model_step_nooutlier)#>
#> Call:
#> lm(formula = LoanAmount ~ Married + Loan_Amount_Term + Property_Area +
#> Total_Income, data = appdata_no_outlier)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -98.422 -15.293 0.719 15.390 170.751
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 26.697870 26.313999 1.015 0.311328
#> MarriedNo -41.401075 21.386349 -1.936 0.054064 .
#> MarriedYes -36.116161 21.287282 -1.697 0.091073 .
#> Loan_Amount_Term 0.119002 0.031202 3.814 0.000174 ***
#> Property_AreaSemiurban 3.132239 4.771095 0.657 0.512132
#> Property_AreaUrban -8.262080 5.304775 -1.557 0.120680
#> Total_Income 0.018200 0.001199 15.184 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 29.63 on 239 degrees of freedom
#> Multiple R-squared: 0.5309, Adjusted R-squared: 0.5192
#> F-statistic: 45.09 on 6 and 239 DF, p-value: < 2.2e-16
model_step_nooutlier only uses four variables, namely
Married, Loan_Amount_Term,
Property_Area, and Total_Income.Total_Amount_Term and
Total_Income, are predictors that significantly affect the
Loan Amount linearly in both models.model_all_nooutlier can explain
the variation in Loan Amount around 52,35%, slightly better than
model_step_outlier with adjusted R-square score of 51.9%.
This means that around 48% variation in Loan Amount can be explained by
other factors outside the model.model_step_nooutlier) or
0.114 points (model_all_outlier) given that the other
predictors’ value are fixed.Let’s do assumption checking on our selected model
model_step_nooutlier!
Linearity hypothesis test:
cor.test(appdata_no_outlier$LoanAmount, appdata_no_outlier$Total_Income)$p.value#> [1] 2.684471e-35
cor.test(appdata_no_outlier$LoanAmount, appdata_no_outlier$Loan_Amount_Term)$p.value#> [1] 0.009353887
Conclusion: All predictors have p-value < 0.05, therefore reject H0 (correlation is significant). Assumption is fulfilled.
Shapiro-Wilk hypothesis test:
shapiro.test(model_all_nooutlier$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_all_nooutlier$residuals
#> W = 0.95332, p-value = 4.054e-07
shapiro.test(model_step_nooutlier$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_step_nooutlier$residuals
#> W = 0.94001, p-value = 1.735e-08
Conclusion: The p-value of the two models is < 0.05. therefore reject H0 (error no normal distribution). Assumption is fulfilled.
Breusch-Pagan hypothesis test:
bptest(model_all_nooutlier)#>
#> studentized Breusch-Pagan test
#>
#> data: model_all_nooutlier
#> BP = 23.547, df = 16, p-value = 0.09988
bptest(model_step_nooutlier)#>
#> studentized Breusch-Pagan test
#>
#> data: model_step_nooutlier
#> BP = 9.2058, df = 6, p-value = 0.1623
Conclusion: The p-value of both models is > 0.05.
therefore fail to reject H0 (error constantly spread out or
homoscedascticity). However model_all_nooutlier is better
as the figure is closer to p-value of 0.05.
plot(appdata_no_outlier$Total_Income, model_step_nooutlier$residuals)
abline(h = 0, col = "red")plot(appdata_no_outlier$Total_Income, model_all_nooutlier$residuals)
abline(h = 0, col = "red")VIF test:
vif(model_all_nooutlier)#> GVIF Df GVIF^(1/(2*Df))
#> Gender 1.300536 2 1.067900
#> Married 2.018874 2 1.192003
#> Dependents 1.934231 4 1.085959
#> Education 1.080959 1 1.039692
#> Self_Employed 1.160563 2 1.037928
#> Loan_Amount_Term 1.093106 1 1.045517
#> Credit_History 1.044193 1 1.021858
#> Property_Area 1.177679 2 1.041734
#> Total_Income 1.133656 1 1.064733
vif(model_step_nooutlier)#> GVIF Df GVIF^(1/(2*Df))
#> Married 1.039137 2 1.009644
#> Loan_Amount_Term 1.065876 1 1.032413
#> Property_Area 1.073699 2 1.017936
#> Total_Income 1.030879 1 1.015322
Conclusion: All VIF < 10, therefore there is no multicollinearity in both models. Assumption is fulfilled.
Based on the above assumption tests, only on Homoscedasticity test where the assumption is not fulfilled. Meanwhile for the remaining tests, we have fulfilled all assumptions for both models
For the modelling we will choose the model_all_nooutlier
to predict Loan Amount for approved application. This is because the
model has better results in terms of Adjusted R-square and RMSE, as well
as better result for Homoscedascticity test compared to
model_step_nooutlier.
Based on this model, Total Income and Loan Amount Term are the predictors that statistically significant and correlate positively to Loan Amount. Therefore, the Bank customer who is interested in applying for Loan, should pay attention to these two predictors.