Project Description

A Bank Manager was tasked to identify the indicators that affect and thus predict the loan amount given to the Bank customers whose application are already approved. The dataset was obtained from https://www.kaggle.com/datasets/vipin20/loan-application-data.

Before we do the modelling, let’s load all the libraries that we will use.

# load all libraries for further usage

library(dplyr) # data transformation
library(GGally) # EDA, correlation check
library(performance) # model comparison
library(MLmetrics) # error value calculation
library(lmtest) # homoscedasticity assumption checking
library(car) # multicollinearity assumption checking

Reading and Cleaning the Dataset

loan_application <- read.csv("datainput/df1_loan.csv", stringsAsFactors = T)

head(loan_application)

Since we only want to use the data for those loan applications which have been approved, we would need to drop the rejected application.

approved_loan <- loan_application[!grepl("N", loan_application$Loan_Status),]
head(approved_loan)

Let’s check whether the structure of the data is already correct.

str(approved_loan)
#> 'data.frame':    345 obs. of  15 variables:
#>  $ X                : int  0 2 3 4 5 6 8 10 11 12 ...
#>  $ Loan_ID          : Factor w/ 500 levels "LP001002","LP001003",..: 1 3 4 5 6 7 9 11 12 13 ...
#>  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 3 3 3 3 ...
#>  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 2 2 2 4 2 4 4 4 4 ...
#>  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 2 1 1 2 1 1 1 1 ...
#>  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 2 2 2 1 2 ...
#>  $ ApplicantIncome  : int  5849 3000 2583 6000 5417 2333 4006 3200 2500 3073 ...
#>  $ CoapplicantIncome: num  0 0 2358 0 4196 ...
#>  $ LoanAmount       : num  NA 66 120 141 267 95 168 70 109 200 ...
#>  $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
#>  $ Credit_History   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ Total_Income     : Factor w/ 457 levels "$10000.0","$10047.0",..: 342 99 268 347 452 162 316 167 199 20 ...

It seems that the Total_Income variable is not considered as numerical values due to it using dollar sign ($). Let’s change it by first deleting the column and creating a new one by summing up ApplicantIncome and CoapplicantIncome variables so that it has numerical values. We should also remove unnecessary columns such as X, Loan_ID and loan_status columns.

approved_loan2 <- approved_loan[-c(1:2,14:15)]
approved_loan2$Total_Income <- rowSums(approved_loan2[,c("ApplicantIncome", "CoapplicantIncome")])
str(approved_loan2)
#> 'data.frame':    345 obs. of  12 variables:
#>  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 3 3 3 3 ...
#>  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 2 2 2 4 2 4 4 4 4 ...
#>  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 2 1 1 2 1 1 1 1 ...
#>  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 2 2 2 1 2 ...
#>  $ ApplicantIncome  : int  5849 3000 2583 6000 5417 2333 4006 3200 2500 3073 ...
#>  $ CoapplicantIncome: num  0 0 2358 0 4196 ...
#>  $ LoanAmount       : num  NA 66 120 141 267 95 168 70 109 200 ...
#>  $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
#>  $ Credit_History   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Total_Income     : num  5849 3000 4941 6000 9613 ...

To simplify the dataset, let’s also drop ApplicantIncome and CoapplicantIncome as this is already represented by Total_Income column as a proxy.

approved_loan3 <- approved_loan2[-c(6:7)]
str(approved_loan3)
#> 'data.frame':    345 obs. of  10 variables:
#>  $ Gender          : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Married         : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 3 3 3 3 ...
#>  $ Dependents      : Factor w/ 5 levels "","0","1","2",..: 2 2 2 2 4 2 4 4 4 4 ...
#>  $ Education       : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 2 1 1 2 1 1 1 1 ...
#>  $ Self_Employed   : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 2 2 2 1 2 ...
#>  $ LoanAmount      : num  NA 66 120 141 267 95 168 70 109 200 ...
#>  $ Loan_Amount_Term: num  360 360 360 360 360 360 360 360 360 360 ...
#>  $ Credit_History  : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Property_Area   : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ Total_Income    : num  5849 3000 4941 6000 9613 ...

Based on the information that we got from the dataset, the following are the variables description:

  • Gender: Gender of applicant male or female
  • Married: Married Status! Yes or no
  • Dependents: Dependents of applicant
  • Education: Education, Graduate or Not Graduate
  • Self_Employed: Self_Employed! Yes or No
  • LoanAmount: Loan Amount for loan
  • Loan_Amount_Term: Loan Amount Term (the length of time it will take for a loan to be completely paid off when the borrower is making regular payments)
  • Credit_History: Credit History
  • Property_Area: Property Area
  • Total_Income: Total Income in a household of Applicant

Next, we need to check whether our dataset has any missing value or not.

anyNA(approved_loan3)
#> [1] TRUE
colSums(is.na(approved_loan3))
#>           Gender          Married       Dependents        Education 
#>                0                0                0                0 
#>    Self_Employed       LoanAmount Loan_Amount_Term   Credit_History 
#>                0               10                8               32 
#>    Property_Area     Total_Income 
#>                0                0

Since the data has several missing values, let’s exclude those data that have missing values.

approved_loan_clean <- na.omit(approved_loan3)
anyNA(approved_loan_clean)
#> [1] FALSE
colSums(is.na(approved_loan_clean))
#>           Gender          Married       Dependents        Education 
#>                0                0                0                0 
#>    Self_Employed       LoanAmount Loan_Amount_Term   Credit_History 
#>                0                0                0                0 
#>    Property_Area     Total_Income 
#>                0                0

GREAT! Now, we can start with our analysis.

Exploratory Data Analysis (EDA)

Next, we will need to perform EDA to see whether there are any patterns of correlation between variables.

First, we can start with a correlation checking.

# correlation checking
ggcorr(approved_loan_clean, label = T, hjust = 1, layout.exp = 3)

From the above data, TotalIncome seems to have a strong positive correlation with LoanAmount.

Next, let’s see whether there are any outliers in the dataset by using summary and boxplot parameters.

#Checking the summary of the data
summary(approved_loan_clean)
#>     Gender    Married   Dependents        Education   Self_Employed
#>        :  5      :  2     :  6     Graduate    :242      : 17      
#>  Female: 48   No : 98   0 :169     Not Graduate: 53   No :243      
#>  Male  :242   Yes:195   1 : 42                        Yes: 35      
#>                         2 : 54                                     
#>                         3+: 24                                     
#>                                                                    
#>    LoanAmount    Loan_Amount_Term Credit_History     Property_Area
#>  Min.   : 17.0   Min.   : 60.0    Min.   :0.0000   Rural    : 73  
#>  1st Qu.:100.0   1st Qu.:360.0    1st Qu.:1.0000   Semiurban:133  
#>  Median :125.0   Median :360.0    Median :1.0000   Urban    : 89  
#>  Mean   :139.4   Mean   :339.9    Mean   :0.9797                  
#>  3rd Qu.:158.0   3rd Qu.:360.0    3rd Qu.:1.0000                  
#>  Max.   :700.0   Max.   :480.0    Max.   :1.0000                  
#>   Total_Income  
#>  Min.   : 1963  
#>  1st Qu.: 4190  
#>  Median : 5300  
#>  Mean   : 6975  
#>  3rd Qu.: 7404  
#>  Max.   :63337

From the above summary, we note that Total_Income and LoanAmount variables have outliers. Let’s confirm it by using the boxplot parameter.

# outlier checking
boxplot(approved_loan_clean)

It can be seen from the above boxplot, that the variable of Total_Income has the most outliers.

Let’s confirm it by using another boxplot and histogram graphic.

boxplot(approved_loan_clean$Total_Income)

hist(approved_loan_clean$Total_Income, breaks = 50)

To make sure that our data is not biased from the outliers, let’s clean this up.

# filter
appdata_no_outlier <- approved_loan_clean[approved_loan_clean$Total_Income < 9500,]

summary(appdata_no_outlier)
#>     Gender    Married   Dependents        Education   Self_Employed
#>        :  1      :  2     :  6     Graduate    :195      : 15      
#>  Female: 43   No : 86   0 :147     Not Graduate: 51   No :205      
#>  Male  :202   Yes:158   1 : 33                        Yes: 26      
#>                         2 : 42                                     
#>                         3+: 18                                     
#>                                                                    
#>    LoanAmount    Loan_Amount_Term Credit_History     Property_Area
#>  Min.   : 17.0   Min.   : 60.0    Min.   :0.0000   Rural    : 60  
#>  1st Qu.: 98.0   1st Qu.:360.0    1st Qu.:1.0000   Semiurban:113  
#>  Median :120.0   Median :360.0    Median :1.0000   Urban    : 73  
#>  Mean   :122.0   Mean   :341.6    Mean   :0.9837                  
#>  3rd Qu.:140.8   3rd Qu.:360.0    3rd Qu.:1.0000                  
#>  Max.   :280.0   Max.   :480.0    Max.   :1.0000                  
#>   Total_Income 
#>  Min.   :1963  
#>  1st Qu.:3881  
#>  Median :4915  
#>  Mean   :5126  
#>  3rd Qu.:6000  
#>  Max.   :9328

Just to be safe, let’s see again whether we have really removed the outliers on our dataset.

boxplot(appdata_no_outlier$Total_Income)

hist(appdata_no_outlier$Total_Income, breaks = 50)

Based on the boxplot and histogram above, it’s safe to say that our dataset of appdata_no_outlier is considerably free from outliers.

Model Fitting

After done with the EDA, let’s find the linear regression model that best fit in predicting the loan amount. We will try 2 datasets, one with outlier and the other without outliers. Each dataset will have three models, - linear regression model with Total_Income as the only predictor - linear regression model with all predictors - linear regression model produced from the step-wise regression process.In our case, we will be using the backward step.

Model with Total_Income as the only predictor

model_totalincome <- lm(LoanAmount ~ Total_Income, data = approved_loan_clean)
model_totalincome_nooutlier <- lm(LoanAmount ~ Total_Income, data = appdata_no_outlier)

Model With all Predictors

model_all <- lm(LoanAmount ~ ., data = approved_loan_clean)
model_all_nooutlier <- lm(LoanAmount ~ ., data = appdata_no_outlier)

Step-wise regression model

model_step <- step(object = model_all, direction = "backward", trace = FALSE)
model_step_nooutlier <- step(object = model_all_nooutlier, direction = "backward", trace = FALSE)

Model Evaluation

Based on the three models above, we will compare their performance, particularly looking at the AIC, R-squared, and RMSE scores.

# model performance comparison

comparison <- compare_performance(model_totalincome, model_totalincome_nooutlier, model_all, model_all_nooutlier, model_step, model_step_nooutlier)

as.data.frame(comparison)

Kindly note the following: - the smaller the AIC, the lower the observation values that are not captured, hence the better the model. - for predictors > 1, better use adjusted R-squared as it is more representative. This is because the more predictors used, the Multiple R-squared will increase, regardless of whether the predictor is significant or not. Adjusted R-square will only increase if the predictors can minimize the error production of the model. Hence, the bigger the Adjusted R-squared, the better. - Root Mean Square or RMSE is a standard and popular way to measure the error of a model in predicting quantitative data showing how spread out the data is around the best-fit line. If the RMSE value is smaller, it means that the predicted value is close to the observed or observed value, and vice versa.

From the above comparison, we note that - AIC: model_step_nooutlier has the lowest AIC score - R2_adjusted: model_all_nooutlier has the highest R2_adjusted score - RMSE: model_all_nooutlier has the lowest RMSE score

This shows that all data without outliers fare better than the data with outliers.

Since model_all_nooutlier and model_step_nooutlier have similar scores, let’s see what sets them apart by doing model interpretation.

Model Interpretation

First, let’s see the differences of the model by looking at their summary:

summary(model_all_nooutlier)
#> 
#> Call:
#> lm(formula = LoanAmount ~ ., data = appdata_no_outlier)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -94.898 -13.954   1.226  15.987 151.869 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)             76.928879  42.951475   1.791 0.074603 .  
#> GenderFemale           -26.901162  30.883850  -0.871 0.384643    
#> GenderMale             -35.645748  30.382171  -1.173 0.241916    
#> MarriedNo              -35.929035  25.980570  -1.383 0.168037    
#> MarriedYes             -31.778713  25.909939  -1.227 0.221268    
#> Dependents0             -9.897133  15.020940  -0.659 0.510629    
#> Dependents1              0.130389  15.762678   0.008 0.993407    
#> Dependents2              0.098077  15.660335   0.006 0.995009    
#> Dependents3+             3.180343  16.518995   0.193 0.847501    
#> EducationNot Graduate   -5.115139   4.822618  -1.061 0.289964    
#> Self_EmployedNo          2.888578   8.034176   0.360 0.719525    
#> Self_EmployedYes        -3.449699   9.849817  -0.350 0.726488    
#> Loan_Amount_Term         0.113893   0.031456   3.621 0.000362 ***
#> Credit_History         -14.884879  15.192641  -0.980 0.328247    
#> Property_AreaSemiurban   1.877377   4.878094   0.385 0.700699    
#> Property_AreaUrban     -10.118956   5.454082  -1.855 0.064839 .  
#> Total_Income             0.018439   0.001251  14.736  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 29.49 on 229 degrees of freedom
#> Multiple R-squared:  0.5546, Adjusted R-squared:  0.5235 
#> F-statistic: 17.82 on 16 and 229 DF,  p-value: < 2.2e-16
summary(model_step_nooutlier)
#> 
#> Call:
#> lm(formula = LoanAmount ~ Married + Loan_Amount_Term + Property_Area + 
#>     Total_Income, data = appdata_no_outlier)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -98.422 -15.293   0.719  15.390 170.751 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)             26.697870  26.313999   1.015 0.311328    
#> MarriedNo              -41.401075  21.386349  -1.936 0.054064 .  
#> MarriedYes             -36.116161  21.287282  -1.697 0.091073 .  
#> Loan_Amount_Term         0.119002   0.031202   3.814 0.000174 ***
#> Property_AreaSemiurban   3.132239   4.771095   0.657 0.512132    
#> Property_AreaUrban      -8.262080   5.304775  -1.557 0.120680    
#> Total_Income             0.018200   0.001199  15.184  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 29.63 on 239 degrees of freedom
#> Multiple R-squared:  0.5309, Adjusted R-squared:  0.5192 
#> F-statistic: 45.09 on 6 and 239 DF,  p-value: < 2.2e-16
  • Based on the above summary, we can see that model_step_nooutlier only uses four variables, namely Married, Loan_Amount_Term, Property_Area, and Total_Income.
  • From the significance level, Total_Amount_Term and Total_Income, are predictors that significantly affect the Loan Amount linearly in both models.
  • From R-squared score: model_all_nooutlier can explain the variation in Loan Amount around 52,35%, slightly better than model_step_outlier with adjusted R-square score of 51.9%. This means that around 48% variation in Loan Amount can be explained by other factors outside the model.
  • If we try to interpret the model:
    • The higher the Total Income and the longer the Total Amount Term, the higher the approved Loan Amount (Both coefficients are positive).
    • For every increase of 1 point in Total Income, the Loan Amount will increase by 0.018 points given that the other predictors’ values are fixed.
    • Similarly, the longer the Loan Amount Term applied, the Loan Amount will increase by 0.119 points (model_step_nooutlier) or 0.114 points (model_all_outlier) given that the other predictors’ value are fixed.

Model Assumption

Let’s do assumption checking on our selected model model_step_nooutlier!

Linearity

Linearity hypothesis test:

  • H0: correlation is not significant
  • H1: correlation is significant
cor.test(appdata_no_outlier$LoanAmount, appdata_no_outlier$Total_Income)$p.value
#> [1] 2.684471e-35
cor.test(appdata_no_outlier$LoanAmount, appdata_no_outlier$Loan_Amount_Term)$p.value
#> [1] 0.009353887

Conclusion: All predictors have p-value < 0.05, therefore reject H0 (correlation is significant). Assumption is fulfilled.

Normality of Residuals

Shapiro-Wilk hypothesis test:

  • H0: error normal distribution
  • H1: error no normal distribution
shapiro.test(model_all_nooutlier$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_all_nooutlier$residuals
#> W = 0.95332, p-value = 4.054e-07
shapiro.test(model_step_nooutlier$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_step_nooutlier$residuals
#> W = 0.94001, p-value = 1.735e-08

Conclusion: The p-value of the two models is < 0.05. therefore reject H0 (error no normal distribution). Assumption is fulfilled.

Homoscedasticity of Residuals

Breusch-Pagan hypothesis test:

  • H0: error constantly spread out or homoscedasticity
  • H1: error inconstantly spread out or heteroscedasticity
bptest(model_all_nooutlier)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_all_nooutlier
#> BP = 23.547, df = 16, p-value = 0.09988
bptest(model_step_nooutlier)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_step_nooutlier
#> BP = 9.2058, df = 6, p-value = 0.1623

Conclusion: The p-value of both models is > 0.05. therefore fail to reject H0 (error constantly spread out or homoscedascticity). However model_all_nooutlier is better as the figure is closer to p-value of 0.05.

plot(appdata_no_outlier$Total_Income, model_step_nooutlier$residuals)
abline(h = 0, col = "red")

plot(appdata_no_outlier$Total_Income, model_all_nooutlier$residuals)
abline(h = 0, col = "red")

No Multicollinearity

VIF test:

  • VIF value > 10: there is a multicollinearity in the model
  • VIF value < 10: there is no multicollinearity in the model
vif(model_all_nooutlier)
#>                      GVIF Df GVIF^(1/(2*Df))
#> Gender           1.300536  2        1.067900
#> Married          2.018874  2        1.192003
#> Dependents       1.934231  4        1.085959
#> Education        1.080959  1        1.039692
#> Self_Employed    1.160563  2        1.037928
#> Loan_Amount_Term 1.093106  1        1.045517
#> Credit_History   1.044193  1        1.021858
#> Property_Area    1.177679  2        1.041734
#> Total_Income     1.133656  1        1.064733
vif(model_step_nooutlier)
#>                      GVIF Df GVIF^(1/(2*Df))
#> Married          1.039137  2        1.009644
#> Loan_Amount_Term 1.065876  1        1.032413
#> Property_Area    1.073699  2        1.017936
#> Total_Income     1.030879  1        1.015322

Conclusion: All VIF < 10, therefore there is no multicollinearity in both models. Assumption is fulfilled.

Recap

Based on the above assumption tests, only on Homoscedasticity test where the assumption is not fulfilled. Meanwhile for the remaining tests, we have fulfilled all assumptions for both models

Conclusion

For the modelling we will choose the model_all_nooutlier to predict Loan Amount for approved application. This is because the model has better results in terms of Adjusted R-square and RMSE, as well as better result for Homoscedascticity test compared to model_step_nooutlier.

Based on this model, Total Income and Loan Amount Term are the predictors that statistically significant and correlate positively to Loan Amount. Therefore, the Bank customer who is interested in applying for Loan, should pay attention to these two predictors.