Data 607 Final Project - Loans

ZIXIAN LIANG

2024-04-28

Introduction

In our daily lives, our basic needs and activities, such as appliances, transportation, and housing, are more or less closely related to loans. From a macro perspective, the existence of loans can stimulate the socio-economic landscape; while from a micro perspective, judicious use of loans can alleviate financial pressure and meet essential life needs, such as purchasing a car or buying a home. Recognizing this, I have decided to focus on analyzing data related to loans. On a macro level, the ideal focus is mortgage loans. I will collect data on mortgage loan applications and interest rates from recent years and visualize this data to observe trends. Typically, a higher interest rate corresponds to fewer loan applications, while a lower interest rate leads to more applications. However, with the impact of the COVID-19 pandemic and the post-pandemic era, where the Federal Reserve has significantly raised interest rates to curb inflation, will the situation remain the same? Under persistently high interest rates, what factors contribute to loan approvals when needed? I will gather data on loan approvals to observe approval trends from a micro perspective. Which variables influence approval? Can machine learning predict loan approvals? I will utilize a portion of the data for training and another portion for testing to evaluate whether the simulation results align with real-world outcomes.

Citation:

https://www.lendingtree.com/home/mortgage/u-s-mortgage-market-statistics/

https://www.kaggle.com/datasets/pravinmaurya69/loan-approval-prediction-dataset

Data 1

content <- read_html("https://www.lendingtree.com/home/mortgage/u-s-mortgage-market-statistics/")

tables <- content %>% html_table(fill = TRUE)

application_table <- tables[[1]]

application_table
## # A tibble: 12 × 4
##    ``      Mortgage accounts* (m…¹ Mortgage balances ($…² Average mortgage siz…³
##    <chr>                     <dbl> <chr>                  <chr>                 
##  1 Q4 2012                    83.2 $8.03                  $96,516               
##  2 Q4 2013                    81.6 $8.05                  $98,640               
##  3 Q4 2014                    81.4 $8.17                  $100,332              
##  4 Q4 2015                    80.6 $8.25                  $102,332              
##  5 Q4 2016                    79.9 $8.48                  $106,133              
##  6 Q4 2017                    80.0 $8.88                  $111,039              
##  7 Q4 2018                    79.4 $9.12                  $114,984              
##  8 Q4 2019                    80.9 $9.56                  $118,075              
##  9 Q4 2020                    80.6 $10.04                 $124,603              
## 10 Q4 2021                    81.0 $10.93                 $135,005              
## 11 Q4 2022                    83.4 $11.92                 $142,927              
## 12 Q3 2023                    84.0 $12.14                 $144,593              
## # ℹ abbreviated names: ¹​`Mortgage accounts* (millions)`,
## #   ²​`Mortgage balances ($ trillions)`, ³​`Average mortgage size per account`

rate_table <- tables[[3]]
rate_table
## # A tibble: 52 × 4
##     Year `Annual weekly average` `Annual high` `Annual low`
##    <int> <chr>                   <chr>         <chr>       
##  1  1972 7.38%                   7.46%         7.23%       
##  2  1973 8.04%                   8.85%         7.43%       
##  3  1974 9.19%                   10.03%        8.40%       
##  4  1975 9.05%                   9.60%         8.80%       
##  5  1976 8.87%                   9.10%         8.70%       
##  6  1977 8.85%                   9.00%         8.65%       
##  7  1978 9.64%                   10.38%        8.98%       
##  8  1979 11.20%                  12.90%        10.38%      
##  9  1980 13.74%                  16.35%        12.18%      
## 10  1981 16.64%                  18.63%        14.80%      
## # ℹ 42 more rows

Data transformation

## # A tibble: 10 × 4
##     Year `Mortgage applications (millions)` `Mortgage rate` Mortgage_rate_num
##    <int>                              <dbl> <chr>                       <dbl>
##  1  2014                               81.4 4.17%                        4.17
##  2  2015                               80.6 3.85%                        3.85
##  3  2016                               79.9 3.65%                        3.65
##  4  2017                               80.0 3.99%                        3.99
##  5  2018                               79.4 4.54%                        4.54
##  6  2019                               80.9 3.94%                        3.94
##  7  2020                               80.6 3.11%                        3.11
##  8  2021                               81.0 2.96%                        2.96
##  9  2022                               83.4 5.34%                        5.34
## 10  2023                               84.0 6.79%                        6.79

Plot Draft

Improved Plot

Data 2

## Rows: 598
## Columns: 13
## $ Loan_ID           <chr> "LP001002", "LP001003", "LP001005", "LP001006", "LP0…
## $ Gender            <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Married           <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes"…
## $ Dependents        <int> 0, 1, 0, 0, 0, 2, 0, 3, 2, 1, 2, 2, 2, 0, 2, 0, 1, 0…
## $ Education         <chr> "Graduate", "Graduate", "Graduate", "Not Graduate", …
## $ Self_Employed     <chr> "No", "No", "Yes", "No", "No", "Yes", "No", "No", "N…
## $ ApplicantIncome   <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006…
## $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, …
## $ LoanAmount        <int> NA, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, 1…
## $ Loan_Amount_Term  <int> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36…
## $ Credit_History    <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, NA, …
## $ Property_Area     <chr> "Urban", "Rural", "Urban", "Urban", "Urban", "Urban"…
## $ Loan_Status       <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", "N", "Y", "N", "Y…

Exploratory Data Analysis

colSums(is.na(loan_data))
##           Loan_ID            Gender           Married        Dependents 
##                 0                 0                 0                12 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                 0                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                21                14                49                 0 
##       Loan_Status 
##                 0
##           Loan_ID            Gender           Married        Dependents 
##                 0                 0                 0                 0 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                 0                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                 0                 0                 0                 0 
##       Loan_Status 
##                 0
##  Loan_Status   Loan_ID             Gender    Married     Dependents    
##  N:187       Length:598         Female:111   No :210   Min.   :0.0000  
##  Y:411       Class :character   Male  :487   Yes:388   1st Qu.:0.0000  
##              Mode  :character                          Median :0.0000  
##                                                        Mean   :0.7408  
##                                                        3rd Qu.:1.0000  
##                                                        Max.   :3.0000  
##         Education   Self_Employed ApplicantIncome CoapplicantIncome
##  Graduate    :465   No :488       Min.   :  150   Min.   :    0    
##  Not Graduate:133   Yes:110       1st Qu.: 2878   1st Qu.:    0    
##                                   Median : 3806   Median : 1212    
##                                   Mean   : 5292   Mean   : 1632    
##                                   3rd Qu.: 5746   3rd Qu.: 2324    
##                                   Max.   :81000   Max.   :41667    
##    LoanAmount    Loan_Amount_Term Credit_History   Property_Area
##  Min.   :  9.0   Min.   : 12.0    0:135          Rural    :175  
##  1st Qu.:100.0   1st Qu.:360.0    1:463          Semiurban:225  
##  Median :127.0   Median :360.0                   Urban    :198  
##  Mean   :144.3   Mean   :342.3                                  
##  3rd Qu.:163.5   3rd Qu.:360.0                                  
##  Max.   :650.0   Max.   :480.0
## Rows: 598
## Columns: 13
## $ Loan_Status       <fct> Y, N, Y, Y, Y, Y, Y, N, Y, N, Y, Y, Y, N, Y, Y, Y, N…
## $ Loan_ID           <chr> "LP001002", "LP001003", "LP001005", "LP001006", "LP0…
## $ Gender            <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male…
## $ Married           <fct> No, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ Dependents        <dbl> 0, 1, 0, 0, 0, 2, 0, 3, 2, 1, 2, 2, 2, 0, 2, 0, 1, 0…
## $ Education         <fct> Graduate, Graduate, Graduate, Not Graduate, Graduate…
## $ Self_Employed     <fct> No, No, Yes, No, No, Yes, No, No, No, No, No, Yes, N…
## $ ApplicantIncome   <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006…
## $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, …
## $ LoanAmount        <int> 127, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, …
## $ Loan_Amount_Term  <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36…
## $ Credit_History    <fct> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0…
## $ Property_Area     <fct> Urban, Rural, Urban, Urban, Urban, Urban, Urban, Sem…

Decision Tree 1 - Non_numeric_loan_data

##          Prediction
## From Data   N   Y
##         N   2  55
##         Y   5 117
## [1] 0.6648045

Decision Tree 2 - Numeric_loan_data

##          Prediction
## From Data   N   Y
##         N   9  48
##         Y  13 109
## [1] 0.6592179

Random Forest 3

## rf variable importance
## 
##                    Overall
## Credit_History    100.0000
## ApplicantIncome    94.4343
## LoanAmount         86.8831
## CoapplicantIncome  58.1373
## Loan_Amount_Term   21.1374
## Property_Area      20.4248
## Dependents         17.4167
## Education           3.6522
## Married             2.8995
## Self_Employed       0.6865
## Gender              0.0000
## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = train_data3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 26.01%
## Confusion matrix:
##    N   Y class.error
## N 56  74   0.5692308
## Y 35 254   0.1211073

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  24  14
##          Y  33 108
##                                           
##                Accuracy : 0.7374          
##                  95% CI : (0.6666, 0.8003)
##     No Information Rate : 0.6816          
##     P-Value [Acc > NIR] : 0.06193         
##                                           
##                   Kappa : 0.3361          
##                                           
##  Mcnemar's Test P-Value : 0.00865         
##                                           
##             Sensitivity : 0.4211          
##             Specificity : 0.8852          
##          Pos Pred Value : 0.6316          
##          Neg Pred Value : 0.7660          
##              Prevalence : 0.3184          
##          Detection Rate : 0.1341          
##    Detection Prevalence : 0.2123          
##       Balanced Accuracy : 0.6531          
##                                           
##        'Positive' Class : N               
## 

Random Forest Tuning

## mtry = 1  OOB error = 29.12% 
## Searching left ...
## Searching right ...
## mtry = 2     OOB error = 26.01% 
## 0.1065574 0.05 
## mtry = 4     OOB error = 26.73% 
## -0.02752294 0.05

##       mtry  OOBError
## 1.OOB    1 0.2911695
## 2.OOB    2 0.2601432
## 4.OOB    4 0.2673031
tuning_rf_model <- randomForest(Loan_Status ~ ., data = train_data3, ntree = 300, mtry = 2,
                                do.trace = 100)
## ntree      OOB      1      2
##   100:  26.49% 60.77% 11.07%
##   200:  25.78% 60.00% 10.38%
##   300:  26.49% 61.54% 10.73%
random_forest_mirror2 <- predict(tuning_rf_model,test_data3, type = "class")
confusionMatrix(random_forest_mirror2, test_data3$Loan_Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  24  12
##          Y  33 110
##                                           
##                Accuracy : 0.7486          
##                  95% CI : (0.6784, 0.8103)
##     No Information Rate : 0.6816          
##     P-Value [Acc > NIR] : 0.030594        
##                                           
##                   Kappa : 0.3578          
##                                           
##  Mcnemar's Test P-Value : 0.002869        
##                                           
##             Sensitivity : 0.4211          
##             Specificity : 0.9016          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.7692          
##              Prevalence : 0.3184          
##          Detection Rate : 0.1341          
##    Detection Prevalence : 0.2011          
##       Balanced Accuracy : 0.6613          
##                                           
##        'Positive' Class : N               
## 

Conclusion

In general, fluctuations in interest rates do indeed tend to correspond with decreases or increases in loan application volumes. However, as I speculated, this relationship is not absolute. According to the ggplot geomline chart spanning the past decade, we can observe that from 2014 to 2016, despite a decline in interest rates, loan application volumes were also decreasing. Conversely, starting in 2021, as the post-COVID market began to recover, despite interest rates rising by as much as 4%, loan application volumes continued to steadily increase. Therefore, it is evident that interest rates are not the sole determinant of loan application volumes; market demand plays a more crucial role. On the other hand, through training and testing loan approvals using decision trees and random forests, we can conclude that factors such as individual qualifications and income standards can only predict accuracy to a certain extent, with an accuracy score of 0.7486. This suggests that approximately 25% of the dataset remains unexplained. Therefore, it is evident that besides meeting explicit criteria, there are additional factors influencing loan approval. It could be that when there is an overwhelming number of loan applicants at the same time, banks may adopt a selective approach in approving loans, prioritizing applicants with stronger financial profiles or meeting specific criteria.