ZIXIAN LIANG
2024-04-28
In our daily lives, our basic needs and activities, such as appliances, transportation, and housing, are more or less closely related to loans. From a macro perspective, the existence of loans can stimulate the socio-economic landscape; while from a micro perspective, judicious use of loans can alleviate financial pressure and meet essential life needs, such as purchasing a car or buying a home. Recognizing this, I have decided to focus on analyzing data related to loans. On a macro level, the ideal focus is mortgage loans. I will collect data on mortgage loan applications and interest rates from recent years and visualize this data to observe trends. Typically, a higher interest rate corresponds to fewer loan applications, while a lower interest rate leads to more applications. However, with the impact of the COVID-19 pandemic and the post-pandemic era, where the Federal Reserve has significantly raised interest rates to curb inflation, will the situation remain the same? Under persistently high interest rates, what factors contribute to loan approvals when needed? I will gather data on loan approvals to observe approval trends from a micro perspective. Which variables influence approval? Can machine learning predict loan approvals? I will utilize a portion of the data for training and another portion for testing to evaluate whether the simulation results align with real-world outcomes.
Citation:
https://www.lendingtree.com/home/mortgage/u-s-mortgage-market-statistics/
https://www.kaggle.com/datasets/pravinmaurya69/loan-approval-prediction-dataset
content <- read_html("https://www.lendingtree.com/home/mortgage/u-s-mortgage-market-statistics/")
tables <- content %>% html_table(fill = TRUE)
application_table <- tables[[1]]
application_table## # A tibble: 12 × 4
## `` Mortgage accounts* (m…¹ Mortgage balances ($…² Average mortgage siz…³
## <chr> <dbl> <chr> <chr>
## 1 Q4 2012 83.2 $8.03 $96,516
## 2 Q4 2013 81.6 $8.05 $98,640
## 3 Q4 2014 81.4 $8.17 $100,332
## 4 Q4 2015 80.6 $8.25 $102,332
## 5 Q4 2016 79.9 $8.48 $106,133
## 6 Q4 2017 80.0 $8.88 $111,039
## 7 Q4 2018 79.4 $9.12 $114,984
## 8 Q4 2019 80.9 $9.56 $118,075
## 9 Q4 2020 80.6 $10.04 $124,603
## 10 Q4 2021 81.0 $10.93 $135,005
## 11 Q4 2022 83.4 $11.92 $142,927
## 12 Q3 2023 84.0 $12.14 $144,593
## # ℹ abbreviated names: ¹​`Mortgage accounts* (millions)`,
## # ²​`Mortgage balances ($ trillions)`, ³​`Average mortgage size per account`
## # A tibble: 52 × 4
## Year `Annual weekly average` `Annual high` `Annual low`
## <int> <chr> <chr> <chr>
## 1 1972 7.38% 7.46% 7.23%
## 2 1973 8.04% 8.85% 7.43%
## 3 1974 9.19% 10.03% 8.40%
## 4 1975 9.05% 9.60% 8.80%
## 5 1976 8.87% 9.10% 8.70%
## 6 1977 8.85% 9.00% 8.65%
## 7 1978 9.64% 10.38% 8.98%
## 8 1979 11.20% 12.90% 10.38%
## 9 1980 13.74% 16.35% 12.18%
## 10 1981 16.64% 18.63% 14.80%
## # ℹ 42 more rows
## # A tibble: 10 × 4
## Year `Mortgage applications (millions)` `Mortgage rate` Mortgage_rate_num
## <int> <dbl> <chr> <dbl>
## 1 2014 81.4 4.17% 4.17
## 2 2015 80.6 3.85% 3.85
## 3 2016 79.9 3.65% 3.65
## 4 2017 80.0 3.99% 3.99
## 5 2018 79.4 4.54% 4.54
## 6 2019 80.9 3.94% 3.94
## 7 2020 80.6 3.11% 3.11
## 8 2021 81.0 2.96% 2.96
## 9 2022 83.4 5.34% 5.34
## 10 2023 84.0 6.79% 6.79
## Rows: 598
## Columns: 13
## $ Loan_ID <chr> "LP001002", "LP001003", "LP001005", "LP001006", "LP0…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Married <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes"…
## $ Dependents <int> 0, 1, 0, 0, 0, 2, 0, 3, 2, 1, 2, 2, 2, 0, 2, 0, 1, 0…
## $ Education <chr> "Graduate", "Graduate", "Graduate", "Not Graduate", …
## $ Self_Employed <chr> "No", "No", "Yes", "No", "No", "Yes", "No", "No", "N…
## $ ApplicantIncome <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006…
## $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, …
## $ LoanAmount <int> NA, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, 1…
## $ Loan_Amount_Term <int> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36…
## $ Credit_History <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, NA, …
## $ Property_Area <chr> "Urban", "Rural", "Urban", "Urban", "Urban", "Urban"…
## $ Loan_Status <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", "N", "Y", "N", "Y…
## Loan_ID Gender Married Dependents
## 0 0 0 12
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 0 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 21 14 49 0
## Loan_Status
## 0
## Loan_ID Gender Married Dependents
## 0 0 0 0
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 0 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 0 0 0 0
## Loan_Status
## 0
## Loan_Status Loan_ID Gender Married Dependents
## N:187 Length:598 Female:111 No :210 Min. :0.0000
## Y:411 Class :character Male :487 Yes:388 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.7408
## 3rd Qu.:1.0000
## Max. :3.0000
## Education Self_Employed ApplicantIncome CoapplicantIncome
## Graduate :465 No :488 Min. : 150 Min. : 0
## Not Graduate:133 Yes:110 1st Qu.: 2878 1st Qu.: 0
## Median : 3806 Median : 1212
## Mean : 5292 Mean : 1632
## 3rd Qu.: 5746 3rd Qu.: 2324
## Max. :81000 Max. :41667
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## Min. : 9.0 Min. : 12.0 0:135 Rural :175
## 1st Qu.:100.0 1st Qu.:360.0 1:463 Semiurban:225
## Median :127.0 Median :360.0 Urban :198
## Mean :144.3 Mean :342.3
## 3rd Qu.:163.5 3rd Qu.:360.0
## Max. :650.0 Max. :480.0
## Rows: 598
## Columns: 13
## $ Loan_Status <fct> Y, N, Y, Y, Y, Y, Y, N, Y, N, Y, Y, Y, N, Y, Y, Y, N…
## $ Loan_ID <chr> "LP001002", "LP001003", "LP001005", "LP001006", "LP0…
## $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male…
## $ Married <fct> No, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ Dependents <dbl> 0, 1, 0, 0, 0, 2, 0, 3, 2, 1, 2, 2, 2, 0, 2, 0, 1, 0…
## $ Education <fct> Graduate, Graduate, Graduate, Not Graduate, Graduate…
## $ Self_Employed <fct> No, No, Yes, No, No, Yes, No, No, No, No, No, Yes, N…
## $ ApplicantIncome <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006…
## $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, …
## $ LoanAmount <int> 127, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, …
## $ Loan_Amount_Term <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36…
## $ Credit_History <fct> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0…
## $ Property_Area <fct> Urban, Rural, Urban, Urban, Urban, Urban, Urban, Sem…
## Prediction
## From Data N Y
## N 2 55
## Y 5 117
## [1] 0.6648045
## Prediction
## From Data N Y
## N 9 48
## Y 13 109
## [1] 0.6592179
## rf variable importance
##
## Overall
## Credit_History 100.0000
## ApplicantIncome 94.4343
## LoanAmount 86.8831
## CoapplicantIncome 58.1373
## Loan_Amount_Term 21.1374
## Property_Area 20.4248
## Dependents 17.4167
## Education 3.6522
## Married 2.8995
## Self_Employed 0.6865
## Gender 0.0000
##
## Call:
## randomForest(formula = Loan_Status ~ ., data = train_data3)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 26.01%
## Confusion matrix:
## N Y class.error
## N 56 74 0.5692308
## Y 35 254 0.1211073
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 24 14
## Y 33 108
##
## Accuracy : 0.7374
## 95% CI : (0.6666, 0.8003)
## No Information Rate : 0.6816
## P-Value [Acc > NIR] : 0.06193
##
## Kappa : 0.3361
##
## Mcnemar's Test P-Value : 0.00865
##
## Sensitivity : 0.4211
## Specificity : 0.8852
## Pos Pred Value : 0.6316
## Neg Pred Value : 0.7660
## Prevalence : 0.3184
## Detection Rate : 0.1341
## Detection Prevalence : 0.2123
## Balanced Accuracy : 0.6531
##
## 'Positive' Class : N
##
## mtry = 1 OOB error = 29.12%
## Searching left ...
## Searching right ...
## mtry = 2 OOB error = 26.01%
## 0.1065574 0.05
## mtry = 4 OOB error = 26.73%
## -0.02752294 0.05
## mtry OOBError
## 1.OOB 1 0.2911695
## 2.OOB 2 0.2601432
## 4.OOB 4 0.2673031
tuning_rf_model <- randomForest(Loan_Status ~ ., data = train_data3, ntree = 300, mtry = 2,
do.trace = 100)## ntree OOB 1 2
## 100: 26.49% 60.77% 11.07%
## 200: 25.78% 60.00% 10.38%
## 300: 26.49% 61.54% 10.73%
random_forest_mirror2 <- predict(tuning_rf_model,test_data3, type = "class")
confusionMatrix(random_forest_mirror2, test_data3$Loan_Status)## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 24 12
## Y 33 110
##
## Accuracy : 0.7486
## 95% CI : (0.6784, 0.8103)
## No Information Rate : 0.6816
## P-Value [Acc > NIR] : 0.030594
##
## Kappa : 0.3578
##
## Mcnemar's Test P-Value : 0.002869
##
## Sensitivity : 0.4211
## Specificity : 0.9016
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.7692
## Prevalence : 0.3184
## Detection Rate : 0.1341
## Detection Prevalence : 0.2011
## Balanced Accuracy : 0.6613
##
## 'Positive' Class : N
##
In general, fluctuations in interest rates do indeed tend to correspond with decreases or increases in loan application volumes. However, as I speculated, this relationship is not absolute. According to the ggplot geomline chart spanning the past decade, we can observe that from 2014 to 2016, despite a decline in interest rates, loan application volumes were also decreasing. Conversely, starting in 2021, as the post-COVID market began to recover, despite interest rates rising by as much as 4%, loan application volumes continued to steadily increase. Therefore, it is evident that interest rates are not the sole determinant of loan application volumes; market demand plays a more crucial role. On the other hand, through training and testing loan approvals using decision trees and random forests, we can conclude that factors such as individual qualifications and income standards can only predict accuracy to a certain extent, with an accuracy score of 0.7486. This suggests that approximately 25% of the dataset remains unexplained. Therefore, it is evident that besides meeting explicit criteria, there are additional factors influencing loan approval. It could be that when there is an overwhelming number of loan applicants at the same time, banks may adopt a selective approach in approving loans, prioritizing applicants with stronger financial profiles or meeting specific criteria.