Applied Data Science HW 2

Step 1: Data Selection and Preparation

Data Source Selection

Source: Kaggle Category: Finance

Data Origin Description

Data Uploaded: 10/28/2024
Data Author: Tawei Lo
Data Url: https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/data

Data Import

loan_df <- read.csv("loan_data.csv")

Step 2: Parametric Statistics

Regression Choice

The main purpose of this data set is to determine whether a loan will be awarded based on factors like income, education, credit score, etc. loan_status indicates the award (1) or lack thereof (0) of the loan. Logistic regression is the clear choice to apply here given that this is a binary classification problem.

head(loan_df)

##   person_age person_gender person_education person_income person_emp_exp
## 1         22        female           Master         71948              0
## 2         21        female      High School         12282              0
## 3         25        female      High School         12438              3
## 4         23        female         Bachelor         79753              0
## 5         24          male           Master         66135              1
## 6         21        female      High School         12951              0
##   person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income
## 1                  RENT     35000    PERSONAL         16.02                0.49
## 2                   OWN      1000   EDUCATION         11.14                0.08
## 3              MORTGAGE      5500     MEDICAL         12.87                0.44
## 4                  RENT     35000     MEDICAL         15.23                0.44
## 5                  RENT     35000     MEDICAL         14.27                0.53
## 6                   OWN      2500     VENTURE          7.14                0.19
##   cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
## 1                          3          561                             No
## 2                          2          504                            Yes
## 3                          3          635                             No
## 4                          2          675                             No
## 5                          4          586                             No
## 6                          2          532                             No
##   loan_status
## 1           1
## 2           0
## 3           1
## 4           1
## 5           1
## 6           1

There are 45,000 rows in the data. Loans were awarded 10,000 times (22%) and rejected 35,000 times (78%).

table(loan_df$loan_status)

## 
##     0     1 
## 35000 10000

Data Assessment

dim(loan_df)

## [1] 45000    14

str(loan_df)

## 'data.frame':    45000 obs. of  14 variables:
##  $ person_age                    : num  22 21 25 23 24 21 26 24 24 21 ...
##  $ person_gender                 : chr  "female" "female" "female" "female" ...
##  $ person_education              : chr  "Master" "High School" "High School" "Bachelor" ...
##  $ person_income                 : num  71948 12282 12438 79753 66135 ...
##  $ person_emp_exp                : int  0 0 3 0 1 0 1 5 3 0 ...
##  $ person_home_ownership         : chr  "RENT" "OWN" "MORTGAGE" "RENT" ...
##  $ loan_amnt                     : num  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
##  $ loan_intent                   : chr  "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...
##  $ loan_int_rate                 : num  16 11.1 12.9 15.2 14.3 ...
##  $ loan_percent_income           : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
##  $ cb_person_cred_hist_length    : num  3 2 3 2 4 2 3 4 2 3 ...
##  $ credit_score                  : int  561 504 635 675 586 532 701 585 544 640 ...
##  $ previous_loan_defaults_on_file: chr  "No" "Yes" "No" "No" ...
##  $ loan_status                   : int  1 0 1 1 1 1 1 1 1 1 ...

names(loan_df)

##  [1] "person_age"                     "person_gender"                 
##  [3] "person_education"               "person_income"                 
##  [5] "person_emp_exp"                 "person_home_ownership"         
##  [7] "loan_amnt"                      "loan_intent"                   
##  [9] "loan_int_rate"                  "loan_percent_income"           
## [11] "cb_person_cred_hist_length"     "credit_score"                  
## [13] "previous_loan_defaults_on_file" "loan_status"

describe(loan_df)

## loan_df 
## 
##  14  Variables      45000  Observations
## --------------------------------------------------------------------------------
## person_age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45000        0       60    0.994    27.76    6.015       22       22 
##      .25      .50      .75      .90      .95 
##       24       26       30       35       39 
## 
## lowest :  20  21  22  23  24, highest:  94 109 116 123 144
## --------------------------------------------------------------------------------
## person_gender 
##        n  missing distinct 
##    45000        0        2 
##                         
## Value      female   male
## Frequency   20159  24841
## Proportion  0.448  0.552
## --------------------------------------------------------------------------------
## person_education 
##        n  missing distinct 
##    45000        0        5 
##                                                                       
## Value        Associate    Bachelor   Doctorate High School      Master
## Frequency        12028       13399         621       11972        6980
## Proportion       0.267       0.298       0.014       0.266       0.155
## --------------------------------------------------------------------------------
## person_income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45000        0    33989        1    80319    51015    28367    35119 
##      .25      .50      .75      .90      .95 
##    47204    67048    95789   133276   166755 
## 
## lowest :    8000    8037    8104    8186    8248
## highest: 2280980 2448661 5545545 5556399 7200766
## --------------------------------------------------------------------------------
## person_emp_exp 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45000        0       63    0.987     5.41    5.996        0        0 
##      .25      .50      .75      .90      .95 
##        1        4        8       13       17 
## 
## lowest :   0   1   2   3   4, highest: 100 101 121 124 125
## --------------------------------------------------------------------------------
## person_home_ownership 
##        n  missing distinct 
##    45000        0        4 
##                                               
## Value      MORTGAGE    OTHER      OWN     RENT
## Frequency     18489      117     2951    23443
## Proportion    0.411    0.003    0.066    0.521
## --------------------------------------------------------------------------------
## loan_amnt 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45000        0     4483    0.999     9583     6790     2000     3000 
##      .25      .50      .75      .90      .95 
##     5000     8000    12237    19200    24000 
## 
## lowest :   500   563   700   725   750, highest: 34375 34664 34800 34826 35000
## --------------------------------------------------------------------------------
## loan_intent 
##        n  missing distinct 
##    45000        0        6 
##                                                                 
## Value      DEBTCONSOLIDATION         EDUCATION   HOMEIMPROVEMENT
## Frequency               7145              9153              4783
## Proportion             0.159             0.203             0.106
##                                                                 
## Value                MEDICAL          PERSONAL           VENTURE
## Frequency               8548              7552              7819
## Proportion             0.190             0.168             0.174
## --------------------------------------------------------------------------------
## loan_int_rate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45000        0     1302        1    11.01    3.381     6.17     6.99 
##      .25      .50      .75      .90      .95 
##     8.59    11.01    12.99    14.97    16.00 
## 
## lowest : 5.42  5.43  5.44  5.46  5.47 , highest: 19.8  19.82 19.9  19.91 20   
## --------------------------------------------------------------------------------
## loan_percent_income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45000        0       64    0.998   0.1397  0.09517     0.03     0.04 
##      .25      .50      .75      .90      .95 
##     0.07     0.12     0.19     0.26     0.31 
## 
## lowest : 0    0.01 0.02 0.03 0.04, highest: 0.59 0.61 0.62 0.63 0.66
## --------------------------------------------------------------------------------
## cb_person_cred_hist_length 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45000        0       29    0.982    5.867     3.99        2        2 
##      .25      .50      .75      .90      .95 
##        3        4        8       11       14 
## 
## lowest :  2  3  4  5  6, highest: 26 27 28 29 30
## --------------------------------------------------------------------------------
## credit_score 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45000        0      340        1    632.6    56.39      539      563 
##      .25      .50      .75      .90      .95 
##      601      640      670      692      703 
## 
## lowest : 390 418 419 420 421, highest: 789 792 805 807 850
## --------------------------------------------------------------------------------
## previous_loan_defaults_on_file 
##        n  missing distinct 
##    45000        0        2 
##                       
## Value         No   Yes
## Frequency  22142 22858
## Proportion 0.492 0.508
## --------------------------------------------------------------------------------
## loan_status 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##    45000        0        2    0.519    10000   0.2222   0.3457 
## 
## --------------------------------------------------------------------------------

No missing data in this dataset. There’s a mix of numerical and character field types.

vis_dat(loan_df)

numeric_features <- c("person_emp_exp","credit_score","loan_status","person_age","person_income","loan_amnt","loan_int_rate","loan_percent_income","cb_person_cred_hist_length")
categorical_features <- c("person_gender","person_education","person_home_ownership","loan_intent","previous_loan_defaults_on_file")

loan_df %>% 
  dplyr::select(-all_of(categorical_features)) %>%
  cor() %>%
  corrplot(order="hclust")

loan_df %>% ggplot(aes(x=person_age)) +
  geom_boxplot() +
  theme_bw()

loan_df %>% ggplot(aes(x=person_emp_exp)) +
  geom_boxplot() +
  theme_bw()

Removes rows with extreme outliers.

clean_loan_df <- loan_df %>% filter(person_age < 100, person_emp_exp < 50)

Data Split

Splits the data into train and test sets.

set.seed(42)
train_indices <- createDataPartition(clean_loan_df$loan_status, p = 0.8, list = FALSE) 

train <- clean_loan_df[train_indices,]
test <- clean_loan_df[-train_indices,]

Model Selection

Model Iterations

Performs backward selection process on model variables.

model1 = glm(loan_status~., family = binomial(link="logit"),train)
summary(model1)

## 
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -4.625e-01  4.068e-01  -1.137   0.2555    
## person_age                         2.091e-02  1.232e-02   1.697   0.0897 .  
## person_gendermale                  3.514e-02  3.964e-02   0.887   0.3753    
## person_educationBachelor          -1.465e-02  5.271e-02  -0.278   0.7810    
## person_educationDoctorate         -5.711e-04  1.617e-01  -0.004   0.9972    
## person_educationHigh School        4.909e-02  5.505e-02   0.892   0.3725    
## person_educationMaster             2.504e-02  6.308e-02   0.397   0.6914    
## person_income                      1.945e-06  4.862e-07   4.000 6.34e-05 ***
## person_emp_exp                    -1.509e-02  1.090e-02  -1.385   0.1660    
## person_home_ownershipOTHER         2.033e-01  3.575e-01   0.569   0.5695    
## person_home_ownershipOWN          -1.417e+00  1.129e-01 -12.549  < 2e-16 ***
## person_home_ownershipRENT          7.378e-01  4.487e-02  16.443  < 2e-16 ***
## loan_amnt                         -1.079e-04  5.272e-06 -20.464  < 2e-16 ***
## loan_intentEDUCATION              -8.901e-01  6.532e-02 -13.627  < 2e-16 ***
## loan_intentHOMEIMPROVEMENT         7.888e-04  7.365e-02   0.011   0.9915    
## loan_intentMEDICAL                -2.862e-01  6.311e-02  -4.535 5.76e-06 ***
## loan_intentPERSONAL               -7.251e-01  6.687e-02 -10.844  < 2e-16 ***
## loan_intentVENTURE                -1.234e+00  7.124e-02 -17.323  < 2e-16 ***
## loan_int_rate                      3.346e-01  7.357e-03  45.482  < 2e-16 ***
## loan_percent_income                1.637e+01  3.998e-01  40.942  < 2e-16 ***
## cb_person_cred_hist_length        -1.223e-02  1.057e-02  -1.157   0.2474    
## credit_score                      -8.779e-03  4.582e-04 -19.157  < 2e-16 ***
## previous_loan_defaults_on_fileYes -2.038e+01  1.148e+02  -0.178   0.8590    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38030  on 35988  degrees of freedom
## Residual deviance: 15888  on 35966  degrees of freedom
## AIC: 15934
## 
## Number of Fisher Scoring iterations: 19

Removes fields from model with p-value > 0.8

train2 <- train %>% dplyr::select(-c(person_education,previous_loan_defaults_on_file))
model2 = glm(loan_status~., family = binomial(link="logit"),train2)
summary(model2)

## 
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"), 
##     data = train2)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -6.826e+00  3.060e-01 -22.306  < 2e-16 ***
## person_age                  7.024e-03  9.876e-03   0.711  0.47696    
## person_gendermale          -1.326e-03  3.159e-02  -0.042  0.96651    
## person_income               1.821e-06  3.470e-07   5.249 1.53e-07 ***
## person_emp_exp             -6.347e-03  8.709e-03  -0.729  0.46617    
## person_home_ownershipOTHER  7.321e-01  2.839e-01   2.578  0.00993 ** 
## person_home_ownershipOWN   -1.404e+00  9.675e-02 -14.509  < 2e-16 ***
## person_home_ownershipRENT   8.731e-01  3.671e-02  23.783  < 2e-16 ***
## loan_amnt                  -1.072e-04  4.084e-06 -26.250  < 2e-16 ***
## loan_intentEDUCATION       -9.381e-01  5.209e-02 -18.008  < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -3.852e-02  5.816e-02  -0.662  0.50781    
## loan_intentMEDICAL         -2.971e-01  4.914e-02  -6.044 1.50e-09 ***
## loan_intentPERSONAL        -6.851e-01  5.325e-02 -12.865  < 2e-16 ***
## loan_intentVENTURE         -1.194e+00  5.702e-02 -20.942  < 2e-16 ***
## loan_int_rate               3.390e-01  6.066e-03  55.885  < 2e-16 ***
## loan_percent_income         1.642e+01  2.942e-01  55.810  < 2e-16 ***
## cb_person_cred_hist_length -7.521e-03  8.579e-03  -0.877  0.38068    
## credit_score               -2.058e-04  3.140e-04  -0.655  0.51223    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38030  on 35988  degrees of freedom
## Residual deviance: 25901  on 35971  degrees of freedom
## AIC: 25937
## 
## Number of Fisher Scoring iterations: 6

Removes fields from model with p-value > 0.6

train3 <- train2 %>% dplyr::select(-c(person_gender))
model3 = glm(loan_status~., family = binomial(link="logit"),train3)
summary(model3)

## 
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"), 
##     data = train3)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -6.826e+00  3.056e-01 -22.337  < 2e-16 ***
## person_age                  7.020e-03  9.875e-03   0.711  0.47716    
## person_income               1.821e-06  3.470e-07   5.249 1.53e-07 ***
## person_emp_exp             -6.348e-03  8.709e-03  -0.729  0.46609    
## person_home_ownershipOTHER  7.321e-01  2.839e-01   2.578  0.00993 ** 
## person_home_ownershipOWN   -1.404e+00  9.675e-02 -14.510  < 2e-16 ***
## person_home_ownershipRENT   8.731e-01  3.671e-02  23.783  < 2e-16 ***
## loan_amnt                  -1.072e-04  4.084e-06 -26.250  < 2e-16 ***
## loan_intentEDUCATION       -9.381e-01  5.209e-02 -18.008  < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -3.850e-02  5.816e-02  -0.662  0.50799    
## loan_intentMEDICAL         -2.970e-01  4.914e-02  -6.044 1.50e-09 ***
## loan_intentPERSONAL        -6.851e-01  5.325e-02 -12.865  < 2e-16 ***
## loan_intentVENTURE         -1.194e+00  5.702e-02 -20.943  < 2e-16 ***
## loan_int_rate               3.390e-01  6.066e-03  55.885  < 2e-16 ***
## loan_percent_income         1.642e+01  2.942e-01  55.811  < 2e-16 ***
## cb_person_cred_hist_length -7.516e-03  8.578e-03  -0.876  0.38092    
## credit_score               -2.057e-04  3.140e-04  -0.655  0.51225    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38030  on 35988  degrees of freedom
## Residual deviance: 25901  on 35972  degrees of freedom
## AIC: 25935
## 
## Number of Fisher Scoring iterations: 6

Removes fields from model with p-value > 0.4

train4 <- train3 %>% dplyr::select(-c(person_age,person_emp_exp,credit_score))
model4 = glm(loan_status~., family = binomial(link="logit"),train4)
summary(model4)

## 
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"), 
##     data = train4)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -6.799e+00  1.000e-01 -67.985  < 2e-16 ***
## person_income               1.821e-06  3.468e-07   5.251 1.51e-07 ***
## person_home_ownershipOTHER  7.294e-01  2.838e-01   2.570   0.0102 *  
## person_home_ownershipOWN   -1.404e+00  9.673e-02 -14.519  < 2e-16 ***
## person_home_ownershipRENT   8.730e-01  3.671e-02  23.785  < 2e-16 ***
## loan_amnt                  -1.071e-04  4.081e-06 -26.256  < 2e-16 ***
## loan_intentEDUCATION       -9.396e-01  5.204e-02 -18.055  < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -3.818e-02  5.813e-02  -0.657   0.5113    
## loan_intentMEDICAL         -2.974e-01  4.914e-02  -6.052 1.43e-09 ***
## loan_intentPERSONAL        -6.857e-01  5.325e-02 -12.877  < 2e-16 ***
## loan_intentVENTURE         -1.195e+00  5.701e-02 -20.960  < 2e-16 ***
## loan_int_rate               3.388e-01  6.063e-03  55.883  < 2e-16 ***
## loan_percent_income         1.641e+01  2.940e-01  55.831  < 2e-16 ***
## cb_person_cred_hist_length -6.702e-03  4.145e-03  -1.617   0.1059    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38030  on 35988  degrees of freedom
## Residual deviance: 25902  on 35975  degrees of freedom
## AIC: 25930
## 
## Number of Fisher Scoring iterations: 6

Removes fields from model with p-value > 0.1

train5 <- train4 %>% dplyr::select(-c(cb_person_cred_hist_length))
model5 = glm(loan_status~., family = binomial(link="logit"),train5)
summary(model5)

## 
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"), 
##     data = train5)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -6.832e+00  9.801e-02 -69.706  < 2e-16 ***
## person_income               1.770e-06  3.454e-07   5.125 2.98e-07 ***
## person_home_ownershipOTHER  7.343e-01  2.836e-01   2.589  0.00962 ** 
## person_home_ownershipOWN   -1.405e+00  9.672e-02 -14.526  < 2e-16 ***
## person_home_ownershipRENT   8.728e-01  3.670e-02  23.781  < 2e-16 ***
## loan_amnt                  -1.071e-04  4.081e-06 -26.251  < 2e-16 ***
## loan_intentEDUCATION       -9.368e-01  5.201e-02 -18.011  < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -4.236e-02  5.807e-02  -0.729  0.46573    
## loan_intentMEDICAL         -2.992e-01  4.913e-02  -6.090 1.13e-09 ***
## loan_intentPERSONAL        -6.871e-01  5.324e-02 -12.905  < 2e-16 ***
## loan_intentVENTURE         -1.195e+00  5.700e-02 -20.962  < 2e-16 ***
## loan_int_rate               3.387e-01  6.063e-03  55.871  < 2e-16 ***
## loan_percent_income         1.641e+01  2.940e-01  55.820  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38030  on 35988  degrees of freedom
## Residual deviance: 25904  on 35976  degrees of freedom
## AIC: 25930
## 
## Number of Fisher Scoring iterations: 6

VIF

Test for multicollinearity. No exceptionally large VIF values, so no further action required in this step.

vif(model5)

##                           GVIF Df GVIF^(1/(2*Df))
## person_income         1.528758  1        1.236430
## person_home_ownership 1.138824  3        1.021902
## loan_amnt             2.666317  1        1.632886
## loan_intent           1.052181  5        1.005100
## loan_int_rate         1.100726  1        1.049155
## loan_percent_income   2.647676  1        1.627168

Normality Review

Review normality of relevant variables.

ggplot(train5,aes(x=person_income))+
  geom_histogram(fill="lightblue",color="black")+
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(train5,aes(x=person_home_ownership))+
  geom_histogram(fill="lightblue",color="black",stat = "count")+
  theme_bw()

## Warning in geom_histogram(fill = "lightblue", color = "black", stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

ggplot(train5,aes(x=loan_amnt))+
  geom_histogram(fill="lightblue",color="black")+
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(train5,aes(x=loan_intent))+
  geom_histogram(fill="lightblue",color="black", stat="count")+
  theme_bw()

## Warning in geom_histogram(fill = "lightblue", color = "black", stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

ggplot(train5,aes(x=loan_int_rate))+
  geom_histogram(fill="lightblue",color="black")+
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(train5,aes(x=loan_percent_income))+
  geom_histogram(fill="lightblue",color="black")+
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Feature Scaling

Scales features used in model.

numeric_model_features <- c("person_income","loan_amnt","loan_int_rate","loan_percent_income")
categorical_model_features <- c("person_home_ownership","loan_intent")
target <- "loan_status"

train_scaled <- train %>% dplyr::select(all_of(numeric_model_features)) %>%
  scale() 

test_scaled <- test %>% dplyr::select(all_of(numeric_model_features)) %>% #uses train data to perform scaling to avoid leakage
  scale(center=attr(train_scaled, "scaled:center"), scale=attr(train_scaled, "scaled:scale")) 

train_scaled <- train %>% #adds back in target and categorical features
  dplyr::select(all_of(c(categorical_model_features,target))) %>%
  bind_cols(train_scaled)

test_scaled <- test %>% dplyr::select(all_of(c(categorical_model_features,target))) %>%
  bind_cols(test_scaled)

model6 = glm(loan_status~., family = binomial(link="logit"),train_scaled)
summary(model6)

## 
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"), 
##     data = train_scaled)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.69779    0.04398 -38.605  < 2e-16 ***
## person_home_ownershipOTHER  0.73426    0.28357   2.589  0.00962 ** 
## person_home_ownershipOWN   -1.40495    0.09672 -14.526  < 2e-16 ***
## person_home_ownershipRENT   0.87281    0.03670  23.781  < 2e-16 ***
## loan_intentEDUCATION       -0.93681    0.05201 -18.011  < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -0.04236    0.05807  -0.729  0.46573    
## loan_intentMEDICAL         -0.29919    0.04913  -6.090 1.13e-09 ***
## loan_intentPERSONAL        -0.68706    0.05324 -12.905  < 2e-16 ***
## loan_intentVENTURE         -1.19493    0.05700 -20.962  < 2e-16 ***
## person_income               0.11270    0.02199   5.125 2.98e-07 ***
## loan_amnt                  -0.67864    0.02585 -26.251  < 2e-16 ***
## loan_int_rate               1.00773    0.01804  55.871  < 2e-16 ***
## loan_percent_income         1.43417    0.02569  55.820  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38030  on 35988  degrees of freedom
## Residual deviance: 25904  on 35976  degrees of freedom
## AIC: 25930
## 
## Number of Fisher Scoring iterations: 6

Review Results and Select Best Model

fit_results <- predict(model6,newdata=test_scaled,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)

class_error <- mean(fit_results != test_scaled$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))

## [1] "Accuracy: 84.19%"

fit_results <- predict(model1,newdata=test,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)

class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))

## [1] "Accuracy: 89.53%"

fit_results <- predict(model2,newdata=test,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)

class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))

## [1] "Accuracy: 84.18%"

fit_results <- predict(model3,newdata=test,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)

class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))

## [1] "Accuracy: 84.18%"

fit_results <- predict(model4,newdata=test,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)

class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))

## [1] "Accuracy: 84.21%"

mae(test_scaled$loan_status,fit_results)

## [1] 0.1579415

mae(test$loan_status,predict(model1,newdata = test,type='response'))

## [1] 0.1434256

mae(test$loan_status,predict(model2,newdata = test,type='response'))

## [1] 0.2251504

mae(test$loan_status,predict(model3,newdata = test,type='response'))

## [1] 0.2251488

mae(test$loan_status,predict(model4,newdata = test,type='response'))

## [1] 0.225173

mae(test$loan_status,predict(model5,newdata = test,type='response'))

## [1] 0.225191

Model 1 displayed the lowest MAE and highest accuracy score despite the attempts to improve the model through successive iterations.

Step 3: Non-parametric Statistics

Decision Tree Analysis

Model Creation

model <- rpart(loan_status ~ ., method="class",data=train)

Decision Tree Visualized

The decision tree uses previous_loan_defaults_on_file, loan_percent_income, loan_int_rate, person_income, and person_home_ownership as the key predictors for loan_status.

rpart.plot(model, box.palette="RdBu", shadow.col="gray", nn=TRUE)

Model Results

p <- predict(model, test, type = "class")
table(p, test$loan_status)

##    
## p      0    1
##   0 6698  559
##   1  258 1482

fit_results <- predict(model,newdata=test,type='class')

class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))

## [1] "Accuracy: 90.92%"

The decision tree displays a higher accuracy than all logistic regression models attempted in step 2. Decision tree may therefore be more appopriate for modeling this particular dataset.

Factor Analysis

Scree plot

The scree plot indicates 4 is the appopriate number of factors for this dataset.

ev <- clean_loan_df %>% dplyr::select(all_of(numeric_features)) %>%
  cor() %>%
  eigen()

ev$values

## [1] 2.87238630 1.88788443 1.27971812 1.01420870 0.95300076 0.60556782 0.17560649
## [8] 0.16758966 0.04403771

#Plot a Scree plot using base plot:
Factor = c(1,2,3,4,5,6,7,8,9)
Eigen_Values <- ev$values
Scree <- data.frame(Factor, Eigen_Values)
plot(Scree, main = "Scree Plot", col= "Blue",ylim=c(0,4))
lines(Scree,col='Red')
abline(h = 1, col="Green")

Varimax Rotation

The results of the hypothesis test indicate that 4 factors are indeed sufficient.

nfactors <- 4
fit1 <- clean_loan_df %>% dplyr::select(all_of(numeric_features)) %>%
  factanal(nfactors,scores = c("regression"),rotation = "varimax")
print(fit1)

## 
## Call:
## factanal(x = ., factors = nfactors, scores = c("regression"),     rotation = "varimax")
## 
## Uniquenesses:
##             person_emp_exp               credit_score 
##                      0.089                      0.969 
##                loan_status                 person_age 
##                      0.005                      0.005 
##              person_income                  loan_amnt 
##                      0.546                      0.005 
##              loan_int_rate        loan_percent_income 
##                      0.869                      0.005 
## cb_person_cred_hist_length 
##                      0.223 
## 
## Loadings:
##                            Factor1 Factor2 Factor3 Factor4
## person_emp_exp              0.953                         
## credit_score                0.174                         
## loan_status                                 0.971  -0.210 
## person_age                  0.995                         
## person_income               0.102                   0.666 
## loan_amnt                           0.866   0.140   0.475 
## loan_int_rate                               0.348         
## loan_percent_income                 0.876   0.231  -0.417 
## cb_person_cred_hist_length  0.880                         
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      2.713   1.530   1.139   0.900
## Proportion Var   0.301   0.170   0.127   0.100
## Cumulative Var   0.301   0.471   0.598   0.698
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 259.53 on 6 degrees of freedom.
## The p-value is 3.77e-53

Diagram

fa_var <- clean_loan_df %>% dplyr::select(all_of(numeric_features)) %>%
  fa(r=., nfactors = 4, rotate="varimax",fm="pa")

## maximum iteration exceeded

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

fa.diagram(fa_var)

head(fa_var$scores)

##             PA1         PA2        PA3        PA4
## [1,] -1.0301953  4.17880191  0.9179240  1.6139383
## [2,] -1.0841345 -0.94971305 -0.9130053 -0.2926295
## [3,] -0.1405244  2.95782622 -4.5132567  0.8775261
## [4,] -0.9153672  3.62953972  1.5436974  1.5596118
## [5,] -0.6655125  4.68131468  0.3925582  1.3579715
## [6,] -1.0232959  0.02801921 -1.7955365  0.3864672

The groups indicated by the factor analysis make logical sense. The features in group 1 are all tied to age/time. Loan percentage and loan amount would logically be connected. Loan interest rate and loan status is an interesting pair, but that aligns with the results of the decision tree.

References

Lo, T. (2024, October 28). Loan Approval Classification Dataset. Retrieved from Kaggle: https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/data