Source: Kaggle Category: Finance
loan_df <- read.csv("loan_data.csv")
The main purpose of this data set is to determine whether a loan will be awarded based on factors like income, education, credit score, etc. loan_status indicates the award (1) or lack thereof (0) of the loan. Logistic regression is the clear choice to apply here given that this is a binary classification problem.
head(loan_df)
## person_age person_gender person_education person_income person_emp_exp
## 1 22 female Master 71948 0
## 2 21 female High School 12282 0
## 3 25 female High School 12438 3
## 4 23 female Bachelor 79753 0
## 5 24 male Master 66135 1
## 6 21 female High School 12951 0
## person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income
## 1 RENT 35000 PERSONAL 16.02 0.49
## 2 OWN 1000 EDUCATION 11.14 0.08
## 3 MORTGAGE 5500 MEDICAL 12.87 0.44
## 4 RENT 35000 MEDICAL 15.23 0.44
## 5 RENT 35000 MEDICAL 14.27 0.53
## 6 OWN 2500 VENTURE 7.14 0.19
## cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
## 1 3 561 No
## 2 2 504 Yes
## 3 3 635 No
## 4 2 675 No
## 5 4 586 No
## 6 2 532 No
## loan_status
## 1 1
## 2 0
## 3 1
## 4 1
## 5 1
## 6 1
There are 45,000 rows in the data. Loans were awarded 10,000 times (22%) and rejected 35,000 times (78%).
table(loan_df$loan_status)
##
## 0 1
## 35000 10000
dim(loan_df)
## [1] 45000 14
str(loan_df)
## 'data.frame': 45000 obs. of 14 variables:
## $ person_age : num 22 21 25 23 24 21 26 24 24 21 ...
## $ person_gender : chr "female" "female" "female" "female" ...
## $ person_education : chr "Master" "High School" "High School" "Bachelor" ...
## $ person_income : num 71948 12282 12438 79753 66135 ...
## $ person_emp_exp : int 0 0 3 0 1 0 1 5 3 0 ...
## $ person_home_ownership : chr "RENT" "OWN" "MORTGAGE" "RENT" ...
## $ loan_amnt : num 35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
## $ loan_intent : chr "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...
## $ loan_int_rate : num 16 11.1 12.9 15.2 14.3 ...
## $ loan_percent_income : num 0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
## $ cb_person_cred_hist_length : num 3 2 3 2 4 2 3 4 2 3 ...
## $ credit_score : int 561 504 635 675 586 532 701 585 544 640 ...
## $ previous_loan_defaults_on_file: chr "No" "Yes" "No" "No" ...
## $ loan_status : int 1 0 1 1 1 1 1 1 1 1 ...
names(loan_df)
## [1] "person_age" "person_gender"
## [3] "person_education" "person_income"
## [5] "person_emp_exp" "person_home_ownership"
## [7] "loan_amnt" "loan_intent"
## [9] "loan_int_rate" "loan_percent_income"
## [11] "cb_person_cred_hist_length" "credit_score"
## [13] "previous_loan_defaults_on_file" "loan_status"
describe(loan_df)
## loan_df
##
## 14 Variables 45000 Observations
## --------------------------------------------------------------------------------
## person_age
## n missing distinct Info Mean Gmd .05 .10
## 45000 0 60 0.994 27.76 6.015 22 22
## .25 .50 .75 .90 .95
## 24 26 30 35 39
##
## lowest : 20 21 22 23 24, highest: 94 109 116 123 144
## --------------------------------------------------------------------------------
## person_gender
## n missing distinct
## 45000 0 2
##
## Value female male
## Frequency 20159 24841
## Proportion 0.448 0.552
## --------------------------------------------------------------------------------
## person_education
## n missing distinct
## 45000 0 5
##
## Value Associate Bachelor Doctorate High School Master
## Frequency 12028 13399 621 11972 6980
## Proportion 0.267 0.298 0.014 0.266 0.155
## --------------------------------------------------------------------------------
## person_income
## n missing distinct Info Mean Gmd .05 .10
## 45000 0 33989 1 80319 51015 28367 35119
## .25 .50 .75 .90 .95
## 47204 67048 95789 133276 166755
##
## lowest : 8000 8037 8104 8186 8248
## highest: 2280980 2448661 5545545 5556399 7200766
## --------------------------------------------------------------------------------
## person_emp_exp
## n missing distinct Info Mean Gmd .05 .10
## 45000 0 63 0.987 5.41 5.996 0 0
## .25 .50 .75 .90 .95
## 1 4 8 13 17
##
## lowest : 0 1 2 3 4, highest: 100 101 121 124 125
## --------------------------------------------------------------------------------
## person_home_ownership
## n missing distinct
## 45000 0 4
##
## Value MORTGAGE OTHER OWN RENT
## Frequency 18489 117 2951 23443
## Proportion 0.411 0.003 0.066 0.521
## --------------------------------------------------------------------------------
## loan_amnt
## n missing distinct Info Mean Gmd .05 .10
## 45000 0 4483 0.999 9583 6790 2000 3000
## .25 .50 .75 .90 .95
## 5000 8000 12237 19200 24000
##
## lowest : 500 563 700 725 750, highest: 34375 34664 34800 34826 35000
## --------------------------------------------------------------------------------
## loan_intent
## n missing distinct
## 45000 0 6
##
## Value DEBTCONSOLIDATION EDUCATION HOMEIMPROVEMENT
## Frequency 7145 9153 4783
## Proportion 0.159 0.203 0.106
##
## Value MEDICAL PERSONAL VENTURE
## Frequency 8548 7552 7819
## Proportion 0.190 0.168 0.174
## --------------------------------------------------------------------------------
## loan_int_rate
## n missing distinct Info Mean Gmd .05 .10
## 45000 0 1302 1 11.01 3.381 6.17 6.99
## .25 .50 .75 .90 .95
## 8.59 11.01 12.99 14.97 16.00
##
## lowest : 5.42 5.43 5.44 5.46 5.47 , highest: 19.8 19.82 19.9 19.91 20
## --------------------------------------------------------------------------------
## loan_percent_income
## n missing distinct Info Mean Gmd .05 .10
## 45000 0 64 0.998 0.1397 0.09517 0.03 0.04
## .25 .50 .75 .90 .95
## 0.07 0.12 0.19 0.26 0.31
##
## lowest : 0 0.01 0.02 0.03 0.04, highest: 0.59 0.61 0.62 0.63 0.66
## --------------------------------------------------------------------------------
## cb_person_cred_hist_length
## n missing distinct Info Mean Gmd .05 .10
## 45000 0 29 0.982 5.867 3.99 2 2
## .25 .50 .75 .90 .95
## 3 4 8 11 14
##
## lowest : 2 3 4 5 6, highest: 26 27 28 29 30
## --------------------------------------------------------------------------------
## credit_score
## n missing distinct Info Mean Gmd .05 .10
## 45000 0 340 1 632.6 56.39 539 563
## .25 .50 .75 .90 .95
## 601 640 670 692 703
##
## lowest : 390 418 419 420 421, highest: 789 792 805 807 850
## --------------------------------------------------------------------------------
## previous_loan_defaults_on_file
## n missing distinct
## 45000 0 2
##
## Value No Yes
## Frequency 22142 22858
## Proportion 0.492 0.508
## --------------------------------------------------------------------------------
## loan_status
## n missing distinct Info Sum Mean Gmd
## 45000 0 2 0.519 10000 0.2222 0.3457
##
## --------------------------------------------------------------------------------
No missing data in this dataset. There’s a mix of numerical and character field types.
vis_dat(loan_df)
numeric_features <- c("person_emp_exp","credit_score","loan_status","person_age","person_income","loan_amnt","loan_int_rate","loan_percent_income","cb_person_cred_hist_length")
categorical_features <- c("person_gender","person_education","person_home_ownership","loan_intent","previous_loan_defaults_on_file")
loan_df %>%
dplyr::select(-all_of(categorical_features)) %>%
cor() %>%
corrplot(order="hclust")
loan_df %>% ggplot(aes(x=person_age)) +
geom_boxplot() +
theme_bw()
loan_df %>% ggplot(aes(x=person_emp_exp)) +
geom_boxplot() +
theme_bw()
Removes rows with extreme outliers.
clean_loan_df <- loan_df %>% filter(person_age < 100, person_emp_exp < 50)
Splits the data into train and test sets.
set.seed(42)
train_indices <- createDataPartition(clean_loan_df$loan_status, p = 0.8, list = FALSE)
train <- clean_loan_df[train_indices,]
test <- clean_loan_df[-train_indices,]
Performs backward selection process on model variables.
model1 = glm(loan_status~., family = binomial(link="logit"),train)
summary(model1)
##
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"),
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.625e-01 4.068e-01 -1.137 0.2555
## person_age 2.091e-02 1.232e-02 1.697 0.0897 .
## person_gendermale 3.514e-02 3.964e-02 0.887 0.3753
## person_educationBachelor -1.465e-02 5.271e-02 -0.278 0.7810
## person_educationDoctorate -5.711e-04 1.617e-01 -0.004 0.9972
## person_educationHigh School 4.909e-02 5.505e-02 0.892 0.3725
## person_educationMaster 2.504e-02 6.308e-02 0.397 0.6914
## person_income 1.945e-06 4.862e-07 4.000 6.34e-05 ***
## person_emp_exp -1.509e-02 1.090e-02 -1.385 0.1660
## person_home_ownershipOTHER 2.033e-01 3.575e-01 0.569 0.5695
## person_home_ownershipOWN -1.417e+00 1.129e-01 -12.549 < 2e-16 ***
## person_home_ownershipRENT 7.378e-01 4.487e-02 16.443 < 2e-16 ***
## loan_amnt -1.079e-04 5.272e-06 -20.464 < 2e-16 ***
## loan_intentEDUCATION -8.901e-01 6.532e-02 -13.627 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT 7.888e-04 7.365e-02 0.011 0.9915
## loan_intentMEDICAL -2.862e-01 6.311e-02 -4.535 5.76e-06 ***
## loan_intentPERSONAL -7.251e-01 6.687e-02 -10.844 < 2e-16 ***
## loan_intentVENTURE -1.234e+00 7.124e-02 -17.323 < 2e-16 ***
## loan_int_rate 3.346e-01 7.357e-03 45.482 < 2e-16 ***
## loan_percent_income 1.637e+01 3.998e-01 40.942 < 2e-16 ***
## cb_person_cred_hist_length -1.223e-02 1.057e-02 -1.157 0.2474
## credit_score -8.779e-03 4.582e-04 -19.157 < 2e-16 ***
## previous_loan_defaults_on_fileYes -2.038e+01 1.148e+02 -0.178 0.8590
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38030 on 35988 degrees of freedom
## Residual deviance: 15888 on 35966 degrees of freedom
## AIC: 15934
##
## Number of Fisher Scoring iterations: 19
Removes fields from model with p-value > 0.8
train2 <- train %>% dplyr::select(-c(person_education,previous_loan_defaults_on_file))
model2 = glm(loan_status~., family = binomial(link="logit"),train2)
summary(model2)
##
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"),
## data = train2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.826e+00 3.060e-01 -22.306 < 2e-16 ***
## person_age 7.024e-03 9.876e-03 0.711 0.47696
## person_gendermale -1.326e-03 3.159e-02 -0.042 0.96651
## person_income 1.821e-06 3.470e-07 5.249 1.53e-07 ***
## person_emp_exp -6.347e-03 8.709e-03 -0.729 0.46617
## person_home_ownershipOTHER 7.321e-01 2.839e-01 2.578 0.00993 **
## person_home_ownershipOWN -1.404e+00 9.675e-02 -14.509 < 2e-16 ***
## person_home_ownershipRENT 8.731e-01 3.671e-02 23.783 < 2e-16 ***
## loan_amnt -1.072e-04 4.084e-06 -26.250 < 2e-16 ***
## loan_intentEDUCATION -9.381e-01 5.209e-02 -18.008 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -3.852e-02 5.816e-02 -0.662 0.50781
## loan_intentMEDICAL -2.971e-01 4.914e-02 -6.044 1.50e-09 ***
## loan_intentPERSONAL -6.851e-01 5.325e-02 -12.865 < 2e-16 ***
## loan_intentVENTURE -1.194e+00 5.702e-02 -20.942 < 2e-16 ***
## loan_int_rate 3.390e-01 6.066e-03 55.885 < 2e-16 ***
## loan_percent_income 1.642e+01 2.942e-01 55.810 < 2e-16 ***
## cb_person_cred_hist_length -7.521e-03 8.579e-03 -0.877 0.38068
## credit_score -2.058e-04 3.140e-04 -0.655 0.51223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38030 on 35988 degrees of freedom
## Residual deviance: 25901 on 35971 degrees of freedom
## AIC: 25937
##
## Number of Fisher Scoring iterations: 6
Removes fields from model with p-value > 0.6
train3 <- train2 %>% dplyr::select(-c(person_gender))
model3 = glm(loan_status~., family = binomial(link="logit"),train3)
summary(model3)
##
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"),
## data = train3)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.826e+00 3.056e-01 -22.337 < 2e-16 ***
## person_age 7.020e-03 9.875e-03 0.711 0.47716
## person_income 1.821e-06 3.470e-07 5.249 1.53e-07 ***
## person_emp_exp -6.348e-03 8.709e-03 -0.729 0.46609
## person_home_ownershipOTHER 7.321e-01 2.839e-01 2.578 0.00993 **
## person_home_ownershipOWN -1.404e+00 9.675e-02 -14.510 < 2e-16 ***
## person_home_ownershipRENT 8.731e-01 3.671e-02 23.783 < 2e-16 ***
## loan_amnt -1.072e-04 4.084e-06 -26.250 < 2e-16 ***
## loan_intentEDUCATION -9.381e-01 5.209e-02 -18.008 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -3.850e-02 5.816e-02 -0.662 0.50799
## loan_intentMEDICAL -2.970e-01 4.914e-02 -6.044 1.50e-09 ***
## loan_intentPERSONAL -6.851e-01 5.325e-02 -12.865 < 2e-16 ***
## loan_intentVENTURE -1.194e+00 5.702e-02 -20.943 < 2e-16 ***
## loan_int_rate 3.390e-01 6.066e-03 55.885 < 2e-16 ***
## loan_percent_income 1.642e+01 2.942e-01 55.811 < 2e-16 ***
## cb_person_cred_hist_length -7.516e-03 8.578e-03 -0.876 0.38092
## credit_score -2.057e-04 3.140e-04 -0.655 0.51225
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38030 on 35988 degrees of freedom
## Residual deviance: 25901 on 35972 degrees of freedom
## AIC: 25935
##
## Number of Fisher Scoring iterations: 6
Removes fields from model with p-value > 0.4
train4 <- train3 %>% dplyr::select(-c(person_age,person_emp_exp,credit_score))
model4 = glm(loan_status~., family = binomial(link="logit"),train4)
summary(model4)
##
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"),
## data = train4)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.799e+00 1.000e-01 -67.985 < 2e-16 ***
## person_income 1.821e-06 3.468e-07 5.251 1.51e-07 ***
## person_home_ownershipOTHER 7.294e-01 2.838e-01 2.570 0.0102 *
## person_home_ownershipOWN -1.404e+00 9.673e-02 -14.519 < 2e-16 ***
## person_home_ownershipRENT 8.730e-01 3.671e-02 23.785 < 2e-16 ***
## loan_amnt -1.071e-04 4.081e-06 -26.256 < 2e-16 ***
## loan_intentEDUCATION -9.396e-01 5.204e-02 -18.055 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -3.818e-02 5.813e-02 -0.657 0.5113
## loan_intentMEDICAL -2.974e-01 4.914e-02 -6.052 1.43e-09 ***
## loan_intentPERSONAL -6.857e-01 5.325e-02 -12.877 < 2e-16 ***
## loan_intentVENTURE -1.195e+00 5.701e-02 -20.960 < 2e-16 ***
## loan_int_rate 3.388e-01 6.063e-03 55.883 < 2e-16 ***
## loan_percent_income 1.641e+01 2.940e-01 55.831 < 2e-16 ***
## cb_person_cred_hist_length -6.702e-03 4.145e-03 -1.617 0.1059
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38030 on 35988 degrees of freedom
## Residual deviance: 25902 on 35975 degrees of freedom
## AIC: 25930
##
## Number of Fisher Scoring iterations: 6
Removes fields from model with p-value > 0.1
train5 <- train4 %>% dplyr::select(-c(cb_person_cred_hist_length))
model5 = glm(loan_status~., family = binomial(link="logit"),train5)
summary(model5)
##
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"),
## data = train5)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.832e+00 9.801e-02 -69.706 < 2e-16 ***
## person_income 1.770e-06 3.454e-07 5.125 2.98e-07 ***
## person_home_ownershipOTHER 7.343e-01 2.836e-01 2.589 0.00962 **
## person_home_ownershipOWN -1.405e+00 9.672e-02 -14.526 < 2e-16 ***
## person_home_ownershipRENT 8.728e-01 3.670e-02 23.781 < 2e-16 ***
## loan_amnt -1.071e-04 4.081e-06 -26.251 < 2e-16 ***
## loan_intentEDUCATION -9.368e-01 5.201e-02 -18.011 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -4.236e-02 5.807e-02 -0.729 0.46573
## loan_intentMEDICAL -2.992e-01 4.913e-02 -6.090 1.13e-09 ***
## loan_intentPERSONAL -6.871e-01 5.324e-02 -12.905 < 2e-16 ***
## loan_intentVENTURE -1.195e+00 5.700e-02 -20.962 < 2e-16 ***
## loan_int_rate 3.387e-01 6.063e-03 55.871 < 2e-16 ***
## loan_percent_income 1.641e+01 2.940e-01 55.820 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38030 on 35988 degrees of freedom
## Residual deviance: 25904 on 35976 degrees of freedom
## AIC: 25930
##
## Number of Fisher Scoring iterations: 6
Test for multicollinearity. No exceptionally large VIF values, so no further action required in this step.
vif(model5)
## GVIF Df GVIF^(1/(2*Df))
## person_income 1.528758 1 1.236430
## person_home_ownership 1.138824 3 1.021902
## loan_amnt 2.666317 1 1.632886
## loan_intent 1.052181 5 1.005100
## loan_int_rate 1.100726 1 1.049155
## loan_percent_income 2.647676 1 1.627168
Review normality of relevant variables.
ggplot(train5,aes(x=person_income))+
geom_histogram(fill="lightblue",color="black")+
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(train5,aes(x=person_home_ownership))+
geom_histogram(fill="lightblue",color="black",stat = "count")+
theme_bw()
## Warning in geom_histogram(fill = "lightblue", color = "black", stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
ggplot(train5,aes(x=loan_amnt))+
geom_histogram(fill="lightblue",color="black")+
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(train5,aes(x=loan_intent))+
geom_histogram(fill="lightblue",color="black", stat="count")+
theme_bw()
## Warning in geom_histogram(fill = "lightblue", color = "black", stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
ggplot(train5,aes(x=loan_int_rate))+
geom_histogram(fill="lightblue",color="black")+
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(train5,aes(x=loan_percent_income))+
geom_histogram(fill="lightblue",color="black")+
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Scales features used in model.
numeric_model_features <- c("person_income","loan_amnt","loan_int_rate","loan_percent_income")
categorical_model_features <- c("person_home_ownership","loan_intent")
target <- "loan_status"
train_scaled <- train %>% dplyr::select(all_of(numeric_model_features)) %>%
scale()
test_scaled <- test %>% dplyr::select(all_of(numeric_model_features)) %>% #uses train data to perform scaling to avoid leakage
scale(center=attr(train_scaled, "scaled:center"), scale=attr(train_scaled, "scaled:scale"))
train_scaled <- train %>% #adds back in target and categorical features
dplyr::select(all_of(c(categorical_model_features,target))) %>%
bind_cols(train_scaled)
test_scaled <- test %>% dplyr::select(all_of(c(categorical_model_features,target))) %>%
bind_cols(test_scaled)
model6 = glm(loan_status~., family = binomial(link="logit"),train_scaled)
summary(model6)
##
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"),
## data = train_scaled)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.69779 0.04398 -38.605 < 2e-16 ***
## person_home_ownershipOTHER 0.73426 0.28357 2.589 0.00962 **
## person_home_ownershipOWN -1.40495 0.09672 -14.526 < 2e-16 ***
## person_home_ownershipRENT 0.87281 0.03670 23.781 < 2e-16 ***
## loan_intentEDUCATION -0.93681 0.05201 -18.011 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -0.04236 0.05807 -0.729 0.46573
## loan_intentMEDICAL -0.29919 0.04913 -6.090 1.13e-09 ***
## loan_intentPERSONAL -0.68706 0.05324 -12.905 < 2e-16 ***
## loan_intentVENTURE -1.19493 0.05700 -20.962 < 2e-16 ***
## person_income 0.11270 0.02199 5.125 2.98e-07 ***
## loan_amnt -0.67864 0.02585 -26.251 < 2e-16 ***
## loan_int_rate 1.00773 0.01804 55.871 < 2e-16 ***
## loan_percent_income 1.43417 0.02569 55.820 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38030 on 35988 degrees of freedom
## Residual deviance: 25904 on 35976 degrees of freedom
## AIC: 25930
##
## Number of Fisher Scoring iterations: 6
fit_results <- predict(model6,newdata=test_scaled,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)
class_error <- mean(fit_results != test_scaled$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))
## [1] "Accuracy: 84.19%"
fit_results <- predict(model1,newdata=test,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)
class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))
## [1] "Accuracy: 89.53%"
fit_results <- predict(model2,newdata=test,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)
class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))
## [1] "Accuracy: 84.18%"
fit_results <- predict(model3,newdata=test,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)
class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))
## [1] "Accuracy: 84.18%"
fit_results <- predict(model4,newdata=test,type='response')
fit_results <- ifelse(fit_results > 0.5,1,0)
class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))
## [1] "Accuracy: 84.21%"
mae(test_scaled$loan_status,fit_results)
## [1] 0.1579415
mae(test$loan_status,predict(model1,newdata = test,type='response'))
## [1] 0.1434256
mae(test$loan_status,predict(model2,newdata = test,type='response'))
## [1] 0.2251504
mae(test$loan_status,predict(model3,newdata = test,type='response'))
## [1] 0.2251488
mae(test$loan_status,predict(model4,newdata = test,type='response'))
## [1] 0.225173
mae(test$loan_status,predict(model5,newdata = test,type='response'))
## [1] 0.225191
Model 1 displayed the lowest MAE and highest accuracy score despite the attempts to improve the model through successive iterations.
model <- rpart(loan_status ~ ., method="class",data=train)
The decision tree uses previous_loan_defaults_on_file, loan_percent_income, loan_int_rate, person_income, and person_home_ownership as the key predictors for loan_status.
rpart.plot(model, box.palette="RdBu", shadow.col="gray", nn=TRUE)
p <- predict(model, test, type = "class")
table(p, test$loan_status)
##
## p 0 1
## 0 6698 559
## 1 258 1482
fit_results <- predict(model,newdata=test,type='class')
class_error <- mean(fit_results != test$loan_status)
print(paste0('Accuracy: ',round((1-class_error)*100,2),"%"))
## [1] "Accuracy: 90.92%"
The decision tree displays a higher accuracy than all logistic regression models attempted in step 2. Decision tree may therefore be more appopriate for modeling this particular dataset.
The scree plot indicates 4 is the appopriate number of factors for this dataset.
ev <- clean_loan_df %>% dplyr::select(all_of(numeric_features)) %>%
cor() %>%
eigen()
ev$values
## [1] 2.87238630 1.88788443 1.27971812 1.01420870 0.95300076 0.60556782 0.17560649
## [8] 0.16758966 0.04403771
#Plot a Scree plot using base plot:
Factor = c(1,2,3,4,5,6,7,8,9)
Eigen_Values <- ev$values
Scree <- data.frame(Factor, Eigen_Values)
plot(Scree, main = "Scree Plot", col= "Blue",ylim=c(0,4))
lines(Scree,col='Red')
abline(h = 1, col="Green")
The results of the hypothesis test indicate that 4 factors are indeed sufficient.
nfactors <- 4
fit1 <- clean_loan_df %>% dplyr::select(all_of(numeric_features)) %>%
factanal(nfactors,scores = c("regression"),rotation = "varimax")
print(fit1)
##
## Call:
## factanal(x = ., factors = nfactors, scores = c("regression"), rotation = "varimax")
##
## Uniquenesses:
## person_emp_exp credit_score
## 0.089 0.969
## loan_status person_age
## 0.005 0.005
## person_income loan_amnt
## 0.546 0.005
## loan_int_rate loan_percent_income
## 0.869 0.005
## cb_person_cred_hist_length
## 0.223
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## person_emp_exp 0.953
## credit_score 0.174
## loan_status 0.971 -0.210
## person_age 0.995
## person_income 0.102 0.666
## loan_amnt 0.866 0.140 0.475
## loan_int_rate 0.348
## loan_percent_income 0.876 0.231 -0.417
## cb_person_cred_hist_length 0.880
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 2.713 1.530 1.139 0.900
## Proportion Var 0.301 0.170 0.127 0.100
## Cumulative Var 0.301 0.471 0.598 0.698
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 259.53 on 6 degrees of freedom.
## The p-value is 3.77e-53
fa_var <- clean_loan_df %>% dplyr::select(all_of(numeric_features)) %>%
fa(r=., nfactors = 4, rotate="varimax",fm="pa")
## maximum iteration exceeded
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected. Examine the results carefully
fa.diagram(fa_var)
head(fa_var$scores)
## PA1 PA2 PA3 PA4
## [1,] -1.0301953 4.17880191 0.9179240 1.6139383
## [2,] -1.0841345 -0.94971305 -0.9130053 -0.2926295
## [3,] -0.1405244 2.95782622 -4.5132567 0.8775261
## [4,] -0.9153672 3.62953972 1.5436974 1.5596118
## [5,] -0.6655125 4.68131468 0.3925582 1.3579715
## [6,] -1.0232959 0.02801921 -1.7955365 0.3864672
The groups indicated by the factor analysis make logical sense. The features in group 1 are all tied to age/time. Loan percentage and loan amount would logically be connected. Loan interest rate and loan status is an interesting pair, but that aligns with the results of the decision tree.
Lo, T. (2024, October 28). Loan Approval Classification Dataset. Retrieved from Kaggle: https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data/data