This dataset is titled “Bank Loan Default Dataset” and shows various explanatory variables and one target/response variable (Default). The purpose of collecting this dataset is to see what factors may cause peoples loans to be in default or not. This dataset was collected from your Course Project Data Repository. This dataset contains 1000 observations and 16 variables (15 feature variables and 1 target variable). The Checking_amount is a numerical variable that shows the amount of money in ones checking account. The Saving_amount is a numerical variable that shows the amount of money in ones saving account. The term (numerical) is the duration of the loan term. The credit_score (numerical) shows ones credit score. The gender (categorical) consists of male and female. The marital_status (categorical) consists of married or single. The car_loan, personal_loan, home_loan, and education_loan (all categorical/binary) shows if people have loans in any of those areas. The emp_status (categorical) consists of unemployed or employed. The amount (numerical) shows the amount of the loan. The emp_duration (num) shows length of employment in months. Age (num) shows age and no_of_credit_account (num) shows number of credit accounts. The overall goal of this project is to see how significant the feature variables are in predicting if ones loan will be in default (1 if default, 0 if not).
The original dataset did not have any missing values so I needed to manually create them. For this dataset, the variables Gender, Marital_status, Emp_status, credit_score, amount, emp_duration, checking_amount, and age have missing values. Missing numerical values can be resolved by imputing the mean. Missing categorical values can be resolved by imputing the mode.
Based on this dataset, we can form two practical questions. We can use both linear and logistic regression. One question we can form is what variables are the most important in predicting if ones loan will be in default (1 if default, 0 if not). Since Default is a binary variable, we can use logistic regression. However, binary variables cannot be used for linear regression so we would have to adjust our question to use a continuous dependent variable. In theory, credit score can be a useful indicator of whether ones loan will be in default or not because a loan default can damage credit scores. Therefore, we can adjust our question to what variables are the most important in predicting credit score.
The first part of the project is doing EDA. Then we impute the missing values using the MICE procedure. Then we do feature engineering and use the modified variables to do linear/logistic regression.
Read in dataset.
bank <- read.csv("BankLoanDefaultDataset.csv")
bank$Car_loan <- as.factor(bank$Car_loan)
bank$Personal_loan <- as.factor(bank$Personal_loan)
bank$Home_loan <- as.factor(bank$Home_loan)
bank$Education_loan <- as.factor(bank$Education_loan)
bank$Gender <- as.factor(bank$Gender)
bank$Marital_status <- as.factor(bank$Marital_status)
bank$Emp_status <- as.factor(bank$Emp_status)
bank$Default <- as.factor(bank$Default)
The original dataset does not have any missing values so I needed to manually create them.
gender.missing.id <- sample(1:1000, 20 , replace = FALSE)
Marital.missing.id <- sample(1:1000, 20, replace = FALSE)
emp.status.missing.id <- sample(1:1000, 20, replace = FALSE)
credit.missing.id <- sample(1:1000, 20, replace = FALSE)
amount.missing.id <- sample(1:1000, 20, replace = FALSE)
emp.duration.missing.id <- sample(1:1000, 20, replace = FALSE)
check.amt.missing.id <- sample(1:1000, 20, replace = FALSE)
age.missing.id <- sample(1:1000, 20, replace = FALSE)
bank$Gender[gender.missing.id] <- NA
bank$Marital_status[Marital.missing.id] <- NA
bank$Emp_status[emp.status.missing.id] <- NA
bank$Credit_score[credit.missing.id] <- NA
bank$Amount[amount.missing.id] <- NA
bank$Emp_duration[emp.duration.missing.id] <- NA
bank$Checking_amount[check.amt.missing.id] <- NA
bank$Age[age.missing.id] <- NA
This section will show the distribution of each individual feature and relationship between features. Some features will have missing values and can be resolved by imputing the mean or mode.
The below figure shows the distribution of the Gender variable. There are significantly more males than females.
ggplot(bank, aes(x = Gender)) +
geom_bar() +
labs(title = "Gender")
The below figure shows the distribution of the Marital_status variable. Married and single people have similar counts.
ggplot(bank, aes(x = Marital_status)) +
geom_bar() +
labs(title = "Marital_status")
The below figure shows the distribution of the Emp_status variable. There are significantly more unemployed than employed.
ggplot(bank, aes(x = Emp_status)) +
geom_bar() +
labs(title = "Emp_status")
The below figure shows the distribution of the car loan variable. There are significantly more that do not have a car loan than those that do.
ggplot(bank, aes(x = Car_loan)) +
geom_bar() +
labs(title = "Car_loan")
The below figure shows the distribution of the personal loan variable. Those that have a personal loan and those that do not have similar counts.
ggplot(bank, aes(x = Personal_loan)) +
geom_bar() +
labs(title = "Personal_loan")
The below figure shows the distribution of the education loan variable. There are significantly more that do not have an education loan than those that do.
ggplot(bank, aes(x = Education_loan)) +
geom_bar() +
labs(title = "Education_loan")
The below figure shows the distribution of the home loan variable. There are significantly more that do not have a home loan than those that do.
ggplot(bank, aes(x = Home_loan)) +
geom_bar() +
labs(title = "Home_loan")
The below figure shows the distribution of the credit score variable. There are no alarming trends outside of a couple anomalies.
ggplot(data = bank, aes(x = Credit_score)) +
geom_boxplot() +
labs(title = "Credit_score")
The below figure shows the distribution of the checking amount variable. There are no alarming trends.
ggplot(data = bank, aes(x = Checking_amount)) +
geom_boxplot() +
labs(title = "Checking_amount")
The below figure shows the distribution of the term variable. There are no alarming trends.
ggplot(data = bank, aes(x = Term)) +
geom_boxplot() +
labs(title = "Term")
The below figure shows the distribution of the amount variable. There are no alarming trends.
ggplot(data = bank, aes(x = Amount)) +
geom_boxplot() +
labs(title = "Amount")
The below figure shows the distribution of the Saving amount variable. There are no alarming trends.
ggplot(data = bank, aes(x = Saving_amount)) +
geom_boxplot() +
labs(title = "Saving amount")
The below figure shows the distribution of the Emp_duration variable. There are no alarming trends.
ggplot(data = bank, aes(x = Emp_duration)) +
geom_boxplot() +
labs(title = "Emp duration")
The below figure shows the distribution of the age variable. There are no alarming trends.
ggplot(data = bank, aes(x = Age)) +
geom_boxplot() +
labs(title = "Age")
The below figure shows the distribution of the No_of_credit_acc variable. This variable seems to be heavily skewed.
ggplot(data = bank, aes(x = No_of_credit_acc)) +
geom_boxplot() +
labs(title = "No_of_credit_acc")
In this section, we will show the relationship between one categorical feature and one numerical feature.
The below figure shows the relationship between credit score and gender. Based on the graph, the credit score ranges look to be similar across both genders. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Credit_score, y=Gender, fill=Gender)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Gender")
The below figure shows the relationship between loan amount and gender. Based on the graph, the loan amount ranges look to be similar across both genders. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Amount, y=Gender, fill=Gender)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Gender")
The below figure shows the relationship between Employment duration and gender. Based on the graph, it seems like males have a longer employment duration than females. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Emp_duration, y=Gender, fill=Gender)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by Gender")
The below figure shows the relationship between Age and gender. Based on the graph, the Age ranges look to be similar across both genders. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Age, y=Gender, fill=Gender)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Age by Gender")
The below figure shows the relationship between marital status and credit score. Based on the graph, the credit score ranges look to be similar across both single and married people. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Credit_score, y=Marital_status, fill=Marital_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Marital status")
The below figure shows the relationship between loan amount and marital status. Based on the graph, the loan amount ranges look to be similar across both marital statuses. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Amount, y=Marital_status, fill=Marital_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Marital status")
The below figure shows the relationship between marital status and employment duration. Based on the graph, it seems like married people have longer employment duration than single people. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Emp_duration, y=Marital_status, fill=Marital_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by Marital status")
The below figure shows the relationship between Age and marital status. Based on the graph, the Age ranges look to be similar across both marital statuses. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Age, y=Marital_status, fill=Marital_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by Marital status")
The below figure shows the relationship between credit score and employment status. Based on the graph, the credit score ranges look to be similar across both employed and unemployed people. There seems to be more anomalies for unemployed people but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Credit_score, y=Emp_status, fill=Emp_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Employment status")
The below figure shows the relationship between loan amount and employment status. Based on the graph, the loan amount ranges look to be similar across both employed/unemployed. Again there seems to be more anomalies for unemployed people than employed. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Amount, y=Emp_status, fill=Emp_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Employment status")
The below figure shows the relationship between employment duration and employment status. Based on the graph, it seems like unemployed people have longer and more varied employment duration than employed people. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Emp_duration, y=Emp_status, fill=Emp_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by employment status")
The below figure shows the relationship between Age and employment status. Based on the graph, the Age ranges look to be similar across both statuses. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Age, y=Emp_status, fill=Emp_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by employment status")
The below figure shows the relationship between Age and car loan. Based on the graph, the Age ranges look to be similar across both those who own a car loan and those who do not.
# convert car, home, personal, and education loans into categorical to use for EDA purposes
# bank$Car_loan <- as.factor(bank$Car_loan)
# bank$Personal_loan <- as.factor(bank$Personal_loan)
# bank$Home_loan <- as.factor(bank$Home_loan)
# bank$Education_loan <- as.factor(bank$Education_loan)
ggplot(bank, aes(x=Age, y=Car_loan, fill=Car_loan)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by car loan")
The below figure shows the relationship between Age and personal loan. Based on the graph, the Age ranges look to be similar across all those that have a personal loan or not.
ggplot(bank, aes(x=Age, y=Personal_loan, fill=Personal_loan)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by personal_loan")
The below figure shows the relationship between Age and home loan. Based on the graph, the Age ranges look to be similar across all those that have a home loan or not.
ggplot(bank, aes(x=Age, y=Home_loan, fill=Home_loan)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by home_loan")
The below figure shows the relationship between Age and education loan. Based on the graph, it seems like younger people have an education loan and older people do not.
ggplot(bank, aes(x=Age, y=Education_loan, fill=Education_loan)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by education loan")
Next, we will examine the relationship between two numerical features.
The below graph shows employment duration and credit score. Based on the graph, it seems like regardless of the employment duration, the majority of the credit scores seem to fall in the 700-900 range.
ggplot(data = bank, aes(x = Credit_score, y = Emp_duration)) +
geom_point() +
ggtitle("Employment Duration vs Credit Score")
The below graph shows employment duration and loan amount. There does not seem to be any patterns based on this graph.
ggplot(data = bank, aes(x = Amount, y = Emp_duration)) +
geom_point() +
ggtitle("Employment Duration vs Loan Amount")
The below graph shows employment duration and age. There does not seem to be any patterns based on this graph.
ggplot(data = bank, aes(x = Age, y = Emp_duration)) +
geom_point() +
ggtitle("Employment Duration vs Age")
The below graph shows age and Credit score. Again, it seems like regardless of age, most of the credit scores seem to fall in that 700-900 range.
ggplot(data = bank, aes(x = Age, y = Credit_score)) +
geom_point() +
ggtitle("Age vs credit score")
The below graph shows checking amount and saving amount. There does not seem to be any correlation between these two variables.
ggplot(data = bank, aes(x = Checking_amount, y = Saving_amount)) +
geom_point() +
ggtitle("Checking amount vs savings Amount")
The below graph shows age and loan amount. It seems like regardless of age, most of the loan amounts seem to fall in a certain range.
ggplot(data = bank, aes(x = Age, y = Amount)) +
geom_point() +
ggtitle("age vs Loan Amount")
The below graph shows credit score and loan amount. It seems like most of the points are clumped together.
ggplot(data = bank, aes(x = Credit_score, y = Amount)) +
geom_point() +
ggtitle("credit score vs Loan Amount")
This section shows the relationships between two categorical features.
The below graph shows Gender and Employment status. It seems like the amount of employed and unemployed females are similar but there are significantly more unemployed males than employed.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Emp_status), position = "dodge")+ggtitle("gender vs employment status")
The below graph shows Gender and marital status. It seems like all females are single and there are significantly more married males than single.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Marital_status), position = "dodge")+ggtitle("gender vs marital status")
The below graph shows employment status and marital status. It seems like the amount of employed and unemployed people are relatively similar but for married people, there are significantly more who are unemployed.
ggplot(bank, aes(Marital_status, ..count..)) + geom_bar(aes(fill = Emp_status), position = "dodge")+ggtitle("marital status vs enployment status")
The below graph shows gender and car loan. It seems like for both males and females, there are significantly more that do not have car loans than those that do.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Car_loan), position = "dodge")+ggtitle("gender vs car loan")
The below graph shows gender and education loan. It seems like for both males and females, there are significantly more that do not have education loans than those that do.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Education_loan), position = "dodge")+ggtitle("Gender vs education Loan")
The below graph shows gender and personal loan. It seems like for both males and females, the amounts that who have personal loans and those that do not are relatively similar.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Personal_loan), position = "dodge")+ggtitle("Gender vs personal Loan")
The below graph shows gender and home loan. It seems like for both males and females, there are significantly more that do not have home loans than those that do.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Home_loan), position = "dodge")+ggtitle("Gender vs home Loan")
We use the MICE imputation method in this section.
complete_bank_data is the complete dataset.
init <- mice(bank, maxit = 0)
init$method
Default Checking_amount Term Credit_score
"" "pmm" "" "pmm"
Gender Marital_status Car_loan Personal_loan
"logreg" "logreg" "" ""
Home_loan Education_loan Emp_status Amount
"" "" "logreg" "pmm"
Saving_amount Emp_duration Age No_of_credit_acc
"" "pmm" "pmm" ""
imp <- mice(bank, method = c("","pmm", "", "pmm", "logreg", "logreg", "", "", "", "", "logreg", "pmm", "", "pmm", "pmm", ""),
maxit = 10,
m = 5,
seed=123,
print=F)
complete_bank_data <- complete(imp)
This section focuses on feature engineering. Box-Cox transformation and standardization are used.
The only variable that is skewed is No_of_credit_acc so we use a log transformation.
boxcox_result <- boxcox(lm(complete_bank_data$No_of_credit_acc ~ 1), lambda = seq(0, 1, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_result$x[which.max(boxcox_result$y)]
optimal_lambda
[1] 0
complete_bank_data$Num_credit_acc_log_value <- log(complete_bank_data$No_of_credit_acc)
head(complete_bank_data)
Default Checking_amount Term Credit_score Gender Marital_status Car_loan
1 0 988 15 796 Female Single 1
2 0 458 15 813 Female Single 1
3 0 158 14 756 Female Single 0
4 1 300 25 737 Female Single 0
5 1 63 24 662 Female Single 0
6 0 1071 20 828 Male Married 1
Personal_loan Home_loan Education_loan Emp_status Amount Saving_amount
1 0 0 0 employed 1536 3455
2 0 0 0 employed 947 3600
3 1 0 0 employed 1678 3093
4 0 0 1 employed 1804 2449
5 0 0 1 unemployed 1184 2867
6 0 0 0 employed 475 3282
Emp_duration Age No_of_credit_acc Num_credit_acc_log_value
1 12 38 1 0.0000000
2 25 36 1 0.0000000
3 43 34 1 0.0000000
4 0 29 1 0.0000000
5 4 30 1 0.0000000
6 76 32 2 0.6931472
We use standardization for all the numerical variables since these variables have different units. This will help with our modeling later on.
standardize <- function(x) {
return((x - mean(x)) / sd(x))
}
complete_bank_data$Checking_amount_standardized <- standardize(complete_bank_data$Checking_amount)
complete_bank_data$term_standardized <- standardize(complete_bank_data$Term)
complete_bank_data$Credit_score_standardized <- standardize(complete_bank_data$Credit_score)
complete_bank_data$amount_standardized <- standardize(complete_bank_data$Amount)
complete_bank_data$Saving_amount_standardized <- standardize(complete_bank_data$Saving_amount)
complete_bank_data$Emp_duration_standardized <- standardize(complete_bank_data$Emp_duration)
complete_bank_data$Age_standardized <- standardize(complete_bank_data$Age)
complete_bank_data$No_of_credit_acc_standardized <- standardize(complete_bank_data$Num_credit_acc_log_value)
In this section, we do linear regression, addressing the question about which variables are the most important in predicting credit score. As mentioned earlier, we use credit score as the dependent variable, under the assumption that credit score is related to loan default and that the dependent variable has to be continuous.
First, we use stepwise selection. This method both adds and removes significant predictors. This method returns all the predictors that result in the lowest AIC value. When this method is implemented, we see that Education_loan, Checking_amount_standardized, Saving_amount_standardized, term_standardized, and Age_standardized are the most significant predictors.
linear_regression <- lm(Credit_score_standardized ~ Checking_amount_standardized + term_standardized + amount_standardized + Saving_amount_standardized + Emp_duration_standardized + Age_standardized + No_of_credit_acc_standardized + Gender + Marital_status + Car_loan + Personal_loan + Home_loan + Education_loan + Emp_status, data = complete_bank_data)
step_model_linear <- step(linear_regression, direction = "both")
Start: AIC=-119.44
Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
amount_standardized + Saving_amount_standardized + Emp_duration_standardized +
Age_standardized + No_of_credit_acc_standardized + Gender +
Marital_status + Car_loan + Personal_loan + Home_loan + Education_loan +
Emp_status
Df Sum of Sq RSS AIC
- Home_loan 1 0.006 861.20 -121.430
- Car_loan 1 0.006 861.20 -121.430
- Personal_loan 1 0.023 861.22 -121.410
- Education_loan 1 0.076 861.27 -121.350
- Emp_status 1 0.114 861.31 -121.306
- No_of_credit_acc_standardized 1 0.198 861.39 -121.208
- Marital_status 1 0.434 861.63 -120.934
- Emp_duration_standardized 1 0.453 861.65 -120.912
- amount_standardized 1 1.347 862.54 -119.875
<none> 861.19 -119.438
- Gender 1 2.026 863.22 -119.087
- Checking_amount_standardized 1 4.432 865.62 -116.304
- Saving_amount_standardized 1 6.367 867.56 -114.072
- term_standardized 1 9.204 870.40 -110.807
- Age_standardized 1 36.939 898.13 -79.439
Step: AIC=-121.43
Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
amount_standardized + Saving_amount_standardized + Emp_duration_standardized +
Age_standardized + No_of_credit_acc_standardized + Gender +
Marital_status + Car_loan + Personal_loan + Education_loan +
Emp_status
Df Sum of Sq RSS AIC
- Car_loan 1 0.000 861.20 -123.430
- Personal_loan 1 0.065 861.26 -123.354
- Emp_status 1 0.113 861.31 -123.299
- No_of_credit_acc_standardized 1 0.200 861.40 -123.198
- Marital_status 1 0.438 861.64 -122.922
- Emp_duration_standardized 1 0.454 861.65 -122.903
- Education_loan 1 0.971 862.17 -122.304
- amount_standardized 1 1.344 862.54 -121.871
<none> 861.20 -121.430
- Gender 1 2.034 863.23 -121.072
+ Home_loan 1 0.006 861.19 -119.438
- Checking_amount_standardized 1 4.431 865.63 -118.298
- Saving_amount_standardized 1 6.371 867.57 -116.060
- term_standardized 1 9.207 870.41 -112.796
- Age_standardized 1 37.117 898.32 -81.234
Step: AIC=-123.43
Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
amount_standardized + Saving_amount_standardized + Emp_duration_standardized +
Age_standardized + No_of_credit_acc_standardized + Gender +
Marital_status + Personal_loan + Education_loan + Emp_status
Df Sum of Sq RSS AIC
- Emp_status 1 0.113 861.31 -125.299
- No_of_credit_acc_standardized 1 0.203 861.40 -125.194
- Personal_loan 1 0.241 861.44 -125.150
- Marital_status 1 0.438 861.64 -124.922
- Emp_duration_standardized 1 0.457 861.66 -124.900
- amount_standardized 1 1.347 862.55 -123.867
<none> 861.20 -123.430
- Gender 1 2.036 863.23 -123.069
- Education_loan 1 2.202 863.40 -122.877
+ Car_loan 1 0.000 861.20 -121.430
+ Home_loan 1 0.000 861.20 -121.430
- Checking_amount_standardized 1 4.431 865.63 -120.298
- Saving_amount_standardized 1 6.372 867.57 -118.058
- term_standardized 1 9.217 870.42 -114.784
- Age_standardized 1 37.126 898.32 -83.224
Step: AIC=-125.3
Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
amount_standardized + Saving_amount_standardized + Emp_duration_standardized +
Age_standardized + No_of_credit_acc_standardized + Gender +
Marital_status + Personal_loan + Education_loan
Df Sum of Sq RSS AIC
- No_of_credit_acc_standardized 1 0.178 861.49 -127.092
- Personal_loan 1 0.235 861.55 -127.026
- Marital_status 1 0.382 861.69 -126.855
- Emp_duration_standardized 1 0.526 861.84 -126.688
- amount_standardized 1 1.344 862.66 -125.739
<none> 861.31 -125.299
- Gender 1 2.064 863.38 -124.905
- Education_loan 1 2.288 863.60 -124.645
+ Emp_status 1 0.113 861.20 -123.430
+ Car_loan 1 0.000 861.31 -123.299
+ Home_loan 1 0.000 861.31 -123.299
- Checking_amount_standardized 1 4.415 865.73 -122.186
- Saving_amount_standardized 1 6.300 867.61 -120.011
- term_standardized 1 9.177 870.49 -116.700
- Age_standardized 1 37.366 898.68 -84.831
Step: AIC=-127.09
Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
amount_standardized + Saving_amount_standardized + Emp_duration_standardized +
Age_standardized + Gender + Marital_status + Personal_loan +
Education_loan
Df Sum of Sq RSS AIC
- Personal_loan 1 0.260 861.75 -128.79
- Marital_status 1 0.425 861.92 -128.60
- Emp_duration_standardized 1 0.471 861.96 -128.54
- amount_standardized 1 1.394 862.88 -127.47
<none> 861.49 -127.09
- Gender 1 2.073 863.56 -126.69
- Education_loan 1 2.262 863.75 -126.47
+ No_of_credit_acc_standardized 1 0.178 861.31 -125.30
+ Emp_status 1 0.088 861.40 -125.19
+ Car_loan 1 0.003 861.49 -125.09
+ Home_loan 1 0.000 861.49 -125.09
- Checking_amount_standardized 1 4.409 865.90 -123.99
- Saving_amount_standardized 1 6.286 867.78 -121.82
- term_standardized 1 9.183 870.67 -118.49
- Age_standardized 1 37.359 898.85 -86.64
Step: AIC=-128.79
Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
amount_standardized + Saving_amount_standardized + Emp_duration_standardized +
Age_standardized + Gender + Marital_status + Education_loan
Df Sum of Sq RSS AIC
- Marital_status 1 0.481 862.23 -130.233
- Emp_duration_standardized 1 0.502 862.25 -130.209
- amount_standardized 1 1.389 863.14 -129.181
<none> 861.75 -128.791
- Gender 1 2.088 863.84 -128.371
- Education_loan 1 3.083 864.83 -127.219
+ Personal_loan 1 0.260 861.49 -127.092
+ No_of_credit_acc_standardized 1 0.203 861.55 -127.026
+ Car_loan 1 0.172 861.58 -126.991
+ Emp_status 1 0.081 861.67 -126.884
+ Home_loan 1 0.027 861.72 -126.822
- Checking_amount_standardized 1 4.624 866.37 -125.440
- Saving_amount_standardized 1 6.225 867.97 -123.593
- term_standardized 1 9.298 871.05 -120.059
- Age_standardized 1 37.913 899.66 -87.735
Step: AIC=-130.23
Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
amount_standardized + Saving_amount_standardized + Emp_duration_standardized +
Age_standardized + Gender + Education_loan
Df Sum of Sq RSS AIC
- Emp_duration_standardized 1 0.367 862.60 -131.807
- amount_standardized 1 1.447 863.68 -130.556
<none> 862.23 -130.233
- Gender 1 1.865 864.09 -130.073
+ Marital_status 1 0.481 861.75 -128.791
- Education_loan 1 3.079 865.31 -128.668
+ Personal_loan 1 0.315 861.92 -128.598
+ No_of_credit_acc_standardized 1 0.255 861.98 -128.529
+ Car_loan 1 0.216 862.01 -128.484
+ Emp_status 1 0.029 862.20 -128.267
+ Home_loan 1 0.026 862.20 -128.263
- Checking_amount_standardized 1 4.690 866.92 -126.808
- Saving_amount_standardized 1 6.247 868.48 -125.014
- term_standardized 1 9.116 871.35 -121.716
- Age_standardized 1 37.904 900.13 -89.212
Step: AIC=-131.81
Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
amount_standardized + Saving_amount_standardized + Age_standardized +
Gender + Education_loan
Df Sum of Sq RSS AIC
- amount_standardized 1 1.386 863.98 -132.202
<none> 862.60 -131.807
- Gender 1 2.233 864.83 -131.221
+ Emp_duration_standardized 1 0.367 862.23 -130.233
- Education_loan 1 3.109 865.71 -130.210
+ Marital_status 1 0.346 862.25 -130.209
+ Personal_loan 1 0.335 862.26 -130.196
+ Car_loan 1 0.217 862.38 -130.058
+ No_of_credit_acc_standardized 1 0.189 862.41 -130.027
+ Emp_status 1 0.071 862.53 -129.889
+ Home_loan 1 0.038 862.56 -129.851
- Checking_amount_standardized 1 4.828 867.43 -128.226
- Saving_amount_standardized 1 6.348 868.95 -126.475
- term_standardized 1 9.283 871.88 -123.103
- Age_standardized 1 38.174 900.77 -90.504
Step: AIC=-132.2
Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
Saving_amount_standardized + Age_standardized + Gender +
Education_loan
Df Sum of Sq RSS AIC
<none> 863.98 -132.202
+ amount_standardized 1 1.386 862.60 -131.807
- Gender 1 2.355 866.34 -131.480
+ Marital_status 1 0.404 863.58 -130.670
+ Personal_loan 1 0.330 863.65 -130.584
+ Emp_duration_standardized 1 0.306 863.68 -130.556
+ No_of_credit_acc_standardized 1 0.249 863.73 -130.490
+ Car_loan 1 0.193 863.79 -130.425
- Education_loan 1 3.304 867.29 -130.384
+ Emp_status 1 0.058 863.92 -130.269
+ Home_loan 1 0.057 863.93 -130.268
- Checking_amount_standardized 1 5.446 869.43 -127.918
- Saving_amount_standardized 1 6.081 870.06 -127.189
- term_standardized 1 9.370 873.35 -123.415
- Age_standardized 1 39.497 903.48 -89.501
The second method of selecting the most important predictors is lasso regression, used below. This method shrinks insignificant variables to zero. Using this method, we get Checking_amount_standardized, term_standardized, amount_standardized, Saving_amount_standardized, Age_standardized, Gender, Personal_loan, and Education_loan as the most important predictors.
linear <- complete_bank_data
x_linear <- model.matrix(~ Gender + Marital_status + Emp_status+ Car_loan+ Personal_loan+Education_loan+Home_loan,
data = linear)[,-1]
x_linear <- cbind(linear$Checking_amount_standardized,linear$term_standardized, linear$amount_standardized,
linear$Saving_amount_standardized,linear$Emp_duration_standardized,
linear$Age_standardized, linear$No_of_credit_acc_standardized, x_linear)
y_linear <- linear$Credit_score_standardized
colnames(x_linear) <- c('Checking_amount_standardized', 'term_standardized', 'amount_standardized',
'Saving_amount_standardized', 'Emp_duration_standardized' , 'Age_standardized',
'No_of_credit_acc_standardized', 'GenderMale','Marital_statusSingle',
'Emp_statusUnemployed','Car_loan','Personal_loan','Education_loan','Home_loan' )
set.seed(1)
cv_test <- cv.glmnet(x_linear, y_linear, alpha = 1, family = "gaussian")
test_lambda <- cv_test$lambda.min
test_features <- rownames(coef(cv_test, s = test_lambda))[coef(cv_test, s = test_lambda)[,1] != 0]
print(test_features)
[1] "(Intercept)" "Checking_amount_standardized"
[3] "term_standardized" "amount_standardized"
[5] "Saving_amount_standardized" "Emp_duration_standardized"
[7] "Age_standardized" "GenderMale"
[9] "Personal_loan" "Education_loan"
The results from stepwise and lasso are used as the two candidate models. Below, we split the dataset using a 80/20 split. Since the original dataset contains 1000 observations, we use 800 obs for the training and the rest for validation and testing. We used 5 for the number of folds. Then we run k-fold cross validation for the two candidate models to get the MSE for both.
x_linear <- as.data.frame(x_linear)
x_linear <- cbind(complete_bank_data$Credit_score_standardized ,x_linear)
colnames(x_linear) <- c('Credit_score_standardized','Checking_amount_standardized', 'term_standardized',
'amount_standardized',
'Saving_amount_standardized', 'Emp_duration_standardized' , 'Age_standardized',
'No_of_credit_acc_standardized', 'GenderMale','Marital_statusSingle',
'Emp_statusUnemployed','Car_loan','Personal_loan','Education_loan','Home_loan' )
train.ID = sample(1:dim(x_linear)[1], 800, replace = FALSE) # without replacement
train = x_linear[train.ID,]
test = x_linear[-train.ID,]
N = dim(train)[1]
k = 5
fld.n = ceiling(N/k)
MSE.m1 = NULL
MSE.m2 = NULL
for (i in 1:k){
valid.ID = ((i-1)*fld.n +1):(i*fld.n)
valid.set = train[valid.ID, ]
train.set = train[-valid.ID,]
M01 = lm(Credit_score_standardized ~ Education_loan + Checking_amount_standardized + Saving_amount_standardized
+ term_standardized + Age_standardized, data = train.set)
M02 = lm(Credit_score_standardized ~ Checking_amount_standardized + term_standardized +
amount_standardized + Saving_amount_standardized + Age_standardized + GenderMale + Personal_loan +
Education_loan , data = train.set)
predM01 = predict(M01, newdata = valid.set)
predM02 = predict(M02, newdata = valid.set)
MSE.m1[i] = mean((predM01 - valid.set$Credit_score_standardized)^2)
MSE.m2[i] = mean((predM02 - valid.set$Credit_score_standardized)^2)
}
Below we get an MSE of 0.895 for model 1 and 0.905 for model 2. Since lower MSE is better, model 1 (stepwise) is better.
MSE = data.frame(fold = rep(1:k,2), MSE = c(MSE.m1, MSE.m2), type=c(rep("Model 1",k), rep("Model 2", k)))
cvplot = ggplot(data = MSE, aes(x=fold, y=MSE, color = type)) +
geom_line() +
geom_point() +
coord_cartesian(xlim = c(0, 6),
ylim = c(0,2)) +
geom_text(mapping = aes(x=1.0, y=0.25,
label=paste("Model 1 Mean MSE: = ", round(mean(MSE.m1),3), "")),
hjust=0) +
geom_text(mapping = aes(x=1.0, y=0.15,
label=paste("Model 2 Mean MSE: = ", round(mean(MSE.m2),3), "")),
hjust=0) +
ggtitle("Line plots of MSE candidate Models across folds") +
theme(plot.title = element_text(hjust = 0.5),
plot.margin = unit(c(1,1,1,1), "cm"))
ggplotly(cvplot)
Below we see that the MSE of the final model is 0.735
if (mean(MSE.m1) < mean(MSE.m2)){
model.test = lm(Credit_score_standardized ~ Education_loan + Checking_amount_standardized + Saving_amount_standardized
+ term_standardized + Age_standardized, data = train)
pred.model.test = predict(model.test, newdata = test)
test.MSE = round(mean((pred.model.test-test$Credit_score_standardized)^2),3)
final.test.plot = cvplot +
geom_text(mapping = aes(x=5.0, y=0.2,
label=paste("Final Model Test MSE: = ", round(mean(test.MSE),3), "")),
hjust=0, )
ggplotly(final.test.plot)
} else{
model.test = lm(Credit_score_standardized ~ Checking_amount_standardized + term_standardized + amount_standardized+
Saving_amount_standardized + Age_standardized + GenderMale + Personal_loan +
Education_loan, data = train)
pred.model.test = predict(model.test, newdata = test)
test.MSE = round(mean((pred.model.test-test$Credit_score_standardized)^2),3)
final.test.plot = cvplot +
geom_text(mapping = aes(x=5.0, y=0.2,
label=paste("Final Model Test MSE: = ", round(mean(test.MSE),3), "")),
hjust=0)
ggplotly(final.test.plot)
}
Based on the summary below, the final model is
Credit_score_standardized = 0.02870 - 0.18052(Education_loan) + 0.07620(Checking_amount_standardized) + 0.09390(Saving_amount_standardized) - 0.10278(term_standardized) + 0.24915(Age_standardized)
summary(model.test)
Call:
lm(formula = Credit_score_standardized ~ Education_loan + Checking_amount_standardized +
Saving_amount_standardized + term_standardized + Age_standardized,
data = train)
Residuals:
Min 1Q Median 3Q Max
-4.5470 -0.4991 0.0562 0.5602 3.7240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02602 0.03565 0.730 0.46562
Education_loan -0.20386 0.11098 -1.837 0.06659 .
Checking_amount_standardized 0.08040 0.03517 2.286 0.02250 *
Saving_amount_standardized 0.09275 0.03623 2.560 0.01065 *
term_standardized -0.09716 0.03521 -2.759 0.00592 **
Age_standardized 0.22862 0.03754 6.090 1.76e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9471 on 794 degrees of freedom
Multiple R-squared: 0.135, Adjusted R-squared: 0.1296
F-statistic: 24.79 on 5 and 794 DF, p-value: < 2.2e-16
Next, we use logistic regression, addressing the question about which variables are the most important in predicting Default. As mentioned earlier, we use Default as the dependent variable. This is the original target variable and is binary.
First, we use stepwise selection. This method both adds and removes significant predictors. This method returns all the predictors that result in the lowest AIC value. When this method is implemented, we see that Education_loan, Checking_amount_standardized, Saving_amount_standardized, term_standardized, Age_standardized, credit_score_standradized, home_loan, personal_loan, and emp_status are the most significant predictors.
logistic_regression <- glm(Default ~ Credit_score_standardized + Checking_amount_standardized + term_standardized + amount_standardized + Saving_amount_standardized + Emp_duration_standardized + Age_standardized + No_of_credit_acc_standardized + Gender + Marital_status + Car_loan + Personal_loan + Home_loan + Education_loan + Emp_status, data = complete_bank_data, family = binomial)
step_model_logistic <- step(logistic_regression, direction = "both")
Start: AIC=328.03
Default ~ Credit_score_standardized + Checking_amount_standardized +
term_standardized + amount_standardized + Saving_amount_standardized +
Emp_duration_standardized + Age_standardized + No_of_credit_acc_standardized +
Gender + Marital_status + Car_loan + Personal_loan + Home_loan +
Education_loan + Emp_status
Df Deviance AIC
- Car_loan 1 296.07 326.07
- Gender 1 296.08 326.08
- Education_loan 1 296.18 326.18
- Personal_loan 1 296.34 326.34
- Marital_status 1 296.42 326.42
- Emp_duration_standardized 1 296.64 326.64
- No_of_credit_acc_standardized 1 296.79 326.79
- amount_standardized 1 297.62 327.62
<none> 296.03 328.03
- Home_loan 1 298.25 328.25
- Emp_status 1 300.66 330.66
- term_standardized 1 307.23 337.23
- Credit_score_standardized 1 332.98 362.98
- Checking_amount_standardized 1 374.74 404.74
- Saving_amount_standardized 1 382.82 412.82
- Age_standardized 1 507.68 537.68
Step: AIC=326.07
Default ~ Credit_score_standardized + Checking_amount_standardized +
term_standardized + amount_standardized + Saving_amount_standardized +
Emp_duration_standardized + Age_standardized + No_of_credit_acc_standardized +
Gender + Marital_status + Personal_loan + Home_loan + Education_loan +
Emp_status
Df Deviance AIC
- Gender 1 296.12 324.12
- Marital_status 1 296.45 324.45
- Emp_duration_standardized 1 296.66 324.66
- No_of_credit_acc_standardized 1 296.81 324.81
- amount_standardized 1 297.66 325.66
<none> 296.07 326.07
+ Car_loan 1 296.03 328.03
- Emp_status 1 300.73 328.73
- Personal_loan 1 302.78 330.78
- Education_loan 1 302.94 330.94
- term_standardized 1 307.25 335.25
- Home_loan 1 312.65 340.65
- Credit_score_standardized 1 333.03 361.03
- Checking_amount_standardized 1 374.74 402.74
- Saving_amount_standardized 1 383.38 411.38
- Age_standardized 1 508.20 536.20
Step: AIC=324.12
Default ~ Credit_score_standardized + Checking_amount_standardized +
term_standardized + amount_standardized + Saving_amount_standardized +
Emp_duration_standardized + Age_standardized + No_of_credit_acc_standardized +
Marital_status + Personal_loan + Home_loan + Education_loan +
Emp_status
Df Deviance AIC
- Marital_status 1 296.58 322.58
- Emp_duration_standardized 1 296.69 322.69
- No_of_credit_acc_standardized 1 296.86 322.86
- amount_standardized 1 297.71 323.71
<none> 296.12 324.12
+ Gender 1 296.07 326.07
+ Car_loan 1 296.08 326.08
- Emp_status 1 300.89 326.89
- Personal_loan 1 302.85 328.85
- Education_loan 1 302.95 328.95
- term_standardized 1 307.56 333.56
- Home_loan 1 312.65 338.65
- Credit_score_standardized 1 333.08 359.08
- Checking_amount_standardized 1 374.86 400.86
- Saving_amount_standardized 1 383.38 409.38
- Age_standardized 1 508.28 534.28
Step: AIC=322.58
Default ~ Credit_score_standardized + Checking_amount_standardized +
term_standardized + amount_standardized + Saving_amount_standardized +
Emp_duration_standardized + Age_standardized + No_of_credit_acc_standardized +
Personal_loan + Home_loan + Education_loan + Emp_status
Df Deviance AIC
- Emp_duration_standardized 1 296.97 320.97
- No_of_credit_acc_standardized 1 297.49 321.49
- amount_standardized 1 298.12 322.12
<none> 296.58 322.58
+ Marital_status 1 296.12 324.12
+ Gender 1 296.45 324.45
+ Car_loan 1 296.55 324.55
- Emp_status 1 301.00 325.00
- Personal_loan 1 303.23 327.23
- Education_loan 1 303.53 327.53
- term_standardized 1 308.69 332.69
- Home_loan 1 313.05 337.05
- Credit_score_standardized 1 333.69 357.69
- Checking_amount_standardized 1 375.45 399.45
- Saving_amount_standardized 1 384.43 408.43
- Age_standardized 1 508.89 532.89
Step: AIC=320.97
Default ~ Credit_score_standardized + Checking_amount_standardized +
term_standardized + amount_standardized + Saving_amount_standardized +
Age_standardized + No_of_credit_acc_standardized + Personal_loan +
Home_loan + Education_loan + Emp_status
Df Deviance AIC
- No_of_credit_acc_standardized 1 297.76 319.76
- amount_standardized 1 298.53 320.53
<none> 296.97 320.97
+ Emp_duration_standardized 1 296.58 322.58
+ Marital_status 1 296.69 322.69
+ Gender 1 296.89 322.89
+ Car_loan 1 296.96 322.96
- Emp_status 1 301.89 323.89
- Personal_loan 1 303.63 325.63
- Education_loan 1 303.78 325.78
- term_standardized 1 309.15 331.15
- Home_loan 1 313.72 335.72
- Credit_score_standardized 1 333.70 355.70
- Checking_amount_standardized 1 375.45 397.45
- Saving_amount_standardized 1 384.49 406.49
- Age_standardized 1 509.87 531.87
Step: AIC=319.76
Default ~ Credit_score_standardized + Checking_amount_standardized +
term_standardized + amount_standardized + Saving_amount_standardized +
Age_standardized + Personal_loan + Home_loan + Education_loan +
Emp_status
Df Deviance AIC
- amount_standardized 1 299.27 319.27
<none> 297.76 319.76
+ No_of_credit_acc_standardized 1 296.97 320.97
+ Marital_status 1 297.33 321.33
+ Emp_duration_standardized 1 297.49 321.49
+ Gender 1 297.63 321.63
+ Car_loan 1 297.76 321.76
- Emp_status 1 302.27 322.27
- Education_loan 1 304.49 324.49
- Personal_loan 1 304.63 324.63
- term_standardized 1 310.05 330.05
- Home_loan 1 315.24 335.24
- Credit_score_standardized 1 334.40 354.40
- Checking_amount_standardized 1 377.57 397.57
- Saving_amount_standardized 1 387.53 407.53
- Age_standardized 1 511.83 531.83
Step: AIC=319.27
Default ~ Credit_score_standardized + Checking_amount_standardized +
term_standardized + Saving_amount_standardized + Age_standardized +
Personal_loan + Home_loan + Education_loan + Emp_status
Df Deviance AIC
<none> 299.27 319.27
+ amount_standardized 1 297.76 319.76
+ No_of_credit_acc_standardized 1 298.53 320.53
+ Marital_status 1 298.89 320.89
+ Emp_duration_standardized 1 298.98 320.98
+ Gender 1 299.16 321.16
+ Car_loan 1 299.26 321.26
- Emp_status 1 304.27 322.27
- Personal_loan 1 305.66 323.66
- Education_loan 1 306.43 324.43
- term_standardized 1 311.10 329.10
- Home_loan 1 315.60 333.60
- Credit_score_standardized 1 335.74 353.74
- Checking_amount_standardized 1 380.84 398.84
- Saving_amount_standardized 1 387.80 405.80
- Age_standardized 1 522.31 540.31
The second method of selecting the most important predictors is lasso regression, used below. This method shrinks insignificant variables to zero. Using this method, we get Checking_amount_standardized, term_standardized, amount_standardized, Credit_score_standardized, Saving_amount_standardized, Emp_duration_standardized, Age_standardized, No_of_credit_acc_standardized, Marital_status, Emp_status, Personal_loan, Education_loan, and Home_loan as the most important predictors.
logistic_data <- complete_bank_data
x_logistic <- model.matrix(~ Gender + Marital_status + Emp_status+ Car_loan+ Personal_loan+Education_loan+Home_loan,
data = logistic_data)[,-1]
x_logistic <- cbind(logistic_data$Checking_amount_standardized,logistic_data$term_standardized, logistic_data$amount_standardized, logistic_data$Credit_score_standardized,
logistic_data$Saving_amount_standardized, logistic_data$Emp_duration_standardized,
logistic_data$Age_standardized, logistic_data$No_of_credit_acc_standardized, x_logistic)
y_logistic <- logistic_data$Default
colnames(x_logistic) <- c('Checking_amount_standardized', 'term_standardized', 'amount_standardized',
'Credit_score_standardized',
'Saving_amount_standardized', 'Emp_duration_standardized' , 'Age_standardized',
'No_of_credit_acc_standardized', 'GenderMale','Marital_statusSingle',
'Emp_statusUnemployed','Car_loan','Personal_loan','Education_loan','Home_loan' )
set.seed(1)
cv_test2 <- cv.glmnet(x_logistic, y_logistic, alpha = 1, family = "binomial")
test_lambda2 <- cv_test2$lambda.min
test_features2 <- rownames(coef(cv_test2, s = test_lambda2))[coef(cv_test2, s = test_lambda2)[,1] != 0]
print(test_features2)
[1] "(Intercept)" "Checking_amount_standardized"
[3] "term_standardized" "amount_standardized"
[5] "Credit_score_standardized" "Saving_amount_standardized"
[7] "Emp_duration_standardized" "Age_standardized"
[9] "No_of_credit_acc_standardized" "Marital_statusSingle"
[11] "Emp_statusUnemployed" "Personal_loan"
[13] "Education_loan" "Home_loan"
The results from stepwise and lasso are used as the two candidate models. First, we use the results from the stepwise selection. Below, we split the dataset using a 80/20 split. Since the original dataset contains 1000 observations, we use 800 obs for the training and the rest for validation and testing. We used 5 for the number of folds. Then we run k-fold cross validation for the two candidate models to get the confusion matrices and the ROC/AUC. Based on the below ROC, we get an AUC value of 0.988
x_logistic <- as.data.frame(x_logistic)
x_logistic <- cbind(complete_bank_data$Default ,x_logistic)
colnames(x_logistic) <- c('Default', 'Checking_amount_standardized', 'term_standardized', 'amount_standardized',
'Credit_score_standardized',
'Saving_amount_standardized', 'Emp_duration_standardized' , 'Age_standardized',
'No_of_credit_acc_standardized', 'GenderMale','Marital_statusSingle',
'Emp_statusUnemployed','Car_loan','Personal_loan','Education_loan','Home_loan' )
x_logistic$Default <- as.numeric(x_logistic$Default)
x_logistic$Default[x_logistic$Default == '1'] <- '0'
x_logistic$Default[x_logistic$Default == '2'] <- '1'
x_logistic$Default <- as.numeric(x_logistic$Default)
logtrain.ID = sample(1:dim(x_logistic)[1], 800, replace = FALSE)
logtrain = x_logistic[logtrain.ID,]
logtest = x_logistic[-logtrain.ID,]
train_control <- trainControl(method = "cv", number = 5)
set.seed(1)
logistic_model <- train(Default ~ Emp_statusUnemployed + Personal_loan + Education_loan + term_standardized + Home_loan +
Credit_score_standardized + Checking_amount_standardized +
Saving_amount_standardized + Age_standardized,
data = logtrain,
trControl = train_control,
method = "glm",
family=binomial())
probabilities <- predict(logistic_model, logtest, type="raw")
thresholds <- c(0.0, 0.25, 0.5, 0.75, 1.0)
for (threshold in thresholds) {
cat("\nConfusion Matrix for Threshold =", threshold, "\n")
predictions <- ifelse(probabilities > threshold, "1", "0")
cm <- confusionMatrix(as.factor(predictions), as.factor(logtest$Default), positive = "1")
print(cm$table)
}
Confusion Matrix for Threshold = 0
Reference
Prediction 0 1
0 0 0
1 136 64
Confusion Matrix for Threshold = 0.25
Reference
Prediction 0 1
0 118 3
1 18 61
Confusion Matrix for Threshold = 0.5
Reference
Prediction 0 1
0 129 5
1 7 59
Confusion Matrix for Threshold = 0.75
Reference
Prediction 0 1
0 133 10
1 3 54
Confusion Matrix for Threshold = 1
Reference
Prediction 0 1
0 136 64
1 0 0
TPR = c(1,64/(64+0), 62/(62+2), 58/(58+6), 53/(53+11), 0/(64+0))
FPR = c(1,136/(136+0), 16/(16+120), 5/(5+131), 2/(2+134), 0/(0+136))
plot(FPR, TPR, type = "b", main = "An Illustrative ROC Curve", col ="blue",
xlab="1 - Specifity (FPR)", ylab = "Sensitivity (TPR)")
abline(0,1, lty = 2, col = "red")
legend("bottomright", c("Logistic Model", "Random Guess"),
col=c("blue", "red"), lty = 1:2, bty="n", cex = 0.9)
ROCobj <- roc(logtest$Default, probabilities)
Sen <- ROCobj$sensitivities
Spe <- ROCobj$specificities
pROCdata <- data.frame(TPR=Sen, FPR = (1 - Spe))
AUC <- ROCobj$auc
print(AUC)
Area under the curve: 0.986
Next, we use the results from the lasso regression. Based on the below ROC, we get an AUC value of 0.99
logtrain.ID2 = sample(1:dim(x_logistic)[1], 800, replace = FALSE)
logtrain2 = x_logistic[logtrain.ID2,]
logtest2 = x_logistic[-logtrain.ID2,]
set.seed(1)
logistic_model2 <- train(Default ~ Checking_amount_standardized + term_standardized + amount_standardized + Credit_score_standardized + Saving_amount_standardized + Emp_duration_standardized + Age_standardized + No_of_credit_acc_standardized + Marital_statusSingle + Emp_statusUnemployed + Personal_loan + Education_loan + Home_loan,
data = logtrain2,
trControl = train_control,
method = "glm",
family=binomial())
probabilities2 <- predict(logistic_model2, logtest2, type="raw")
thresholds2 <- c(0.0, 0.25, 0.5, 0.75, 1.0)
for (threshold in thresholds2) {
cat("\nConfusion Matrix for Threshold =", threshold, "\n")
predictions2 <- ifelse(probabilities2 > threshold, "1", "0")
cm2 <- confusionMatrix(as.factor(predictions2), as.factor(logtest2$Default), positive = "1")
print(cm2$table)
}
Confusion Matrix for Threshold = 0
Reference
Prediction 0 1
0 0 0
1 141 59
Confusion Matrix for Threshold = 0.25
Reference
Prediction 0 1
0 131 3
1 10 56
Confusion Matrix for Threshold = 0.5
Reference
Prediction 0 1
0 137 5
1 4 54
Confusion Matrix for Threshold = 0.75
Reference
Prediction 0 1
0 138 12
1 3 47
Confusion Matrix for Threshold = 1
Reference
Prediction 0 1
0 141 59
1 0 0
TPR2 = c(1,59/(59+0), 56/(56+3), 54/(54+5), 47/(47+12), 0/(59+0))
FPR2 = c(1,141/(141+0), 11/(11+130), 3/(3+138), 2/(2+139), 0/(0+141))
plot(FPR2, TPR2, type = "b", main = "An Illustrative ROC Curve", col ="blue",
xlab="1 - Specifity (FPR)", ylab = "Sensitivity (TPR)")
abline(0,1, lty = 2, col = "red")
legend("bottomright", c("Logistic Model", "Random Guess"),
col=c("blue", "red"), lty = 1:2, bty="n", cex = 0.9)
ROCobj2 <- roc(logtest2$Default, probabilities2)
Sen2 <- ROCobj2$sensitivities
Spe2 <- ROCobj2$specificities
pROCdata2 <- data.frame(TPR=Sen2, FPR = (1 - Spe2))
AUC2 <- ROCobj2$auc
print(AUC2)
Area under the curve: 0.9891
For linear regression, the MSE for candidate model 1 is 0.895 and 0.905 for model 2. The MSE of the final model is 0.735
Since model 1 did slightly better, the final model equation is
Credit_score_standardized = 0.02870 - 0.18052(Education_loan) + 0.07620(Checking_amount_standardized) + 0.09390(Saving_amount_standardized) - 0.10278(term_standardized) + 0.24915(Age_standardized)
Ideally, a model should have a lower MSE as it would mean more accurate predictions. Since the MSE of the final model (0.735) is lower than that of the candidate models, we can assume that this is a good model. Therefore, we can assume that Education_loan, Checking_amount_standardized, Saving_amount_standardized, term_standardized, and Age_standardized are the most important variables in predicting credit score.
For logistic regression, the AUC for candidate model 1 is 0.988 and 0.99 for model 2.
Ideally, a model should have a higher AUC as it would mean more accurate predictions. However, the AUC values mentioned are too large, which may indicate overfitting. In this case, I would say that candidate model 1 is better despite having a smaller AUC and uses less features to predict loan default than model 2. We can assume that Education_loan, Checking_amount_standardized, Saving_amount_standardized, term_standardized, Age_standardized, credit_score_standradized, home_loan, personal_loan, and emp_status are the most significant features in predicting loan default.
To resolve overfitting, we may need to do additional feature engineering or adjust cross validation parameters.