Project: Credit Card Default Prediction
I. Introduction
The dataset used in this project is named as Default, which pertains to Credit Card Default Data. This data is from ISLR2 package in Rstudio, which is already processed and cleaned.
The objective of this project is to explore the Default data and see the relations among variables. First, I will describe each variable. Then, discover their patterns by looking at their descriptive statistics and some plots. Finally, I will form a model to see how predictors impact on the dependent variables.
II. Exploratory Data Analysis
1. About the dataset
| Order | Variable_Name | Type | Label_Description |
|---|---|---|---|
| 1 | Default | Binary | A factor with levels No and Yes indicating whether the customer defaulted on their debt |
| 2 | Student | Binary | A factor with levels No and Yes indicating whether the customer is a student |
| 3 | Balance | Numerical | The average balance that the customer has remaining on their credit card after making their monthly payment |
| 4 | Income | Numerical | Income of customer |
The dataset includes 4 variables named Default, Student, Balance, and Income, which are described in the table above. The dependent variable is “Default” with 3 predictors as Student, Balance, and Income. It means that we will use the data: balance of an account, income of customers, and whether the customers are students or not to predict the probability of default in their credit cards.
2. Descriptive Statistics
| vars | n | mean | sd | median | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|
| Balance | 10000 | 835.3749 | 483.715 | 823.637 | 0.0000 | 2654.323 | 2654.323 | 0.2459914 | -0.3556736 | 4.83715 |
| Income | 10000 | 33516.9819 | 13336.640 | 34552.645 | 771.9677 | 73554.233 | 72782.266 | 0.0733187 | -0.9000681 | 133.36640 |
With 2 numerical variables Balance and Income, we have descriptive statistics table above to explore the features of the data.
In terms of Balance, the mean and median of 10000 accounts balance are 835.37 and 823.64, respectively, ranging from 0 to 2654.32. Its standard deviation is at 483.72, which determines how the data is spread. The standard error of balance tells us that the population mean could be 4.84 differences from the sample mean. The skewness of balance is at 0.25 ~ 0, meaning that the data is nearly symmetrical. Kurtosis at -0.36 indicates that the tail of distribution is lighter than normal distribution (kurtosis at 3).
Regarding Income data, the income of customers varies from 772 to 73554, with the average at 33517. The varied level of data is illustrated by standard deviation at 13337. Sharing the same distribution with Balance, the Income data is also nearly symmetrical and lighter in tail compared to normal distribution.
As can be seen in the graph, the number of default account is 333, which accounts for 3% of the total accounts. Besides, among all customers, there are 2944 students and 7056 of customers are not students.
3. Relationship among variables
As can be seen from the boxplot, the difference in balance is very huge between default account and non-default account. The non-default customers have the mean balance at around 700-800, while that for default customers is nearly 1800. The interquartile range of non-default accounts’ balance is from 500 to around 1200, while that for default account is 1500-2000. To conclude, the default customers have larger balance than non-default customers.
The histograms illustrate how income of customers are distributed. Both two charts share similar pattern with two tops at 20000 and 40000. The income area around 30000 is only one-fourth of that in 20000 in Default account histogram, while this for non-default is also only two-thirds. In conclusion, there is no clear pattern to see the different between two histograms; hence, we need further test to see in the model part.
In terms of default customers, there are 127 students, which accounts for 38.14% of total. According to non-default customers, 2817 out of 9667 customers are students, contributing to 29.14% of total.
III. Model
1. Model Construction
First, to build the model, we split the data into 2 parts in a proportion of 80/20, with 80% of data as training data and 20% are testing data. Splitting data is a fundamental step in machine learning and data analysis, allowing us to effectively evaluate and train models. The process involves dividing the available dataset into two or more subsets: typically a training set and a testing/validation set. The training set is used to train the model, allowing it to learn patterns and relationships within the data. The testing/validation set, on the other hand, is utilized to assess the model’s performance and generalization capabilities on unseen data. By splitting the data, we can simulate real-world scenarios and evaluate how well the model performs on new, unseen examples. It helps in identifying potential issues such as overfitting or underfitting.
Based on the exploration of data, logistic regression will be used for the data because the dependent variable “default” is binary data.
2. Logistic Regression
##
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial",
## data = data_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4850 -0.1383 -0.0538 -0.0195 3.7573
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.097e+01 5.583e-01 -19.642 <2e-16 ***
## income 2.645e-06 9.262e-06 0.286 0.7752
## balance 5.802e-03 2.623e-04 22.121 <2e-16 ***
## studentYes -6.611e-01 2.655e-01 -2.490 0.0128 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2320.3 on 7999 degrees of freedom
## Residual deviance: 1229.4 on 7996 degrees of freedom
## AIC: 1237.4
##
## Number of Fisher Scoring iterations: 8
R-Square of the model:
## [1] 0.4701481
Adjusted R-Square of the model:
## [1] 0.4699
Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1923 50
## Yes 8 19
##
## Accuracy : 0.971
## 95% CI : (0.9627, 0.9779)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 0.09672
##
## Kappa : 0.3839
##
## Mcnemar's Test P-Value : 7.303e-08
##
## Sensitivity : 0.9959
## Specificity : 0.2754
## Pos Pred Value : 0.9747
## Neg Pred Value : 0.7037
## Prevalence : 0.9655
## Detection Rate : 0.9615
## Detection Prevalence : 0.9865
## Balanced Accuracy : 0.6356
##
## 'Positive' Class : No
##
3. Linear Discriminant Analysis
lda <- lda(default ~income+ balance+ student, data = data_training)
predict_lda <- predict(lda, type= "response", newdata=data_testing)
result_lda <- predict_lda %>% bind_cols(data_testing %>% dplyr::select(default))
result_lda <- result_lda %>% dplyr::select(class, default)
colnames(result_lda) <- c("predicted_value", "actual_value")
confusion_lda <- confusionMatrix(result_lda$predicted_value,result_lda$actual_value)
confusion_lda## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1927 53
## Yes 4 16
##
## Accuracy : 0.9715
## 95% CI : (0.9632, 0.9783)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 0.07637
##
## Kappa : 0.3495
##
## Mcnemar's Test P-Value : 2.047e-10
##
## Sensitivity : 0.9979
## Specificity : 0.2319
## Pos Pred Value : 0.9732
## Neg Pred Value : 0.8000
## Prevalence : 0.9655
## Detection Rate : 0.9635
## Detection Prevalence : 0.9900
## Balanced Accuracy : 0.6149
##
## 'Positive' Class : No
##
4. naive Bayes
naivebayes <- naiveBayes(default ~income+ balance+ student, data = data_training)
naivebayes##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.967 0.033
##
## Conditional probabilities:
## income
## Y [,1] [,2]
## No 33580.79 13301.68
## Yes 31831.14 13800.33
##
## balance
## Y [,1] [,2]
## No 801.9643 455.4274
## Yes 1756.4756 337.6413
##
## student
## Y No Yes
## No 0.7110910 0.2889090
## Yes 0.6136364 0.3863636
predict_nb <- predict(naivebayes, type= "raw", newdata=data_testing)
nb.pred.adj <- ifelse(predict_nb[, 2] >= .5, "Yes", "No")
result_nb <- nb.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))## New names:
## • `` -> `...1`
colnames(result_nb) <- c("predicted_value", "actual_value")
confusion_nb <- confusionMatrix(as.factor(result_nb$predicted_value), as.factor(result_nb$actual_value))
confusion_nb## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1922 53
## Yes 9 16
##
## Accuracy : 0.969
## 95% CI : (0.9604, 0.9762)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 0.2149
##
## Kappa : 0.3281
##
## Mcnemar's Test P-Value : 4.734e-08
##
## Sensitivity : 0.9953
## Specificity : 0.2319
## Pos Pred Value : 0.9732
## Neg Pred Value : 0.6400
## Prevalence : 0.9655
## Detection Rate : 0.9610
## Detection Prevalence : 0.9875
## Balanced Accuracy : 0.6136
##
## 'Positive' Class : No
##
IV. Robustness
4.1. Threshold Modification
Threshold 0.4
Logistic Regression:
#fit test data
model_2 <- glm(default~income+ balance+ student, data= data_training,family = "binomial")
model_fit_2 <- predict(model_2, type= "response",newdata = data_testing)
predict_binary_2 <- ifelse(model_fit_2 >= 0.4, "Yes", "No")
test_results_2 <- predict_binary_2 %>%
dplyr::bind_cols(data_testing$default)## New names:
## • `` -> `...1`
## • `` -> `...2`
colnames(test_results_2) <- c("Prediction","Actual data")
#confusion matrix
test_results_2$Prediction <- as.factor(test_results_2$Prediction)
test_results_2$`Actual data`<- as.factor(test_results_2$`Actual data`)
confusion_glm_2 <- confusionMatrix(test_results_2$Prediction, test_results_2$`Actual data`)
confusion_glm_2## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1918 42
## Yes 13 27
##
## Accuracy : 0.9725
## 95% CI : (0.9644, 0.9792)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 0.0454165
##
## Kappa : 0.4823
##
## Mcnemar's Test P-Value : 0.0001597
##
## Sensitivity : 0.9933
## Specificity : 0.3913
## Pos Pred Value : 0.9786
## Neg Pred Value : 0.6750
## Prevalence : 0.9655
## Detection Rate : 0.9590
## Detection Prevalence : 0.9800
## Balanced Accuracy : 0.6923
##
## 'Positive' Class : No
##
LDA:
lda.pred.adj <- ifelse(predict_lda$posterior[, 2] >= .4, "Yes", "No")
result_lda_2 <- lda.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))## New names:
## • `` -> `...1`
colnames(result_lda_2) <- c("predicted_value", "actual_value")
confusion_lda_2 <- confusionMatrix(as.factor(result_lda_2$predicted_value),result_lda_2$actual_value)
confusion_lda_2## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1923 49
## Yes 8 20
##
## Accuracy : 0.9715
## 95% CI : (0.9632, 0.9783)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 0.07637
##
## Kappa : 0.4004
##
## Mcnemar's Test P-Value : 1.17e-07
##
## Sensitivity : 0.9959
## Specificity : 0.2899
## Pos Pred Value : 0.9752
## Neg Pred Value : 0.7143
## Prevalence : 0.9655
## Detection Rate : 0.9615
## Detection Prevalence : 0.9860
## Balanced Accuracy : 0.6429
##
## 'Positive' Class : No
##
Naive Bayes:
naivebayes <- naiveBayes(default ~income+ balance+ student, data = data_training)
naivebayes##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.967 0.033
##
## Conditional probabilities:
## income
## Y [,1] [,2]
## No 33580.79 13301.68
## Yes 31831.14 13800.33
##
## balance
## Y [,1] [,2]
## No 801.9643 455.4274
## Yes 1756.4756 337.6413
##
## student
## Y No Yes
## No 0.7110910 0.2889090
## Yes 0.6136364 0.3863636
predict_nb <- predict(naivebayes, type= "raw", newdata=data_testing)
nb.pred.adj <- ifelse(predict_nb[, 2] >= .4, "Yes", "No")
result_nb <- nb.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))## New names:
## • `` -> `...1`
colnames(result_nb) <- c("predicted_value", "actual_value")
confusion_nb <- confusionMatrix(as.factor(result_nb$predicted_value), as.factor(result_nb$actual_value))
confusion_nb## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1910 47
## Yes 21 22
##
## Accuracy : 0.966
## 95% CI : (0.9571, 0.9735)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 0.483145
##
## Kappa : 0.3763
##
## Mcnemar's Test P-Value : 0.002432
##
## Sensitivity : 0.9891
## Specificity : 0.3188
## Pos Pred Value : 0.9760
## Neg Pred Value : 0.5116
## Prevalence : 0.9655
## Detection Rate : 0.9550
## Detection Prevalence : 0.9785
## Balanced Accuracy : 0.6540
##
## 'Positive' Class : No
##
Threshold 0.3
Logistic Regression:
#fit test data
model_3 <- glm(default~income+ balance+ student, data= data_training,family = "binomial")
model_fit_3 <- predict(model_3, type= "response",newdata = data_testing)
predict_binary_3 <- ifelse(model_fit_3 >= 0.3, "Yes", "No")
test_results_3 <- predict_binary_3 %>%
dplyr::bind_cols(data_testing$default)## New names:
## • `` -> `...1`
## • `` -> `...2`
colnames(test_results_3) <- c("Prediction","Actual data")
#confusion matrix
test_results_3$Prediction <- as.factor(test_results_3$Prediction)
test_results_3$`Actual data`<- as.factor(test_results_3$`Actual data`)
confusion_glm_3 <- confusionMatrix(test_results_3$Prediction, test_results_3$`Actual data`)
confusion_glm_3## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1906 37
## Yes 25 32
##
## Accuracy : 0.969
## 95% CI : (0.9604, 0.9762)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 0.2149
##
## Kappa : 0.4921
##
## Mcnemar's Test P-Value : 0.1624
##
## Sensitivity : 0.9871
## Specificity : 0.4638
## Pos Pred Value : 0.9810
## Neg Pred Value : 0.5614
## Prevalence : 0.9655
## Detection Rate : 0.9530
## Detection Prevalence : 0.9715
## Balanced Accuracy : 0.7254
##
## 'Positive' Class : No
##
LDA:
lda.pred.adj <- ifelse(predict_lda$posterior[, 2] > .3, "Yes", "No")
result_lda_2 <- lda.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))## New names:
## • `` -> `...1`
colnames(result_lda_2) <- c("predicted_value", "actual_value")
confusion_lda_2 <- confusionMatrix(as.factor(result_lda_2$predicted_value),result_lda_2$actual_value)
confusion_lda_2## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1913 39
## Yes 18 30
##
## Accuracy : 0.9715
## 95% CI : (0.9632, 0.9783)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 0.076366
##
## Kappa : 0.4986
##
## Mcnemar's Test P-Value : 0.008071
##
## Sensitivity : 0.9907
## Specificity : 0.4348
## Pos Pred Value : 0.9800
## Neg Pred Value : 0.6250
## Prevalence : 0.9655
## Detection Rate : 0.9565
## Detection Prevalence : 0.9760
## Balanced Accuracy : 0.7127
##
## 'Positive' Class : No
##
Naive Bayes:
naivebayes <- naiveBayes(default ~income+ balance+ student, data = data_training)
naivebayes##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.967 0.033
##
## Conditional probabilities:
## income
## Y [,1] [,2]
## No 33580.79 13301.68
## Yes 31831.14 13800.33
##
## balance
## Y [,1] [,2]
## No 801.9643 455.4274
## Yes 1756.4756 337.6413
##
## student
## Y No Yes
## No 0.7110910 0.2889090
## Yes 0.6136364 0.3863636
predict_nb <- predict(naivebayes, type= "raw", newdata=data_testing)
nb.pred.adj <- ifelse(predict_nb[, 2] >= .3, "Yes", "No")
result_nb <- nb.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))## New names:
## • `` -> `...1`
colnames(result_nb) <- c("predicted_value", "actual_value")
confusion_nb <- confusionMatrix(as.factor(result_nb$predicted_value), as.factor(result_nb$actual_value))
confusion_nb## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1891 37
## Yes 40 32
##
## Accuracy : 0.9615
## 95% CI : (0.9521, 0.9695)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 0.8509
##
## Kappa : 0.434
##
## Mcnemar's Test P-Value : 0.8197
##
## Sensitivity : 0.9793
## Specificity : 0.4638
## Pos Pred Value : 0.9808
## Neg Pred Value : 0.4444
## Prevalence : 0.9655
## Detection Rate : 0.9455
## Detection Prevalence : 0.9640
## Balanced Accuracy : 0.7215
##
## 'Positive' Class : No
##
4.2. SMOTE
Logistic Regression:
data_oversample <- SMOTE(default ~ ., data_training, perc.over = 100, perc.under = 200)
model_glm_os <- glm(default~., data= data_oversample,family = "binomial")
model_fit_os <- predict(model_glm_os, type= "response",newdata = data_testing)
predict_binary <- ifelse(model_fit_os > 0.5, "Yes", "No")
test_results <- predict_binary %>%
dplyr::bind_cols(data_testing$default)## New names:
## • `` -> `...1`
## • `` -> `...2`
colnames(test_results) <- c("Prediction","Actual data")
test_results$Prediction <- as.factor(test_results$Prediction)
test_results$`Actual data`<- as.factor(test_results$`Actual data`)
confusion_glm <- confusionMatrix(test_results$Prediction, test_results$`Actual data`)
confusion_glm## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1672 11
## Yes 259 58
##
## Accuracy : 0.865
## 95% CI : (0.8492, 0.8797)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2585
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8659
## Specificity : 0.8406
## Pos Pred Value : 0.9935
## Neg Pred Value : 0.1830
## Prevalence : 0.9655
## Detection Rate : 0.8360
## Detection Prevalence : 0.8415
## Balanced Accuracy : 0.8532
##
## 'Positive' Class : No
##
LDA:
model_lda_os <- lda(default~., data= data_oversample,family = "binomial")
model_fit_os <- predict(model_lda_os, type= "response", newdata=data_testing)$class
result_lda <-model_fit_os %>% bind_cols(data_testing %>% dplyr::select(default))## New names:
## • `` -> `...1`
colnames(result_lda) <- c("predicted_value", "actual_value")
confusion_lda <- confusionMatrix(as.factor(result_lda$predicted_value),as.factor(result_lda$actual_value))
confusion_lda## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1610 8
## Yes 321 61
##
## Accuracy : 0.8355
## 95% CI : (0.8185, 0.8515)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2252
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8338
## Specificity : 0.8841
## Pos Pred Value : 0.9951
## Neg Pred Value : 0.1597
## Prevalence : 0.9655
## Detection Rate : 0.8050
## Detection Prevalence : 0.8090
## Balanced Accuracy : 0.8589
##
## 'Positive' Class : No
##
Naive Bayes:
model_nb_os <- naiveBayes(default~., data= data_oversample,family = "binomial")
model_fit_os <- predict(model_nb_os, type= "class", newdata=data_testing)
result_nb <-model_fit_os %>% bind_cols(data_testing %>% dplyr::select(default))## New names:
## • `` -> `...1`
colnames(result_nb) <- c("predicted_value", "actual_value")
confusion_nb <- confusionMatrix(as.factor(result_nb$predicted_value),as.factor(result_nb$actual_value))
confusion_nb## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1624 10
## Yes 307 59
##
## Accuracy : 0.8415
## 95% CI : (0.8247, 0.8572)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2263
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8410
## Specificity : 0.8551
## Pos Pred Value : 0.9939
## Neg Pred Value : 0.1612
## Prevalence : 0.9655
## Detection Rate : 0.8120
## Detection Prevalence : 0.8170
## Balanced Accuracy : 0.8480
##
## 'Positive' Class : No
##