Project: Credit Card Default Prediction

I. Introduction

The dataset used in this project is named as Default, which pertains to Credit Card Default Data. This data is from ISLR2 package in Rstudio, which is already processed and cleaned.

The objective of this project is to explore the Default data and see the relations among variables. First, I will describe each variable. Then, discover their patterns by looking at their descriptive statistics and some plots. Finally, I will form a model to see how predictors impact on the dependent variables.

II. Exploratory Data Analysis

1. About the dataset

Data “default” (Credit Card Default situation) label and description
Order	Variable_Name	Type	Label_Description
1	Default	Binary	A factor with levels No and Yes indicating whether the customer defaulted on their debt
2	Student	Binary	A factor with levels No and Yes indicating whether the customer is a student
3	Balance	Numerical	The average balance that the customer has remaining on their credit card after making their monthly payment
4	Income	Numerical	Income of customer

The dataset includes 4 variables named Default, Student, Balance, and Income, which are described in the table above. The dependent variable is “Default” with 3 predictors as Student, Balance, and Income. It means that we will use the data: balance of an account, income of customers, and whether the customers are students or not to predict the probability of default in their credit cards.

2. Descriptive Statistics

Descriptive Statistics for Numerical Variables
vars	n	mean	sd	median	min	max	range	skew	kurtosis	se
Balance	10000	835.3749	483.715	823.637	0.0000	2654.323	2654.323	0.2459914	-0.3556736	4.83715
Income	10000	33516.9819	13336.640	34552.645	771.9677	73554.233	72782.266	0.0733187	-0.9000681	133.36640

With 2 numerical variables Balance and Income, we have descriptive statistics table above to explore the features of the data.

In terms of Balance, the mean and median of 10000 accounts balance are 835.37 and 823.64, respectively, ranging from 0 to 2654.32. Its standard deviation is at 483.72, which determines how the data is spread. The standard error of balance tells us that the population mean could be 4.84 differences from the sample mean. The skewness of balance is at 0.25 ~ 0, meaning that the data is nearly symmetrical. Kurtosis at -0.36 indicates that the tail of distribution is lighter than normal distribution (kurtosis at 3).

Regarding Income data, the income of customers varies from 772 to 73554, with the average at 33517. The varied level of data is illustrated by standard deviation at 13337. Sharing the same distribution with Balance, the Income data is also nearly symmetrical and lighter in tail compared to normal distribution.

As can be seen in the graph, the number of default account is 333, which accounts for 3% of the total accounts. Besides, among all customers, there are 2944 students and 7056 of customers are not students.

3. Relationship among variables

As can be seen from the boxplot, the difference in balance is very huge between default account and non-default account. The non-default customers have the mean balance at around 700-800, while that for default customers is nearly 1800. The interquartile range of non-default accounts’ balance is from 500 to around 1200, while that for default account is 1500-2000. To conclude, the default customers have larger balance than non-default customers.

The histograms illustrate how income of customers are distributed. Both two charts share similar pattern with two tops at 20000 and 40000. The income area around 30000 is only one-fourth of that in 20000 in Default account histogram, while this for non-default is also only two-thirds. In conclusion, there is no clear pattern to see the different between two histograms; hence, we need further test to see in the model part.

In terms of default customers, there are 127 students, which accounts for 38.14% of total. According to non-default customers, 2817 out of 9667 customers are students, contributing to 29.14% of total.

III. Model

1. Model Construction

First, to build the model, we split the data into 2 parts in a proportion of 80/20, with 80% of data as training data and 20% are testing data. Splitting data is a fundamental step in machine learning and data analysis, allowing us to effectively evaluate and train models. The process involves dividing the available dataset into two or more subsets: typically a training set and a testing/validation set. The training set is used to train the model, allowing it to learn patterns and relationships within the data. The testing/validation set, on the other hand, is utilized to assess the model’s performance and generalization capabilities on unseen data. By splitting the data, we can simulate real-world scenarios and evaluate how well the model performs on new, unseen examples. It helps in identifying potential issues such as overfitting or underfitting.

Based on the exploration of data, logistic regression will be used for the data because the dependent variable “default” is binary data.

2. Logistic Regression

## 
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial", 
##     data = data_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4850  -0.1383  -0.0538  -0.0195   3.7573  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.097e+01  5.583e-01 -19.642   <2e-16 ***
## income       2.645e-06  9.262e-06   0.286   0.7752    
## balance      5.802e-03  2.623e-04  22.121   <2e-16 ***
## studentYes  -6.611e-01  2.655e-01  -2.490   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2320.3  on 7999  degrees of freedom
## Residual deviance: 1229.4  on 7996  degrees of freedom
## AIC: 1237.4
## 
## Number of Fisher Scoring iterations: 8

R-Square of the model:

## [1] 0.4701481

Adjusted R-Square of the model:

## [1] 0.4699

Confusion Matrix:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1923   50
##        Yes    8   19
##                                           
##                Accuracy : 0.971           
##                  95% CI : (0.9627, 0.9779)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.09672         
##                                           
##                   Kappa : 0.3839          
##                                           
##  Mcnemar's Test P-Value : 7.303e-08       
##                                           
##             Sensitivity : 0.9959          
##             Specificity : 0.2754          
##          Pos Pred Value : 0.9747          
##          Neg Pred Value : 0.7037          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9615          
##    Detection Prevalence : 0.9865          
##       Balanced Accuracy : 0.6356          
##                                           
##        'Positive' Class : No              
##

3. Linear Discriminant Analysis

lda <- lda(default ~income+ balance+ student, data = data_training)
predict_lda <- predict(lda, type= "response", newdata=data_testing)
result_lda <- predict_lda %>% bind_cols(data_testing %>% dplyr::select(default))
result_lda <- result_lda %>% dplyr::select(class, default)
colnames(result_lda) <- c("predicted_value", "actual_value")
confusion_lda <- confusionMatrix(result_lda$predicted_value,result_lda$actual_value)
confusion_lda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1927   53
##        Yes    4   16
##                                           
##                Accuracy : 0.9715          
##                  95% CI : (0.9632, 0.9783)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.07637         
##                                           
##                   Kappa : 0.3495          
##                                           
##  Mcnemar's Test P-Value : 2.047e-10       
##                                           
##             Sensitivity : 0.9979          
##             Specificity : 0.2319          
##          Pos Pred Value : 0.9732          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9635          
##    Detection Prevalence : 0.9900          
##       Balanced Accuracy : 0.6149          
##                                           
##        'Positive' Class : No              
##

4. naive Bayes

naivebayes <- naiveBayes(default ~income+ balance+ student, data = data_training)
naivebayes

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    No   Yes 
## 0.967 0.033 
## 
## Conditional probabilities:
##      income
## Y         [,1]     [,2]
##   No  33580.79 13301.68
##   Yes 31831.14 13800.33
## 
##      balance
## Y          [,1]     [,2]
##   No   801.9643 455.4274
##   Yes 1756.4756 337.6413
## 
##      student
## Y            No       Yes
##   No  0.7110910 0.2889090
##   Yes 0.6136364 0.3863636

predict_nb <- predict(naivebayes, type= "raw", newdata=data_testing)
nb.pred.adj <- ifelse(predict_nb[, 2] >= .5, "Yes", "No")
result_nb <- nb.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))

## New names:
## • `` -> `...1`

colnames(result_nb) <- c("predicted_value", "actual_value")
confusion_nb <- confusionMatrix(as.factor(result_nb$predicted_value), as.factor(result_nb$actual_value))
confusion_nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1922   53
##        Yes    9   16
##                                           
##                Accuracy : 0.969           
##                  95% CI : (0.9604, 0.9762)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.2149          
##                                           
##                   Kappa : 0.3281          
##                                           
##  Mcnemar's Test P-Value : 4.734e-08       
##                                           
##             Sensitivity : 0.9953          
##             Specificity : 0.2319          
##          Pos Pred Value : 0.9732          
##          Neg Pred Value : 0.6400          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9610          
##    Detection Prevalence : 0.9875          
##       Balanced Accuracy : 0.6136          
##                                           
##        'Positive' Class : No              
##

IV. Robustness

4.1. Threshold Modification

Threshold 0.4

Logistic Regression:

#fit test data
model_2 <- glm(default~income+ balance+ student, data= data_training,family = "binomial")
model_fit_2 <- predict(model_2,  type= "response",newdata = data_testing)
predict_binary_2 <- ifelse(model_fit_2 >= 0.4, "Yes", "No")
test_results_2 <- predict_binary_2 %>% 
  dplyr::bind_cols(data_testing$default)

## New names:
## • `` -> `...1`
## • `` -> `...2`

colnames(test_results_2) <- c("Prediction","Actual data")

#confusion matrix 
test_results_2$Prediction <- as.factor(test_results_2$Prediction)
test_results_2$`Actual data`<- as.factor(test_results_2$`Actual data`)
confusion_glm_2 <- confusionMatrix(test_results_2$Prediction, test_results_2$`Actual data`)
confusion_glm_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1918   42
##        Yes   13   27
##                                           
##                Accuracy : 0.9725          
##                  95% CI : (0.9644, 0.9792)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.0454165       
##                                           
##                   Kappa : 0.4823          
##                                           
##  Mcnemar's Test P-Value : 0.0001597       
##                                           
##             Sensitivity : 0.9933          
##             Specificity : 0.3913          
##          Pos Pred Value : 0.9786          
##          Neg Pred Value : 0.6750          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9590          
##    Detection Prevalence : 0.9800          
##       Balanced Accuracy : 0.6923          
##                                           
##        'Positive' Class : No              
##

LDA:

lda.pred.adj <- ifelse(predict_lda$posterior[, 2] >= .4, "Yes", "No")
result_lda_2 <- lda.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))

## New names:
## • `` -> `...1`

colnames(result_lda_2) <- c("predicted_value", "actual_value")
confusion_lda_2 <- confusionMatrix(as.factor(result_lda_2$predicted_value),result_lda_2$actual_value)
confusion_lda_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1923   49
##        Yes    8   20
##                                           
##                Accuracy : 0.9715          
##                  95% CI : (0.9632, 0.9783)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.07637         
##                                           
##                   Kappa : 0.4004          
##                                           
##  Mcnemar's Test P-Value : 1.17e-07        
##                                           
##             Sensitivity : 0.9959          
##             Specificity : 0.2899          
##          Pos Pred Value : 0.9752          
##          Neg Pred Value : 0.7143          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9615          
##    Detection Prevalence : 0.9860          
##       Balanced Accuracy : 0.6429          
##                                           
##        'Positive' Class : No              
##

Naive Bayes:

naivebayes <- naiveBayes(default ~income+ balance+ student, data = data_training)
naivebayes

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    No   Yes 
## 0.967 0.033 
## 
## Conditional probabilities:
##      income
## Y         [,1]     [,2]
##   No  33580.79 13301.68
##   Yes 31831.14 13800.33
## 
##      balance
## Y          [,1]     [,2]
##   No   801.9643 455.4274
##   Yes 1756.4756 337.6413
## 
##      student
## Y            No       Yes
##   No  0.7110910 0.2889090
##   Yes 0.6136364 0.3863636

predict_nb <- predict(naivebayes, type= "raw", newdata=data_testing)
nb.pred.adj <- ifelse(predict_nb[, 2] >= .4, "Yes", "No")
result_nb <- nb.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))

## New names:
## • `` -> `...1`

colnames(result_nb) <- c("predicted_value", "actual_value")
confusion_nb <- confusionMatrix(as.factor(result_nb$predicted_value), as.factor(result_nb$actual_value))
confusion_nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1910   47
##        Yes   21   22
##                                           
##                Accuracy : 0.966           
##                  95% CI : (0.9571, 0.9735)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.483145        
##                                           
##                   Kappa : 0.3763          
##                                           
##  Mcnemar's Test P-Value : 0.002432        
##                                           
##             Sensitivity : 0.9891          
##             Specificity : 0.3188          
##          Pos Pred Value : 0.9760          
##          Neg Pred Value : 0.5116          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9550          
##    Detection Prevalence : 0.9785          
##       Balanced Accuracy : 0.6540          
##                                           
##        'Positive' Class : No              
##

Threshold 0.3

Logistic Regression:

#fit test data
model_3 <- glm(default~income+ balance+ student, data= data_training,family = "binomial")
model_fit_3 <- predict(model_3,  type= "response",newdata = data_testing)
predict_binary_3 <- ifelse(model_fit_3 >= 0.3, "Yes", "No")
test_results_3 <- predict_binary_3 %>% 
  dplyr::bind_cols(data_testing$default)

## New names:
## • `` -> `...1`
## • `` -> `...2`

colnames(test_results_3) <- c("Prediction","Actual data")

#confusion matrix 
test_results_3$Prediction <- as.factor(test_results_3$Prediction)
test_results_3$`Actual data`<- as.factor(test_results_3$`Actual data`)
confusion_glm_3 <- confusionMatrix(test_results_3$Prediction, test_results_3$`Actual data`)
confusion_glm_3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1906   37
##        Yes   25   32
##                                           
##                Accuracy : 0.969           
##                  95% CI : (0.9604, 0.9762)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.2149          
##                                           
##                   Kappa : 0.4921          
##                                           
##  Mcnemar's Test P-Value : 0.1624          
##                                           
##             Sensitivity : 0.9871          
##             Specificity : 0.4638          
##          Pos Pred Value : 0.9810          
##          Neg Pred Value : 0.5614          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9530          
##    Detection Prevalence : 0.9715          
##       Balanced Accuracy : 0.7254          
##                                           
##        'Positive' Class : No              
##

LDA:

lda.pred.adj <- ifelse(predict_lda$posterior[, 2] > .3, "Yes", "No")
result_lda_2 <- lda.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))

## New names:
## • `` -> `...1`

colnames(result_lda_2) <- c("predicted_value", "actual_value")
confusion_lda_2 <- confusionMatrix(as.factor(result_lda_2$predicted_value),result_lda_2$actual_value)
confusion_lda_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1913   39
##        Yes   18   30
##                                           
##                Accuracy : 0.9715          
##                  95% CI : (0.9632, 0.9783)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.076366        
##                                           
##                   Kappa : 0.4986          
##                                           
##  Mcnemar's Test P-Value : 0.008071        
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.4348          
##          Pos Pred Value : 0.9800          
##          Neg Pred Value : 0.6250          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9565          
##    Detection Prevalence : 0.9760          
##       Balanced Accuracy : 0.7127          
##                                           
##        'Positive' Class : No              
##

Naive Bayes:

naivebayes <- naiveBayes(default ~income+ balance+ student, data = data_training)
naivebayes

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    No   Yes 
## 0.967 0.033 
## 
## Conditional probabilities:
##      income
## Y         [,1]     [,2]
##   No  33580.79 13301.68
##   Yes 31831.14 13800.33
## 
##      balance
## Y          [,1]     [,2]
##   No   801.9643 455.4274
##   Yes 1756.4756 337.6413
## 
##      student
## Y            No       Yes
##   No  0.7110910 0.2889090
##   Yes 0.6136364 0.3863636

predict_nb <- predict(naivebayes, type= "raw", newdata=data_testing)
nb.pred.adj <- ifelse(predict_nb[, 2] >= .3, "Yes", "No")
result_nb <- nb.pred.adj %>% bind_cols(data_testing %>% dplyr::select(default))

## New names:
## • `` -> `...1`

colnames(result_nb) <- c("predicted_value", "actual_value")
confusion_nb <- confusionMatrix(as.factor(result_nb$predicted_value), as.factor(result_nb$actual_value))
confusion_nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1891   37
##        Yes   40   32
##                                           
##                Accuracy : 0.9615          
##                  95% CI : (0.9521, 0.9695)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.8509          
##                                           
##                   Kappa : 0.434           
##                                           
##  Mcnemar's Test P-Value : 0.8197          
##                                           
##             Sensitivity : 0.9793          
##             Specificity : 0.4638          
##          Pos Pred Value : 0.9808          
##          Neg Pred Value : 0.4444          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9455          
##    Detection Prevalence : 0.9640          
##       Balanced Accuracy : 0.7215          
##                                           
##        'Positive' Class : No              
##

4.2. SMOTE

Logistic Regression:

data_oversample <- SMOTE(default ~ ., data_training, perc.over = 100, perc.under = 200)
model_glm_os <- glm(default~., data= data_oversample,family = "binomial")
model_fit_os <- predict(model_glm_os,  type= "response",newdata = data_testing)
predict_binary <- ifelse(model_fit_os > 0.5, "Yes", "No")
test_results <- predict_binary %>% 
  dplyr::bind_cols(data_testing$default)

## New names:
## • `` -> `...1`
## • `` -> `...2`

colnames(test_results) <- c("Prediction","Actual data")
test_results$Prediction <- as.factor(test_results$Prediction)
test_results$`Actual data`<- as.factor(test_results$`Actual data`)
confusion_glm <- confusionMatrix(test_results$Prediction, test_results$`Actual data`)
confusion_glm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1672   11
##        Yes  259   58
##                                           
##                Accuracy : 0.865           
##                  95% CI : (0.8492, 0.8797)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2585          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8659          
##             Specificity : 0.8406          
##          Pos Pred Value : 0.9935          
##          Neg Pred Value : 0.1830          
##              Prevalence : 0.9655          
##          Detection Rate : 0.8360          
##    Detection Prevalence : 0.8415          
##       Balanced Accuracy : 0.8532          
##                                           
##        'Positive' Class : No              
##

LDA:

model_lda_os <- lda(default~., data= data_oversample,family = "binomial")
model_fit_os <- predict(model_lda_os, type= "response", newdata=data_testing)$class
result_lda <-model_fit_os %>% bind_cols(data_testing %>% dplyr::select(default))

## New names:
## • `` -> `...1`

colnames(result_lda) <- c("predicted_value", "actual_value")
confusion_lda <- confusionMatrix(as.factor(result_lda$predicted_value),as.factor(result_lda$actual_value))
confusion_lda

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1610    8
##        Yes  321   61
##                                           
##                Accuracy : 0.8355          
##                  95% CI : (0.8185, 0.8515)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2252          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8338          
##             Specificity : 0.8841          
##          Pos Pred Value : 0.9951          
##          Neg Pred Value : 0.1597          
##              Prevalence : 0.9655          
##          Detection Rate : 0.8050          
##    Detection Prevalence : 0.8090          
##       Balanced Accuracy : 0.8589          
##                                           
##        'Positive' Class : No              
##

Naive Bayes:

model_nb_os <- naiveBayes(default~., data= data_oversample,family = "binomial")
model_fit_os <- predict(model_nb_os, type= "class", newdata=data_testing)
result_nb <-model_fit_os %>% bind_cols(data_testing %>% dplyr::select(default))

## New names:
## • `` -> `...1`

colnames(result_nb) <- c("predicted_value", "actual_value")
confusion_nb <- confusionMatrix(as.factor(result_nb$predicted_value),as.factor(result_nb$actual_value))
confusion_nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1624   10
##        Yes  307   59
##                                           
##                Accuracy : 0.8415          
##                  95% CI : (0.8247, 0.8572)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2263          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8410          
##             Specificity : 0.8551          
##          Pos Pred Value : 0.9939          
##          Neg Pred Value : 0.1612          
##              Prevalence : 0.9655          
##          Detection Rate : 0.8120          
##    Detection Prevalence : 0.8170          
##       Balanced Accuracy : 0.8480          
##                                           
##        'Positive' Class : No              
##