Final Project: International Financial Management

I. Introduction

The dataset used in this project is named as Default, which pertains to Credit Card Default Data. This data is from ISLR2 package in Rstudio, which is already processed and cleaned.

The objective of this project is to explore the Default data and see the relations among variables. First, I will describe each variable. Then, discover their patterns by looking at their descriptive statistics and some plots. Finally, I will form a model to see how predictors impact on the dependent variables.

II. Exploratory Data Analysis

1. About the dataset

Data “default” (Credit Card Default situation) label and description
Order Variable_Name Type Label_Description
1 Default Binary A factor with levels No and Yes indicating whether the customer defaulted on their debt
2 Student Binary A factor with levels No and Yes indicating whether the customer is a student
3 Balance Numerical The average balance that the customer has remaining on their credit card after making their monthly payment
4 Income Numerical Income of customer

The dataset includes 4 variables named Default, Student, Balance, and Income, which are described in the table above. The dependent variable is “Default” with 3 predictors as Student, Balance, and Income. It means that we will use the data: balance of an account, income of customers, and whether the customers are students or not to predict the probability of default in their credit cards.

2. Descriptive Statistics

Descriptive Statistics for Numerical Variables
vars n mean sd median min max range skew kurtosis se
Balance 10000 835.3749 483.715 823.637 0.0000 2654.323 2654.323 0.2459914 -0.3556736 4.83715
Income 10000 33516.9819 13336.640 34552.645 771.9677 73554.233 72782.266 0.0733187 -0.9000681 133.36640

With 2 numerical variables Balance and Income, we have descriptive statistics table above to explore the features of the data.

In terms of Balance, the mean and median of 10000 accounts balance are 835.37 and 823.64, respectively, ranging from 0 to 2654.32. Its standard deviation is at 483.72, which determines how the data is spread. The standard error of balance tells us that the population mean could be 4.84 differences from the sample mean. The skewness of balance is at 0.25 ~ 0, meaning that the data is nearly symmetrical. Kurtosis at -0.36 indicates that the tail of distribution is lighter than normal distribution (kurtosis at 3).

Regarding Income data, the income of customers varies from 772 to 73554, with the average at 33517. The varied level of data is illustrated by standard deviation at 13337. Sharing the same distribution with Balance, the Income data is also nearly symmetrical and lighter in tail compared to normal distribution.

As can be seen in the graph, the number of default account is 333, which accounts for 3% of the total accounts. Besides, among all customers, there are 2944 students and 7056 of customers are not students.

3. Relationship among variables

As can be seen from the boxplot, the difference in balance is very huge between default account and non-default account. The non-default customers have the mean balance at around 700-800, while that for default customers is nearly 1800. The interquartile range of non-default accounts’ balance is from 500 to around 1200, while that for default account is 1500-2000. To conclude, the default customers have larger balance than non-default customers.

The histograms illustrate how income of customers are distributed. Both two charts share similar pattern with two tops at 20000 and 40000. The income area around 30000 is only one-fourth of that in 20000 in Default account histogram, while this for non-default is also only two-thirds. In conclusion, there is no clear pattern to see the different between two histograms; hence, we need further test to see in the model part.

In terms of default customers, there are 127 students, which accounts for 38.14% of total. According to non-default customers, 2817 out of 9667 customers are students, contributing to 29.14% of total.

III. Model

1. Model Construction

First, to build the model, we split the data into 2 parts in a proportion of 80/20, with 80% of data as training data and 20% are testing data. Splitting data is a fundamental step in machine learning and data analysis, allowing us to effectively evaluate and train models. The process involves dividing the available dataset into two or more subsets: typically a training set and a testing/validation set. The training set is used to train the model, allowing it to learn patterns and relationships within the data. The testing/validation set, on the other hand, is utilized to assess the model’s performance and generalization capabilities on unseen data. By splitting the data, we can simulate real-world scenarios and evaluate how well the model performs on new, unseen examples. It helps in identifying potential issues such as overfitting or underfitting.

Based on the exploration of data, logistic regression will be used for the data because the dependent variable “default” is binary data.

The first model will present Default ~ Income + Balance + Student:

## 
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial", 
##     data = data_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4850  -0.1383  -0.0538  -0.0195   3.7573  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.097e+01  5.583e-01 -19.642   <2e-16 ***
## income       2.645e-06  9.262e-06   0.286   0.7752    
## balance      5.802e-03  2.623e-04  22.121   <2e-16 ***
## studentYes  -6.611e-01  2.655e-01  -2.490   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2320.3  on 7999  degrees of freedom
## Residual deviance: 1229.4  on 7996  degrees of freedom
## AIC: 1237.4
## 
## Number of Fisher Scoring iterations: 8

R-Square of the model:

## [1] 0.4701481

Adjusted R-Square of the model:

## [1] 0.4699

Looking at the result, we can see that:

Income has p-value at 0.712, which means that the p-value > 0.05, failing to reject H0. Hence, with 5% significance level, income does not have predictable power to default.

Both balance and student have p-value small enough to reject H0, indicating the two variables are significant to predict probability of default.

The R-square for this model is 46.192%, adjusted R-square is 46.18%. This number means that the predictors can explain around 46.18% the change of probability of default

2. Model evaluation

In this part, first, we remind the model as model_1, which used the training data (data_training): default ~ income + balance + student. The model indicates that default data will be forecasted by balance, student, and income variables. The next step is to fit the model into the testing data by predict function. After the model is trained by training data, model_1 is applied again with new data as testing data. Then, the predicted results are converted into binary form by ifelse function. Specifically, if the predicted probability of default is larger than 0.5, it will be denoted as “Yes” (or default). Otherwise, the result will be “No” (or non-default). Finally, the predicted value and the actual data is put together into a table to compare the accuracy of the model.

## # A tibble: 2,000 × 2
##    Prediction `Actual data`
##    <chr>      <fct>        
##  1 No         No           
##  2 No         No           
##  3 No         No           
##  4 No         No           
##  5 No         No           
##  6 No         No           
##  7 No         No           
##  8 No         No           
##  9 No         No           
## 10 No         No           
## # ℹ 1,990 more rows
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1923   50
##        Yes    8   19
##                                           
##                Accuracy : 0.971           
##                  95% CI : (0.9627, 0.9779)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.09672         
##                                           
##                   Kappa : 0.3839          
##                                           
##  Mcnemar's Test P-Value : 7.303e-08       
##                                           
##             Sensitivity : 0.9959          
##             Specificity : 0.2754          
##          Pos Pred Value : 0.9747          
##          Neg Pred Value : 0.7037          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9615          
##    Detection Prevalence : 0.9865          
##       Balanced Accuracy : 0.6356          
##                                           
##        'Positive' Class : No              
## 

Looking at the confusion matrix in the figure we can see in 2000 observations in data_test, there are 1931 observations predicted as true “No”, while only 22 objects are forecasted as true “Yes”. The anticipations that are false for “No” and “Yes” are 4 and 43, respectively. Hence, we can easily measure the accuracy of the model at 97.65%, which is very well-performed. The sensitivity of the model represents how well the model predicts “No” observation or non-default account. The model indicates that 99.79% of “No” value is truly predicted with 1931/1935 values are explored as true “No”. The well performance of sensitivity together with the imbalanced distribution of the dataset (“No” value outnumber “Yes” value) exaggerates the model’s accuracy. In fact, if we look at specificity metric, which measure how well the model detects Default account, we can see the model’s failure. Specifically, in 2000 observations, there are only 65 Default accounts, which makes up 3.25% of the sample. Among that, 22 true “Yes” are detected, meaning that only 33.85% of the value are truly found.

Expand

lda <- lda(default ~income+ balance+ student, data = data_training)
lda
## Call:
## lda(default ~ income + balance + student, data = data_training)
## 
## Prior probabilities of groups:
##    No   Yes 
## 0.967 0.033 
## 
## Group means:
##       income   balance studentYes
## No  33580.79  801.9643  0.2889090
## Yes 31831.14 1756.4756  0.3863636
## 
## Coefficients of linear discriminants:
##                      LD1
## income      3.005265e-06
## balance     2.248463e-03
## studentYes -1.716001e-01
predict_lda <- predict(lda, type= "response", newdata=data_testing)$class
result_lda <- predict_lda %>% bind_cols(data_testing %>% dplyr::select(default))
## New names:
## • `` -> `...1`
colnames(result_lda) <- c("predicted_value", "actual_value")
confusion_lda <- confusionMatrix(result_lda$predicted_value, result_lda$actual_value)
confusion_lda
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1927   53
##        Yes    4   16
##                                           
##                Accuracy : 0.9715          
##                  95% CI : (0.9632, 0.9783)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 0.07637         
##                                           
##                   Kappa : 0.3495          
##                                           
##  Mcnemar's Test P-Value : 2.047e-10       
##                                           
##             Sensitivity : 0.9979          
##             Specificity : 0.2319          
##          Pos Pred Value : 0.9732          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.9655          
##          Detection Rate : 0.9635          
##    Detection Prevalence : 0.9900          
##       Balanced Accuracy : 0.6149          
##                                           
##        'Positive' Class : No              
##