First and foremost, install (use install.packages()) and load all the required libraries.

# install the required packages
library(readxl)    # load excel files
library(dplyr)     # tidy up the data (data wrangling)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(cdata)     # data wrangling
## Loading required package: wrapr
## 
## Attaching package: 'wrapr'
## The following object is masked from 'package:dplyr':
## 
##     coalesce
library(ggplot2)   # beautiful plot
library(rsample)   # stratified sampling
library(reshape2)  # data wrangling

Data acquisition

Import data in excel file from working directory. Examine each predictor’s statistical measures and their classes. Set the working directory to where the file is located.

# load the data with headers as column names
dat = read_excel('dsc_interview_assignment.xlsx', sheet="Data", range = "B2:Y30002")
# From the str() output, we get to know that the features are all in numeric format.
str(dat)
## tibble [30,000 x 24] (S3: tbl_df/tbl/data.frame)
##  $ LIMIT_BAL                 : num [1:30000] 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ SEX                       : num [1:30000] 2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION                 : num [1:30000] 2 2 2 2 2 1 1 2 3 3 ...
##  $ MARRIAGE                  : num [1:30000] 1 2 2 1 1 2 2 2 1 2 ...
##  $ AGE                       : num [1:30000] 24 26 34 37 57 37 29 23 28 35 ...
##  $ PAY_0                     : num [1:30000] 2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ PAY_2                     : num [1:30000] 2 2 0 0 0 0 0 -1 0 -2 ...
##  $ PAY_3                     : num [1:30000] -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ PAY_4                     : num [1:30000] -1 0 0 0 0 0 0 0 0 -2 ...
##  $ PAY_5                     : num [1:30000] -2 0 0 0 0 0 0 0 0 -1 ...
##  $ PAY_6                     : num [1:30000] -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ BILL_AMT1                 : num [1:30000] 3913 2682 29239 46990 8617 ...
##  $ BILL_AMT2                 : num [1:30000] 3102 1725 14027 48233 5670 ...
##  $ BILL_AMT3                 : num [1:30000] 689 2682 13559 49291 35835 ...
##  $ BILL_AMT4                 : num [1:30000] 0 3272 14331 28314 20940 ...
##  $ BILL_AMT5                 : num [1:30000] 0 3455 14948 28959 19146 ...
##  $ BILL_AMT6                 : num [1:30000] 0 3261 15549 29547 19131 ...
##  $ PAY_AMT1                  : num [1:30000] 0 0 1518 2000 2000 ...
##  $ PAY_AMT2                  : num [1:30000] 689 1000 1500 2019 36681 ...
##  $ PAY_AMT3                  : num [1:30000] 0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ PAY_AMT4                  : num [1:30000] 0 1000 1000 1100 9000 ...
##  $ PAY_AMT5                  : num [1:30000] 0 0 1000 1069 689 ...
##  $ PAY_AMT6                  : num [1:30000] 0 2000 5000 1000 679 ...
##  $ default payment next month: num [1:30000] 1 1 0 0 0 0 0 0 0 0 ...
summary(dat)
##    LIMIT_BAL            SEX          EDUCATION        MARRIAGE    
##  Min.   :  10000   Min.   :1.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:  50000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 140000   Median :2.000   Median :2.000   Median :2.000  
##  Mean   : 167484   Mean   :1.604   Mean   :1.853   Mean   :1.552  
##  3rd Qu.: 240000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :1000000   Max.   :2.000   Max.   :6.000   Max.   :3.000  
##       AGE            PAY_0             PAY_2             PAY_3        
##  Min.   :21.00   Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000  
##  1st Qu.:28.00   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000  
##  Median :34.00   Median : 0.0000   Median : 0.0000   Median : 0.0000  
##  Mean   :35.49   Mean   :-0.0167   Mean   :-0.1338   Mean   :-0.1662  
##  3rd Qu.:41.00   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :79.00   Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000  
##      PAY_4             PAY_5             PAY_6           BILL_AMT1      
##  Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000   Min.   :-165580  
##  1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:   3559  
##  Median : 0.0000   Median : 0.0000   Median : 0.0000   Median :  22382  
##  Mean   :-0.2207   Mean   :-0.2662   Mean   :-0.2911   Mean   :  51223  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:  67091  
##  Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000   Max.   : 964511  
##    BILL_AMT2        BILL_AMT3         BILL_AMT4         BILL_AMT5     
##  Min.   :-69777   Min.   :-157264   Min.   :-170000   Min.   :-81334  
##  1st Qu.:  2985   1st Qu.:   2666   1st Qu.:   2327   1st Qu.:  1763  
##  Median : 21200   Median :  20089   Median :  19052   Median : 18105  
##  Mean   : 49179   Mean   :  47013   Mean   :  43263   Mean   : 40311  
##  3rd Qu.: 64006   3rd Qu.:  60165   3rd Qu.:  54506   3rd Qu.: 50191  
##  Max.   :983931   Max.   :1664089   Max.   : 891586   Max.   :927171  
##    BILL_AMT6          PAY_AMT1         PAY_AMT2          PAY_AMT3     
##  Min.   :-339603   Min.   :     0   Min.   :      0   Min.   :     0  
##  1st Qu.:   1256   1st Qu.:  1000   1st Qu.:    833   1st Qu.:   390  
##  Median :  17071   Median :  2100   Median :   2009   Median :  1800  
##  Mean   :  38872   Mean   :  5664   Mean   :   5921   Mean   :  5226  
##  3rd Qu.:  49198   3rd Qu.:  5006   3rd Qu.:   5000   3rd Qu.:  4505  
##  Max.   : 961664   Max.   :873552   Max.   :1684259   Max.   :896040  
##     PAY_AMT4         PAY_AMT5           PAY_AMT6       
##  Min.   :     0   Min.   :     0.0   Min.   :     0.0  
##  1st Qu.:   296   1st Qu.:   252.5   1st Qu.:   117.8  
##  Median :  1500   Median :  1500.0   Median :  1500.0  
##  Mean   :  4826   Mean   :  4799.4   Mean   :  5215.5  
##  3rd Qu.:  4013   3rd Qu.:  4031.5   3rd Qu.:  4000.0  
##  Max.   :621000   Max.   :426529.0   Max.   :528666.0  
##  default payment next month
##  Min.   :0.0000            
##  1st Qu.:0.0000            
##  Median :0.0000            
##  Mean   :0.2212            
##  3rd Qu.:0.0000            
##  Max.   :1.0000
print(paste("Is there any missing value?  ",sum(apply(dat,2, function(x) any(is.na(x))))))
## [1] "Is there any missing value?   0"
# No missing value

Data preprocessing

The data loaded contains categorical features stored in numeric form.

t = dat
colnames(t)[dim(t)[2]]= c("default")

# Change categorical features to factor
t$default = factor(t$default)
# encode sex
t$SEX = ifelse(dat$SEX==2, 'F', 'M')
# encode marriage
t$MARRIAGE = ifelse(dat$MARRIAGE==1,'married',ifelse(dat$MARRIAGE==2,'single','others'))
# encode education
t$EDUCATION = ifelse(dat$EDUCATION==1, 'grad_school',
                     ifelse(dat$EDUCATION==2, 'university', 
                            ifelse(dat$EDUCATION==3, 'high_school',
                                   ifelse(dat$EDUCATION==4, 'others', 'unknown'
                                                                   ))))
# Convert to factor
features_cat = c(colnames(t)[2:4], colnames(t)[6:11])

library(dplyr)
for (i in features_cat){
  t[i] = factor(pull(t[i]))
}

Data partitioning

Perform stratified sampling: 80% training, 20% test. Check both training and test data class proportion. The data is of skewed class distribution, roughly 78% of the data consist of non-default cases, while the remaining 22% belongs to default instances.

response = "default"
set.seed(100)    # for reproducibility
split = initial_split(t, prop = 0.8, strata = response)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(response)` instead of `response` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
train = training(split)
test = testing(split)

summary(train$default)/dim(train)[1]
##         0         1 
## 0.7788241 0.2211759
summary(test$default)/dim(test)[1]
##         0         1 
## 0.7787035 0.2212965

Data visualization

From this point onward, we will assume that we only have training data at our disposal to work with. Test data is only available for performance evaluation stage.

Distribution of continuous features

Histogram

# Continuos features
features_cont = setdiff(colnames(train), c(features_cat, response))

# Change from wide to tall format
data_long = unpivot_to_blocks(train, nameForNewKeyColumn = "variables",
                              nameForNewValueColumn = "values", 
                              columnsToTakeFrom = features_cont)

data_long %>% ggplot(aes(x=values)) +
  geom_histogram(bins = 10, fill="gray") +
  facet_wrap(~variables, ncol = 4, scales = "free")

The above plot shows all the predictors’ distributions in one single figure. For a clearer depiction of a few features, we can specified the specific continuous features by features_cont[p:q], where p<=q. From the above plots, all the continuous predictors are positively skewed.

Boxplots

data_long = unpivot_to_blocks(train, nameForNewKeyColumn = "variables",
                              nameForNewValueColumn = "values", 
                              columnsToTakeFrom = features_cont[1:2])  
# Limit balance and age
ggplot(data_long,aes(x=default,y=values)) +
  geom_boxplot(color="blue",fill="blue",alpha=0.2,notch=TRUE,
               outlier.color="red",outlier.fill = "red",outlier.size = 2) +
  facet_wrap(~variables,ncol = 2,scales = "free")

Correlation matrix

corr_mat = cor(train[,features_cont])
corr_mat[upper.tri(corr_mat)]=NA
melted_cormat = melt(corr_mat, na.rm = TRUE)

ggplot(data = melted_cormat, aes(x=Var1, y = Var2, fill=value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                      midpoint = 0, limit = c(-1,1), space = "Lab",
                      name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust=1, size =10, hjust=1)) +
  coord_fixed() +
  geom_text(aes(Var1, Var2, label = round(value,2)), color = "black", size = 3)

Amount of bill statement from April to September in the year 2005 are highly correlated (linear).

Distribution of categorical features

Contingency table

table_cont = train %>% count(EDUCATION, default)

table_sum = train %>% count(EDUCATION, default) %>% group_by(EDUCATION) %>% 
  summarise(sum_col=sum(n))
row_sum = train %>% count(EDUCATION, default) %>% group_by(default) %>% 
  summarise(sum_row=sum(n))

# colormap (heatmap)
table_cont %>% left_join(table_sum, by = "EDUCATION") %>% 
  mutate(percent = n/sum_col) %>% 
  ggplot(aes(x = EDUCATION, y = default)) +
  geom_tile(aes(fill = percent)) +
  theme(axis.text.x = element_text(angle = 45,hjust=1)) +
  geom_text(aes(label = round(percent,2), hjust=1, vjust=1, colour = "white"),
            show.legend = FALSE) +
  geom_text(aes(label = n, hjust=1, vjust=0, colour = "red"), show.legend = FALSE)

# contingency table
table(train$EDUCATION, train$default)
##              
##                  0    1
##   grad_school 6808 1621
##   high_school 2919 1000
##   others        94    7
##   university  8602 2661
##   unknown      268   19

Chi-squared test

Plot of expected count and actual count for each class label under different categories.

m = dim(train)[1]
table_cont %>% left_join(table_sum, by = "EDUCATION") %>% 
  left_join(row_sum, by = "default") %>% 
  mutate(expected_n = (sum_col*sum_row)/m) %>% 
  unpivot_to_blocks(nameForNewKeyColumn = "variables",
                    nameForNewValueColumn = "values",
                    columnsToTakeFrom = c("n", "expected_n")) %>% 
  ggplot(aes(x = default, y = values, group = variables, fill = variables)) +
  geom_bar(stat = "identity", width = 0.5, position = "dodge") +
  facet_wrap(.~EDUCATION,ncol = 3, scales = "free_y") +
  geom_text(aes(label = round(values,2)), vjust = 1.6, size = 3) +
  theme_bw()

Chi-squared statistics can be used to test the dependency of categorical feature and target variable. Below are some predictors that are associated with the outcomes (eg. default or non-default).

train_cat_res = train[, !(colnames(train) %in% features_cont)]
#chisq.test(train_cat_res[,2], train_cat_res$default)

t=c()
idx=c()
for (i in (1:(ncol(train_cat_res)-1))) {
  t[i]=chisq.test(train_cat_res[,i],train_cat_res$default)$p.value
  # u[i]=fisher.test(train[,i],train$credit_risk)$p.value
  if (!is.list(tryCatch( { result <- chisq.test(train_cat_res[,i],train_cat_res$default) }
                         , warning = function(w) { print("TRUE") }))) {
    idx=c(idx,i)
  }
}
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
idx_sig=which(t<=0.05)

idx_int=!(idx_sig %in% idx)
colnames(train_cat_res)[idx_sig[idx_int]]
## [1] "SEX"       "EDUCATION" "MARRIAGE"
print(paste("Variables not independent with response: ", 
            colnames(train_cat_res)[idx_sig[idx_int]]))
## [1] "Variables not independent with response:  SEX"      
## [2] "Variables not independent with response:  EDUCATION"
## [3] "Variables not independent with response:  MARRIAGE"

Summary of finding

The following are some takeaways from the data visualization:
1. According to the above boxplots, under the ‘LIMIT_BAL’ boxplot column, the notches of two boxes do not overlap, indicating strong evidence that their medians “significantly differ”. As for other predictors, there are no significant difference in medians. 2. Based on the chi-squared test, SEX, MARRIAGE and EDUCATION are some categorical predictors which are associated with the response variable (default or non-default).

Before we step into the modeling, I would like to summarize/highlights some important pitfalls that might cause issues in modeling:
1. All the continuous features are positively skewed.
2. Amount of bill statements for previous months (BILL_AMT1,…,BILL_AMT6) are strongly correlated (linear).

Computing environment

##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          4                           
## minor          0.2                         
## year           2020                        
## month          06                          
## day            22                          
## svn rev        78730                       
## language       R                           
## version.string R version 4.0.2 (2020-06-22)
## nickname       Taking Off Again

Reference

  1. Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.