First and foremost, install (use install.packages()) and load all the required libraries.
# install the required packages
library(readxl) # load excel files
library(dplyr) # tidy up the data (data wrangling)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(cdata) # data wrangling
## Loading required package: wrapr
##
## Attaching package: 'wrapr'
## The following object is masked from 'package:dplyr':
##
## coalesce
library(ggplot2) # beautiful plot
library(rsample) # stratified sampling
library(reshape2) # data wrangling
Import data in excel file from working directory. Examine each predictor’s statistical measures and their classes. Set the working directory to where the file is located.
# load the data with headers as column names
dat = read_excel('dsc_interview_assignment.xlsx', sheet="Data", range = "B2:Y30002")
# From the str() output, we get to know that the features are all in numeric format.
str(dat)
## tibble [30,000 x 24] (S3: tbl_df/tbl/data.frame)
## $ LIMIT_BAL : num [1:30000] 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
## $ SEX : num [1:30000] 2 2 2 2 1 1 1 2 2 1 ...
## $ EDUCATION : num [1:30000] 2 2 2 2 2 1 1 2 3 3 ...
## $ MARRIAGE : num [1:30000] 1 2 2 1 1 2 2 2 1 2 ...
## $ AGE : num [1:30000] 24 26 34 37 57 37 29 23 28 35 ...
## $ PAY_0 : num [1:30000] 2 -1 0 0 -1 0 0 0 0 -2 ...
## $ PAY_2 : num [1:30000] 2 2 0 0 0 0 0 -1 0 -2 ...
## $ PAY_3 : num [1:30000] -1 0 0 0 -1 0 0 -1 2 -2 ...
## $ PAY_4 : num [1:30000] -1 0 0 0 0 0 0 0 0 -2 ...
## $ PAY_5 : num [1:30000] -2 0 0 0 0 0 0 0 0 -1 ...
## $ PAY_6 : num [1:30000] -2 2 0 0 0 0 0 -1 0 -1 ...
## $ BILL_AMT1 : num [1:30000] 3913 2682 29239 46990 8617 ...
## $ BILL_AMT2 : num [1:30000] 3102 1725 14027 48233 5670 ...
## $ BILL_AMT3 : num [1:30000] 689 2682 13559 49291 35835 ...
## $ BILL_AMT4 : num [1:30000] 0 3272 14331 28314 20940 ...
## $ BILL_AMT5 : num [1:30000] 0 3455 14948 28959 19146 ...
## $ BILL_AMT6 : num [1:30000] 0 3261 15549 29547 19131 ...
## $ PAY_AMT1 : num [1:30000] 0 0 1518 2000 2000 ...
## $ PAY_AMT2 : num [1:30000] 689 1000 1500 2019 36681 ...
## $ PAY_AMT3 : num [1:30000] 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ PAY_AMT4 : num [1:30000] 0 1000 1000 1100 9000 ...
## $ PAY_AMT5 : num [1:30000] 0 0 1000 1069 689 ...
## $ PAY_AMT6 : num [1:30000] 0 2000 5000 1000 679 ...
## $ default payment next month: num [1:30000] 1 1 0 0 0 0 0 0 0 0 ...
summary(dat)
## LIMIT_BAL SEX EDUCATION MARRIAGE
## Min. : 10000 Min. :1.000 Min. :0.000 Min. :0.000
## 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 140000 Median :2.000 Median :2.000 Median :2.000
## Mean : 167484 Mean :1.604 Mean :1.853 Mean :1.552
## 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :1000000 Max. :2.000 Max. :6.000 Max. :3.000
## AGE PAY_0 PAY_2 PAY_3
## Min. :21.00 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
## 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
## Median :34.00 Median : 0.0000 Median : 0.0000 Median : 0.0000
## Mean :35.49 Mean :-0.0167 Mean :-0.1338 Mean :-0.1662
## 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :79.00 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
## PAY_4 PAY_5 PAY_6 BILL_AMT1
## Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-165580
## 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.: 3559
## Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 22382
## Mean :-0.2207 Mean :-0.2662 Mean :-0.2911 Mean : 51223
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 67091
## Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 964511
## BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5
## Min. :-69777 Min. :-157264 Min. :-170000 Min. :-81334
## 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327 1st Qu.: 1763
## Median : 21200 Median : 20089 Median : 19052 Median : 18105
## Mean : 49179 Mean : 47013 Mean : 43263 Mean : 40311
## 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506 3rd Qu.: 50191
## Max. :983931 Max. :1664089 Max. : 891586 Max. :927171
## BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3
## Min. :-339603 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833 1st Qu.: 390
## Median : 17071 Median : 2100 Median : 2009 Median : 1800
## Mean : 38872 Mean : 5664 Mean : 5921 Mean : 5226
## 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000 3rd Qu.: 4505
## Max. : 961664 Max. :873552 Max. :1684259 Max. :896040
## PAY_AMT4 PAY_AMT5 PAY_AMT6
## Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8
## Median : 1500 Median : 1500.0 Median : 1500.0
## Mean : 4826 Mean : 4799.4 Mean : 5215.5
## 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0
## Max. :621000 Max. :426529.0 Max. :528666.0
## default payment next month
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2212
## 3rd Qu.:0.0000
## Max. :1.0000
print(paste("Is there any missing value? ",sum(apply(dat,2, function(x) any(is.na(x))))))
## [1] "Is there any missing value? 0"
# No missing value
The data loaded contains categorical features stored in numeric form.
t = dat
colnames(t)[dim(t)[2]]= c("default")
# Change categorical features to factor
t$default = factor(t$default)
# encode sex
t$SEX = ifelse(dat$SEX==2, 'F', 'M')
# encode marriage
t$MARRIAGE = ifelse(dat$MARRIAGE==1,'married',ifelse(dat$MARRIAGE==2,'single','others'))
# encode education
t$EDUCATION = ifelse(dat$EDUCATION==1, 'grad_school',
ifelse(dat$EDUCATION==2, 'university',
ifelse(dat$EDUCATION==3, 'high_school',
ifelse(dat$EDUCATION==4, 'others', 'unknown'
))))
# Convert to factor
features_cat = c(colnames(t)[2:4], colnames(t)[6:11])
library(dplyr)
for (i in features_cat){
t[i] = factor(pull(t[i]))
}
Perform stratified sampling: 80% training, 20% test. Check both training and test data class proportion. The data is of skewed class distribution, roughly 78% of the data consist of non-default cases, while the remaining 22% belongs to default instances.
response = "default"
set.seed(100) # for reproducibility
split = initial_split(t, prop = 0.8, strata = response)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(response)` instead of `response` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
train = training(split)
test = testing(split)
summary(train$default)/dim(train)[1]
## 0 1
## 0.7788241 0.2211759
summary(test$default)/dim(test)[1]
## 0 1
## 0.7787035 0.2212965
From this point onward, we will assume that we only have training data at our disposal to work with. Test data is only available for performance evaluation stage.
# Continuos features
features_cont = setdiff(colnames(train), c(features_cat, response))
# Change from wide to tall format
data_long = unpivot_to_blocks(train, nameForNewKeyColumn = "variables",
nameForNewValueColumn = "values",
columnsToTakeFrom = features_cont)
data_long %>% ggplot(aes(x=values)) +
geom_histogram(bins = 10, fill="gray") +
facet_wrap(~variables, ncol = 4, scales = "free")
The above plot shows all the predictors’ distributions in one single figure. For a clearer depiction of a few features, we can specified the specific continuous features by features_cont[p:q], where p<=q. From the above plots, all the continuous predictors are positively skewed.
data_long = unpivot_to_blocks(train, nameForNewKeyColumn = "variables",
nameForNewValueColumn = "values",
columnsToTakeFrom = features_cont[1:2])
# Limit balance and age
ggplot(data_long,aes(x=default,y=values)) +
geom_boxplot(color="blue",fill="blue",alpha=0.2,notch=TRUE,
outlier.color="red",outlier.fill = "red",outlier.size = 2) +
facet_wrap(~variables,ncol = 2,scales = "free")
corr_mat = cor(train[,features_cont])
corr_mat[upper.tri(corr_mat)]=NA
melted_cormat = melt(corr_mat, na.rm = TRUE)
ggplot(data = melted_cormat, aes(x=Var1, y = Var2, fill=value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust=1, size =10, hjust=1)) +
coord_fixed() +
geom_text(aes(Var1, Var2, label = round(value,2)), color = "black", size = 3)
Amount of bill statement from April to September in the year 2005 are highly correlated (linear).
table_cont = train %>% count(EDUCATION, default)
table_sum = train %>% count(EDUCATION, default) %>% group_by(EDUCATION) %>%
summarise(sum_col=sum(n))
row_sum = train %>% count(EDUCATION, default) %>% group_by(default) %>%
summarise(sum_row=sum(n))
# colormap (heatmap)
table_cont %>% left_join(table_sum, by = "EDUCATION") %>%
mutate(percent = n/sum_col) %>%
ggplot(aes(x = EDUCATION, y = default)) +
geom_tile(aes(fill = percent)) +
theme(axis.text.x = element_text(angle = 45,hjust=1)) +
geom_text(aes(label = round(percent,2), hjust=1, vjust=1, colour = "white"),
show.legend = FALSE) +
geom_text(aes(label = n, hjust=1, vjust=0, colour = "red"), show.legend = FALSE)
# contingency table
table(train$EDUCATION, train$default)
##
## 0 1
## grad_school 6808 1621
## high_school 2919 1000
## others 94 7
## university 8602 2661
## unknown 268 19
Plot of expected count and actual count for each class label under different categories.
m = dim(train)[1]
table_cont %>% left_join(table_sum, by = "EDUCATION") %>%
left_join(row_sum, by = "default") %>%
mutate(expected_n = (sum_col*sum_row)/m) %>%
unpivot_to_blocks(nameForNewKeyColumn = "variables",
nameForNewValueColumn = "values",
columnsToTakeFrom = c("n", "expected_n")) %>%
ggplot(aes(x = default, y = values, group = variables, fill = variables)) +
geom_bar(stat = "identity", width = 0.5, position = "dodge") +
facet_wrap(.~EDUCATION,ncol = 3, scales = "free_y") +
geom_text(aes(label = round(values,2)), vjust = 1.6, size = 3) +
theme_bw()
Chi-squared statistics can be used to test the dependency of categorical feature and target variable. Below are some predictors that are associated with the outcomes (eg. default or non-default).
train_cat_res = train[, !(colnames(train) %in% features_cont)]
#chisq.test(train_cat_res[,2], train_cat_res$default)
t=c()
idx=c()
for (i in (1:(ncol(train_cat_res)-1))) {
t[i]=chisq.test(train_cat_res[,i],train_cat_res$default)$p.value
# u[i]=fisher.test(train[,i],train$credit_risk)$p.value
if (!is.list(tryCatch( { result <- chisq.test(train_cat_res[,i],train_cat_res$default) }
, warning = function(w) { print("TRUE") }))) {
idx=c(idx,i)
}
}
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
## Warning in chisq.test(train_cat_res[, i], train_cat_res$default): Chi-squared
## approximation may be incorrect
## [1] "TRUE"
idx_sig=which(t<=0.05)
idx_int=!(idx_sig %in% idx)
colnames(train_cat_res)[idx_sig[idx_int]]
## [1] "SEX" "EDUCATION" "MARRIAGE"
print(paste("Variables not independent with response: ",
colnames(train_cat_res)[idx_sig[idx_int]]))
## [1] "Variables not independent with response: SEX"
## [2] "Variables not independent with response: EDUCATION"
## [3] "Variables not independent with response: MARRIAGE"
The following are some takeaways from the data visualization:
1. According to the above boxplots, under the ‘LIMIT_BAL’ boxplot column, the notches of two boxes do not overlap, indicating strong evidence that their medians “significantly differ”. As for other predictors, there are no significant difference in medians. 2. Based on the chi-squared test, SEX, MARRIAGE and EDUCATION are some categorical predictors which are associated with the response variable (default or non-default).
Before we step into the modeling, I would like to summarize/highlights some important pitfalls that might cause issues in modeling:
1. All the continuous features are positively skewed.
2. Amount of bill statements for previous months (BILL_AMT1,…,BILL_AMT6) are strongly correlated (linear).
Computing environment
## _
## platform x86_64-w64-mingw32
## arch x86_64
## os mingw32
## system x86_64, mingw32
## status
## major 4
## minor 0.2
## year 2020
## month 06
## day 22
## svn rev 78730
## language R
## version.string R version 4.0.2 (2020-06-22)
## nickname Taking Off Again