The provided data relates to clients taking up a new product, which is indicated by the SS column (target variable). More specifically, a response to SS of 1 is indicative of a client that has taken up the product and clients with SS responsive of 0 did not take up.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.2 magrittr_1.5 tools_3.5.2 htmltools_0.3.6
## [5] yaml_2.2.0 Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
## [9] knitr_1.21 stringr_1.3.1 xfun_0.5 digest_0.6.18
## [13] evaluate_0.13
setwd("C:/Users/Edzai/Documents/Edzai_assessment_documents")
getwd()
## [1] "C:/Users/Edzai/Documents/Edzai_assessment_documents"
The dataset is provided in the feather format so it is necessary to load the feather package in R
library(feather)
df <- read_feather("C:/Users/Edzai/Documents/Edzai_assessment_documents/train_data")
Sense-check the loaded data, check column names
names(df)
## [1] "LoanClient" "Inflows_Total"
## [3] "Outflows_Total_L3M" "Inflows_Above_1k"
## [5] "Max_Dep_Bal_L3M" "InactiveLoanClient"
## [7] "Other_Perc_L3M" "Outflow_Max_L3M"
## [9] "Inflows_Max_AVG_Day" "DormantLoanClient"
## [11] "Ave_Days_Above_100_L3M" "CW_Perc_L3M"
## [13] "DO_Perc_L3M" "DODispute_L3M"
## [15] "Val_POS_L3M" "Val_DO_L3M"
## [17] "Avg_Dep_Bal_L3M" "CSWEEP_P90_L3M"
## [19] "CW_Util_L3M" "SS"
## [21] "LoanEver"
Analyse the structure of the data
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 140000 obs. of 21 variables:
## $ LoanClient : int 1 1 0 0 0 0 0 1 0 0 ...
## $ Inflows_Total : num 44214 NA 42156 97643 300 ...
## $ Outflows_Total_L3M : num -43560 NA -38444 -157110 -224 ...
## $ Inflows_Above_1k : int 2 NA 0 3 0 0 3 0 1 6 ...
## $ Max_Dep_Bal_L3M : num 2954 NA 30800 51622 318 ...
## $ InactiveLoanClient : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Other_Perc_L3M : num 0.253 NA 0.027 0.062 0 ...
## $ Outflow_Max_L3M : num -2000 NA -1500 -6500 -224 ...
## $ Inflows_Max_AVG_Day : int 14 NA 23 24 2 24 24 25 24 19 ...
## $ DormantLoanClient : int 0 0 1 1 0 0 0 0 1 1 ...
## $ Ave_Days_Above_100_L3M: num 0.424 NA 0.141 1 0.13 ...
## $ CW_Perc_L3M : num 0.312 NA 0.742 0.239 0 ...
## $ DO_Perc_L3M : num 0.0417 NA 0 0.0708 1 ...
## $ DODispute_L3M : int 2 NA 0 2 0 0 6 0 0 0 ...
## $ Val_POS_L3M : num -9112 NA -9713 -52805 0 ...
## $ Val_DO_L3M : num -1682 NA 0 -11227 -224 ...
## $ Avg_Dep_Bal_L3M : num 307.2 NA 1543.9 9005.7 95.2 ...
## $ CSWEEP_P90_L3M : int 0 NA 0 0 0 0 5 1 3 1 ...
## $ CW_Util_L3M : num 0.307 NA 0.67 0.202 0 ...
## $ SS : int 1 0 0 0 0 1 0 1 0 0 ...
## $ LoanEver : num 1 1 1 1 0 1 0 1 1 1 ...
Check the summary statistics for each of the variables
summary(df)
## LoanClient Inflows_Total Outflows_Total_L3M Inflows_Above_1k
## Min. :0.0000 Min. : 0 Min. :-2696387.0 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 5640 1st Qu.: -33424.3 1st Qu.: 1.000
## Median :0.0000 Median : 15250 Median : -14912.5 Median : 2.000
## Mean :0.2269 Mean : 26613 Mean : -25697.0 Mean : 2.928
## 3rd Qu.:0.0000 3rd Qu.: 34047 3rd Qu.: -5624.8 3rd Qu.: 4.000
## Max. :1.0000 Max. :3506638 Max. : -2.5 Max. :700.000
## NA's :21242 NA's :21325 NA's :21242
## Max_Dep_Bal_L3M InactiveLoanClient Other_Perc_L3M
## Min. : -5599.2 Min. :0.00000 Min. :0.000
## 1st Qu.: 999.8 1st Qu.:0.00000 1st Qu.:0.000
## Median : 3327.9 Median :0.00000 Median :0.015
## Mean : 8375.3 Mean :0.03246 Mean :0.089
## 3rd Qu.: 8010.8 3rd Qu.:0.00000 3rd Qu.:0.126
## Max. :2118570.0 Max. :1.00000 Max. :1.000
## NA's :5244 NA's :18884
## Outflow_Max_L3M Inflows_Max_AVG_Day DormantLoanClient
## Min. :-550000.0 Min. : 1.00 Min. :0.0000
## 1st Qu.: -2000.0 1st Qu.:15.00 1st Qu.:0.0000
## Median : -1000.0 Median :22.00 Median :0.0000
## Mean : -1637.2 Mean :19.95 Mean :0.1512
## 3rd Qu.: -471.5 3rd Qu.:25.00 3rd Qu.:0.0000
## Max. : -1.5 Max. :31.00 Max. :1.0000
## NA's :21325 NA's :21242
## Ave_Days_Above_100_L3M CW_Perc_L3M DO_Perc_L3M DODispute_L3M
## Min. :0.000 Min. :0.000 Min. :0.000 Min. : 0.000
## 1st Qu.:0.130 1st Qu.:0.216 1st Qu.:0.000 1st Qu.: 0.000
## Median :0.393 Median :0.443 Median :0.000 Median : 0.000
## Mean :0.445 Mean :0.463 Mean :0.079 Mean : 0.353
## 3rd Qu.:0.750 3rd Qu.:0.691 3rd Qu.:0.098 3rd Qu.: 0.000
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :54.000
## NA's :5244 NA's :18884 NA's :18884 NA's :18884
## Val_POS_L3M Val_DO_L3M Avg_Dep_Bal_L3M CSWEEP_P90_L3M
## Min. :-410127 Min. :-253761 Min. : -6827.4 Min. : 0.000
## 1st Qu.: -6748 1st Qu.: -2140 1st Qu.: 109.9 1st Qu.: 0.000
## Median : -2007 Median : 0 Median : 405.3 Median : 1.000
## Mean : -5416 Mean : -2639 Mean : 2110.3 Mean : 1.333
## 3rd Qu.: -239 3rd Qu.: 0 3rd Qu.: 1239.5 3rd Qu.: 2.000
## Max. : 0 Max. : 0 Max. :1308562.6 Max. :31.000
## NA's :18884 NA's :18884 NA's :5244 NA's :18884
## CW_Util_L3M SS LoanEver
## Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.152 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.362 Median :0.0000 Median :0.0000
## Mean : 0.436 Mean :0.1984 Mean :0.4106
## 3rd Qu.: 0.616 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :128.693 Max. :1.0000 Max. :1.0000
## NA's :18884
Check the shape of the data
dim(df)
## [1] 140000 21
dim(df)
## [1] 140000 21
Explore missing values in the dataset using the VIM package
library(VIM)
a <- aggr(df)
summary(a)
##
## Missings per variable:
## Variable Count
## LoanClient 0
## Inflows_Total 21242
## Outflows_Total_L3M 21325
## Inflows_Above_1k 21242
## Max_Dep_Bal_L3M 5244
## InactiveLoanClient 0
## Other_Perc_L3M 18884
## Outflow_Max_L3M 21325
## Inflows_Max_AVG_Day 21242
## DormantLoanClient 0
## Ave_Days_Above_100_L3M 5244
## CW_Perc_L3M 18884
## DO_Perc_L3M 18884
## DODispute_L3M 18884
## Val_POS_L3M 18884
## Val_DO_L3M 18884
## Avg_Dep_Bal_L3M 5244
## CSWEEP_P90_L3M 18884
## CW_Util_L3M 18884
## SS 0
## LoanEver 0
##
## Missings in combinations of variables:
## Combinations Count Percent
## 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 116382 83.130000000
## 0:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0 2371 1.693571429
## 0:0:1:0:1:0:0:1:0:0:1:0:0:0:0:0:1:0:0:0:0 5 0.003571429
## 0:1:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0 2288 1.634285714
## 0:1:0:1:1:0:0:0:1:0:1:0:0:0:0:0:1:0:0:0:0 5 0.003571429
## 0:1:1:1:0:0:0:1:1:0:0:0:0:0:0:0:0:0:0:0:0 65 0.046428571
## 0:1:1:1:0:0:1:1:1:0:0:1:1:1:1:1:0:1:1:0:0 13650 9.750000000
## 0:1:1:1:1:0:1:1:1:0:1:1:1:1:1:1:1:1:1:0:0 5234 3.738571429
Since the data is way more than 100 000, it is sufficient to discard cases with missing data with minimal effect to the logistic regression model.
dfna <- na.omit(df)
Check the dimension and structure of the new dataset
str(dfna)
## Classes 'tbl_df', 'tbl' and 'data.frame': 116382 obs. of 21 variables:
## $ LoanClient : int 1 0 0 0 0 0 1 0 0 1 ...
## $ Inflows_Total : num 44214 42156 97643 300 16836 ...
## $ Outflows_Total_L3M : num -43560 -38444 -157110 -224 -13428 ...
## $ Inflows_Above_1k : int 2 0 3 0 0 3 0 1 6 9 ...
## $ Max_Dep_Bal_L3M : num 2954 30800 51622 318 5826 ...
## $ InactiveLoanClient : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Other_Perc_L3M : num 0.253 0.027 0.062 0 0.256 ...
## $ Outflow_Max_L3M : num -2000 -1500 -6500 -224 -1234 ...
## $ Inflows_Max_AVG_Day : int 14 23 24 2 24 24 25 24 19 21 ...
## $ DormantLoanClient : int 0 1 1 0 0 0 0 1 1 0 ...
## $ Ave_Days_Above_100_L3M: num 0.424 0.141 1 0.13 0.876 ...
## $ CW_Perc_L3M : num 0.312 0.742 0.239 0 0.403 ...
## $ DO_Perc_L3M : num 0.0417 0 0.0708 1 0 ...
## $ DODispute_L3M : int 2 0 2 0 0 6 0 0 0 0 ...
## $ Val_POS_L3M : num -9112 -9713 -52805 0 -2772 ...
## $ Val_DO_L3M : num -1682 0 -11227 -224 0 ...
## $ Avg_Dep_Bal_L3M : num 307.2 1543.9 9005.7 95.2 1187.4 ...
## $ CSWEEP_P90_L3M : int 0 0 0 0 0 5 1 3 1 0 ...
## $ CW_Util_L3M : num 0.307 0.67 0.202 0 0.373 ...
## $ SS : int 1 0 0 0 1 0 1 0 0 1 ...
## $ LoanEver : num 1 1 1 0 1 0 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int 2 11 12 15 33 40 44 46 49 50 ...
## ..- attr(*, "names")= chr "2" "11" "12" "15" ...
dim(dfna)
## [1] 116382 21
summary(dfna)
## LoanClient Inflows_Total Outflows_Total_L3M Inflows_Above_1k
## Min. :0.0000 Min. : 0 Min. :-2696387.0 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 6139 1st Qu.: -33916.1 1st Qu.: 1.000
## Median :0.0000 Median : 15675 Median : -15306.8 Median : 2.000
## Mean :0.2485 Mean : 27104 Mean : -26115.3 Mean : 2.981
## 3rd Qu.:0.0000 3rd Qu.: 34674 3rd Qu.: -6023.2 3rd Qu.: 4.000
## Max. :1.0000 Max. :3506638 Max. : -4.9 Max. :700.000
## Max_Dep_Bal_L3M InactiveLoanClient Other_Perc_L3M
## Min. : -55 Min. :0.00000 Min. :0.00000
## 1st Qu.: 1686 1st Qu.:0.00000 1st Qu.:0.00000
## Median : 4064 Median :0.00000 Median :0.01750
## Mean : 9405 Mean :0.03498 Mean :0.09042
## 3rd Qu.: 9063 3rd Qu.:0.00000 3rd Qu.:0.13180
## Max. :2118570 Max. :1.00000 Max. :1.00000
## Outflow_Max_L3M Inflows_Max_AVG_Day DormantLoanClient
## Min. :-550000.0 Min. : 1.00 Min. :0.0000
## 1st Qu.: -2047.8 1st Qu.:15.00 1st Qu.:0.0000
## Median : -1000.0 Median :22.00 Median :0.0000
## Mean : -1654.2 Mean :19.99 Mean :0.1493
## 3rd Qu.: -500.0 3rd Qu.:25.00 3rd Qu.:0.0000
## Max. : -1.5 Max. :31.00 Max. :1.0000
## Ave_Days_Above_100_L3M CW_Perc_L3M DO_Perc_L3M
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1847 1st Qu.:0.2339 1st Qu.:0.00000
## Median :0.4347 Median :0.4526 Median :0.00200
## Mean :0.4742 Mean :0.4728 Mean :0.08191
## 3rd Qu.:0.7500 3rd Qu.:0.6950 3rd Qu.:0.10380
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## DODispute_L3M Val_POS_L3M Val_DO_L3M
## Min. : 0.0000 Min. :-410127.0 Min. :-253760.76
## 1st Qu.: 0.0000 1st Qu.: -7031.0 1st Qu.: -2313.16
## Median : 0.0000 Median : -2211.9 Median : -69.73
## Mean : 0.3673 Mean : -5608.5 Mean : -2745.00
## 3rd Qu.: 0.0000 3rd Qu.: -332.8 3rd Qu.: 0.00
## Max. :54.0000 Max. : 0.0 Max. : 0.00
## Avg_Dep_Bal_L3M CSWEEP_P90_L3M CW_Util_L3M SS
## Min. : -6827.4 Min. : 0.000 Min. : 0.0000 Min. :0.0000
## 1st Qu.: 168.4 1st Qu.: 0.000 1st Qu.: 0.1751 1st Qu.:0.0000
## Median : 493.9 Median : 1.000 Median : 0.3775 Median :0.0000
## Mean : 2227.8 Mean : 1.383 Mean : 0.4517 Mean :0.2252
## 3rd Qu.: 1384.5 3rd Qu.: 2.000 3rd Qu.: 0.6269 3rd Qu.:0.0000
## Max. :1308562.6 Max. :31.000 Max. :128.6934 Max. :1.0000
## LoanEver
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4328
## 3rd Qu.:1.0000
## Max. :1.0000
Split-out validation dataset.Given that there is only one data set, it is paramount that we randomly split the data set into a training set and testing set.
For reproducibility, we’ll set our seed to initialize the random number generator.The CATools package contains sample.split command to split the data with a split ratio of 0.75 implying we’ll put 75% of the data in the training set, which we’ll use to build the model, and 25% of the data in the testing
library(caTools)
# Randomly split data
set.seed(1)
split = sample.split(dfna$SS, SplitRatio = 0.75)
The subset function was used to create the training and testing sets.Training set will be called dfnaTrain and testing set dfnaTest.
dfnaTrain = subset(dfna, split == TRUE)
dfnaTest = subset(dfna, split == FALSE)
Check the dimensions of both the training and testing sets
dim(dfnaTrain)
## [1] 87287 21
dim(dfnaTest)
## [1] 29095 21
Check bivariate correlations between the different variables to check whether there exists a linear relationship and the associated strength.
set.seed(71)
library(mlbench)
library(caret)
# calculate correlation matrix
correlationMatrix <- cor(dfnaTrain)
#correlogram
require(corrplot)
corrplot(correlationMatrix,
method = 'shade',
type = "lower")
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = 0.7)
print(highlyCorrelated)
## [1] 3 5
Change the variables from integer to factor
dfnaTrain$LoanEver<-factor(dfnaTrain$LoanEver)
dfnaTrain$LoanClient<-factor(dfnaTrain$LoanClient)
dfnaTrain$InactiveLoanClient<-factor(dfnaTrain$InactiveLoanClient)
dfnaTrain$DormantLoanClient<-factor(dfnaTrain$DormantLoanClient)
dfnaTrain$SS <-factor(dfnaTrain$SS )
dfnaTest$LoanEver<-factor(dfnaTest$LoanEver)
dfnaTest$LoanClient<-factor(dfnaTest$LoanClient)
dfnaTest$InactiveLoanClient<-factor(dfnaTest$InactiveLoanClient)
dfnaTest$DormantLoanClient<-factor(dfnaTest$DormantLoanClient)
dfnaTest$SS <-factor(dfnaTest$SS )
STructure of Training dataset
str(dfnaTrain)
xtabs(~SS + LoanClient, data=dfnaTrain)
## LoanClient
## SS 0 1
## 0 57567 10061
## 1 7985 11674
xtabs(~SS + LoanEver, data=dfnaTrain)
## LoanEver
## SS 0 1
## 0 45042 22586
## 1 4359 15300
xtabs(~SS + DormantLoanClient, data=dfnaTrain)
## DormantLoanClient
## SS 0 1
## 0 56919 10709
## 1 17264 2395
xtabs(~SS +InactiveLoanClient, data=dfnaTrain)
## InactiveLoanClient
## SS 0 1
## 0 65812 1816
## 1 18428 1231
The baseline model has an accuracy of 23%.This is what we’ll try to beat with the logistic regression model.
table(dfnaTrain$SS)
##
## 0 1
## 67628 19659
19659/87287
## [1] 0.2252225
set.seed(7)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(SS~., data=dfnaTrain, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
## glm variable importance
##
## Overall
## LoanClient1 89.3681
## InactiveLoanClient1 43.4456
## DormantLoanClient1 27.6578
## Inflows_Above_1k 26.7331
## Ave_Days_Above_100_L3M 26.1940
## Inflows_Max_AVG_Day 25.6627
## Max_Dep_Bal_L3M 14.7194
## Outflows_Total_L3M 13.4118
## DODispute_L3M 12.3461
## CW_Perc_L3M 11.2464
## Val_POS_L3M 9.7256
## Other_Perc_L3M 6.0744
## DO_Perc_L3M 4.5116
## Val_DO_L3M 4.3199
## Inflows_Total 2.8254
## CSWEEP_P90_L3M 1.7254
## Outflow_Max_L3M 1.2228
## CW_Util_L3M 0.9523
## Avg_Dep_Bal_L3M 0.6633
# plot importance
plot(importance)
The logistic regression method assumes that:
logistic = glm(SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) + InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day + DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M +DO_Perc_L3M + DODispute_L3M, data=dfnaTrain, family="binomial")
summary(logistic)
##
## Call:
## glm(formula = SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) +
## InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day +
## DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M +
## DO_Perc_L3M + DODispute_L3M, family = "binomial", data = dfnaTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2280 -0.5737 -0.3527 -0.1686 3.8523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.111585 0.100847 -70.518 < 2e-16 ***
## LoanClient1 2.057989 0.026294 78.268 < 2e-16 ***
## Inflows_Above_1k -0.094828 0.003137 -30.226 < 2e-16 ***
## log10(abs(Outflows_Total_L3M)) 1.084983 0.023683 45.812 < 2e-16 ***
## InactiveLoanClient1 1.710479 0.043358 39.450 < 2e-16 ***
## Other_Perc_L3M 0.370036 0.080766 4.582 4.61e-06 ***
## Inflows_Max_AVG_Day 0.033207 0.001564 21.239 < 2e-16 ***
## DormantLoanClient1 0.712351 0.029414 24.218 < 2e-16 ***
## Ave_Days_Above_100_L3M 0.655681 0.038433 17.061 < 2e-16 ***
## CW_Perc_L3M -0.847232 0.043863 -19.315 < 2e-16 ***
## DO_Perc_L3M -0.333294 0.076442 -4.360 1.30e-05 ***
## DODispute_L3M -0.108984 0.007641 -14.262 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 93125 on 87286 degrees of freedom
## Residual deviance: 68956 on 87275 degrees of freedom
## AIC: 68980
##
## Number of Fisher Scoring iterations: 5
None of the predictor variables showed multicollinearity as they all had low VIF value below 5.
car::vif(logistic)
## LoanClient Inflows_Above_1k
## 1.856679 1.093900
## log10(abs(Outflows_Total_L3M)) InactiveLoanClient
## 1.407744 1.131941
## Other_Perc_L3M Inflows_Max_AVG_Day
## 1.572125 1.032053
## DormantLoanClient Ave_Days_Above_100_L3M
## 1.287433 1.482674
## CW_Perc_L3M DO_Perc_L3M
## 1.349119 1.291934
## DODispute_L3M
## 1.076992
library(tidyverse)
library(broom)
plot(logistic, which = 4, id.n = 3)
# Extract model results
logistic.data <- augment(logistic) %>%
mutate(index = 1:n())
logistic.data %>% top_n(3, .cooksd)
## # A tibble: 3 x 20
## SS LoanClient Inflows_Above_1k log10.abs.Outfl~ InactiveLoanCli~
## <fct> <fct> <int> <dbl> <fct>
## 1 1 1 52 5.56 0
## 2 1 1 97 5.31 0
## 3 1 1 2 4.57 0
## # ... with 15 more variables: Other_Perc_L3M <dbl>,
## # Inflows_Max_AVG_Day <int>, DormantLoanClient <fct>,
## # Ave_Days_Above_100_L3M <dbl>, CW_Perc_L3M <dbl>, DO_Perc_L3M <dbl>,
## # DODispute_L3M <int>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## # .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>
ggplot(logistic.data, aes(index, .std.resid)) +
geom_point(aes(color = SS), alpha = .5) +
theme_bw()
logistic.data %>%
filter(abs(.std.resid) > 3)
## # A tibble: 29 x 20
## SS LoanClient Inflows_Above_1k log10.abs.Outfl~ InactiveLoanCli~
## <fct> <fct> <int> <dbl> <fct>
## 1 1 0 0 2.21 0
## 2 1 0 0 2.70 0
## 3 1 0 0 1.08 0
## 4 1 0 0 1.38 0
## 5 1 0 1 2.95 0
## 6 1 0 0 2.53 0
## 7 1 0 3 3.82 0
## 8 1 0 0 2.20 0
## 9 1 0 0 1.08 0
## 10 1 0 0 2.60 0
## # ... with 19 more rows, and 15 more variables: Other_Perc_L3M <dbl>,
## # Inflows_Max_AVG_Day <int>, DormantLoanClient <fct>,
## # Ave_Days_Above_100_L3M <dbl>, CW_Perc_L3M <dbl>, DO_Perc_L3M <dbl>,
## # DODispute_L3M <int>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## # .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>
predicted = data.frame(probability.of.SS= logistic$fitted.value, SS=dfnaTrain$SS)
predicted = predicted[order(predicted$probability.of.SS, decreasing = FALSE),]
predicted$rank = 1:nrow(predicted)
library(ggplot2)
library(cowplot)
ggplot(data=predicted, aes(x=rank, y=probability.of.SS)) +
geom_point(aes(colour=SS), alpha=1, shape=4, stroke=2)+
xlab("Index") +
ylab("Predicted Probability of Taking Up SS")
predictTrain = predict(logistic, type="response")
summary(predictTrain)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0546 0.1216 0.2252 0.3543 0.9164
tapply(predictTrain, dfnaTrain$SS, mean)
## 0 1
## 0.1621618 0.4421549
The threshold of 0.5 was chosen as it had the highest Accuracy compared to 0.2 and 0.7 thresholds.
table(dfnaTrain$SS, predictTrain > 0.5)
##
## FALSE TRUE
## 0 62224 5404
## 1 10201 9458
a = 62224
b = 9458
c = 5404
d = 10201
(a+b)/(a+b+c+d)
## [1] 0.8212219
b/(b+d)
## [1] 0.4811028
a/(a+c)
## [1] 0.9200923
b/(b+c)
## [1] 0.6363881
table(dfnaTrain$SS, predictTrain > 0.7)
##
## FALSE TRUE
## 0 66400 1228
## 1 17067 2592
e = 66400
f = 2592
g = 1228
h = 17067
(e+f)/(e+f+g+h)
## [1] 0.7904041
f/(f+h)
## [1] 0.131848
e/(e+g)
## [1] 0.9818418
f/(f+g)
## [1] 0.678534
table(dfnaTrain$SS, predictTrain > 0.2)
##
## FALSE TRUE
## 0 50331 17297
## 1 4184 15475
i = 50331
j = 15475
k = 4184
l = 17297
(i+j)/(i+j+k+l)
## [1] 0.7539038
j/(j+l)
## [1] 0.4722019
i/(i+k)
## [1] 0.9232505
j/(j+k)
## [1] 0.7871713
library(ROCR)
ROCRpred = prediction(predictTrain, dfnaTrain$SS)
ROCRperf = performance(ROCRpred, "tpr", "fpr")
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
predictTest = predict(logistic, type = "response", newdata = dfnaTest)
table(dfnaTest$SS,predictTest >= 0.5)
##
## FALSE TRUE
## 0 20677 1865
## 1 3509 3044
m = 20677
n = 3044
o = 1865
p = 3509
(m+n)/(m+n+o+p)
## [1] 0.8152947
n/(n+p)
## [1] 0.4645201
m/(m+o)
## [1] 0.9172655
n/(n+o)
## [1] 0.6200856
library(rpart, warn.conflicts = FALSE)
library(rpart.plot, warn.conflicts = FALSE)
library(rattle, warn.conflicts = FALSE)
library(RColorBrewer, warn.conflicts = FALSE)
# Changing the prior probabilities of SS product takeup and non-takeup
tree_prior = rpart(SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) + InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day + DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M +DO_Perc_L3M + DODispute_L3M, method = "class",
data = dfnaTrain,
parms = list(prior = c(0.7,0.3)), # changing the probabilities of non product takeup to 0.7
control = rpart.control(cp = 0.001)) # and product takeup to 0.3
prp(tree_prior)
# cp and xerror values.
printcp(tree_prior)
##
## Classification tree:
## rpart(formula = SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) +
## InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day +
## DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M +
## DO_Perc_L3M + DODispute_L3M, data = dfnaTrain, method = "class",
## parms = list(prior = c(0.7, 0.3)), control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] Ave_Days_Above_100_L3M DODispute_L3M
## [3] InactiveLoanClient Inflows_Above_1k
## [5] Inflows_Max_AVG_Day LoanClient
## [7] log10(abs(Outflows_Total_L3M)) Other_Perc_L3M
##
## Root node error: 26186/87287 = 0.3
##
## n= 87287
##
## CP nsplit rel error xerror xstd
## 1 0.2466953 0 1.00000 1.00000 0.0062778
## 2 0.0082954 1 0.75330 0.75330 0.0051124
## 3 0.0079886 5 0.72012 0.73726 0.0051670
## 4 0.0051988 7 0.70415 0.70604 0.0051171
## 5 0.0029382 8 0.69895 0.70249 0.0050549
## 6 0.0024183 9 0.69601 0.69978 0.0050350
## 7 0.0021667 10 0.69359 0.69708 0.0050330
## 8 0.0019707 11 0.69142 0.69615 0.0050312
## 9 0.0016691 12 0.68945 0.69546 0.0050313
## 10 0.0014668 13 0.68778 0.69367 0.0050244
## 11 0.0014306 14 0.68632 0.69262 0.0050236
## 12 0.0010427 15 0.68489 0.68985 0.0050147
## 13 0.0010000 16 0.68384 0.68993 0.0050102
plotcp(tree_prior)
tree_min = tree_prior$cptable[which.min(tree_prior$cptable[,"xerror"]),"CP"]
# pruning the tree for increased model performance.
ptree_prior = prune(tree_prior, cp = tree_min) # pruning the tree.
prp(ptree_prior)
pred_prior = predict(ptree_prior, newdata = dfnaTest, type = "class") # making predictions.
confmat_prior = table(dfnaTest$SS,pred_prior) # making the confusion matrix.
confmat_prior
## pred_prior
## 0 1
## 0 19925 2617
## 1 2868 3685
acc_prior = sum(diag(confmat_prior)) / sum(confmat_prior)
acc_prior
## [1] 0.8114796
prp(ptree_prior)
#Loss Matrix
# Including a loss matrix
tree_loss_matrix = rpart(SS ~LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) + InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day + DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M +DO_Perc_L3M + DODispute_L3M, method = "class",
data = dfnaTrain,
parms = list(loss = matrix(c(0,10,1,0),ncol = 2)),
control = rpart.control(cp = 0.001))
#penalizing classifying a product takeup as a non-takeup 10 times more.
printcp(tree_loss_matrix)
##
## Classification tree:
## rpart(formula = SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) +
## InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day +
## DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M +
## DO_Perc_L3M + DODispute_L3M, data = dfnaTrain, method = "class",
## parms = list(loss = matrix(c(0, 10, 1, 0), ncol = 2)), control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] Ave_Days_Above_100_L3M CW_Perc_L3M
## [3] DO_Perc_L3M DODispute_L3M
## [5] DormantLoanClient InactiveLoanClient
## [7] Inflows_Above_1k Inflows_Max_AVG_Day
## [9] LoanClient log10(abs(Outflows_Total_L3M))
##
## Root node error: 67628/87287 = 0.77478
##
## n= 87287
##
## CP nsplit rel error xerror xstd
## 1 0.1282975 0 1.00000 10.0000 0.018249
## 2 0.0222541 2 0.74341 4.8453 0.021071
## 3 0.0060034 3 0.72115 4.9508 0.021171
## 4 0.0057816 5 0.70914 5.0145 0.021227
## 5 0.0040664 7 0.69758 4.8936 0.021128
## 6 0.0038446 10 0.68374 4.7382 0.020988
## 7 0.0038076 11 0.67990 4.6968 0.020948
## 8 0.0031274 13 0.67228 4.6902 0.020943
## 9 0.0030609 15 0.66603 4.5601 0.020809
## 10 0.0016413 16 0.66297 4.5074 0.020754
## 11 0.0016118 18 0.65968 4.5702 0.020823
## 12 0.0012125 19 0.65807 4.5719 0.020826
## 13 0.0010000 21 0.65565 4.5694 0.020823
plotcp(tree_loss_matrix)
#Another plot
ptree_loss_matrix = prune(tree_loss_matrix, cp = 0.0011)
prp(tree_loss_matrix)
#Confmat
pred_loss_matrix = predict(ptree_loss_matrix,newdata = dfnaTest, type = "class")
confmat_loss_matrix = table(dfnaTest$SS,pred_loss_matrix)
confmat_loss_matrix
## pred_loss_matrix
## 0 1
## 0 12375 10167
## 1 497 6056
acc_loss_matrix = sum(diag(confmat_loss_matrix)) / sum(confmat_loss_matrix)
acc_loss_matrix
## [1] 0.6334765
case_weights = ifelse(dfnaTrain$SS == 0,1,3)
tree_weights = rpart(SS ~ ., method = "class",
data = dfnaTrain,
control = rpart.control(cp = 0.001,minsplit = 5,minbucket = 2),
weights = case_weights)
plotcp(tree_weights)
ptree_weights = prune(tree_weights, cp = 0.00183101)
prp(ptree_weights,extra = 1)
pred_weights = predict(ptree_weights, newdata = dfnaTest,type = "class")
confmat_weights = table(dfnaTest$SS,pred_weights)
confmat_weights
## pred_weights
## 0 1
## 0 17343 5199
## 1 1520 5033
acc_weights = sum(diag(confmat_weights)) / sum(confmat_weights)
acc_weights
## [1] 0.7690668
It can be seen that most of the cases in the data did not take up product SS. A logistic regression and a decision tree algorithm were each applied to the data in an effort to optimally predict whether a client will take up product SS.
From the results in the analysis it can be seen that the logistic regression method produced better results than Decision trees with respect to Accuracy at 82% success rate at prediction.