Problem Statement

The provided data relates to clients taking up a new product, which is indicated by the SS column (target variable). More specifically, a response to SS of 1 is indicative of a client that has taken up the product and clients with SS responsive of 0 did not take up.

Session Information

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.5.2  magrittr_1.5    tools_3.5.2     htmltools_0.3.6
##  [5] yaml_2.2.0      Rcpp_1.0.0      stringi_1.2.4   rmarkdown_1.11 
##  [9] knitr_1.21      stringr_1.3.1   xfun_0.5        digest_0.6.18  
## [13] evaluate_0.13

setwd("C:/Users/Edzai/Documents/Edzai_assessment_documents")

getwd()

## [1] "C:/Users/Edzai/Documents/Edzai_assessment_documents"

The dataset is provided in the feather format so it is necessary to load the feather package in R

library(feather)

Load Dataset

df <- read_feather("C:/Users/Edzai/Documents/Edzai_assessment_documents/train_data")

Sense-check the loaded data, check column names

names(df)

##  [1] "LoanClient"             "Inflows_Total"         
##  [3] "Outflows_Total_L3M"     "Inflows_Above_1k"      
##  [5] "Max_Dep_Bal_L3M"        "InactiveLoanClient"    
##  [7] "Other_Perc_L3M"         "Outflow_Max_L3M"       
##  [9] "Inflows_Max_AVG_Day"    "DormantLoanClient"     
## [11] "Ave_Days_Above_100_L3M" "CW_Perc_L3M"           
## [13] "DO_Perc_L3M"            "DODispute_L3M"         
## [15] "Val_POS_L3M"            "Val_DO_L3M"            
## [17] "Avg_Dep_Bal_L3M"        "CSWEEP_P90_L3M"        
## [19] "CW_Util_L3M"            "SS"                    
## [21] "LoanEver"

Analyse the structure of the data

str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    140000 obs. of  21 variables:
##  $ LoanClient            : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ Inflows_Total         : num  44214 NA 42156 97643 300 ...
##  $ Outflows_Total_L3M    : num  -43560 NA -38444 -157110 -224 ...
##  $ Inflows_Above_1k      : int  2 NA 0 3 0 0 3 0 1 6 ...
##  $ Max_Dep_Bal_L3M       : num  2954 NA 30800 51622 318 ...
##  $ InactiveLoanClient    : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Other_Perc_L3M        : num  0.253 NA 0.027 0.062 0 ...
##  $ Outflow_Max_L3M       : num  -2000 NA -1500 -6500 -224 ...
##  $ Inflows_Max_AVG_Day   : int  14 NA 23 24 2 24 24 25 24 19 ...
##  $ DormantLoanClient     : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ Ave_Days_Above_100_L3M: num  0.424 NA 0.141 1 0.13 ...
##  $ CW_Perc_L3M           : num  0.312 NA 0.742 0.239 0 ...
##  $ DO_Perc_L3M           : num  0.0417 NA 0 0.0708 1 ...
##  $ DODispute_L3M         : int  2 NA 0 2 0 0 6 0 0 0 ...
##  $ Val_POS_L3M           : num  -9112 NA -9713 -52805 0 ...
##  $ Val_DO_L3M            : num  -1682 NA 0 -11227 -224 ...
##  $ Avg_Dep_Bal_L3M       : num  307.2 NA 1543.9 9005.7 95.2 ...
##  $ CSWEEP_P90_L3M        : int  0 NA 0 0 0 0 5 1 3 1 ...
##  $ CW_Util_L3M           : num  0.307 NA 0.67 0.202 0 ...
##  $ SS                    : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ LoanEver              : num  1 1 1 1 0 1 0 1 1 1 ...

Check the summary statistics for each of the variables

summary(df)

##    LoanClient     Inflows_Total     Outflows_Total_L3M   Inflows_Above_1k 
##  Min.   :0.0000   Min.   :      0   Min.   :-2696387.0   Min.   :  0.000  
##  1st Qu.:0.0000   1st Qu.:   5640   1st Qu.:  -33424.3   1st Qu.:  1.000  
##  Median :0.0000   Median :  15250   Median :  -14912.5   Median :  2.000  
##  Mean   :0.2269   Mean   :  26613   Mean   :  -25697.0   Mean   :  2.928  
##  3rd Qu.:0.0000   3rd Qu.:  34047   3rd Qu.:   -5624.8   3rd Qu.:  4.000  
##  Max.   :1.0000   Max.   :3506638   Max.   :      -2.5   Max.   :700.000  
##                   NA's   :21242     NA's   :21325        NA's   :21242    
##  Max_Dep_Bal_L3M     InactiveLoanClient Other_Perc_L3M 
##  Min.   :  -5599.2   Min.   :0.00000    Min.   :0.000  
##  1st Qu.:    999.8   1st Qu.:0.00000    1st Qu.:0.000  
##  Median :   3327.9   Median :0.00000    Median :0.015  
##  Mean   :   8375.3   Mean   :0.03246    Mean   :0.089  
##  3rd Qu.:   8010.8   3rd Qu.:0.00000    3rd Qu.:0.126  
##  Max.   :2118570.0   Max.   :1.00000    Max.   :1.000  
##  NA's   :5244                           NA's   :18884  
##  Outflow_Max_L3M     Inflows_Max_AVG_Day DormantLoanClient
##  Min.   :-550000.0   Min.   : 1.00       Min.   :0.0000   
##  1st Qu.:  -2000.0   1st Qu.:15.00       1st Qu.:0.0000   
##  Median :  -1000.0   Median :22.00       Median :0.0000   
##  Mean   :  -1637.2   Mean   :19.95       Mean   :0.1512   
##  3rd Qu.:   -471.5   3rd Qu.:25.00       3rd Qu.:0.0000   
##  Max.   :     -1.5   Max.   :31.00       Max.   :1.0000   
##  NA's   :21325       NA's   :21242                        
##  Ave_Days_Above_100_L3M  CW_Perc_L3M     DO_Perc_L3M    DODispute_L3M   
##  Min.   :0.000          Min.   :0.000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:0.130          1st Qu.:0.216   1st Qu.:0.000   1st Qu.: 0.000  
##  Median :0.393          Median :0.443   Median :0.000   Median : 0.000  
##  Mean   :0.445          Mean   :0.463   Mean   :0.079   Mean   : 0.353  
##  3rd Qu.:0.750          3rd Qu.:0.691   3rd Qu.:0.098   3rd Qu.: 0.000  
##  Max.   :1.000          Max.   :1.000   Max.   :1.000   Max.   :54.000  
##  NA's   :5244           NA's   :18884   NA's   :18884   NA's   :18884   
##   Val_POS_L3M        Val_DO_L3M      Avg_Dep_Bal_L3M     CSWEEP_P90_L3M  
##  Min.   :-410127   Min.   :-253761   Min.   :  -6827.4   Min.   : 0.000  
##  1st Qu.:  -6748   1st Qu.:  -2140   1st Qu.:    109.9   1st Qu.: 0.000  
##  Median :  -2007   Median :      0   Median :    405.3   Median : 1.000  
##  Mean   :  -5416   Mean   :  -2639   Mean   :   2110.3   Mean   : 1.333  
##  3rd Qu.:   -239   3rd Qu.:      0   3rd Qu.:   1239.5   3rd Qu.: 2.000  
##  Max.   :      0   Max.   :      0   Max.   :1308562.6   Max.   :31.000  
##  NA's   :18884     NA's   :18884     NA's   :5244        NA's   :18884   
##   CW_Util_L3M            SS            LoanEver     
##  Min.   :  0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:  0.152   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :  0.362   Median :0.0000   Median :0.0000  
##  Mean   :  0.436   Mean   :0.1984   Mean   :0.4106  
##  3rd Qu.:  0.616   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :128.693   Max.   :1.0000   Max.   :1.0000  
##  NA's   :18884

Check the shape of the data

dim(df)

## [1] 140000     21

dim(df)

## [1] 140000     21

Explore missing values in the dataset using the VIM package

library(VIM)
a <- aggr(df)

summary(a)

## 
##  Missings per variable: 
##                Variable Count
##              LoanClient     0
##           Inflows_Total 21242
##      Outflows_Total_L3M 21325
##        Inflows_Above_1k 21242
##         Max_Dep_Bal_L3M  5244
##      InactiveLoanClient     0
##          Other_Perc_L3M 18884
##         Outflow_Max_L3M 21325
##     Inflows_Max_AVG_Day 21242
##       DormantLoanClient     0
##  Ave_Days_Above_100_L3M  5244
##             CW_Perc_L3M 18884
##             DO_Perc_L3M 18884
##           DODispute_L3M 18884
##             Val_POS_L3M 18884
##              Val_DO_L3M 18884
##         Avg_Dep_Bal_L3M  5244
##          CSWEEP_P90_L3M 18884
##             CW_Util_L3M 18884
##                      SS     0
##                LoanEver     0
## 
##  Missings in combinations of variables: 
##                               Combinations  Count      Percent
##  0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 116382 83.130000000
##  0:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0   2371  1.693571429
##  0:0:1:0:1:0:0:1:0:0:1:0:0:0:0:0:1:0:0:0:0      5  0.003571429
##  0:1:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0   2288  1.634285714
##  0:1:0:1:1:0:0:0:1:0:1:0:0:0:0:0:1:0:0:0:0      5  0.003571429
##  0:1:1:1:0:0:0:1:1:0:0:0:0:0:0:0:0:0:0:0:0     65  0.046428571
##  0:1:1:1:0:0:1:1:1:0:0:1:1:1:1:1:0:1:1:0:0  13650  9.750000000
##  0:1:1:1:1:0:1:1:1:0:1:1:1:1:1:1:1:1:1:0:0   5234  3.738571429

Since the data is way more than 100 000, it is sufficient to discard cases with missing data with minimal effect to the logistic regression model.

dfna <- na.omit(df)

Check the dimension and structure of the new dataset

str(dfna)

## Classes 'tbl_df', 'tbl' and 'data.frame':    116382 obs. of  21 variables:
##  $ LoanClient            : int  1 0 0 0 0 0 1 0 0 1 ...
##  $ Inflows_Total         : num  44214 42156 97643 300 16836 ...
##  $ Outflows_Total_L3M    : num  -43560 -38444 -157110 -224 -13428 ...
##  $ Inflows_Above_1k      : int  2 0 3 0 0 3 0 1 6 9 ...
##  $ Max_Dep_Bal_L3M       : num  2954 30800 51622 318 5826 ...
##  $ InactiveLoanClient    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Other_Perc_L3M        : num  0.253 0.027 0.062 0 0.256 ...
##  $ Outflow_Max_L3M       : num  -2000 -1500 -6500 -224 -1234 ...
##  $ Inflows_Max_AVG_Day   : int  14 23 24 2 24 24 25 24 19 21 ...
##  $ DormantLoanClient     : int  0 1 1 0 0 0 0 1 1 0 ...
##  $ Ave_Days_Above_100_L3M: num  0.424 0.141 1 0.13 0.876 ...
##  $ CW_Perc_L3M           : num  0.312 0.742 0.239 0 0.403 ...
##  $ DO_Perc_L3M           : num  0.0417 0 0.0708 1 0 ...
##  $ DODispute_L3M         : int  2 0 2 0 0 6 0 0 0 0 ...
##  $ Val_POS_L3M           : num  -9112 -9713 -52805 0 -2772 ...
##  $ Val_DO_L3M            : num  -1682 0 -11227 -224 0 ...
##  $ Avg_Dep_Bal_L3M       : num  307.2 1543.9 9005.7 95.2 1187.4 ...
##  $ CSWEEP_P90_L3M        : int  0 0 0 0 0 5 1 3 1 0 ...
##  $ CW_Util_L3M           : num  0.307 0.67 0.202 0 0.373 ...
##  $ SS                    : int  1 0 0 0 1 0 1 0 0 1 ...
##  $ LoanEver              : num  1 1 1 0 1 0 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int  2 11 12 15 33 40 44 46 49 50 ...
##   ..- attr(*, "names")= chr  "2" "11" "12" "15" ...

dim(dfna)

## [1] 116382     21

summary(dfna)

##    LoanClient     Inflows_Total     Outflows_Total_L3M   Inflows_Above_1k 
##  Min.   :0.0000   Min.   :      0   Min.   :-2696387.0   Min.   :  0.000  
##  1st Qu.:0.0000   1st Qu.:   6139   1st Qu.:  -33916.1   1st Qu.:  1.000  
##  Median :0.0000   Median :  15675   Median :  -15306.8   Median :  2.000  
##  Mean   :0.2485   Mean   :  27104   Mean   :  -26115.3   Mean   :  2.981  
##  3rd Qu.:0.0000   3rd Qu.:  34674   3rd Qu.:   -6023.2   3rd Qu.:  4.000  
##  Max.   :1.0000   Max.   :3506638   Max.   :      -4.9   Max.   :700.000  
##  Max_Dep_Bal_L3M   InactiveLoanClient Other_Perc_L3M   
##  Min.   :    -55   Min.   :0.00000    Min.   :0.00000  
##  1st Qu.:   1686   1st Qu.:0.00000    1st Qu.:0.00000  
##  Median :   4064   Median :0.00000    Median :0.01750  
##  Mean   :   9405   Mean   :0.03498    Mean   :0.09042  
##  3rd Qu.:   9063   3rd Qu.:0.00000    3rd Qu.:0.13180  
##  Max.   :2118570   Max.   :1.00000    Max.   :1.00000  
##  Outflow_Max_L3M     Inflows_Max_AVG_Day DormantLoanClient
##  Min.   :-550000.0   Min.   : 1.00       Min.   :0.0000   
##  1st Qu.:  -2047.8   1st Qu.:15.00       1st Qu.:0.0000   
##  Median :  -1000.0   Median :22.00       Median :0.0000   
##  Mean   :  -1654.2   Mean   :19.99       Mean   :0.1493   
##  3rd Qu.:   -500.0   3rd Qu.:25.00       3rd Qu.:0.0000   
##  Max.   :     -1.5   Max.   :31.00       Max.   :1.0000   
##  Ave_Days_Above_100_L3M  CW_Perc_L3M      DO_Perc_L3M     
##  Min.   :0.0000         Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.1847         1st Qu.:0.2339   1st Qu.:0.00000  
##  Median :0.4347         Median :0.4526   Median :0.00200  
##  Mean   :0.4742         Mean   :0.4728   Mean   :0.08191  
##  3rd Qu.:0.7500         3rd Qu.:0.6950   3rd Qu.:0.10380  
##  Max.   :1.0000         Max.   :1.0000   Max.   :1.00000  
##  DODispute_L3M      Val_POS_L3M          Val_DO_L3M        
##  Min.   : 0.0000   Min.   :-410127.0   Min.   :-253760.76  
##  1st Qu.: 0.0000   1st Qu.:  -7031.0   1st Qu.:  -2313.16  
##  Median : 0.0000   Median :  -2211.9   Median :    -69.73  
##  Mean   : 0.3673   Mean   :  -5608.5   Mean   :  -2745.00  
##  3rd Qu.: 0.0000   3rd Qu.:   -332.8   3rd Qu.:      0.00  
##  Max.   :54.0000   Max.   :      0.0   Max.   :      0.00  
##  Avg_Dep_Bal_L3M     CSWEEP_P90_L3M    CW_Util_L3M             SS        
##  Min.   :  -6827.4   Min.   : 0.000   Min.   :  0.0000   Min.   :0.0000  
##  1st Qu.:    168.4   1st Qu.: 0.000   1st Qu.:  0.1751   1st Qu.:0.0000  
##  Median :    493.9   Median : 1.000   Median :  0.3775   Median :0.0000  
##  Mean   :   2227.8   Mean   : 1.383   Mean   :  0.4517   Mean   :0.2252  
##  3rd Qu.:   1384.5   3rd Qu.: 2.000   3rd Qu.:  0.6269   3rd Qu.:0.0000  
##  Max.   :1308562.6   Max.   :31.000   Max.   :128.6934   Max.   :1.0000  
##     LoanEver     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4328  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Splitting Training & Testing Data

Split-out validation dataset.Given that there is only one data set, it is paramount that we randomly split the data set into a training set and testing set.

For reproducibility, we’ll set our seed to initialize the random number generator.The CATools package contains sample.split command to split the data with a split ratio of 0.75 implying we’ll put 75% of the data in the training set, which we’ll use to build the model, and 25% of the data in the testing

library(caTools)
# Randomly split data

set.seed(1)

split = sample.split(dfna$SS, SplitRatio = 0.75)

The subset function was used to create the training and testing sets.Training set will be called dfnaTrain and testing set dfnaTest.

dfnaTrain = subset(dfna, split == TRUE)
dfnaTest = subset(dfna, split == FALSE)

Check the dimensions of both the training and testing sets

dim(dfnaTrain)

## [1] 87287    21

dim(dfnaTest)

## [1] 29095    21

Summarize Data

Check bivariate correlations between the different variables to check whether there exists a linear relationship and the associated strength.

Correlations

set.seed(71)
library(mlbench)
library(caret)
# calculate correlation matrix
correlationMatrix <- cor(dfnaTrain)

#correlogram
require(corrplot)
corrplot(correlationMatrix,
         method = 'shade',
         type = "lower")

# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = 0.7)
print(highlyCorrelated)

## [1] 3 5

Change the variables from integer to factor

LoanEver

dfnaTrain$LoanEver<-factor(dfnaTrain$LoanEver)

LoanCLient

dfnaTrain$LoanClient<-factor(dfnaTrain$LoanClient)

InactiveLoanClient

dfnaTrain$InactiveLoanClient<-factor(dfnaTrain$InactiveLoanClient)

DormantLoanClient

dfnaTrain$DormantLoanClient<-factor(dfnaTrain$DormantLoanClient)

SS

dfnaTrain$SS <-factor(dfnaTrain$SS )

LoanEver

dfnaTest$LoanEver<-factor(dfnaTest$LoanEver)

LoanCLient

dfnaTest$LoanClient<-factor(dfnaTest$LoanClient)

InactiveLoanClient

dfnaTest$InactiveLoanClient<-factor(dfnaTest$InactiveLoanClient)

DormantLoanClient

dfnaTest$DormantLoanClient<-factor(dfnaTest$DormantLoanClient)

SS

dfnaTest$SS <-factor(dfnaTest$SS )

STructure of Training dataset

str(dfnaTrain)

SS Vs LoanClient Crosstab

xtabs(~SS + LoanClient, data=dfnaTrain)

##    LoanClient
## SS      0     1
##   0 57567 10061
##   1  7985 11674

SS Vs LoanEver Crosstab

xtabs(~SS + LoanEver, data=dfnaTrain)

##    LoanEver
## SS      0     1
##   0 45042 22586
##   1  4359 15300

SS Vs DormantLoanClient

xtabs(~SS + DormantLoanClient, data=dfnaTrain)

##    DormantLoanClient
## SS      0     1
##   0 56919 10709
##   1 17264  2395

SS Vs InactiveLoanClient

xtabs(~SS +InactiveLoanClient, data=dfnaTrain)

##    InactiveLoanClient
## SS      0     1
##   0 65812  1816
##   1 18428  1231

Baseline Model

The baseline model has an accuracy of 23%.This is what we’ll try to beat with the logistic regression model.

table(dfnaTrain$SS)

## 
##     0     1 
## 67628 19659

19659/87287

## [1] 0.2252225

Feature Selection

set.seed(7)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(SS~., data=dfnaTrain, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)

## glm variable importance
## 
##                        Overall
## LoanClient1            89.3681
## InactiveLoanClient1    43.4456
## DormantLoanClient1     27.6578
## Inflows_Above_1k       26.7331
## Ave_Days_Above_100_L3M 26.1940
## Inflows_Max_AVG_Day    25.6627
## Max_Dep_Bal_L3M        14.7194
## Outflows_Total_L3M     13.4118
## DODispute_L3M          12.3461
## CW_Perc_L3M            11.2464
## Val_POS_L3M             9.7256
## Other_Perc_L3M          6.0744
## DO_Perc_L3M             4.5116
## Val_DO_L3M              4.3199
## Inflows_Total           2.8254
## CSWEEP_P90_L3M          1.7254
## Outflow_Max_L3M         1.2228
## CW_Util_L3M             0.9523
## Avg_Dep_Bal_L3M         0.6633

# plot importance
plot(importance)

Logistic Regression Model

The logistic regression method assumes that:

The outcome is a binary or dichotomous variable.
There is a linear relationship between the logit of the outcome and each predictor variables.
There are no influential values (extreme values or outliers) in the continuous predictors
There is no high intercorrelations (i.e. multicollinearity) among the predictors.

Linearity assumption

logistic = glm(SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) + InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day + DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M +DO_Perc_L3M + DODispute_L3M, data=dfnaTrain, family="binomial")


summary(logistic)

## 
## Call:
## glm(formula = SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) + 
##     InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day + 
##     DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M + 
##     DO_Perc_L3M + DODispute_L3M, family = "binomial", data = dfnaTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2280  -0.5737  -0.3527  -0.1686   3.8523  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -7.111585   0.100847 -70.518  < 2e-16 ***
## LoanClient1                     2.057989   0.026294  78.268  < 2e-16 ***
## Inflows_Above_1k               -0.094828   0.003137 -30.226  < 2e-16 ***
## log10(abs(Outflows_Total_L3M))  1.084983   0.023683  45.812  < 2e-16 ***
## InactiveLoanClient1             1.710479   0.043358  39.450  < 2e-16 ***
## Other_Perc_L3M                  0.370036   0.080766   4.582 4.61e-06 ***
## Inflows_Max_AVG_Day             0.033207   0.001564  21.239  < 2e-16 ***
## DormantLoanClient1              0.712351   0.029414  24.218  < 2e-16 ***
## Ave_Days_Above_100_L3M          0.655681   0.038433  17.061  < 2e-16 ***
## CW_Perc_L3M                    -0.847232   0.043863 -19.315  < 2e-16 ***
## DO_Perc_L3M                    -0.333294   0.076442  -4.360 1.30e-05 ***
## DODispute_L3M                  -0.108984   0.007641 -14.262  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93125  on 87286  degrees of freedom
## Residual deviance: 68956  on 87275  degrees of freedom
## AIC: 68980
## 
## Number of Fisher Scoring iterations: 5

Multicollinearity

None of the predictor variables showed multicollinearity as they all had low VIF value below 5.

car::vif(logistic)

##                     LoanClient               Inflows_Above_1k 
##                       1.856679                       1.093900 
## log10(abs(Outflows_Total_L3M))             InactiveLoanClient 
##                       1.407744                       1.131941 
##                 Other_Perc_L3M            Inflows_Max_AVG_Day 
##                       1.572125                       1.032053 
##              DormantLoanClient         Ave_Days_Above_100_L3M 
##                       1.287433                       1.482674 
##                    CW_Perc_L3M                    DO_Perc_L3M 
##                       1.349119                       1.291934 
##                  DODispute_L3M 
##                       1.076992

Check for Influential Outliers

library(tidyverse)
library(broom)
plot(logistic, which = 4, id.n = 3)

# Extract model results
logistic.data <- augment(logistic) %>% 
  mutate(index = 1:n()) 

logistic.data %>% top_n(3, .cooksd)

## # A tibble: 3 x 20
##   SS    LoanClient Inflows_Above_1k log10.abs.Outfl~ InactiveLoanCli~
##   <fct> <fct>                 <int>            <dbl> <fct>           
## 1 1     1                        52             5.56 0               
## 2 1     1                        97             5.31 0               
## 3 1     1                         2             4.57 0               
## # ... with 15 more variables: Other_Perc_L3M <dbl>,
## #   Inflows_Max_AVG_Day <int>, DormantLoanClient <fct>,
## #   Ave_Days_Above_100_L3M <dbl>, CW_Perc_L3M <dbl>, DO_Perc_L3M <dbl>,
## #   DODispute_L3M <int>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## #   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>

ggplot(logistic.data, aes(index, .std.resid)) + 
  geom_point(aes(color = SS), alpha = .5) +
  theme_bw()

logistic.data %>% 
  filter(abs(.std.resid) > 3)

## # A tibble: 29 x 20
##    SS    LoanClient Inflows_Above_1k log10.abs.Outfl~ InactiveLoanCli~
##    <fct> <fct>                 <int>            <dbl> <fct>           
##  1 1     0                         0             2.21 0               
##  2 1     0                         0             2.70 0               
##  3 1     0                         0             1.08 0               
##  4 1     0                         0             1.38 0               
##  5 1     0                         1             2.95 0               
##  6 1     0                         0             2.53 0               
##  7 1     0                         3             3.82 0               
##  8 1     0                         0             2.20 0               
##  9 1     0                         0             1.08 0               
## 10 1     0                         0             2.60 0               
## # ... with 19 more rows, and 15 more variables: Other_Perc_L3M <dbl>,
## #   Inflows_Max_AVG_Day <int>, DormantLoanClient <fct>,
## #   Ave_Days_Above_100_L3M <dbl>, CW_Perc_L3M <dbl>, DO_Perc_L3M <dbl>,
## #   DODispute_L3M <int>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## #   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>

Making predictions on Training set

predicted = data.frame(probability.of.SS= logistic$fitted.value, SS=dfnaTrain$SS)

predicted = predicted[order(predicted$probability.of.SS, decreasing = FALSE),]

predicted$rank = 1:nrow(predicted)

library(ggplot2)

library(cowplot)

ggplot(data=predicted, aes(x=rank, y=probability.of.SS)) + 
  geom_point(aes(colour=SS), alpha=1, shape=4, stroke=2)+
  xlab("Index") +
  ylab("Predicted Probability of Taking Up SS")

More predictions

predictTrain = predict(logistic, type="response")
summary(predictTrain)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0546  0.1216  0.2252  0.3543  0.9164

tapply(predictTrain, dfnaTrain$SS, mean)

##         0         1 
## 0.1621618 0.4421549

Thresholding - choosing the the threshold based on error size

Confusion matrix

Confusion matrix for threshold of 0.5

The threshold of 0.5 was chosen as it had the highest Accuracy compared to 0.2 and 0.7 thresholds.

table(dfnaTrain$SS, predictTrain > 0.5)

##    
##     FALSE  TRUE
##   0 62224  5404
##   1 10201  9458

Accuracy - (TP+TN)/TOTAL

a = 62224
b = 9458
c = 5404 
d = 10201

(a+b)/(a+b+c+d)

## [1] 0.8212219

Sensitivity - TP/(TP + FN)

b/(b+d)

## [1] 0.4811028

Specificity - TN/(TN+FP)

a/(a+c)

## [1] 0.9200923

Precision - TP/(PREDICTED YES)

b/(b+c)

## [1] 0.6363881

Confusion matrix for threshold of 0.7

table(dfnaTrain$SS, predictTrain > 0.7)

##    
##     FALSE  TRUE
##   0 66400  1228
##   1 17067  2592

Accuracy - (TP+TN)/TOTAL

e = 66400
f = 2592
g = 1228
h = 17067

(e+f)/(e+f+g+h)

## [1] 0.7904041

Sensitivity - TP/(TP + FN)

f/(f+h)

## [1] 0.131848

Specificity - TN/(TN+FP)

e/(e+g)

## [1] 0.9818418

Precision - TP/(PREDICTED YES)

f/(f+g)

## [1] 0.678534

Confusion matrix for threshold of 0.2

table(dfnaTrain$SS, predictTrain > 0.2)

##    
##     FALSE  TRUE
##   0 50331 17297
##   1  4184 15475

Accuracy - (TP+TN)/TOTAL

i = 50331
j = 15475
k = 4184
l = 17297

(i+j)/(i+j+k+l)

## [1] 0.7539038

Sensitivity - TP/(TP + FN)

j/(j+l)

## [1] 0.4722019

Specificity - TN/(TN+FP)

i/(i+k)

## [1] 0.9232505

Precision - TP/(PREDICTED YES)

j/(j+k)

## [1] 0.7871713

A Receiver Operator Characteristic curve, or ROC curve, can help us decide which value of the threshold is best.

load ROCR package

library(ROCR)


ROCRpred = prediction(predictTrain, dfnaTrain$SS)

Performance function - defines what we’d like to plot on the x and y-axes of our ROC curve.

ROCRperf = performance(ROCRpred, "tpr", "fpr")

plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

Prediction on Test Set

predictTest = predict(logistic, type = "response", newdata = dfnaTest)
table(dfnaTest$SS,predictTest >= 0.5)

##    
##     FALSE  TRUE
##   0 20677  1865
##   1  3509  3044

Accuracy - (TP+TN)/TOTAL

m = 20677
n = 3044
o = 1865
p = 3509

(m+n)/(m+n+o+p)

## [1] 0.8152947

Sensitivity - TP/(TP + FN)

n/(n+p)

## [1] 0.4645201

Specificity - TN/(TN+FP)

m/(m+o)

## [1] 0.9172655

Precision - TP/(PREDICTED YES)

n/(n+o)

## [1] 0.6200856

Decision Trees

library(rpart, warn.conflicts = FALSE)
library(rpart.plot, warn.conflicts = FALSE)
library(rattle, warn.conflicts = FALSE)
library(RColorBrewer, warn.conflicts = FALSE)
 
# Changing the prior probabilities of SS product takeup and non-takeup

tree_prior = rpart(SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) + InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day + DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M +DO_Perc_L3M + DODispute_L3M, method = "class",
                    data = dfnaTrain,
                    parms = list(prior = c(0.7,0.3)), # changing the probabilities of non product takeup to 0.7
                    control = rpart.control(cp = 0.001)) # and product takeup to 0.3

prp(tree_prior)

# cp and xerror values.

 printcp(tree_prior)

## 
## Classification tree:
## rpart(formula = SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) + 
##     InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day + 
##     DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M + 
##     DO_Perc_L3M + DODispute_L3M, data = dfnaTrain, method = "class", 
##     parms = list(prior = c(0.7, 0.3)), control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] Ave_Days_Above_100_L3M         DODispute_L3M                 
## [3] InactiveLoanClient             Inflows_Above_1k              
## [5] Inflows_Max_AVG_Day            LoanClient                    
## [7] log10(abs(Outflows_Total_L3M)) Other_Perc_L3M                
## 
## Root node error: 26186/87287 = 0.3
## 
## n= 87287 
## 
##           CP nsplit rel error  xerror      xstd
## 1  0.2466953      0   1.00000 1.00000 0.0062778
## 2  0.0082954      1   0.75330 0.75330 0.0051124
## 3  0.0079886      5   0.72012 0.73726 0.0051670
## 4  0.0051988      7   0.70415 0.70604 0.0051171
## 5  0.0029382      8   0.69895 0.70249 0.0050549
## 6  0.0024183      9   0.69601 0.69978 0.0050350
## 7  0.0021667     10   0.69359 0.69708 0.0050330
## 8  0.0019707     11   0.69142 0.69615 0.0050312
## 9  0.0016691     12   0.68945 0.69546 0.0050313
## 10 0.0014668     13   0.68778 0.69367 0.0050244
## 11 0.0014306     14   0.68632 0.69262 0.0050236
## 12 0.0010427     15   0.68489 0.68985 0.0050147
## 13 0.0010000     16   0.68384 0.68993 0.0050102

cp vs xerror plot.

plotcp(tree_prior)

tree_min = tree_prior$cptable[which.min(tree_prior$cptable[,"xerror"]),"CP"]

# pruning the tree for increased model performance.

ptree_prior = prune(tree_prior, cp = tree_min) # pruning the tree.

prp(ptree_prior)

pred_prior = predict(ptree_prior, newdata = dfnaTest, type = "class") # making predictions.

confmat_prior = table(dfnaTest$SS,pred_prior) # making the confusion matrix.
confmat_prior

##    pred_prior
##         0     1
##   0 19925  2617
##   1  2868  3685

acc_prior = sum(diag(confmat_prior)) / sum(confmat_prior)
acc_prior

## [1] 0.8114796

Prp

prp(ptree_prior)

#Loss Matrix

# Including a loss matrix

tree_loss_matrix = rpart(SS ~LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) + InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day + DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M +DO_Perc_L3M + DODispute_L3M, method = "class",
                          data = dfnaTrain,
                          parms = list(loss = matrix(c(0,10,1,0),ncol = 2)),
                          control = rpart.control(cp = 0.001))  

#penalizing classifying a product takeup as a non-takeup 10 times more.

printcp(tree_loss_matrix)

## 
## Classification tree:
## rpart(formula = SS ~ LoanClient + Inflows_Above_1k + log10(abs(Outflows_Total_L3M)) + 
##     InactiveLoanClient + Other_Perc_L3M + Inflows_Max_AVG_Day + 
##     DormantLoanClient + Ave_Days_Above_100_L3M + CW_Perc_L3M + 
##     DO_Perc_L3M + DODispute_L3M, data = dfnaTrain, method = "class", 
##     parms = list(loss = matrix(c(0, 10, 1, 0), ncol = 2)), control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
##  [1] Ave_Days_Above_100_L3M         CW_Perc_L3M                   
##  [3] DO_Perc_L3M                    DODispute_L3M                 
##  [5] DormantLoanClient              InactiveLoanClient            
##  [7] Inflows_Above_1k               Inflows_Max_AVG_Day           
##  [9] LoanClient                     log10(abs(Outflows_Total_L3M))
## 
## Root node error: 67628/87287 = 0.77478
## 
## n= 87287 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.1282975      0   1.00000 10.0000 0.018249
## 2  0.0222541      2   0.74341  4.8453 0.021071
## 3  0.0060034      3   0.72115  4.9508 0.021171
## 4  0.0057816      5   0.70914  5.0145 0.021227
## 5  0.0040664      7   0.69758  4.8936 0.021128
## 6  0.0038446     10   0.68374  4.7382 0.020988
## 7  0.0038076     11   0.67990  4.6968 0.020948
## 8  0.0031274     13   0.67228  4.6902 0.020943
## 9  0.0030609     15   0.66603  4.5601 0.020809
## 10 0.0016413     16   0.66297  4.5074 0.020754
## 11 0.0016118     18   0.65968  4.5702 0.020823
## 12 0.0012125     19   0.65807  4.5719 0.020826
## 13 0.0010000     21   0.65565  4.5694 0.020823

Plot Tree Loss Matrix

plotcp(tree_loss_matrix)

#Another plot

ptree_loss_matrix = prune(tree_loss_matrix, cp = 0.0011)

prp(tree_loss_matrix)

#Confmat

pred_loss_matrix = predict(ptree_loss_matrix,newdata = dfnaTest, type = "class")

confmat_loss_matrix = table(dfnaTest$SS,pred_loss_matrix)
confmat_loss_matrix

##    pred_loss_matrix
##         0     1
##   0 12375 10167
##   1   497  6056

Acc loss matrix

acc_loss_matrix = sum(diag(confmat_loss_matrix)) / sum(confmat_loss_matrix)
acc_loss_matrix

## [1] 0.6334765

including case weights for the training dataset

case_weights = ifelse(dfnaTrain$SS == 0,1,3)

tree_weights = rpart(SS ~ ., method = "class", 
                      data = dfnaTrain, 
                      control = rpart.control(cp = 0.001,minsplit = 5,minbucket = 2),
                      weights = case_weights)

plotcp(tree_weights)

ptree_weights = prune(tree_weights, cp = 0.00183101)

prp(ptree_weights,extra = 1)

pred_weights = predict(ptree_weights, newdata = dfnaTest,type = "class")

confmat_weights = table(dfnaTest$SS,pred_weights)
confmat_weights

##    pred_weights
##         0     1
##   0 17343  5199
##   1  1520  5033

Acc weights

acc_weights = sum(diag(confmat_weights)) / sum(confmat_weights)
acc_weights

## [1] 0.7690668

Conclusion

It can be seen that most of the cases in the data did not take up product SS. A logistic regression and a decision tree algorithm were each applied to the data in an effort to optimally predict whether a client will take up product SS.

From the results in the analysis it can be seen that the logistic regression method produced better results than Decision trees with respect to Accuracy at 82% success rate at prediction.

Data Scientist Coding Assessment

Edzai Zvobwo

February 27, 2019

Problem Statement

Session Information

Load Dataset

Splitting Training & Testing Data

Summarize Data

Correlations

LoanEver

LoanCLient

InactiveLoanClient

DormantLoanClient

SS

LoanEver

LoanCLient

InactiveLoanClient

DormantLoanClient

SS

SS Vs LoanClient Crosstab

SS Vs LoanEver Crosstab

SS Vs DormantLoanClient

SS Vs InactiveLoanClient

Baseline Model

Feature Selection

Logistic Regression Model

Linearity assumption

Multicollinearity

Check for Influential Outliers

Making predictions on Training set

More predictions

Thresholding - choosing the the threshold based on error size

Confusion matrix

Confusion matrix for threshold of 0.5

Accuracy - (TP+TN)/TOTAL

Sensitivity - TP/(TP + FN)

Specificity - TN/(TN+FP)

Precision - TP/(PREDICTED YES)

Confusion matrix for threshold of 0.7

Accuracy - (TP+TN)/TOTAL

Sensitivity - TP/(TP + FN)

Specificity - TN/(TN+FP)

Precision - TP/(PREDICTED YES)

Confusion matrix for threshold of 0.2

Accuracy - (TP+TN)/TOTAL

Sensitivity - TP/(TP + FN)

Specificity - TN/(TN+FP)

Precision - TP/(PREDICTED YES)

A Receiver Operator Characteristic curve, or ROC curve, can help us decide which value of the threshold is best.

load ROCR package

Performance function - defines what we’d like to plot on the x and y-axes of our ROC curve.

Prediction on Test Set

Accuracy - (TP+TN)/TOTAL

Sensitivity - TP/(TP + FN)

Specificity - TN/(TN+FP)

Precision - TP/(PREDICTED YES)

Decision Trees

cp vs xerror plot.

Prp

Plot Tree Loss Matrix

Acc loss matrix

including case weights for the training dataset

Acc weights

Conclusion