library(rpart)
library(faraway)
library(maptree)    #Loading packages
library(tree)

1 Introduction

The dataset used in this analysis contains a real account of data on customers of a credit card company in Sioux Falls. It has six variables labeled Closed which is the target variable and the predictors labeled Months_on_Book which is the number of months an individual has been a customer, Net_Purchases_During_Cycle which is an amount in dollars of purchases a customer makes in a month, Net_Cash_Advances_During_Cycle which is the amount of cash advances of a customer for each month, Utility which is the balance on card/Credit limit of a customer in each month and DebtDimId which is the identification number for each customer. Closed is in 2 levels (0 & 1) with 0 representing a customer that did not close their account and 1 denoting a customer that the company lost within 6 months of the data collected. The predictors were collected at time (t) and the dependent variable was collected at time t+6.

Two models namely, Logistic regression and Tree regression were used to predict whether a customer will close his/her account within a time space given the months that individual has been a customer, the amount of purchases the customer makes for every month, the balance that remains on the customers card for every month and the amount of cash advances for each month.

#setwd('D:/R-STUDIO/RPubs_Project')
#dir()
Raw_dat<-read.csv("customerretention.csv")[-1]
if(require(dplyr)){glimpse(Raw_dat)} #Importing and loading the data set

## Rows: 6,237
## Columns: 5
## $ Months_On_Book                 <chr> "98", "86", "91", "164", "164", "161", ~
## $ Net_Purchases_During_Cycle     <int> 0, 81, 0, 15, 74, 67, 99, 29, 150, 14, ~
## $ Net_Cash_Advances_During_Cycle <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Utility                        <dbl> 0.017228571, 1.120111111, -0.053333333,~
## $ Closed                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

2 Data Preprocessing

Summary statistics on the data set shows the present of missing observations in two of the predictors namely, Month_On_Book and Utility. There is also the presence of negative observations in Net_Purchases_During_Cycle (This variable was renamed) and Utility. The rows with the missing observations were removed from the data set.

names(Raw_dat)[c(2:3)]<-c("Net_Purchases","Net_Cash_Advances")#Renaming the third and fourth variables since names are lengthy
Raw_dat$Months_On_Book<-as.numeric(Raw_dat$Months_On_Book)#Converting the variable Months_on_Book to numeric
summary(Raw_dat) #Performing summary statistics on the data set.

##  Months_On_Book   Net_Purchases     Net_Cash_Advances    Utility       
##  Min.   :  6.00   Min.   :-176.00   Min.   :  0.000   Min.   :-2.2376  
##  1st Qu.: 12.00   1st Qu.:   0.00   1st Qu.:  0.000   1st Qu.: 0.5456  
##  Median : 22.00   Median :  15.00   Median :  0.000   Median : 0.8364  
##  Mean   : 32.06   Mean   :  47.94   Mean   :  1.159   Mean   : 0.7139  
##  3rd Qu.: 43.00   3rd Qu.:  59.00   3rd Qu.:  0.000   3rd Qu.: 0.9524  
##  Max.   :164.00   Max.   : 775.00   Max.   :411.000   Max.   : 1.6396  
##  NA's   :2                                            NA's   :8        
##      Closed       
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.04714  
##  3rd Qu.:0.00000  
##  Max.   :1.00000  
##

Row.has.na<-apply(Raw_dat,1, function(x){any(is.na(x))})
final_dat<-Raw_dat[!Row.has.na,]#Removing ten missing values from the data set and performing summary statistics on the data
summary(final_dat)

##  Months_On_Book   Net_Purchases     Net_Cash_Advances    Utility       
##  Min.   :  6.00   Min.   :-176.00   Min.   :  0.000   Min.   :-2.2376  
##  1st Qu.: 12.00   1st Qu.:   0.00   1st Qu.:  0.000   1st Qu.: 0.5456  
##  Median : 22.00   Median :  15.00   Median :  0.000   Median : 0.8364  
##  Mean   : 32.08   Mean   :  47.92   Mean   :  1.161   Mean   : 0.7139  
##  3rd Qu.: 43.00   3rd Qu.:  59.00   3rd Qu.:  0.000   3rd Qu.: 0.9525  
##  Max.   :164.00   Max.   : 775.00   Max.   :411.000   Max.   : 1.6396  
##      Closed       
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.04721  
##  3rd Qu.:0.00000  
##  Max.   :1.00000

3 Exploratory Data Analysis

The scatterplot matrix Figure 3.1 shows a weak negative correlation between Closed and all the ipredictors in the dataset. However, Utility is positively correlated with Net Purchases and Net Cash Advances even though the correlations are weak. Net Purchases also has weak positive correlation with Net Cash Advances.

if (require(GGally)){#Scatterplot matrix to visualize the correlation between the variables
  final_dat%>%
    ggpairs() + ggtitle("Scatterplot matrix on the variables in the dataset") + theme(plot.title = element_text(hjust = 0.5))
}

Figure 3.1: Scatterplot matrix on the variables

Figure 3.2 is a histogram to visually inspect the distribution of the observations in each variable in the dataset. From the plots, the observations in all the variables appear not normally distributed. Also, the distribution of the two levels in the target variable are not equal as the number of customers who did not close their account with the company exceed those who closed accounts.

plotHist <- function(columns,bin,colours){#Exploratory analysis on the data set
  par(mfrow = c(3,2))   #Histogram plots to visualize the distribution of the six variables in the data set.
  for (i in columns) {
    hist(final_dat[,i], main = paste("Histogram of ", names(final_dat)[i]),
         nclass = bin, las = 1, col = colours, 
         xlab = paste(names(final_dat)[i]))
  }
}
plotHist(c(1:5), c(40,50,55,60), "brown")

Figure 3.2: Histogram of the predictors

From Figure 3.3 it appears there are outliers in all the predcitors.

if(require(ggplot2)){ #Box plots to visualize the distribution of the variables in the data set
  A<-ggplot(final_dat, aes(as.factor(Closed), Net_Cash_Advances))+ geom_boxplot(fill = c("green","skyblue"))+ stat_boxplot(geom='errorbar', linetype=1, width=0.5)+ theme_bw() + xlab("Closed")
  B<-ggplot(final_dat, aes(as.factor(Closed), Months_On_Book)) + geom_boxplot(fill = c("green","skyblue"))+ stat_boxplot(geom='errorbar', linetype=1, width=0.5)+ theme_bw() + xlab("Closed")
  C<-ggplot(final_dat, aes(as.factor(Closed), Net_Purchases))+ geom_boxplot(fill = c("green","skyblue"))+ stat_boxplot(geom='errorbar', linetype=1, width=0.5)+ theme_bw() + xlab("Closed")
  D<-ggplot(final_dat, aes(as.factor(Closed), Utility)) + geom_boxplot(fill = c("green","skyblue"))+ stat_boxplot(geom='errorbar', linetype=1, width=0.5)+ theme_bw() + xlab("Closed")
gridExtra::grid.arrange(A,B,C,D,nrow = 2, ncol = 2)
}

Figure 3.3: Boxplot of the predictors

From Figure 3.4, there appears to be negative correlation between the independent variables and the target. Also, negative observations are present in Net Purchase and Utility.

if (require(ggplot2)){#Scatterplots to visualize the relationship between the variables as well as their distributions
p1<-ggplot(final_dat, aes(Net_Cash_Advances, Closed)) + geom_point() + geom_smooth(method = lm, se = FALSE, color = "red") +  theme_bw() + xlab("Net_Cash_Advances") + ylab("Closed")
p2<-ggplot(final_dat, aes(Months_On_Book, Closed)) + geom_point() + geom_smooth(method = lm, se = FALSE, color = "red") + theme_bw() + xlab("Values of Months_On_Book") + ylab("Values of Closed")
p3<-ggplot(final_dat, aes(Net_Purchases, Closed)) + geom_point() + geom_smooth(method = lm, se = FALSE, color = "red") + theme_bw() + xlab("Values of Net_Purchases") + ylab("Values of Closed")
p4<-ggplot(final_dat, aes(Utility, Closed)) + geom_point() + geom_smooth(method = lm, se = FALSE, color = "red") + theme_bw() + xlab("Values of Utility") + ylab("Values of Closed")
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)
}

Figure 3.4: Scatterplots to visualize the relationship between the variables as well as their distributions

4 Feature Selection

In building the logistic model, backward and forward stepwise selection was used to select the predictors significant in making the predictions. The table below shows the final variables selected as significant at α=0.05. Three predictors; (Months_On_Book, Net_Purchases and Utility) were selected as significant from the stepwise selection. Net_Cash_Advances appeared insignificant at α=0.05 with p-value of 0.351789.

Full_model<-lm(Closed ~ Months_On_Book + Net_Purchases + Net_Cash_Advances + Utility, data= final_dat) 
Stepwise_model<-step(Full_model,direction="both",test="F")

## Start:  AIC=-19397.62
## Closed ~ Months_On_Book + Net_Purchases + Net_Cash_Advances + 
##     Utility
## 
##                     Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## - Net_Cash_Advances  1    0.0384 275.92 -19399  0.8671  0.351789    
## <none>                           275.88 -19398                      
## - Months_On_Book     1    0.2312 276.11 -19394  5.2136  0.022444 *  
## - Net_Purchases      1    0.3139 276.19 -19393  7.0798  0.007816 ** 
## - Utility            1    3.6363 279.51 -19318 82.0098 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-19398.75
## Closed ~ Months_On_Book + Net_Purchases + Utility
## 
##                     Df Sum of Sq    RSS    AIC F value  Pr(>F)    
## <none>                           275.92 -19399                    
## + Net_Cash_Advances  1    0.0384 275.88 -19398  0.8671 0.35179    
## - Months_On_Book     1    0.2303 276.15 -19396  5.1936 0.02270 *  
## - Net_Purchases      1    0.3243 276.24 -19393  7.3141 0.00686 ** 
## - Utility            1    3.6647 279.58 -19319 82.6532 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Stepwise selection was performed on the variables to select those that appear significant under 0.5 significant level
#Only the variable Net_Cash_Advances appeared insignificant for the analysis

The predictors selected as significant with the stepwise selection also appeared significant when a linear model was built with them.

Select_model<-lm(Closed ~ Months_On_Book + Net_Purchases + Utility, data= final_dat)
summary(Select_model)#Final model was built with the significant variables to confirm if indeed they are significant at 0.5 significant level for the analysis

## 
## Call:
## lm(formula = Closed ~ Months_On_Book + Net_Purchases + Utility, 
##     data = final_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25746 -0.05715 -0.03904 -0.02820  1.02012 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.104e-01  7.370e-03  14.975  < 2e-16 ***
## Months_On_Book -2.214e-04  9.714e-05  -2.279  0.02270 *  
## Net_Purchases  -8.846e-05  3.271e-05  -2.704  0.00686 ** 
## Utility        -7.258e-02  7.983e-03  -9.091  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2106 on 6223 degrees of freedom
## Multiple R-squared:  0.015,  Adjusted R-squared:  0.01453 
## F-statistic: 31.59 on 3 and 6223 DF,  p-value: < 2.2e-16

5 Data Splitting

final_df<-final_dat[,-3]#Removing the insignificant variable (Net_Cash_Advances) from the dataset
target<-final_df[,"Closed"]
predictors<-final_df[,-4]

The logistic regression model was built with the new data set containing the 3 significant predictors and the target variable (Closed). The data was split randomly at 70% (4359 observations) as the train data and 30% (1868) as the test data.

set.seed(702)
#The data set was split randomly into 30% test data and 70% train data
trainRows<- caret::createDataPartition(target,p=.70,list = FALSE)
trainPredictors<-predictors[trainRows,]
trainTarget<-target[trainRows]
testPredictors<-predictors[-trainRows,]
testTarget<-target[-trainRows]

6 Model Building

Logistic regression model was built on the train data and the model performance evaluated on the same train data as well as the test data

glm_Fit<-glm(trainTarget ~.,data=trainPredictors,family = "binomial")  #Logistic model with the function glm performed on the train data
summary(glm_Fit)

## 
## Call:
## glm(formula = trainTarget ~ ., family = "binomial", data = trainPredictors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3531  -0.3337  -0.2895  -0.2573   3.0559  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.878196   0.156860 -11.974  < 2e-16 ***
## Months_On_Book -0.005045   0.002689  -1.876   0.0606 .  
## Net_Purchases  -0.002837   0.001126  -2.520   0.0117 *  
## Utility        -1.230162   0.175095  -7.026 2.13e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1736.8  on 4358  degrees of freedom
## Residual deviance: 1679.5  on 4355  degrees of freedom
## AIC: 1687.5
## 
## Number of Fisher Scoring iterations: 6

The Mean Absolute Error associated with model on the train data is approximately 0.09. This implies that the difference between the actual values and the model’s predicted values is pretty small. Hence, the model was good at the prediction. The values of other parameters for assessing the model’s performance such as RMSE and Rsquared are also low. This is expected as the same data used in building the model was used for the prediction.

train_pred<-predict(glm_Fit,trainPredictors,type = "response")#validating the model with the train data
glm_values<-data.frame(obs=trainTarget,
                       pred=train_pred)
caret::defaultSummary(glm_values)

##       RMSE   Rsquared        MAE 
## 0.21692841 0.01391616 0.09398750

#Id=seq(1:4359)
train_results<-tibble(trainTarget,train_pred)#Generating the scoring data for the logistic model
sort_dat<-(train_results[order(train_results[,2],decreasing = T),])
DT::datatable(sort_dat,filter = "top",caption="Table.1: Predicted verses Actual values in the train Data (Logistic Model)")

#write.csv(Don,"Logistics_Scoring.csv")

if(require(ROCR)){trainpred<-prediction(sort_dat$train_pred,sort_dat$trainTarget)}
if(require('ROCR')){trainperf<-performance(trainpred,"tpr","fpr")}

train_auc<-performance(trainpred, measure = "auc")   #Computes the area under the ROC curve     
print(train_auc@y.values)

## [[1]]
## [1] 0.6520592

test_pred<-predict(glm_Fit,testPredictors,type = "response")#Validating the model on the test data
testTarget_testPred<-data.frame(obs=testTarget,
                       pred=test_pred)
caret::defaultSummary(testTarget_testPred)

##       RMSE   Rsquared        MAE 
## 0.19453278 0.02064137 0.08460465

 #The Root Means Square Error associated with the model is 20.76%

#Id=seq(1:1868)
test_results<-tibble(testTarget,test_pred)             #Generating the scoring data for the logistic model
sort_test_results<-(test_results[order(test_results[,2],decreasing = T),])
DT::datatable(sort_test_results,filter = "top",caption="Table.2: Predicted verses Actual values in the test Data (Logistic Model)")

#write.csv(Don,"Logistics_Scoring.csv")

if(require(ROCR)){testpred<-prediction(sort_test_results$test_pred,sort_test_results$testTarget)}
if(require('ROCR')){testperf<-performance(testpred,"tpr","fpr")}                    #Generate a plot of True Positive Rate vs False Positive Rate from the logistic model
glm_test_auc<-performance(testpred, measure = "auc")   #Computes the area under the ROC curve     
print(glm_test_auc@y.values)

## [[1]]
## [1] 0.6440974

plot(unlist(slot(trainperf,"x.values")),unlist(slot(trainperf,"y.values")),ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="red")
points(unlist(slot(testperf, 'x.values')),unlist(slot(testperf,'y.values')),type="l",col="blue")
abline(a=0,b=1,col="green")
legend("topleft", legend=c("Train_AUC", "Test_AUC","Baseline_AUC"),
       col=c("red", "blue","green"), lty=1:3, cex=0.6)
legend("bottomright", legend=c("Train_AUC=0.65", "Test_AUC=0.64"),
       col=c("red", "blue"), lty=1:2, cex=0.6)

From the above results, the model performance on the train data and the test data is comparable. The Area under ROC (AUROC) curve for the train data (0.65) and the test data (0.64) is comparable likewise their MAE, RMSE and Rsquared values.

if(require(gains)){Gains<-gains(testTarget,test_pred)}#Computing gains for the model prediction on the test data
Gains

## Depth                            Cume   Cume Pct                     Mean
##  of           Cume     Mean      Mean   of Total    Lift   Cume     Model
## File     N      N      Resp      Resp      Resp    Index   Lift     Score
## -------------------------------------------------------------------------
##   10   186    186      0.12      0.12      29.3%     295    295      0.11
##   20   187    373      0.06      0.09      44.0%     147    220      0.08
##   30   187    560      0.03      0.07      52.0%      80    173      0.06
##   40   187    747      0.03      0.06      60.0%      80    150      0.05
##   50   187    934      0.04      0.06      69.3%      93    139      0.04
##   60   186   1120      0.01      0.05      72.0%      27    120      0.04
##   70   187   1307      0.03      0.05      78.7%      67    112      0.04
##   80   187   1494      0.04      0.04      89.3%     107    112      0.03
##   90   187   1681      0.01      0.04      92.0%      27    102      0.03
##  100   187   1868      0.03      0.04     100.0%      80    100      0.02

par(mfrow=c(1,3))#A plot of the Mean Response and the Deciles 
plot(Gains,ylim = c(0,0.1),xlim=c(0,100))

barplot(Gains$mean.resp, names.arg = Gains$depth, xlab = "Percentile", #A barplot of the Means Response and the Deciles
        ylab = "Mean Response", main = "Decile-wise lift chart")
barplot(Gains$cume.pct.of.total, names.arg = Gains$depth, xlab = "Percentile", #A barplot of the cummulative percentage of the total response vs the deciles
        ylab = "Cum(%_Total_Response)", main = "Decile-wise lift chart")

Figure 6.1: Plots on the logistic Gains table

From Figure 6.1 and the Gains table, it is observed that Depth of file of 80 which has cumulative observation of 1494 out of the 1868 has significant cumulative percentage of total response (89.3%). This implies that in case of predicting the customers with the probability of closing their account, if we target 80% of the customers (1494) from the predictions made by the model, approximately 89% percent of them will close their account. Thus; approximately 1330 out of the 1494 customers will close their account.

if(require('ROCR')){Chart=performance(testpred,"tpr","rpp")} 
#A plot of the gain chart (A plot of the True positive rate vs the Rate of positive predictions)
plot(x=c(0, 1), y=c(0, 1), type="l", col="red", lwd=2,
     ylab="True Positive Rate", 
     xlab="Rate of Positive Predictions")
lines(x=c(0, 0.5, 1), y=c(0, 1, 1), col="darkgreen", lwd=2)

gain.x = unlist(slot(Chart, 'x.values'))
gain.y = unlist(slot(Chart, 'y.values'))
lines(x=gain.x, y=gain.y, col="orange", lwd=2)
legend("bottomright", legend=c("Logistic Model", "Optimal Model","Random Model"),
       col=c("yellow", "green","red"), lty=1:3, cex=0.6)

Figure 6.2: A plot of the True positive rate vs the Rate of positive predictions

The tree regression model was built with all four predictors and the target variable (Closed). The dataset was split randomly at 70% (4359 observations) as the train data and 30% (1868) as the test data.

set.seed(551)
treeTarget<-final_dat$Closed
treePredictors<-final_dat[,-5]
trainingRows<-caret::createDataPartition(treeTarget,p=.7,list=F)
trainX<-treePredictors[trainingRows, ]
trainY<-treeTarget[trainingRows]
testX<-treePredictors[-trainingRows, ]
testY<-treeTarget[-trainingRows]#Randomly splitting the full dataset into 70% train data and 30% test data for the tree model

The model was built on the train data and the model performance evaluated on the same train data as well as the test data. The performance of the model on the train and the test data was compared by computing their ROC curves and generating the area under each curve (auc). As shown in Figure 6.4, the performance of the model on the train data and the test data is comparable. Also, the root means square error (RMSE) and mean absolute error associated with the model on the train and test data is comparable.

tree_model<-rpart(trainY ~ .,data= trainX,minbucket=50,cp=0.002)
summary(tree_model)#Building the tree model on the train data and computing summary statistics on the model. The MSE associated with the model on the train data is 21.4%

## Call:
## rpart(formula = trainY ~ ., data = trainX, minbucket = 50, cp = 0.002)
##   n= 4359 
## 
##            CP nsplit rel error    xerror       xstd
## 1 0.015647683      0 1.0000000 1.0006663 0.06449341
## 2 0.005557676      1 0.9843523 0.9917148 0.06317128
## 3 0.003084378      2 0.9787946 0.9918413 0.06297281
## 4 0.002000000      3 0.9757103 0.9964770 0.06309603
## 
## Variable importance
##        Utility Months_On_Book  Net_Purchases 
##             64             35              1 
## 
## Node number 1: 4359 observations,    complexity param=0.01564768
##   mean=0.04748796, MSE=0.04523285 
##   left son=2 (3929 obs) right son=3 (430 obs)
##   Primary splits:
##       Utility           < 0.09031333 to the right, improve=0.0156476800, (0 missing)
##       Net_Purchases     < 31.5       to the right, improve=0.0041701390, (0 missing)
##       Months_On_Book    < 72.5       to the right, improve=0.0018573750, (0 missing)
##       Net_Cash_Advances < 11         to the right, improve=0.0002727452, (0 missing)
##   Surrogate splits:
##       Net_Purchases < -110       to the right, agree=0.902, adj=0.005, (0 split)
## 
## Node number 2: 3929 observations
##   mean=0.03868669, MSE=0.03719003 
## 
## Node number 3: 430 observations,    complexity param=0.005557676
##   mean=0.127907, MSE=0.1115468 
##   left son=6 (114 obs) right son=7 (316 obs)
##   Primary splits:
##       Months_On_Book < 50.5       to the right, improve=0.022845910, (0 missing)
##       Net_Purchases  < 98.5       to the right, improve=0.014984480, (0 missing)
##       Utility        < 0.02384    to the right, improve=0.005026909, (0 missing)
## 
## Node number 6: 114 observations
##   mean=0.04385965, MSE=0.04193598 
## 
## Node number 7: 316 observations,    complexity param=0.003084378
##   mean=0.1582278, MSE=0.1331918 
##   left son=14 (125 obs) right son=15 (191 obs)
##   Primary splits:
##       Months_On_Book < 14.5       to the left,  improve=0.01444920, (0 missing)
##       Net_Purchases  < 77         to the right, improve=0.01361696, (0 missing)
##       Utility        < 0.02556364 to the right, improve=0.00963202, (0 missing)
##   Surrogate splits:
##       Utility           < 0.02671333 to the right, agree=0.630, adj=0.064, (0 split)
##       Net_Purchases     < 211.5      to the right, agree=0.617, adj=0.032, (0 split)
##       Net_Cash_Advances < 11         to the right, agree=0.611, adj=0.016, (0 split)
## 
## Node number 14: 125 observations
##   mean=0.104, MSE=0.093184 
## 
## Node number 15: 191 observations
##   mean=0.1937173, MSE=0.1561909

if(require(rattle)){fancyRpartPlot(tree_model,main="TREE MODEL ON DATASET")} #A plot of a tree on the tree model

Figure 6.3: Tree model on the dataset

Figure 6.3 shows that the predictors used by the tree model in predicting whether a customer will close the account with the company or not is Utility and Months_On_Book.

treeTrain_pred<-predict(tree_model, newdata =trainX) 
trainX_pred<-data.frame(obs=trainY,
                       pred=treeTrain_pred)
caret::defaultSummary(trainX_pred)#Validating the performance of the tree model on the test data

##       RMSE   Rsquared        MAE 
## 0.21008131 0.02428974 0.08826831

treeTrain_results<-tibble(trainY,treeTrain_pred)             #Generating the scoring data for the Tree model
sort_treeTrain_results<-(treeTrain_results[order(treeTrain_results[,2],decreasing = T),])
DT::datatable(sort_treeTrain_results,filter = "top",caption="Table.3: Predicted verses Actual values in the train Data (Tree Model)")

if(require(ROCR)){tree_train_pred<-prediction(sort_treeTrain_results$treeTrain_pred,sort_treeTrain_results$trainY)}
if(require('ROCR')){tree_train_perf<-performance(tree_train_pred,"tpr","fpr")}                     #Generate a plot of True Positive Rate vs False Positive Rate from the tree model

tree_Train_auc<-performance(tree_train_pred, measure = "auc")   #Computes the area under the ROC curve     
print(tree_Train_auc@y.values)

## [[1]]
## [1] 0.5913343

Treetest_pred<-predict(tree_model,newdata=testX)#Validating the model on the test data
treeTarget_pred<-data.frame(obs=testY,
                       pred=Treetest_pred)
caret::defaultSummary(treeTarget_pred)

##        RMSE    Rsquared         MAE 
## 0.210551063 0.007297368 0.088529872

treeTest_results<-tibble(testY,Treetest_pred)             #Generating the scoring data on the test data for the Tree model
sort_treeTest_Results<-(treeTest_results[order(treeTest_results[,2],decreasing = T),])
DT::datatable(sort_treeTest_Results,filter = "top",caption="Table.4: Predicted verses Actual values in the test Data (Tree Model)")

#write.csv(Kutus,"Tree_Scoring.csv")

if(require(ROCR)){tree_test_pred<-prediction(sort_treeTest_Results$Treetest_pred,sort_treeTest_Results$testY)}
if(require('ROCR')){tree_test_perf<-performance(tree_test_pred,"tpr","fpr")}

tree_Test_auc<-performance(tree_test_pred, measure = "auc")   #Computes the area under the ROC curve     
print(tree_Test_auc@y.values)

## [[1]]
## [1] 0.5819345

plot(unlist(slot(tree_train_perf,"x.values")),unlist(slot(tree_train_perf,"y.values")),ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="red")
points(unlist(slot(tree_test_perf, 'x.values')),unlist(slot(tree_test_perf,'y.values')),type="l",col="blue")
abline(a=0,b=1,col="green")
legend("topleft", legend=c("Train_ROC", "Test_ROC","Baseline_ROC"),
       col=c("red", "blue","green"), lty=1:3, cex=0.6)
legend("bottomright", legend=c("Train_AUC=0.59", "Test_AUC=0.58"),
       col=c("red", "blue"), lty=1:2, cex=0.6)

Figure 6.4: Plots on the tree Gains table

if(require(gains)){treegains<-gains(testY,Treetest_pred)}
treegains

## Depth                            Cume   Cume Pct                     Mean
##  of           Cume     Mean      Mean   of Total    Lift   Cume     Model
## File     N      N      Resp      Resp      Resp    Index   Lift     Score
## -------------------------------------------------------------------------
##    5    87     87      0.11      0.11      11.5%     247    247      0.19
##    7    45    132      0.11      0.11      17.2%     239    244      0.10
##   10    46    178      0.15      0.12      25.3%     327    265      0.04
##  100  1690   1868      0.04      0.05     100.0%      83    100      0.04

par(mfrow=c(1,3))                               #A plot of the Mean Response and the Deciles 
plot(Gains,ylim = c(0,0.1),xlim=c(0,100))
barplot(treegains$mean.resp, names.arg = treegains$depth, xlab = "Percentile", #A barplot of the Means Response and the Deciles
        ylab = "Mean Response", main = "Decile-wise lift chart")
barplot(treegains$cume.pct.of.total, names.arg = treegains$depth, xlab = "Percentile", #A barplot of the cummulative percentage of the total response vs the deciles
        ylab = "Cum(%_Total_Response)", main = "Decile-wise lift chart")

Figure 6.5: Plots on the Tree Gains table

From Figure 6.5 and the Gains table, it is observed that Depth of file of 10 which has cumulative observation of 178 out of the 1868 has significant cumulative percentage of total response (25.3%). This implies that in case of predicting the customers with the probability of closing their account, if we target 10% of the customers (178) from the predictions made by the tree model, approximately 25% percent of them will close their account. Thus; approximately 45 out of the 178 customers will close their account.

plot(unlist(slot(testperf, 'x.values')),unlist(slot(testperf,'y.values')),ylab="True Positive Rate",xlab="False Positive Rate",type="l",col="blue")
points(unlist(slot(tree_test_perf,"x.values")),unlist(slot(tree_test_perf,"y.values")),type="l",col="red")
abline(a=0,b=1,col="brown")
legend("topleft", legend=c("Logistics Model", "Tree Model","Random Model"),
       col=c("blue", "red","brown"), lty=1:3, cex=0.6)
legend("bottomright", legend=c("Logistics Model=0.64", "Tree Model=0.58","Random Model"),
       col=c("blue", "red"), lty=1:3, cex=0.6)

Figure 6.6: AUROC curves for the Logistics and Tree model

As shown in Figure 6.6, the logistics regression model with AUC value of 0.64 performed better in predicting whether a customer will close the account with the company or not given the predictor variables in the data set.

ANALYSIS ON CUSTOMER RETENTION DATASET

Samuel Adjei