DA6813 Case Study 4

Tasks:

1) Use acquisitionRetention data set to predict which customers will be acquired and for how long (duration) based on a feature set using a random forest.

The first step is exploratory data analysis.

What is the structure of the

str(df1)

## 'data.frame':    500 obs. of  15 variables:
##  $ customer   : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ acquisition: num  1 1 1 0 1 1 1 1 0 0 ...
##  $ duration   : num  1635 1039 1288 0 1631 ...
##  $ profit     : num  6134 3524 4081 -638 5446 ...
##  $ acq_exp    : num  694 460 249 638 589 ...
##  $ ret_exp    : num  972 450 805 0 920 ...
##  $ acq_exp_sq : num  480998 211628 62016 407644 346897 ...
##  $ ret_exp_sq : num  943929 202077 648089 0 846106 ...
##  $ freq       : num  6 11 21 0 2 7 15 13 0 0 ...
##  $ freq_sq    : num  36 121 441 0 4 49 225 169 0 0 ...
##  $ crossbuy   : num  5 6 6 0 9 4 5 5 0 0 ...
##  $ sow        : num  95 22 90 0 80 48 51 23 0 0 ...
##  $ industry   : num  1 0 0 0 0 1 0 1 0 1 ...
##  $ revenue    : num  47.2 45.1 29.1 40.6 48.7 ...
##  $ employees  : num  898 686 1423 181 631 ...

We have 500 observations across 15 variables. We will be working with two target variables. First, acquisition, and industry which will be factor variables. Second, duration, which will remain numerical. Customer is not useful to us, so I see no reason to keep it.

#Eliminate the variable customer, it is not useful in this analysis.
df1$customer <- NULL

Let’s get a summary of the data.

summary(df1)

##   acquisition       duration          profit           acq_exp       
##  Min.   :0.000   Min.   :   0.0   Min.   :-1027.0   Min.   :   1.21  
##  1st Qu.:0.000   1st Qu.:   0.0   1st Qu.: -316.3   1st Qu.: 384.14  
##  Median :1.000   Median : 957.5   Median : 3369.9   Median : 491.66  
##  Mean   :0.676   Mean   : 742.5   Mean   : 2403.8   Mean   : 493.35  
##  3rd Qu.:1.000   3rd Qu.:1146.2   3rd Qu.: 3931.6   3rd Qu.: 600.21  
##  Max.   :1.000   Max.   :1673.0   Max.   : 6134.3   Max.   :1027.04  
##     ret_exp         acq_exp_sq          ret_exp_sq           freq      
##  Min.   :   0.0   Min.   :      1.5   Min.   :      0   Min.   : 0.00  
##  1st Qu.:   0.0   1st Qu.: 147562.0   1st Qu.:      0   1st Qu.: 0.00  
##  Median : 398.1   Median : 241729.7   Median : 158480   Median : 6.00  
##  Mean   : 336.3   Mean   : 271211.1   Mean   : 184000   Mean   : 6.22  
##  3rd Qu.: 514.3   3rd Qu.: 360246.0   3rd Qu.: 264466   3rd Qu.:11.00  
##  Max.   :1095.0   Max.   :1054811.2   Max.   :1198937   Max.   :21.00  
##     freq_sq          crossbuy           sow            industry    
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :0.000  
##  1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:0.000  
##  Median : 36.00   Median : 5.000   Median : 44.00   Median :1.000  
##  Mean   : 69.25   Mean   : 4.052   Mean   : 38.88   Mean   :0.522  
##  3rd Qu.:121.00   3rd Qu.: 7.000   3rd Qu.: 66.00   3rd Qu.:1.000  
##  Max.   :441.00   Max.   :11.000   Max.   :116.00   Max.   :1.000  
##     revenue        employees     
##  Min.   :14.49   Min.   :  18.0  
##  1st Qu.:33.53   1st Qu.: 503.0  
##  Median :41.43   Median : 657.5  
##  Mean   :40.54   Mean   : 671.5  
##  3rd Qu.:47.52   3rd Qu.: 826.0  
##  Max.   :65.10   Max.   :1461.0

I see a lot of 0 values in these variables. This will cause problems if we do something like log transformation.

Are there any NA values?

sum(is.na(df1))

## [1] 0

There are no NA values in the dataset.

Let’s get a look at whether any variables are correlated with one another, and if that could cause problems for our classification of acquisition.

chart.Correlation(df1, histogram = TRUE, pch=19)

Wow. A lot of highly correlated variables. Here’s the list:

duration
profit
ret_exp
acq_exp_sq
ret_exp_sq
freq
freq_sq
crossbuy
sow

Can these really throw off the value for acquisition? Box plots may be of use here.

par(mfrow=c(3,3))
boxplot(duration ~ acquisition, data=df1, ylab='duration', xlab='acquisition', col='#FF0000')
boxplot(profit ~ acquisition, data=df1, ylab='profit', xlab='acquisition', col='#FF3300')
boxplot(ret_exp ~ acquisition, data=df1, ylab='ret_exp', xlab='acquisition', col='#CC9933')
boxplot(acq_exp_sq ~ acquisition, data=df1, ylab='acq_exp_sq', xlab='acquisition', col='#33CC00')
boxplot(ret_exp_sq ~ acquisition, data=df1, ylab='ret_exp_sq', xlab='acquisition', col='#99CCFF')
boxplot(freq ~ acquisition, data=df1, ylab='freq', xlab='acquisition', col='#0000CC')
boxplot(freq_sq ~ acquisition, data=df1, ylab='freq_sq', xlab='acquisition', col='#9933CC')
boxplot(crossbuy ~ acquisition, data=df1, ylab='crossbuy', xlab='acquisition', col='#9900FF')
boxplot(sow ~ acquisition, data=df1, ylab='sow', xlab='acquisition', col='#6600FF')

par(mfrow=c(1,1))
df1$acquisition <- as.factor(df1$acquisition)
df1$industry <- as.factor(df1$industry)

We see that every one of these variables, with the exception of acq_exp_sq, will cause acquisition to flip to zero if that particular variable is equal to zero, or is negative (as in the case of profit). None of these will be in the final model. I’m going to exclude acq_exp_sq because it is merely a square of acq_exp.

summary(df1)

##  acquisition    duration          profit           acq_exp       
##  0:162       Min.   :   0.0   Min.   :-1027.0   Min.   :   1.21  
##  1:338       1st Qu.:   0.0   1st Qu.: -316.3   1st Qu.: 384.14  
##              Median : 957.5   Median : 3369.9   Median : 491.66  
##              Mean   : 742.5   Mean   : 2403.8   Mean   : 493.35  
##              3rd Qu.:1146.2   3rd Qu.: 3931.6   3rd Qu.: 600.21  
##              Max.   :1673.0   Max.   : 6134.3   Max.   :1027.04  
##     ret_exp         acq_exp_sq          ret_exp_sq           freq      
##  Min.   :   0.0   Min.   :      1.5   Min.   :      0   Min.   : 0.00  
##  1st Qu.:   0.0   1st Qu.: 147562.0   1st Qu.:      0   1st Qu.: 0.00  
##  Median : 398.1   Median : 241729.7   Median : 158480   Median : 6.00  
##  Mean   : 336.3   Mean   : 271211.1   Mean   : 184000   Mean   : 6.22  
##  3rd Qu.: 514.3   3rd Qu.: 360246.0   3rd Qu.: 264466   3rd Qu.:11.00  
##  Max.   :1095.0   Max.   :1054811.2   Max.   :1198937   Max.   :21.00  
##     freq_sq          crossbuy           sow         industry    revenue     
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   0:239    Min.   :14.49  
##  1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:  0.00   1:261    1st Qu.:33.53  
##  Median : 36.00   Median : 5.000   Median : 44.00            Median :41.43  
##  Mean   : 69.25   Mean   : 4.052   Mean   : 38.88            Mean   :40.54  
##  3rd Qu.:121.00   3rd Qu.: 7.000   3rd Qu.: 66.00            3rd Qu.:47.52  
##  Max.   :441.00   Max.   :11.000   Max.   :116.00            Max.   :65.10  
##    employees     
##  Min.   :  18.0  
##  1st Qu.: 503.0  
##  Median : 657.5  
##  Mean   : 671.5  
##  3rd Qu.: 826.0  
##  Max.   :1461.0

The final model is quite simple:

acquisition ~ acq_exp + industry + revenue + employees

Let’s create a random forest. Since I’m using CARET, I need a method of adjusting the trainControl.

set.seed(1)
#Create a traincontrol for model validation
TrnCtrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 2)

I’m going to use an 80/20 split on the data for this case study.

set.seed(1)
inTrain <- createDataPartition(y = df1$acquisition, p=0.8, list=FALSE)
df1.train <- df1[inTrain, ]
df1.test <- df1[-inTrain, ]

set.seed(1)
RF1 <- train(acquisition ~ acq_exp + industry + revenue + employees, data = df1.train, method='rf', trControl = TrnCtrl, importance=TRUE)
#RF1$finalModel

RF1acq.preds <- predict(RF1, df1.test)
confusionMatrix(RF1acq.preds, df1.test$acquisition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19 10
##          1 13 57
##                                           
##                Accuracy : 0.7677          
##                  95% CI : (0.6721, 0.8467)
##     No Information Rate : 0.6768          
##     P-Value [Acc > NIR] : 0.03125         
##                                           
##                   Kappa : 0.4557          
##                                           
##  Mcnemar's Test P-Value : 0.67666         
##                                           
##             Sensitivity : 0.5938          
##             Specificity : 0.8507          
##          Pos Pred Value : 0.6552          
##          Neg Pred Value : 0.8143          
##              Prevalence : 0.3232          
##          Detection Rate : 0.1919          
##    Detection Prevalence : 0.2929          
##       Balanced Accuracy : 0.7222          
##                                           
##        'Positive' Class : 0               
##

This random forest has an overall predictive accuracy of 76.8% on the small hold-out test sample.

We now apply this model upon the full df1 dataset in preparation for answering the 2nd half of the question.

RF1acq.preds.full <- predict(RF1, df1)

We now have to address the 2nd half of the question:

…for how long (duration) based on a feature set using a random forest.

OK, my understanding is that we need to build a regression model on the data where the RF1 model had predicted acquisition of 1. Since we have a small test set of only 99 observations, and from there an even smaller dataset where RF1 had predicted acquisition of 1. I’m going to take the RF1 predictions and add them into the dataframe of df1.test, and call it df2.

#df2 includes the original acquisition, but also the predicted acquisitions from the CARET RF1 model
df2 <- cbind(df1, RF1acq.preds.full)

Now I’m going to create a third dataframe that includes only the predicted acquisitions from RF1 as well as the actual acquisitions from the dataframe.

#Need to do this with both acq.preds = 1 and acq = 1
df3 <- subset(df2, RF1acq.preds.full == "1" & acquisition == "1")
#Let's take a look at the new dataset
summary(df3)

##  acquisition    duration        profit        acq_exp         ret_exp      
##  0:  0       Min.   : 654   Min.   :1839   Min.   :151.0   Min.   : 229.9  
##  1:328       1st Qu.: 956   1st Qu.:3365   1st Qu.:400.9   1st Qu.: 396.8  
##              Median :1073   Median :3718   Median :493.4   Median : 466.2  
##              Mean   :1101   Mean   :3805   Mean   :492.7   Mean   : 499.8  
##              3rd Qu.:1238   3rd Qu.:4160   3rd Qu.:579.6   3rd Qu.: 572.3  
##              Max.   :1673   Max.   :6134   Max.   :864.1   Max.   :1095.0  
##    acq_exp_sq       ret_exp_sq           freq           freq_sq     
##  Min.   : 22813   Min.   :  52840   Min.   : 1.000   Min.   :  1.0  
##  1st Qu.:160755   1st Qu.: 157439   1st Qu.: 6.000   1st Qu.: 36.0  
##  Median :243424   Median : 217329   Median : 9.000   Median : 81.0  
##  Mean   :261326   Mean   : 274822   Mean   : 9.186   Mean   :102.1  
##  3rd Qu.:335971   3rd Qu.: 327513   3rd Qu.:12.000   3rd Qu.:144.0  
##  Max.   :746669   Max.   :1198937   Max.   :21.000   Max.   :441.0  
##     crossbuy           sow         industry    revenue        employees     
##  Min.   : 1.000   Min.   :  1.00   0:126    Min.   :15.86   Min.   :  24.0  
##  1st Qu.: 5.000   1st Qu.: 44.00   1:202    1st Qu.:35.74   1st Qu.: 606.0  
##  Median : 6.000   Median : 58.50            Median :43.26   Median : 751.5  
##  Mean   : 6.006   Mean   : 57.67            Mean   :42.26   Mean   : 757.4  
##  3rd Qu.: 7.000   3rd Qu.: 71.00            3rd Qu.:48.12   3rd Qu.: 902.5  
##  Max.   :11.000   Max.   :116.00            Max.   :65.10   Max.   :1461.0  
##  RF1acq.preds.full
##  0:  0            
##  1:328            
##                   
##                   
##                   
##

We’re now down to 328 observations.

Since we’re now going to be doing RF-based regression on the variable duration, let’s take a look at a correlation plot and evaluate the variance inflation factors of the variables we’re going to use (the numeric variables, basically).

chart.Correlation(df3[ ,c(2:5,8,10,11,13,14)], histogram=TRUE, pch=19)

Variables I’m removing ahead of time: acquisition, acq_exp_sq, ret_exp_sq, acq.preds, freq_sq.

Let’s take a look at a generic GLM and evaluate the VIF.

set.seed(1)
glm.viftest <- glm(duration ~ profit + acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees, data = df3)
vif(glm.viftest)

##    profit   acq_exp   ret_exp      freq  crossbuy       sow  industry   revenue 
## 34.904075  7.688180 26.856120  1.185955  1.114966  1.144093  1.159454  1.104746 
## employees 
##  1.270974

Removing the variable with highest VIF, profit.

set.seed(1)
glm.viftest <- glm(duration ~ acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees, data = df3)
vif(glm.viftest)

##   acq_exp   ret_exp      freq  crossbuy       sow  industry   revenue employees 
##  1.020728  1.024403  1.024951  1.017323  1.015630  1.033686  1.030283  1.038938

We’ve got our final model: duration ~ acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees

Now we run the random forest again for regression, but this time using the rfsrc package, which gives us more tuning abilities than CARET.

set.seed(1)

tuner <- expand.grid(mtry = c(1:10))

RF2 <- rfsrc(duration ~ acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees, data = df3, tuneGrid = tuner, importance = TRUE, ntree = 1000)
#gather up the predicted durations based on the RF2 regression model
duration.preds <- predict(RF2, df3)$predicted
#roll the duration.preds into ANOTHER dataframe
df4 <- cbind(df3, duration.preds)

Here’s a comparison between the actual duration and predicted durations:

#Get totals for "actual" durations and predicted durations

mean.actual.duration <- mean(df4$duration)
mean.predicted.duration <- mean(df4$duration.preds)

median.actual.duration <- median(df4$duration)
median.predicted.duration <- median(df4$duration.preds)

#sum.actual.duration <- sum(df4$duration) #Probably not necessary, commenting out
#sum.predicted.duration <- sum(df4$duration.preds) #Probably not necessary, commenting out

cat('Mean of actual duration: ', mean.actual.duration)

## Mean of actual duration:  1101.381

cat('\nMean of predicted duration: ', mean.predicted.duration)

## 
## Mean of predicted duration:  1103.555

cat('\n\nMedian of actual duration: ', median.actual.duration)

## 
## 
## Median of actual duration:  1073

cat('\nMedian of predicted duration: ', median.predicted.duration)

## 
## Median of predicted duration:  1073.649

The untuned random forest predictions are a little high on both mean and median, but not by much.

2) Compute variable importance to detect interactions and optimize hyperparameters for acquired customers.

var.imp <- RF2$importance
var.imp

##     acq_exp     ret_exp        freq    crossbuy         sow    industry 
##  -137.53478 53425.06073  1515.00650    17.84416   -64.14785    69.38474 
##     revenue   employees 
##    13.73018    96.43545

Let’s use that awesome graphing script the instructor demonstrated and see what these look like:

data.frame(importance = RF2$importance) %>%
  tibble::rownames_to_column(var = "variable") %>%
  ggplot(aes(x = reorder(variable,importance), y = importance)) +
    geom_bar(stat = "identity", fill = "orange", color = "black")+
    coord_flip() +
     labs(x = "Variables", y = "Variable importance")+
     theme_nice

We saw that some of the importance variables are negative, so I’m going to add a large constant to them and then take the log, and plot the results.

log.var.imp <- log(var.imp + 200)
log.var.imp

##   acq_exp   ret_exp      freq  crossbuy       sow  industry   revenue employees 
##  4.134610 10.889772  7.447172  5.383780  4.911567  5.596141  5.364714  5.691829

data.frame(importance = log.var.imp) %>%
  tibble::rownames_to_column(var = "variable") %>%
  ggplot(aes(x = reorder(variable,importance), y = importance)) +
    geom_bar(stat = "identity", fill = "orange", color = "black")+
    coord_flip() +
     labs(x = "Variables", y = "Variable importance")+
     theme_nice

For the top four, we see ret_exp is most important, followed by freq, then employees and industry almost at the same level.

Doing some minimal depth evaluation, like in the instructor’s demonstration.

mindepth <- max.subtree(RF2, sub.order = TRUE)
print(round(mindepth$order, 3)[,1])

##   acq_exp   ret_exp      freq  crossbuy       sow  industry   revenue employees 
##     3.062     1.009     1.784     2.937     3.289     6.609     2.958     2.751

data.frame(md = round(mindepth$order, 3)[,1]) %>%
  tibble::rownames_to_column(var = "variable") %>%
  ggplot(aes(x = reorder(variable,desc(md)), y = md)) +
    geom_bar(stat = "identity", fill = "orange", color = "black", width = 0.2)+
    coord_flip() +
     labs(x = "Variables", y = "Minimal Depth")+
     theme_nice

The variable industry has the greatest minimal depth, followed by sow, then acq_exp, revenue, crossbuy, employees, freq and finally ret_exp.

Checking interactions now, again leveraging the instructor’s cool scripts.

mindepth$sub.order

##             acq_exp    ret_exp      freq  crossbuy       sow  industry
## acq_exp   0.2954469 0.38732288 0.5344080 0.7037542 0.6125871 0.8906572
## ret_exp   0.3245762 0.09474076 0.2487924 0.3847783 0.3428941 0.7556862
## freq      0.4091906 0.21599293 0.1728422 0.4876632 0.4146803 0.7738898
## crossbuy  0.5829393 0.38685813 0.5368186 0.2847095 0.5871855 0.8699448
## sow       0.6446719 0.40558923 0.5774949 0.7361772 0.3192488 0.8988987
## industry  0.8465499 0.75987618 0.8455613 0.8831286 0.8534124 0.6300179
## revenue   0.5979911 0.37545273 0.5350845 0.6777104 0.6092026 0.8735475
## employees 0.5519694 0.34273571 0.5126302 0.6606603 0.5810575 0.8705586
##             revenue employees
## acq_exp   0.6154524 0.5940329
## ret_exp   0.3391213 0.3101321
## freq      0.4207874 0.4183681
## crossbuy  0.5956271 0.5758397
## sow       0.6330410 0.6363552
## industry  0.8527474 0.8617443
## revenue   0.2875672 0.5975773
## employees 0.5754441 0.2646985

as.matrix(mindepth$sub.order) %>%
  reshape2::melt() %>%
  data.frame() %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
    scale_x_discrete(position = "top") +
    geom_tile(color = "white") +
    viridis::scale_fill_viridis("Relative min. depth") +
    labs(x = "Model Variables", y = "Model Variables") +
    theme_bw()

The variable industry clearly has the greatest relative minimum depth across all the other variables. The variable that has the smallest relative minimum depth is ret_exp.

And running a cross-check with variable importance.

find.interaction(RF2, method = "vimp", importance = "permute")

## Pairing ret_exp with freq 
## Pairing ret_exp with employees 
## Pairing ret_exp with industry 
## Pairing ret_exp with crossbuy 
## Pairing ret_exp with revenue 
## Pairing ret_exp with sow 
## Pairing ret_exp with acq_exp 
## Pairing freq with employees 
## Pairing freq with industry 
## Pairing freq with crossbuy 
## Pairing freq with revenue 
## Pairing freq with sow 
## Pairing freq with acq_exp 
## Pairing employees with industry 
## Pairing employees with crossbuy 
## Pairing employees with revenue 
## Pairing employees with sow 
## Pairing employees with acq_exp 
## Pairing industry with crossbuy 
## Pairing industry with revenue 
## Pairing industry with sow 
## Pairing industry with acq_exp 
## Pairing crossbuy with revenue 
## Pairing crossbuy with sow 
## Pairing crossbuy with acq_exp 
## Pairing revenue with sow 
## Pairing revenue with acq_exp 
## Pairing sow with acq_exp 
## 
##                               Method: vimp
##                     No. of variables: 8
##            Variables sorted by VIMP?: TRUE
##    No. of variables used for pairing: 8
##     Total no. of paired interactions: 28
##             Monte Carlo replications: 1
##     Type of noising up used for VIMP: permute
## 
##                         Var 1     Var 2     Paired   Additive Difference
## ret_exp:freq       52879.4208 1471.4388 53929.7375 54350.8596  -421.1221
## ret_exp:employees  52879.4208   61.2513 53113.0540 52940.6720   172.3820
## ret_exp:industry   52879.4208   72.6574 53085.5331 52952.0782   133.4549
## ret_exp:crossbuy   52879.4208    4.9911 52980.4107 52884.4118    95.9989
## ret_exp:revenue    52879.4208  -82.5588 52964.7531 52796.8620   167.8911
## ret_exp:sow        52879.4208   -3.8879 52541.0854 52875.5328  -334.4474
## ret_exp:acq_exp    52879.4208 -169.1507 52690.1046 52710.2701   -20.1655
## freq:employees      1496.5574   60.2929  1610.9132  1556.8503    54.0629
## freq:industry       1496.5574   57.5234  1564.2299  1554.0809    10.1491
## freq:crossbuy       1496.5574   46.9774  1604.8496  1543.5349    61.3148
## freq:revenue        1496.5574  -59.2246  1566.0292  1437.3328   128.6964
## freq:sow            1496.5574   26.0465  1368.0277  1522.6040  -154.5763
## freq:acq_exp        1496.5574 -168.2621  1533.6821  1328.2954   205.3867
## employees:industry   125.5373   45.3814    46.3160   170.9187  -124.6028
## employees:crossbuy   125.5373    4.5083   197.2177   130.0456    67.1720
## employees:revenue    125.5373  -36.5848    61.3581    88.9525   -27.5945
## employees:sow        125.5373  -11.2075    39.7536   114.3298   -74.5763
## employees:acq_exp    125.5373 -156.4618   -50.0289   -30.9245   -19.1044
## industry:crossbuy     37.1442   48.1755    46.9237    85.3197   -38.3959
## industry:revenue      37.1442   22.3808   -68.3742    59.5250  -127.8991
## industry:sow          37.1442   -3.9797   -16.7353    33.1645   -49.8998
## industry:acq_exp      37.1442 -141.8307  -103.4745  -104.6865     1.2120
## crossbuy:revenue      22.9968   -5.8232   -27.4921    17.1736   -44.6657
## crossbuy:sow          22.9968   -0.0798    35.2598    22.9171    12.3428
## crossbuy:acq_exp      22.9968 -148.5177   -68.3083  -125.5208    57.2125
## revenue:sow            4.0900  -81.8618   104.5420   -77.7718   182.3137
## revenue:acq_exp        4.0900 -172.5393  -102.8654  -168.4493    65.5839
## sow:acq_exp           -3.3074 -129.6146  -249.5175  -132.9220  -116.5955

Finally, tuning the hyperparameters on the RFSRC model predicting duration for customers already acquired. This also leverages the script demonstrated in class.

set.seed(1)
# Establish a list of possible values for hyper-parameters
mtry.values <- seq(2,6,1)
nodesize.values <- seq(2,8,2)
ntree.values <- seq(1e3,6e3,500)

# Create a data frame containing all combinations 
hyper_grid <- expand.grid(mtry = mtry.values, nodesize = nodesize.values, ntree = ntree.values)

# Create an empty vector to store OOB error values
oob_err <- c()

# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {

    # Train a Random Forest model
   model <- rfsrc(duration ~ acq_exp +
                    ret_exp +
                    freq +
                    crossbuy +
                    sow +
                    industry +
                    revenue +
                    employees,
                    data = df3,
                    mtry = hyper_grid$mtry[i],
                    nodesize = hyper_grid$nodesize[i],
                    ntree = hyper_grid$ntree[i])  
  
                          
    # Store OOB error for the model                      
    oob_err[i] <- model$err.rate[length(model$err.rate)]
}

# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])

##    mtry nodesize ntree
## 65    6        2  2500

OK, that’s well and good, so let’s see if this tuned RF has improved performance.

set.seed(1)

RF2.tuned <- rfsrc(duration ~ acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees, data = df4, mtry = 6, nodesize = 2, ntree = 2500, importance = TRUE)
#gather up the predicted durations based on the RF2.tuned regression model
duration.preds.tuned <- predict(RF2.tuned, df4)$predicted
#roll the duration.preds into yet ANOTHER dataframe, df5 this time, for the tuned model
df5 <- cbind(df4, duration.preds.tuned)

#Now let us compare the performance of the standard vs. tuned random forest regression predictions

#compare the mean values
mean.actual.duration <- mean(df5$duration)
mean.predicted.duration <- mean(df5$duration.preds)
mean.predicted.duration.tuned <- mean(df5$duration.preds.tuned)

#compare the median values
median.actual.duration <- median(df5$duration)
median.predicted.duration <- median(df5$duration.preds)
median.predicted.duration.tuned <- median(df5$duration.preds.tuned)

#compare the MSE values
MSE.predicted.duration <- mean((df5$duration.preds - df5$duration)^2)
MSE.predicted.duration.tuned <- mean((df5$duration.preds.tuned - df5$duration)^2)

cat('Mean of actual duration: ', mean.actual.duration)

## Mean of actual duration:  1101.381

cat('\nMean of predicted duration: ', mean.predicted.duration)

## 
## Mean of predicted duration:  1103.555

cat('\nMean of tuned predicted duration: ', mean.predicted.duration.tuned)

## 
## Mean of tuned predicted duration:  1103.09

cat('\n\nMedian of actual duration: ', median.actual.duration)

## 
## 
## Median of actual duration:  1073

cat('\nMedian of predicted duration: ', median.predicted.duration)

## 
## Median of predicted duration:  1073.649

cat('\nMedian of tuned predicted duration: ', median.predicted.duration.tuned)

## 
## Median of tuned predicted duration:  1074.55

cat('\n\nMSE un-tuned random forest: ', MSE.predicted.duration)

## 
## 
## MSE un-tuned random forest:  1398.016

cat('\nMSE tuned random forest: ', MSE.predicted.duration.tuned)

## 
## MSE tuned random forest:  185.4799

The tuned random forest model has slightly better performance than the un-tuned random forest on calculating the mean duration, but slightly worse performance when calculating median duration. The MSE of the tuned random forest model is significantly lower than the MSE of the un-tuned model.

3) Compare the accuracy of model with a decision trees and logistic regression model for acquiring customers.

First we optimize the random forest for acquisition prediction.

set.seed(1)
# Establish a list of possible values for hyper-parameters
mtry.values <- seq(2,6,1)
nodesize.values <- seq(2,8,2)
ntree.values <- seq(1e3,6e3,500)

# Create a data frame containing all combinations 
hyper_grid <- expand.grid(mtry = mtry.values, nodesize = nodesize.values, ntree = ntree.values)

# Create an empty vector to store OOB error values
oob_err <- c()

# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {

    # Train a Random Forest model
   model <- rfsrc(acquisition ~ acq_exp +
                    industry +
                    revenue +
                    employees,
                    data = df1.train,
                    mtry = hyper_grid$mtry[i],
                    nodesize = hyper_grid$nodesize[i],
                    ntree = hyper_grid$ntree[i])  
  
                          
    # Store OOB error for the model                      
    oob_err[i] <- model$err.rate[length(model$err.rate)]
}

# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])

##   mtry nodesize ntree
## 6    2        4  1000

RF1.tuned <- rfsrc(acquisition ~ acq_exp + industry + revenue + employees, data = df1.train, mtry = 2, nodesize = 4, ntree = 1000, importance = TRUE)
#gather up the predicted durations based on the RF1.tuned regression model
RF1acq.preds.tuned <- predict(RF1.tuned, df1.test)$class

#Create a basic confusion matrix for the RF1 model
RF1.confusion <- table(RF1acq.preds, df1.test$acquisition)
#Create a basic confusion matrix for the tuned RF1 model
RF1.confusion.tuned <- table(RF1acq.preds.tuned, df1.test$acquisition)

Now that we have a tuned random forest classifier, we can examine how it compares to the original, un-tuned random forest, and also a standard decision tree, as well as LOGIT model.

set.seed(1)

#Decision tree is first, and I am doing all of this in CARET using the same trainControl argument throughout
DT <- train(acquisition ~ acq_exp + industry + revenue + employees, data = df1.train, method='rpart', trControl = TrnCtrl)
DT.preds <- predict(DT, df1.test)
DT.confusion <- table(DT.preds, df1.test$acquisition)

#LOGIT model is next, also done in CARET
GLM <- train(acquisition ~ acq_exp + industry + revenue + employees, data = df1.train, method='glm', family='binomial', trControl = TrnCtrl)
GLM.preds <- predict(GLM, df1.test)
GLM.confusion <- table(GLM.preds, df1.test$acquisition)

#I gather up the information from the various confusion matrixes
RF1.accuracy.tuned <- sum(diag(RF1.confusion.tuned))/sum(RF1.confusion.tuned)
RF1.accuracy <- sum(diag(RF1.confusion))/sum(RF1.confusion)
DT.accuracy <- sum(diag(DT.confusion))/sum(DT.confusion)
GLM.accuracy <- sum(diag(GLM.confusion))/sum(GLM.confusion)

#Print out the accuracies
cat('Accuracy of Original Classification Random Forest: ', RF1.accuracy)

## Accuracy of Original Classification Random Forest:  0.7676768

cat('\nAccuracy of Tuned Classification Random Forest: ', RF1.accuracy.tuned)

## 
## Accuracy of Tuned Classification Random Forest:  0.8282828

cat('\nAccuacy of Decision Tree Model: ', DT.accuracy)

## 
## Accuacy of Decision Tree Model:  0.7878788

cat('\nAccuracy of LOGIT Classification Model: ', GLM.accuracy)

## 
## Accuracy of LOGIT Classification Model:  0.8181818

In this case, we see that the tuned classification random forest model has the best performance, followed by the logistic regression model, then the decision tree, with the un-tuned random forest model coming in at the rear.

Let’s take a look at the GLM confusion matrix.

GLM.confusion

##          
## GLM.preds  0  1
##         0 19  5
##         1 13 62

And here’s the DT confusion matrix.

DT.confusion

##         
## DT.preds  0  1
##        0 17  6
##        1 15 61

And here is the RF1 confusion matrix.

RF1.confusion

##             
## RF1acq.preds  0  1
##            0 19 10
##            1 13 57

Here is what the tuned RF1 confusion matrix looks like.

RF1.confusion.tuned

##                   
## RF1acq.preds.tuned  0  1
##                  0 21  6
##                  1 11 61

Here is what the decision tree looks like:

rattle::fancyRpartPlot(DT$finalModel, sub = "Decision Tree for Predicting Acquisition")

And this is what the GLM can tell us for interpretation purposes.

summary(GLM)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3234  -0.6393   0.3005   0.6428   2.4427  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.7553200  0.9295442  -7.267 3.67e-13 ***
## acq_exp     -0.0004578  0.0007815  -0.586    0.558    
## industry1    1.6146095  0.2907506   5.553 2.80e-08 ***
## revenue      0.0675192  0.0147713   4.571 4.85e-06 ***
## employees    0.0069523  0.0008210   8.469  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 505.25  on 400  degrees of freedom
## Residual deviance: 337.43  on 396  degrees of freedom
## AIC: 347.43
## 
## Number of Fisher Scoring iterations: 5

This is a logistic regression model, therefore the interpretation is as follows:

The p-value of acq_exp is above the selection threshold, so we fail to reject the null hypothesis. There is no significant statistical evidence that changes in acq_exp increase the odds of gaining an acquisition, holding all other variables constant.

The p-value for industry is below the threshold, so we reject the null hypothesis that the coefficient is equal to zero. The odds of an acquisition are e^1.614 (5.02) times higher for a customer who is in the B2B industry compared to one who is not, when all other variables are held constant.

The p-value for revenue is below the threshold, so we reject the null hypothesis that its coefficient is equal to zero. The odds of an acquisition increase by a factor of e^0.07 (1.07) for each million in annual sales revenue, holding all other variables constant.

The p-value for employees is below the threshold, so we reject the null hypothesis that its coefficient is equal to zero. The odds of an acquisition increase by a factor of e^0.007 (1.007) for each additional employee in the prospect’s firm, holding all other variables constant.

#?acquisitionRetention

Generate PDP plots for all variables.

We generate partial dependence plots using the plot.variable function of RFSRC.

plot.variable(RF1.tuned, partial=TRUE)

Partial dependence plots illustrate how each variable affects the model’s prediction. The idea behind how they are interpreted is similar to the interpretation of coefficients in a linear regression model.

For predicted acquisition, we see that there is an inverse relationship between the probability of no acquisition and number of employees and annual sales revenue of the prospect’s firm in millions of dollars. Another way to look at it would be: holding all other variables constant, the more employees in a prospect’s firm, the higher the likelihood of acquisition. The same holds true for annual sales revenue of the prospect’s firm. We can also see that there is a “sweet spot” for acq_exp that maximizes the probability of acquisition: somewhere between 400 and 600 dollars. We also see that the probability of acquisition is highest when the customer is in the B2B industry.

plot.variable(RF2.tuned, partial = TRUE)

Here we have the partial dependence plots for predicting duration. There is a direct relationship between predicted duration and ret_exp, acq_exp and sow. There is a generally inverse relation between duration and freq and employees. Predicted duration generally decreases as revenue increases, until about the 50-million dollar mark, at which point it begins to increase again. We also see an interesting dip between predicted duration and crossbuy, where duration generally increases until a crossbuy of about 10 product categories, at which point the predicted duration begins to drop off. It is difficult to determine whether there is a significant difference between industry and predicted duration, as there is a lot of overlap in predicted duration when the customer is in the B2B industry, or when the customer is not in the industry.

DA6813 Case Study 4

Customer Acquisition & Retention

Melissa Haenke, Joe Schulze, Steven Sutter, Torey Tonche

2021-04-18

Background:

Data:

To access the data in R: load the SMCRM package and execute data(acquisitionRetention).

Therefore, in this case-study, please address the following tasks and write the report using the Case-study guidelines.