data(acquisitionRetention)
#?acquisitionRetention
#I am splitting off a df1 as a working dataframe
df1 <- acquisitionRetention
The first step is exploratory data analysis.
What is the structure of the
str(df1)
## 'data.frame': 500 obs. of 15 variables:
## $ customer : num 1 2 3 4 5 6 7 8 9 10 ...
## $ acquisition: num 1 1 1 0 1 1 1 1 0 0 ...
## $ duration : num 1635 1039 1288 0 1631 ...
## $ profit : num 6134 3524 4081 -638 5446 ...
## $ acq_exp : num 694 460 249 638 589 ...
## $ ret_exp : num 972 450 805 0 920 ...
## $ acq_exp_sq : num 480998 211628 62016 407644 346897 ...
## $ ret_exp_sq : num 943929 202077 648089 0 846106 ...
## $ freq : num 6 11 21 0 2 7 15 13 0 0 ...
## $ freq_sq : num 36 121 441 0 4 49 225 169 0 0 ...
## $ crossbuy : num 5 6 6 0 9 4 5 5 0 0 ...
## $ sow : num 95 22 90 0 80 48 51 23 0 0 ...
## $ industry : num 1 0 0 0 0 1 0 1 0 1 ...
## $ revenue : num 47.2 45.1 29.1 40.6 48.7 ...
## $ employees : num 898 686 1423 181 631 ...
We have 500 observations across 15 variables. We will be working with two target variables. First, acquisition, and industry which will be factor variables. Second, duration, which will remain numerical. Customer is not useful to us, so I see no reason to keep it.
#Eliminate the variable customer, it is not useful in this analysis.
df1$customer <- NULL
Let’s get a summary of the data.
summary(df1)
## acquisition duration profit acq_exp
## Min. :0.000 Min. : 0.0 Min. :-1027.0 Min. : 1.21
## 1st Qu.:0.000 1st Qu.: 0.0 1st Qu.: -316.3 1st Qu.: 384.14
## Median :1.000 Median : 957.5 Median : 3369.9 Median : 491.66
## Mean :0.676 Mean : 742.5 Mean : 2403.8 Mean : 493.35
## 3rd Qu.:1.000 3rd Qu.:1146.2 3rd Qu.: 3931.6 3rd Qu.: 600.21
## Max. :1.000 Max. :1673.0 Max. : 6134.3 Max. :1027.04
## ret_exp acq_exp_sq ret_exp_sq freq
## Min. : 0.0 Min. : 1.5 Min. : 0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 147562.0 1st Qu.: 0 1st Qu.: 0.00
## Median : 398.1 Median : 241729.7 Median : 158480 Median : 6.00
## Mean : 336.3 Mean : 271211.1 Mean : 184000 Mean : 6.22
## 3rd Qu.: 514.3 3rd Qu.: 360246.0 3rd Qu.: 264466 3rd Qu.:11.00
## Max. :1095.0 Max. :1054811.2 Max. :1198937 Max. :21.00
## freq_sq crossbuy sow industry
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. :0.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:0.000
## Median : 36.00 Median : 5.000 Median : 44.00 Median :1.000
## Mean : 69.25 Mean : 4.052 Mean : 38.88 Mean :0.522
## 3rd Qu.:121.00 3rd Qu.: 7.000 3rd Qu.: 66.00 3rd Qu.:1.000
## Max. :441.00 Max. :11.000 Max. :116.00 Max. :1.000
## revenue employees
## Min. :14.49 Min. : 18.0
## 1st Qu.:33.53 1st Qu.: 503.0
## Median :41.43 Median : 657.5
## Mean :40.54 Mean : 671.5
## 3rd Qu.:47.52 3rd Qu.: 826.0
## Max. :65.10 Max. :1461.0
I see a lot of 0 values in these variables. This will cause problems if we do something like log transformation.
Are there any NA values?
sum(is.na(df1))
## [1] 0
There are no NA values in the dataset.
Let’s get a look at whether any variables are correlated with one another, and if that could cause problems for our classification of acquisition.
chart.Correlation(df1, histogram = TRUE, pch=19)
Wow. A lot of highly correlated variables. Here’s the list:
Can these really throw off the value for acquisition? Box plots may be of use here.
par(mfrow=c(3,3))
boxplot(duration ~ acquisition, data=df1, ylab='duration', xlab='acquisition', col='#FF0000')
boxplot(profit ~ acquisition, data=df1, ylab='profit', xlab='acquisition', col='#FF3300')
boxplot(ret_exp ~ acquisition, data=df1, ylab='ret_exp', xlab='acquisition', col='#CC9933')
boxplot(acq_exp_sq ~ acquisition, data=df1, ylab='acq_exp_sq', xlab='acquisition', col='#33CC00')
boxplot(ret_exp_sq ~ acquisition, data=df1, ylab='ret_exp_sq', xlab='acquisition', col='#99CCFF')
boxplot(freq ~ acquisition, data=df1, ylab='freq', xlab='acquisition', col='#0000CC')
boxplot(freq_sq ~ acquisition, data=df1, ylab='freq_sq', xlab='acquisition', col='#9933CC')
boxplot(crossbuy ~ acquisition, data=df1, ylab='crossbuy', xlab='acquisition', col='#9900FF')
boxplot(sow ~ acquisition, data=df1, ylab='sow', xlab='acquisition', col='#6600FF')
par(mfrow=c(1,1))
df1$acquisition <- as.factor(df1$acquisition)
df1$industry <- as.factor(df1$industry)
We see that every one of these variables, with the exception of acq_exp_sq, will cause acquisition to flip to zero if that particular variable is equal to zero, or is negative (as in the case of profit). None of these will be in the final model. I’m going to exclude acq_exp_sq because it is merely a square of acq_exp.
summary(df1)
## acquisition duration profit acq_exp
## 0:162 Min. : 0.0 Min. :-1027.0 Min. : 1.21
## 1:338 1st Qu.: 0.0 1st Qu.: -316.3 1st Qu.: 384.14
## Median : 957.5 Median : 3369.9 Median : 491.66
## Mean : 742.5 Mean : 2403.8 Mean : 493.35
## 3rd Qu.:1146.2 3rd Qu.: 3931.6 3rd Qu.: 600.21
## Max. :1673.0 Max. : 6134.3 Max. :1027.04
## ret_exp acq_exp_sq ret_exp_sq freq
## Min. : 0.0 Min. : 1.5 Min. : 0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 147562.0 1st Qu.: 0 1st Qu.: 0.00
## Median : 398.1 Median : 241729.7 Median : 158480 Median : 6.00
## Mean : 336.3 Mean : 271211.1 Mean : 184000 Mean : 6.22
## 3rd Qu.: 514.3 3rd Qu.: 360246.0 3rd Qu.: 264466 3rd Qu.:11.00
## Max. :1095.0 Max. :1054811.2 Max. :1198937 Max. :21.00
## freq_sq crossbuy sow industry revenue
## Min. : 0.00 Min. : 0.000 Min. : 0.00 0:239 Min. :14.49
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1:261 1st Qu.:33.53
## Median : 36.00 Median : 5.000 Median : 44.00 Median :41.43
## Mean : 69.25 Mean : 4.052 Mean : 38.88 Mean :40.54
## 3rd Qu.:121.00 3rd Qu.: 7.000 3rd Qu.: 66.00 3rd Qu.:47.52
## Max. :441.00 Max. :11.000 Max. :116.00 Max. :65.10
## employees
## Min. : 18.0
## 1st Qu.: 503.0
## Median : 657.5
## Mean : 671.5
## 3rd Qu.: 826.0
## Max. :1461.0
The final model is quite simple:
acquisition ~ acq_exp + industry + revenue + employees
Let’s create a random forest. Since I’m using CARET, I need a method of adjusting the trainControl.
set.seed(1)
#Create a traincontrol for model validation
TrnCtrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 2)
I’m going to use an 80/20 split on the data for this case study.
set.seed(1)
inTrain <- createDataPartition(y = df1$acquisition, p=0.8, list=FALSE)
df1.train <- df1[inTrain, ]
df1.test <- df1[-inTrain, ]
set.seed(1)
RF1 <- train(acquisition ~ acq_exp + industry + revenue + employees, data = df1.train, method='rf', trControl = TrnCtrl, importance=TRUE)
#RF1$finalModel
RF1acq.preds <- predict(RF1, df1.test)
confusionMatrix(RF1acq.preds, df1.test$acquisition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 19 10
## 1 13 57
##
## Accuracy : 0.7677
## 95% CI : (0.6721, 0.8467)
## No Information Rate : 0.6768
## P-Value [Acc > NIR] : 0.03125
##
## Kappa : 0.4557
##
## Mcnemar's Test P-Value : 0.67666
##
## Sensitivity : 0.5938
## Specificity : 0.8507
## Pos Pred Value : 0.6552
## Neg Pred Value : 0.8143
## Prevalence : 0.3232
## Detection Rate : 0.1919
## Detection Prevalence : 0.2929
## Balanced Accuracy : 0.7222
##
## 'Positive' Class : 0
##
This random forest has an overall predictive accuracy of 76.8% on the small hold-out test sample.
We now apply this model upon the full df1 dataset in preparation for answering the 2nd half of the question.
RF1acq.preds.full <- predict(RF1, df1)
We now have to address the 2nd half of the question:
OK, my understanding is that we need to build a regression model on the data where the RF1 model had predicted acquisition of 1. Since we have a small test set of only 99 observations, and from there an even smaller dataset where RF1 had predicted acquisition of 1. I’m going to take the RF1 predictions and add them into the dataframe of df1.test, and call it df2.
#df2 includes the original acquisition, but also the predicted acquisitions from the CARET RF1 model
df2 <- cbind(df1, RF1acq.preds.full)
Now I’m going to create a third dataframe that includes only the predicted acquisitions from RF1 as well as the actual acquisitions from the dataframe.
#Need to do this with both acq.preds = 1 and acq = 1
df3 <- subset(df2, RF1acq.preds.full == "1" & acquisition == "1")
#Let's take a look at the new dataset
summary(df3)
## acquisition duration profit acq_exp ret_exp
## 0: 0 Min. : 654 Min. :1839 Min. :151.0 Min. : 229.9
## 1:328 1st Qu.: 956 1st Qu.:3365 1st Qu.:400.9 1st Qu.: 396.8
## Median :1073 Median :3718 Median :493.4 Median : 466.2
## Mean :1101 Mean :3805 Mean :492.7 Mean : 499.8
## 3rd Qu.:1238 3rd Qu.:4160 3rd Qu.:579.6 3rd Qu.: 572.3
## Max. :1673 Max. :6134 Max. :864.1 Max. :1095.0
## acq_exp_sq ret_exp_sq freq freq_sq
## Min. : 22813 Min. : 52840 Min. : 1.000 Min. : 1.0
## 1st Qu.:160755 1st Qu.: 157439 1st Qu.: 6.000 1st Qu.: 36.0
## Median :243424 Median : 217329 Median : 9.000 Median : 81.0
## Mean :261326 Mean : 274822 Mean : 9.186 Mean :102.1
## 3rd Qu.:335971 3rd Qu.: 327513 3rd Qu.:12.000 3rd Qu.:144.0
## Max. :746669 Max. :1198937 Max. :21.000 Max. :441.0
## crossbuy sow industry revenue employees
## Min. : 1.000 Min. : 1.00 0:126 Min. :15.86 Min. : 24.0
## 1st Qu.: 5.000 1st Qu.: 44.00 1:202 1st Qu.:35.74 1st Qu.: 606.0
## Median : 6.000 Median : 58.50 Median :43.26 Median : 751.5
## Mean : 6.006 Mean : 57.67 Mean :42.26 Mean : 757.4
## 3rd Qu.: 7.000 3rd Qu.: 71.00 3rd Qu.:48.12 3rd Qu.: 902.5
## Max. :11.000 Max. :116.00 Max. :65.10 Max. :1461.0
## RF1acq.preds.full
## 0: 0
## 1:328
##
##
##
##
We’re now down to 328 observations.
Since we’re now going to be doing RF-based regression on the variable duration, let’s take a look at a correlation plot and evaluate the variance inflation factors of the variables we’re going to use (the numeric variables, basically).
chart.Correlation(df3[ ,c(2:5,8,10,11,13,14)], histogram=TRUE, pch=19)
Variables I’m removing ahead of time: acquisition, acq_exp_sq, ret_exp_sq, acq.preds, freq_sq.
Let’s take a look at a generic GLM and evaluate the VIF.
set.seed(1)
glm.viftest <- glm(duration ~ profit + acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees, data = df3)
vif(glm.viftest)
## profit acq_exp ret_exp freq crossbuy sow industry revenue
## 34.904075 7.688180 26.856120 1.185955 1.114966 1.144093 1.159454 1.104746
## employees
## 1.270974
Removing the variable with highest VIF, profit.
set.seed(1)
glm.viftest <- glm(duration ~ acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees, data = df3)
vif(glm.viftest)
## acq_exp ret_exp freq crossbuy sow industry revenue employees
## 1.020728 1.024403 1.024951 1.017323 1.015630 1.033686 1.030283 1.038938
We’ve got our final model: duration ~ acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees
Now we run the random forest again for regression, but this time using the rfsrc package, which gives us more tuning abilities than CARET.
set.seed(1)
tuner <- expand.grid(mtry = c(1:10))
RF2 <- rfsrc(duration ~ acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees, data = df3, tuneGrid = tuner, importance = TRUE, ntree = 1000)
#gather up the predicted durations based on the RF2 regression model
duration.preds <- predict(RF2, df3)$predicted
#roll the duration.preds into ANOTHER dataframe
df4 <- cbind(df3, duration.preds)
Here’s a comparison between the actual duration and predicted durations:
#Get totals for "actual" durations and predicted durations
mean.actual.duration <- mean(df4$duration)
mean.predicted.duration <- mean(df4$duration.preds)
median.actual.duration <- median(df4$duration)
median.predicted.duration <- median(df4$duration.preds)
#sum.actual.duration <- sum(df4$duration) #Probably not necessary, commenting out
#sum.predicted.duration <- sum(df4$duration.preds) #Probably not necessary, commenting out
cat('Mean of actual duration: ', mean.actual.duration)
## Mean of actual duration: 1101.381
cat('\nMean of predicted duration: ', mean.predicted.duration)
##
## Mean of predicted duration: 1103.555
cat('\n\nMedian of actual duration: ', median.actual.duration)
##
##
## Median of actual duration: 1073
cat('\nMedian of predicted duration: ', median.predicted.duration)
##
## Median of predicted duration: 1073.649
The untuned random forest predictions are a little high on both mean and median, but not by much.
var.imp <- RF2$importance
var.imp
## acq_exp ret_exp freq crossbuy sow industry
## -137.53478 53425.06073 1515.00650 17.84416 -64.14785 69.38474
## revenue employees
## 13.73018 96.43545
Let’s use that awesome graphing script the instructor demonstrated and see what these look like:
data.frame(importance = RF2$importance) %>%
tibble::rownames_to_column(var = "variable") %>%
ggplot(aes(x = reorder(variable,importance), y = importance)) +
geom_bar(stat = "identity", fill = "orange", color = "black")+
coord_flip() +
labs(x = "Variables", y = "Variable importance")+
theme_nice
We saw that some of the importance variables are negative, so I’m going to add a large constant to them and then take the log, and plot the results.
log.var.imp <- log(var.imp + 200)
log.var.imp
## acq_exp ret_exp freq crossbuy sow industry revenue employees
## 4.134610 10.889772 7.447172 5.383780 4.911567 5.596141 5.364714 5.691829
data.frame(importance = log.var.imp) %>%
tibble::rownames_to_column(var = "variable") %>%
ggplot(aes(x = reorder(variable,importance), y = importance)) +
geom_bar(stat = "identity", fill = "orange", color = "black")+
coord_flip() +
labs(x = "Variables", y = "Variable importance")+
theme_nice
For the top four, we see ret_exp is most important, followed by freq, then employees and industry almost at the same level.
Doing some minimal depth evaluation, like in the instructor’s demonstration.
mindepth <- max.subtree(RF2, sub.order = TRUE)
print(round(mindepth$order, 3)[,1])
## acq_exp ret_exp freq crossbuy sow industry revenue employees
## 3.062 1.009 1.784 2.937 3.289 6.609 2.958 2.751
data.frame(md = round(mindepth$order, 3)[,1]) %>%
tibble::rownames_to_column(var = "variable") %>%
ggplot(aes(x = reorder(variable,desc(md)), y = md)) +
geom_bar(stat = "identity", fill = "orange", color = "black", width = 0.2)+
coord_flip() +
labs(x = "Variables", y = "Minimal Depth")+
theme_nice
The variable industry has the greatest minimal depth, followed by sow, then acq_exp, revenue, crossbuy, employees, freq and finally ret_exp.
Checking interactions now, again leveraging the instructor’s cool scripts.
mindepth$sub.order
## acq_exp ret_exp freq crossbuy sow industry
## acq_exp 0.2954469 0.38732288 0.5344080 0.7037542 0.6125871 0.8906572
## ret_exp 0.3245762 0.09474076 0.2487924 0.3847783 0.3428941 0.7556862
## freq 0.4091906 0.21599293 0.1728422 0.4876632 0.4146803 0.7738898
## crossbuy 0.5829393 0.38685813 0.5368186 0.2847095 0.5871855 0.8699448
## sow 0.6446719 0.40558923 0.5774949 0.7361772 0.3192488 0.8988987
## industry 0.8465499 0.75987618 0.8455613 0.8831286 0.8534124 0.6300179
## revenue 0.5979911 0.37545273 0.5350845 0.6777104 0.6092026 0.8735475
## employees 0.5519694 0.34273571 0.5126302 0.6606603 0.5810575 0.8705586
## revenue employees
## acq_exp 0.6154524 0.5940329
## ret_exp 0.3391213 0.3101321
## freq 0.4207874 0.4183681
## crossbuy 0.5956271 0.5758397
## sow 0.6330410 0.6363552
## industry 0.8527474 0.8617443
## revenue 0.2875672 0.5975773
## employees 0.5754441 0.2646985
as.matrix(mindepth$sub.order) %>%
reshape2::melt() %>%
data.frame() %>%
ggplot(aes(x = Var1, y = Var2, fill = value)) +
scale_x_discrete(position = "top") +
geom_tile(color = "white") +
viridis::scale_fill_viridis("Relative min. depth") +
labs(x = "Model Variables", y = "Model Variables") +
theme_bw()
The variable industry clearly has the greatest relative minimum depth across all the other variables. The variable that has the smallest relative minimum depth is ret_exp.
And running a cross-check with variable importance.
find.interaction(RF2, method = "vimp", importance = "permute")
## Pairing ret_exp with freq
## Pairing ret_exp with employees
## Pairing ret_exp with industry
## Pairing ret_exp with crossbuy
## Pairing ret_exp with revenue
## Pairing ret_exp with sow
## Pairing ret_exp with acq_exp
## Pairing freq with employees
## Pairing freq with industry
## Pairing freq with crossbuy
## Pairing freq with revenue
## Pairing freq with sow
## Pairing freq with acq_exp
## Pairing employees with industry
## Pairing employees with crossbuy
## Pairing employees with revenue
## Pairing employees with sow
## Pairing employees with acq_exp
## Pairing industry with crossbuy
## Pairing industry with revenue
## Pairing industry with sow
## Pairing industry with acq_exp
## Pairing crossbuy with revenue
## Pairing crossbuy with sow
## Pairing crossbuy with acq_exp
## Pairing revenue with sow
## Pairing revenue with acq_exp
## Pairing sow with acq_exp
##
## Method: vimp
## No. of variables: 8
## Variables sorted by VIMP?: TRUE
## No. of variables used for pairing: 8
## Total no. of paired interactions: 28
## Monte Carlo replications: 1
## Type of noising up used for VIMP: permute
##
## Var 1 Var 2 Paired Additive Difference
## ret_exp:freq 52879.4208 1471.4388 53929.7375 54350.8596 -421.1221
## ret_exp:employees 52879.4208 61.2513 53113.0540 52940.6720 172.3820
## ret_exp:industry 52879.4208 72.6574 53085.5331 52952.0782 133.4549
## ret_exp:crossbuy 52879.4208 4.9911 52980.4107 52884.4118 95.9989
## ret_exp:revenue 52879.4208 -82.5588 52964.7531 52796.8620 167.8911
## ret_exp:sow 52879.4208 -3.8879 52541.0854 52875.5328 -334.4474
## ret_exp:acq_exp 52879.4208 -169.1507 52690.1046 52710.2701 -20.1655
## freq:employees 1496.5574 60.2929 1610.9132 1556.8503 54.0629
## freq:industry 1496.5574 57.5234 1564.2299 1554.0809 10.1491
## freq:crossbuy 1496.5574 46.9774 1604.8496 1543.5349 61.3148
## freq:revenue 1496.5574 -59.2246 1566.0292 1437.3328 128.6964
## freq:sow 1496.5574 26.0465 1368.0277 1522.6040 -154.5763
## freq:acq_exp 1496.5574 -168.2621 1533.6821 1328.2954 205.3867
## employees:industry 125.5373 45.3814 46.3160 170.9187 -124.6028
## employees:crossbuy 125.5373 4.5083 197.2177 130.0456 67.1720
## employees:revenue 125.5373 -36.5848 61.3581 88.9525 -27.5945
## employees:sow 125.5373 -11.2075 39.7536 114.3298 -74.5763
## employees:acq_exp 125.5373 -156.4618 -50.0289 -30.9245 -19.1044
## industry:crossbuy 37.1442 48.1755 46.9237 85.3197 -38.3959
## industry:revenue 37.1442 22.3808 -68.3742 59.5250 -127.8991
## industry:sow 37.1442 -3.9797 -16.7353 33.1645 -49.8998
## industry:acq_exp 37.1442 -141.8307 -103.4745 -104.6865 1.2120
## crossbuy:revenue 22.9968 -5.8232 -27.4921 17.1736 -44.6657
## crossbuy:sow 22.9968 -0.0798 35.2598 22.9171 12.3428
## crossbuy:acq_exp 22.9968 -148.5177 -68.3083 -125.5208 57.2125
## revenue:sow 4.0900 -81.8618 104.5420 -77.7718 182.3137
## revenue:acq_exp 4.0900 -172.5393 -102.8654 -168.4493 65.5839
## sow:acq_exp -3.3074 -129.6146 -249.5175 -132.9220 -116.5955
Finally, tuning the hyperparameters on the RFSRC model predicting duration for customers already acquired. This also leverages the script demonstrated in class.
set.seed(1)
# Establish a list of possible values for hyper-parameters
mtry.values <- seq(2,6,1)
nodesize.values <- seq(2,8,2)
ntree.values <- seq(1e3,6e3,500)
# Create a data frame containing all combinations
hyper_grid <- expand.grid(mtry = mtry.values, nodesize = nodesize.values, ntree = ntree.values)
# Create an empty vector to store OOB error values
oob_err <- c()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {
# Train a Random Forest model
model <- rfsrc(duration ~ acq_exp +
ret_exp +
freq +
crossbuy +
sow +
industry +
revenue +
employees,
data = df3,
mtry = hyper_grid$mtry[i],
nodesize = hyper_grid$nodesize[i],
ntree = hyper_grid$ntree[i])
# Store OOB error for the model
oob_err[i] <- model$err.rate[length(model$err.rate)]
}
# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])
## mtry nodesize ntree
## 65 6 2 2500
OK, that’s well and good, so let’s see if this tuned RF has improved performance.
set.seed(1)
RF2.tuned <- rfsrc(duration ~ acq_exp + ret_exp + freq + crossbuy + sow + industry + revenue + employees, data = df4, mtry = 6, nodesize = 2, ntree = 2500, importance = TRUE)
#gather up the predicted durations based on the RF2.tuned regression model
duration.preds.tuned <- predict(RF2.tuned, df4)$predicted
#roll the duration.preds into yet ANOTHER dataframe, df5 this time, for the tuned model
df5 <- cbind(df4, duration.preds.tuned)
#Now let us compare the performance of the standard vs. tuned random forest regression predictions
#compare the mean values
mean.actual.duration <- mean(df5$duration)
mean.predicted.duration <- mean(df5$duration.preds)
mean.predicted.duration.tuned <- mean(df5$duration.preds.tuned)
#compare the median values
median.actual.duration <- median(df5$duration)
median.predicted.duration <- median(df5$duration.preds)
median.predicted.duration.tuned <- median(df5$duration.preds.tuned)
#compare the MSE values
MSE.predicted.duration <- mean((df5$duration.preds - df5$duration)^2)
MSE.predicted.duration.tuned <- mean((df5$duration.preds.tuned - df5$duration)^2)
cat('Mean of actual duration: ', mean.actual.duration)
## Mean of actual duration: 1101.381
cat('\nMean of predicted duration: ', mean.predicted.duration)
##
## Mean of predicted duration: 1103.555
cat('\nMean of tuned predicted duration: ', mean.predicted.duration.tuned)
##
## Mean of tuned predicted duration: 1103.09
cat('\n\nMedian of actual duration: ', median.actual.duration)
##
##
## Median of actual duration: 1073
cat('\nMedian of predicted duration: ', median.predicted.duration)
##
## Median of predicted duration: 1073.649
cat('\nMedian of tuned predicted duration: ', median.predicted.duration.tuned)
##
## Median of tuned predicted duration: 1074.55
cat('\n\nMSE un-tuned random forest: ', MSE.predicted.duration)
##
##
## MSE un-tuned random forest: 1398.016
cat('\nMSE tuned random forest: ', MSE.predicted.duration.tuned)
##
## MSE tuned random forest: 185.4799
The tuned random forest model has slightly better performance than the un-tuned random forest on calculating the mean duration, but slightly worse performance when calculating median duration. The MSE of the tuned random forest model is significantly lower than the MSE of the un-tuned model.
First we optimize the random forest for acquisition prediction.
set.seed(1)
# Establish a list of possible values for hyper-parameters
mtry.values <- seq(2,6,1)
nodesize.values <- seq(2,8,2)
ntree.values <- seq(1e3,6e3,500)
# Create a data frame containing all combinations
hyper_grid <- expand.grid(mtry = mtry.values, nodesize = nodesize.values, ntree = ntree.values)
# Create an empty vector to store OOB error values
oob_err <- c()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {
# Train a Random Forest model
model <- rfsrc(acquisition ~ acq_exp +
industry +
revenue +
employees,
data = df1.train,
mtry = hyper_grid$mtry[i],
nodesize = hyper_grid$nodesize[i],
ntree = hyper_grid$ntree[i])
# Store OOB error for the model
oob_err[i] <- model$err.rate[length(model$err.rate)]
}
# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])
## mtry nodesize ntree
## 6 2 4 1000
RF1.tuned <- rfsrc(acquisition ~ acq_exp + industry + revenue + employees, data = df1.train, mtry = 2, nodesize = 4, ntree = 1000, importance = TRUE)
#gather up the predicted durations based on the RF1.tuned regression model
RF1acq.preds.tuned <- predict(RF1.tuned, df1.test)$class
#Create a basic confusion matrix for the RF1 model
RF1.confusion <- table(RF1acq.preds, df1.test$acquisition)
#Create a basic confusion matrix for the tuned RF1 model
RF1.confusion.tuned <- table(RF1acq.preds.tuned, df1.test$acquisition)
Now that we have a tuned random forest classifier, we can examine how it compares to the original, un-tuned random forest, and also a standard decision tree, as well as LOGIT model.
set.seed(1)
#Decision tree is first, and I am doing all of this in CARET using the same trainControl argument throughout
DT <- train(acquisition ~ acq_exp + industry + revenue + employees, data = df1.train, method='rpart', trControl = TrnCtrl)
DT.preds <- predict(DT, df1.test)
DT.confusion <- table(DT.preds, df1.test$acquisition)
#LOGIT model is next, also done in CARET
GLM <- train(acquisition ~ acq_exp + industry + revenue + employees, data = df1.train, method='glm', family='binomial', trControl = TrnCtrl)
GLM.preds <- predict(GLM, df1.test)
GLM.confusion <- table(GLM.preds, df1.test$acquisition)
#I gather up the information from the various confusion matrixes
RF1.accuracy.tuned <- sum(diag(RF1.confusion.tuned))/sum(RF1.confusion.tuned)
RF1.accuracy <- sum(diag(RF1.confusion))/sum(RF1.confusion)
DT.accuracy <- sum(diag(DT.confusion))/sum(DT.confusion)
GLM.accuracy <- sum(diag(GLM.confusion))/sum(GLM.confusion)
#Print out the accuracies
cat('Accuracy of Original Classification Random Forest: ', RF1.accuracy)
## Accuracy of Original Classification Random Forest: 0.7676768
cat('\nAccuracy of Tuned Classification Random Forest: ', RF1.accuracy.tuned)
##
## Accuracy of Tuned Classification Random Forest: 0.8282828
cat('\nAccuacy of Decision Tree Model: ', DT.accuracy)
##
## Accuacy of Decision Tree Model: 0.7878788
cat('\nAccuracy of LOGIT Classification Model: ', GLM.accuracy)
##
## Accuracy of LOGIT Classification Model: 0.8181818
In this case, we see that the tuned classification random forest model has the best performance, followed by the logistic regression model, then the decision tree, with the un-tuned random forest model coming in at the rear.
Let’s take a look at the GLM confusion matrix.
GLM.confusion
##
## GLM.preds 0 1
## 0 19 5
## 1 13 62
And here’s the DT confusion matrix.
DT.confusion
##
## DT.preds 0 1
## 0 17 6
## 1 15 61
And here is the RF1 confusion matrix.
RF1.confusion
##
## RF1acq.preds 0 1
## 0 19 10
## 1 13 57
Here is what the tuned RF1 confusion matrix looks like.
RF1.confusion.tuned
##
## RF1acq.preds.tuned 0 1
## 0 21 6
## 1 11 61
Here is what the decision tree looks like:
rattle::fancyRpartPlot(DT$finalModel, sub = "Decision Tree for Predicting Acquisition")
And this is what the GLM can tell us for interpretation purposes.
summary(GLM)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3234 -0.6393 0.3005 0.6428 2.4427
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.7553200 0.9295442 -7.267 3.67e-13 ***
## acq_exp -0.0004578 0.0007815 -0.586 0.558
## industry1 1.6146095 0.2907506 5.553 2.80e-08 ***
## revenue 0.0675192 0.0147713 4.571 4.85e-06 ***
## employees 0.0069523 0.0008210 8.469 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 505.25 on 400 degrees of freedom
## Residual deviance: 337.43 on 396 degrees of freedom
## AIC: 347.43
##
## Number of Fisher Scoring iterations: 5
This is a logistic regression model, therefore the interpretation is as follows:
The p-value of acq_exp is above the selection threshold, so we fail to reject the null hypothesis. There is no significant statistical evidence that changes in acq_exp increase the odds of gaining an acquisition, holding all other variables constant.
The p-value for industry is below the threshold, so we reject the null hypothesis that the coefficient is equal to zero. The odds of an acquisition are e^1.614 (5.02) times higher for a customer who is in the B2B industry compared to one who is not, when all other variables are held constant.
The p-value for revenue is below the threshold, so we reject the null hypothesis that its coefficient is equal to zero. The odds of an acquisition increase by a factor of e^0.07 (1.07) for each million in annual sales revenue, holding all other variables constant.
The p-value for employees is below the threshold, so we reject the null hypothesis that its coefficient is equal to zero. The odds of an acquisition increase by a factor of e^0.007 (1.007) for each additional employee in the prospect’s firm, holding all other variables constant.
#?acquisitionRetention
We generate partial dependence plots using the plot.variable function of RFSRC.
plot.variable(RF1.tuned, partial=TRUE)
Partial dependence plots illustrate how each variable affects the model’s prediction. The idea behind how they are interpreted is similar to the interpretation of coefficients in a linear regression model.
For predicted acquisition, we see that there is an inverse relationship between the probability of no acquisition and number of employees and annual sales revenue of the prospect’s firm in millions of dollars. Another way to look at it would be: holding all other variables constant, the more employees in a prospect’s firm, the higher the likelihood of acquisition. The same holds true for annual sales revenue of the prospect’s firm. We can also see that there is a “sweet spot” for acq_exp that maximizes the probability of acquisition: somewhere between 400 and 600 dollars. We also see that the probability of acquisition is highest when the customer is in the B2B industry.
plot.variable(RF2.tuned, partial = TRUE)
Here we have the partial dependence plots for predicting duration. There is a direct relationship between predicted duration and ret_exp, acq_exp and sow. There is a generally inverse relation between duration and freq and employees. Predicted duration generally decreases as revenue increases, until about the 50-million dollar mark, at which point it begins to increase again. We also see an interesting dip between predicted duration and crossbuy, where duration generally increases until a crossbuy of about 10 product categories, at which point the predicted duration begins to drop off. It is difficult to determine whether there is a significant difference between industry and predicted duration, as there is a lot of overlap in predicted duration when the customer is in the B2B industry, or when the customer is not in the industry.