Imagine this is your business and you wanted to predict if your clients would subscribe to a new service or product your team has been developing for a while. You have launched it to the market, but it is not getting enough sales. What can you do so all those hours and money invested in product development don’t go to waste?

I am choosing a bank marketing dataset for this example, but this could be your business too.

We start by loading the necessary packages

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

Then we load our dataset

BankMarketing <- read.csv("C:/Users/nunre/OneDrive/Documents/Business Consulting/Master Data Analytics/IT 697 R/Module 6/termCrosssell/bank-additional/bankmarketing.csv", sep = ";", header = TRUE)

Explore the dataset

The purpose of exploration is to give us an understanding of our dataset, the variables it contains, if there are any missing values, if there are any outliers, the distribution of our values, etc. We also obtain information from consulting with you our client, to help us understand what the ultimate goal of the analysis is. In this case, the goal is to predict clients who will subscribe to the new service.

We start by looking at the variables’ or columns’ names to have a better understanding of our dataset.

names(BankMarketing)
##  [1] "age"            "job"            "marital"        "education"     
##  [5] "default"        "housing"        "loan"           "contact"       
##  [9] "month"          "day_of_week"    "duration"       "campaign"      
## [13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
## [17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
## [21] "y"

We use summary to get the distribution of values within variables.

summary(BankMarketing)
##       age                 job           marital    
##  Min.   :18.00   admin.     :1012   divorced: 446  
##  1st Qu.:32.00   blue-collar: 884   married :2509  
##  Median :38.00   technician : 691   single  :1153  
##  Mean   :40.11   services   : 393   unknown :  11  
##  3rd Qu.:47.00   management : 324                  
##  Max.   :88.00   retired    : 166                  
##                  (Other)    : 649                  
##                education       default        housing          loan     
##  university.degree  :1264   no     :3315   no     :1839   no     :3349  
##  high.school        : 921   unknown: 803   unknown: 105   unknown: 105  
##  basic.9y           : 574   yes    :   1   yes    :2175   yes    : 665  
##  professional.course: 535                                               
##  basic.4y           : 429                                               
##  basic.6y           : 228                                               
##  (Other)            : 168                                               
##       contact         month      day_of_week    duration     
##  cellular :2652   may    :1378   fri:768     Min.   :   0.0  
##  telephone:1467   jul    : 711   mon:855     1st Qu.: 103.0  
##                   aug    : 636   thu:860     Median : 181.0  
##                   jun    : 530   tue:841     Mean   : 256.8  
##                   nov    : 446   wed:795     3rd Qu.: 317.0  
##                   apr    : 215               Max.   :3643.0  
##                   (Other): 203                               
##     campaign          pdays          previous             poutcome   
##  Min.   : 1.000   Min.   :  0.0   Min.   :0.0000   failure    : 454  
##  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.0000   nonexistent:3523  
##  Median : 2.000   Median :999.0   Median :0.0000   success    : 142  
##  Mean   : 2.537   Mean   :960.4   Mean   :0.1903                     
##  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.0000                     
##  Max.   :35.000   Max.   :999.0   Max.   :6.0000                     
##                                                                      
##   emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
##  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.635  
##  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.334  
##  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
##  Mean   : 0.08497   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
##  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
##  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
##                                                                    
##   nr.employed     y       
##  Min.   :4964   no :3668  
##  1st Qu.:5099   yes: 451  
##  Median :5191             
##  Mean   :5166             
##  3rd Qu.:5228             
##  Max.   :5228             
## 

See it in a different way, with data type information

str(BankMarketing)
## 'data.frame':    4119 obs. of  21 variables:
##  $ age           : int  30 39 25 38 47 32 32 41 31 35 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
##  $ duration      : int  487 346 227 17 58 128 290 44 68 170 ...
##  $ campaign      : int  2 4 1 3 1 3 4 2 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 2 0 0 1 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ emp.var.rate  : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
##  $ cons.price.idx: num  92.9 94 94.5 94.5 93.2 ...
##  $ cons.conf.idx : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
##  $ euribor3m     : num  1.31 4.86 4.96 4.96 4.19 ...
##  $ nr.employed   : num  5099 5191 5228 5228 5196 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Our target variable is the “y” variable at the end. It has two levels, “yes” and “no”, for whether they subscribed or not.

We can create a table to inspect proportion of yes/no answers too.

table(BankMarketing$y)/nrow(BankMarketing)
## 
##        no       yes 
## 0.8905074 0.1094926

Now, we split the dataset into a development and a validation sample set

sample.ind <- sample(2,
                     nrow(BankMarketing),
                     replace = T,
                     prob = c(0.6, 0.4))
BankMarketing.dev <- BankMarketing[sample.ind==1,]
BankMarketing.val <- BankMarketing[sample.ind==2,]

We continue by exploring the development dataset and inspect it’s yes/no distribution

table(BankMarketing.dev$y)/nrow(BankMarketing.dev)
## 
##        no       yes 
## 0.8932079 0.1067921

We do the same for the validation dataset

table(BankMarketing.val$y)/nrow(BankMarketing.val)
## 
##        no       yes 
## 0.8865672 0.1134328

Make sure type of response variable is factor

class(BankMarketing.dev$y)
## [1] "factor"

Now, we make a formula to pass the independent variables as parameter values for randomForest

varNames <- names(BankMarketing.dev)

Exclude ID or Response variable (usin!)

varNames <- varNames[!varNames %in% c("y")]

add + sign between exploratory variables

varNames1 <- paste(varNames, collapse = "+")

add response variable and convert to a formula object

rf.form <- as.formula(paste("y", varNames1, sep = " ~ "))

Build Random Forest

BankMarketing.rf <- randomForest(rf.form,
                              BankMarketing.dev,
                              ntree=500,
                              importance=T)

Plot random forest error

plot(BankMarketing.rf)

Plot Variable Importance. Variable importance tells us which variables were the main drivers in the model.

varImpPlot(BankMarketing.rf,
           sort = T,
           main = "Variable Importance",
           n.var = 5)

Create a variable response table

var.imp <- data.frame(importance(BankMarketing.rf,
                                 type = 2))

Inspect it

var.imp

Make row names as columns

var.imp$Variables <- row.names(var.imp)
var.imp[order(var.imp$MeanDecreaseGini, decreasing = T),]

Predict response variable for the predicted sample

BankMarketing.dev$predicted.response <- predict(BankMarketing.rf, BankMarketing.dev)

Confusion Matrix We build a confusion matrix to helps us determine the accuracy of our model.

We start by loading the necessary packages

library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin

Build the confusion matrix

confusionMatrix(data = BankMarketing.dev$predicted.response,
                reference = BankMarketing.dev$y,
                positive = 'yes')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2183    1
##        yes    0  260
##                                      
##                Accuracy : 0.9996     
##                  95% CI : (0.9977, 1)
##     No Information Rate : 0.8932     
##     P-Value [Acc > NIR] : <2e-16     
##                                      
##                   Kappa : 0.9979     
##  Mcnemar's Test P-Value : 1          
##                                      
##             Sensitivity : 0.9962     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 0.9995     
##              Prevalence : 0.1068     
##          Detection Rate : 0.1064     
##    Detection Prevalence : 0.1064     
##       Balanced Accuracy : 0.9981     
##                                      
##        'Positive' Class : yes        
## 
BankMarketing.val$predicted.response <- predict(BankMarketing.rf, BankMarketing.val)
confusionMatrix(data = BankMarketing.val$predicted.response,
                reference = BankMarketing.val$y,
                positive = 'yes')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1438  112
##        yes   47   78
##                                         
##                Accuracy : 0.9051        
##                  95% CI : (0.89, 0.9187)
##     No Information Rate : 0.8866        
##     P-Value [Acc > NIR] : 0.008226      
##                                         
##                   Kappa : 0.4453        
##  Mcnemar's Test P-Value : 3.864e-07     
##                                         
##             Sensitivity : 0.41053       
##             Specificity : 0.96835       
##          Pos Pred Value : 0.62400       
##          Neg Pred Value : 0.92774       
##              Prevalence : 0.11343       
##          Detection Rate : 0.04657       
##    Detection Prevalence : 0.07463       
##       Balanced Accuracy : 0.68944       
##                                         
##        'Positive' Class : yes           
## 

What does all this mean? Our first confusion matrix is telling us that we have a 99.85 percent accuracy at predicting who would get the new service. The second matrix is telling us that when we apply the model to a new set of data, the accuracy is of about 91.44 percent.

91.44 percent accuracy is still really good. So, if you were to target your clients based on this information, you can count on at least being right 91.44 percent of the time, about who is more likely to buy your new services. That is better than just guessing who would get your new services or products, or going by a gut feeling alone.