Imagine this is your business and you wanted to predict if your clients would subscribe to a new service or product your team has been developing for a while. You have launched it to the market, but it is not getting enough sales. What can you do so all those hours and money invested in product development don’t go to waste?
I am choosing a bank marketing dataset for this example, but this could be your business too.
We start by loading the necessary packages
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
Then we load our dataset
BankMarketing <- read.csv("C:/Users/nunre/OneDrive/Documents/Business Consulting/Master Data Analytics/IT 697 R/Module 6/termCrosssell/bank-additional/bankmarketing.csv", sep = ";", header = TRUE)
Explore the dataset
The purpose of exploration is to give us an understanding of our dataset, the variables it contains, if there are any missing values, if there are any outliers, the distribution of our values, etc. We also obtain information from consulting with you our client, to help us understand what the ultimate goal of the analysis is. In this case, the goal is to predict clients who will subscribe to the new service.
We start by looking at the variables’ or columns’ names to have a better understanding of our dataset.
names(BankMarketing)
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "pdays" "previous" "poutcome" "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
## [21] "y"
We use summary to get the distribution of values within variables.
summary(BankMarketing)
## age job marital
## Min. :18.00 admin. :1012 divorced: 446
## 1st Qu.:32.00 blue-collar: 884 married :2509
## Median :38.00 technician : 691 single :1153
## Mean :40.11 services : 393 unknown : 11
## 3rd Qu.:47.00 management : 324
## Max. :88.00 retired : 166
## (Other) : 649
## education default housing loan
## university.degree :1264 no :3315 no :1839 no :3349
## high.school : 921 unknown: 803 unknown: 105 unknown: 105
## basic.9y : 574 yes : 1 yes :2175 yes : 665
## professional.course: 535
## basic.4y : 429
## basic.6y : 228
## (Other) : 168
## contact month day_of_week duration
## cellular :2652 may :1378 fri:768 Min. : 0.0
## telephone:1467 jul : 711 mon:855 1st Qu.: 103.0
## aug : 636 thu:860 Median : 181.0
## jun : 530 tue:841 Mean : 256.8
## nov : 446 wed:795 3rd Qu.: 317.0
## apr : 215 Max. :3643.0
## (Other): 203
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.0000 failure : 454
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000 nonexistent:3523
## Median : 2.000 Median :999.0 Median :0.0000 success : 142
## Mean : 2.537 Mean :960.4 Mean :0.1903
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000
## Max. :35.000 Max. :999.0 Max. :6.0000
##
## emp.var.rate cons.price.idx cons.conf.idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.635
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.334
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08497 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr.employed y
## Min. :4964 no :3668
## 1st Qu.:5099 yes: 451
## Median :5191
## Mean :5166
## 3rd Qu.:5228
## Max. :5228
##
See it in a different way, with data type information
str(BankMarketing)
## 'data.frame': 4119 obs. of 21 variables:
## $ age : int 30 39 25 38 47 32 32 41 31 35 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
## $ duration : int 487 346 227 17 58 128 290 44 68 170 ...
## $ campaign : int 2 4 1 3 1 3 4 2 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 2 0 0 1 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ emp.var.rate : num -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
## $ cons.price.idx: num 92.9 94 94.5 94.5 93.2 ...
## $ cons.conf.idx : num -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
## $ euribor3m : num 1.31 4.86 4.96 4.96 4.19 ...
## $ nr.employed : num 5099 5191 5228 5228 5196 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Our target variable is the “y” variable at the end. It has two levels, “yes” and “no”, for whether they subscribed or not.
We can create a table to inspect proportion of yes/no answers too.
table(BankMarketing$y)/nrow(BankMarketing)
##
## no yes
## 0.8905074 0.1094926
Now, we split the dataset into a development and a validation sample set
sample.ind <- sample(2,
nrow(BankMarketing),
replace = T,
prob = c(0.6, 0.4))
BankMarketing.dev <- BankMarketing[sample.ind==1,]
BankMarketing.val <- BankMarketing[sample.ind==2,]
We continue by exploring the development dataset and inspect it’s yes/no distribution
table(BankMarketing.dev$y)/nrow(BankMarketing.dev)
##
## no yes
## 0.8932079 0.1067921
We do the same for the validation dataset
table(BankMarketing.val$y)/nrow(BankMarketing.val)
##
## no yes
## 0.8865672 0.1134328
Make sure type of response variable is factor
class(BankMarketing.dev$y)
## [1] "factor"
Now, we make a formula to pass the independent variables as parameter values for randomForest
varNames <- names(BankMarketing.dev)
Exclude ID or Response variable (usin!)
varNames <- varNames[!varNames %in% c("y")]
add + sign between exploratory variables
varNames1 <- paste(varNames, collapse = "+")
add response variable and convert to a formula object
rf.form <- as.formula(paste("y", varNames1, sep = " ~ "))
Build Random Forest
BankMarketing.rf <- randomForest(rf.form,
BankMarketing.dev,
ntree=500,
importance=T)
Plot random forest error
plot(BankMarketing.rf)
Plot Variable Importance. Variable importance tells us which variables were the main drivers in the model.
varImpPlot(BankMarketing.rf,
sort = T,
main = "Variable Importance",
n.var = 5)
Create a variable response table
var.imp <- data.frame(importance(BankMarketing.rf,
type = 2))
Inspect it
var.imp
Make row names as columns
var.imp$Variables <- row.names(var.imp)
var.imp[order(var.imp$MeanDecreaseGini, decreasing = T),]
Predict response variable for the predicted sample
BankMarketing.dev$predicted.response <- predict(BankMarketing.rf, BankMarketing.dev)
Confusion Matrix We build a confusion matrix to helps us determine the accuracy of our model.
We start by loading the necessary packages
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
Build the confusion matrix
confusionMatrix(data = BankMarketing.dev$predicted.response,
reference = BankMarketing.dev$y,
positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2183 1
## yes 0 260
##
## Accuracy : 0.9996
## 95% CI : (0.9977, 1)
## No Information Rate : 0.8932
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9979
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9962
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9995
## Prevalence : 0.1068
## Detection Rate : 0.1064
## Detection Prevalence : 0.1064
## Balanced Accuracy : 0.9981
##
## 'Positive' Class : yes
##
BankMarketing.val$predicted.response <- predict(BankMarketing.rf, BankMarketing.val)
confusionMatrix(data = BankMarketing.val$predicted.response,
reference = BankMarketing.val$y,
positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1438 112
## yes 47 78
##
## Accuracy : 0.9051
## 95% CI : (0.89, 0.9187)
## No Information Rate : 0.8866
## P-Value [Acc > NIR] : 0.008226
##
## Kappa : 0.4453
## Mcnemar's Test P-Value : 3.864e-07
##
## Sensitivity : 0.41053
## Specificity : 0.96835
## Pos Pred Value : 0.62400
## Neg Pred Value : 0.92774
## Prevalence : 0.11343
## Detection Rate : 0.04657
## Detection Prevalence : 0.07463
## Balanced Accuracy : 0.68944
##
## 'Positive' Class : yes
##
What does all this mean? Our first confusion matrix is telling us that we have a 99.85 percent accuracy at predicting who would get the new service. The second matrix is telling us that when we apply the model to a new set of data, the accuracy is of about 91.44 percent.
91.44 percent accuracy is still really good. So, if you were to target your clients based on this information, you can count on at least being right 91.44 percent of the time, about who is more likely to buy your new services. That is better than just guessing who would get your new services or products, or going by a gut feeling alone.