We want to build an ANN model where we can put in any customer’s information and predict whether that customer is likely to buy a bank subscription or not. We had previously run a logistic regression model, but we are hoping that we can make the tool even more accurate by fitting an ANN model. This will make it so that our representatives are only calling good leads which will reduce wasted calls and improve their morale.
tele <- read.csv("tele.csv")
str(tele)
## 'data.frame': 41188 obs. of 22 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
summary(tele)
## X age job marital
## Min. : 1 Min. :17.00 Length:41188 Length:41188
## 1st Qu.:10298 1st Qu.:32.00 Class :character Class :character
## Median :20595 Median :38.00 Mode :character Mode :character
## Mean :20595 Mean :40.02
## 3rd Qu.:30891 3rd Qu.:47.00
## Max. :41188 Max. :98.00
## education default housing loan
## Length:41188 Length:41188 Length:41188 Length:41188
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## contact month day_of_week duration
## Length:41188 Length:41188 Length:41188 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.: 102.0
## Mode :character Mode :character Mode :character Median : 180.0
## Mean : 258.3
## 3rd Qu.: 319.0
## Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.000 Length:41188
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 Class :character
## Median : 2.000 Median :999.0 Median :0.000 Mode :character
## Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :56.000 Max. :999.0 Max. :7.000
## emp.var.rate cons.price.idx cons.conf.idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08189 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
## nr.employed y
## Min. :4964 Length:41188
## 1st Qu.:5099 Class :character
## Median :5191 Mode :character
## Mean :5167
## 3rd Qu.:5228
## Max. :5228
We should get rid of unnecessary columns, factorize qualitative columns, and make sure our response variable only has values of 0 or 1.
# Get rid of unnecessary columns
tele$X <- NULL
tele$duration <- NULL
tele$pdays <- NULL
The duration of the call should not be included because we are using this model to target people. We will not have called them yet. Additionally pdays which calculates how many days since the the previous contact is completely dependent on date which is the last contact date, so pdays should be removed. It has a value of 999 if the person hasn’t been contacted before anyway, so these outliers would screw up the model.
# Columns to factor: job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome
tele$job <- as.factor(tele$job)
tele$marital <- as.factor(tele$marital)
tele$education <- as.factor(tele$education)
tele$default <- as.factor(tele$default)
tele$housing <- as.factor(tele$housing)
tele$loan <- as.factor(tele$loan)
tele$contact <- as.factor(tele$contact)
tele$month <- as.factor(tele$month)
tele$day_of_week <- as.factor(tele$day_of_week)
tele$poutcome <- as.factor(tele$poutcome)
# Convert the response variable into 0 and 1
tele$y <- ifelse(tele$y == "yes", 1, 0)
You can’t run the ANN algorithm with factors. Every variable that is a factor needs to be converted into a one-hot encoding setup.
# library which makes turning factors into a one-hot encoding easy
library(fastDummies)
# remove_first_dummy = TRUE makes it so that each factor with n levels has n-1 variables
# remove_selected_columns = TRUE makes it so that the original factor columns are removed
tele_encoded <- dummy_cols(tele, remove_first_dummy = TRUE, remove_selected_columns = TRUE)
For the ANN algorithm, each variable needs to be scaled and centered.
# UDF which allows us to scale and center each variable using the minmax method
minmax <- function(x){
(x - min(x)) / (max(x)-min(x))
}
# applies the scaling function to each column
tele_scaled <- as.data.frame(lapply(tele_encoded, minmax))
summary(tele_scaled)
## age campaign previous emp.var.rate
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.1852 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.3333
## Median :0.2593 Median :0.01818 Median :0.00000 Median :0.9375
## Mean :0.2842 Mean :0.02850 Mean :0.02471 Mean :0.7254
## 3rd Qu.:0.3704 3rd Qu.:0.03636 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## cons.price.idx cons.conf.idx euribor3m nr.employed
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3406 1st Qu.:0.3389 1st Qu.:0.1610 1st Qu.:0.5123
## Median :0.6033 Median :0.3766 Median :0.9574 Median :0.8597
## Mean :0.5357 Mean :0.4309 Mean :0.6772 Mean :0.7691
## 3rd Qu.:0.6988 3rd Qu.:0.6025 3rd Qu.:0.9810 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## y job_blue.collar job_entrepreneur job_housemaid
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.1127 Mean :0.2247 Mean :0.03535 Mean :0.02574
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
## job_management job_retired job_self.employed job_services
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.07099 Mean :0.04176 Mean :0.0345 Mean :0.09636
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## job_student job_technician job_unemployed job_unknown
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.000000
## Mean :0.02124 Mean :0.1637 Mean :0.02462 Mean :0.008012
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.000000
## marital_married marital_single marital_unknown education_basic.6y
## Min. :0.0000 Min. :0.0000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :1.0000 Median :0.0000 Median :0.000000 Median :0.00000
## Mean :0.6052 Mean :0.2809 Mean :0.001942 Mean :0.05565
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.000000 Max. :1.00000
## education_basic.9y education_high.school education_illiterate
## Min. :0.0000 Min. :0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000000
## Median :0.0000 Median :0.000 Median :0.000000
## Mean :0.1468 Mean :0.231 Mean :0.000437
## 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000 Max. :1.000000
## education_professional.course education_university.degree education_unknown
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.1273 Mean :0.2954 Mean :0.04203
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## default_unknown default_yes housing_unknown housing_yes
## Min. :0.0000 Min. :0.00e+00 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00e+00 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00e+00 Median :0.00000 Median :1.0000
## Mean :0.2087 Mean :7.28e-05 Mean :0.02404 Mean :0.5238
## 3rd Qu.:0.0000 3rd Qu.:0.00e+00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00e+00 Max. :1.00000 Max. :1.0000
## loan_unknown loan_yes contact_telephone month_aug
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.00
## Mean :0.02404 Mean :0.1517 Mean :0.3653 Mean :0.15
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00
## month_dec month_jul month_jun month_mar
## Min. :0.000000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.000000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.004419 Mean :0.1742 Mean :0.1291 Mean :0.01326
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## month_may month_nov month_oct month_sep
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.3343 Mean :0.09957 Mean :0.01743 Mean :0.01384
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## day_of_week_mon day_of_week_thu day_of_week_tue day_of_week_wed
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2067 Mean :0.2094 Mean :0.1964 Mean :0.1975
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## poutcome_nonexistent poutcome_success
## Min. :0.0000 Min. :0.00000
## 1st Qu.:1.0000 1st Qu.:0.00000
## Median :1.0000 Median :0.00000
## Mean :0.8634 Mean :0.03333
## 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
We’ll use an 80/20 split for our training and testing datasets.
set.seed(12345)
train_rows <-sample(1:nrow(tele), .8*nrow(tele))
tele_scaled_train <- tele_scaled[train_rows, ]
tele_scaled_test <- tele_scaled[-train_rows, ]
# Check to make sure train and test sets are similar
mean(tele_scaled_train$y)
## [1] 0.1128073
mean(tele_scaled_test$y)
## [1] 0.1120418
We will start by building a multi-layer ANN model with 6 nodes in the first layer and 3 nodes in the second layer
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.4.3
# the ANN fitting process is random
set.seed(12345)
# hidden: refers to the hidden layers structure
# threshold: threshhold at which the algorithm says the weights are good enough
# stepmax: how many iterations model will do before it gives up
# lifesign: gives you progress as algorithm is running
#m1 <- neuralnet(y ~ ., data = tele_scaled_train, hidden = c(6,3), threshold = 0.05, stepmax = 1e8, lifesign = "full")
# saves the ANN model into an object so we don't have to run the above code every time
#saveRDS(m1, "ANNModelM1.rds")
m1 <- readRDS("ANNModelM1.rds")
#blue numbers are biases
plot(m1)
We will only target people whom the model predicts as a 1 to greatly improve our chances of turning a call into a success.
MK, the Finance Manager, voiced that he wants us to capture as much of the market (potential buyers) as possible while also staying profitable. We can represent market coverage through the model’s sensitivity where sensitivity = TP / (TP + FN), and we can represent the profitability through the model’s success rate which we are defining as success rate = TP / (TP + FP).
Both of these things are a function of the discrimination threshold. As the discrimination threshold increases, success rate will increase, but unfortunately, sensitivity will decrease. Essentially, we can’t have both of these metrics be high at the same time (see the table in the Profitability section below).
To address this, we will create a weighted index which will value both of these metrics by setting weighted index = (0.5 x success rate) + (0.5 x sensitivity), and select the discrimination threshold which maximizes this sum.
Previously Defined Functions
cost_per_1000 <- function(success_rate){
1000 * (1+1000/(1000+100*(success_rate-11)))
}
cost <- function(success_rate, num_calls){
cost_per_1000(success_rate) * (num_calls/1000)
}
revenue <- function(success_rate, num_calls){
num_calls * (success_rate/100) * 10
}
net_income <- function(success_rate, num_calls){
revenue(success_rate, num_calls) - cost(success_rate, num_calls)
}
Discrimination Threshold Profitability Schedule
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
results <- data.frame(Threshold = numeric(), Rev_Per_1000 = numeric(), Cost_Per_1000 = numeric(), Profit_Per_1000 = numeric(), SuccessRate = numeric(), Sensitivity = numeric(), Weighted_idx = numeric())
threshold <- 0.1
for (i in 1:8){
tele_predict <- predict(m1, tele_scaled_test, type = "response")
predict_bin <- ifelse(tele_predict < threshold, 0, 1)
tele_matrix <- confusionMatrix(factor(predict_bin), factor(tele_scaled_test$y), positive = "1")
amount00 <- tele_matrix$table[1,1]
amount01 <- tele_matrix$table[1,2]
amount10 <- tele_matrix$table[2,1]
amount11 <- tele_matrix$table[2,2]
successRate <- (amount11/(amount11 + amount10)) * 100
sensitivity <- (amount11/(amount11 + amount01)) * 100
widx = (0.5 * successRate) + (0.5 * sensitivity)
results <- rbind(results, data.frame(Threshold = threshold, Rev_Per_1000 = revenue(sensitivity,1000), Cost_Per_1000 = cost(sensitivity,1000), Profit_Per_1000 = net_income(sensitivity,1000), SuccessRate = successRate, Sensitivity = sensitivity, Weighted_idx = widx))
threshold <- threshold + .1
}
results
## Threshold Rev_Per_1000 Cost_Per_1000 Profit_Per_1000 SuccessRate Sensitivity
## 1 0.1 5861.322 1173.571 4687.7505 36.70285 58.61322
## 2 0.2 5276.273 1193.189 4083.0838 41.91050 52.76273
## 3 0.3 4637.053 1220.407 3416.6457 44.30642 46.37053
## 4 0.4 3672.806 1279.892 2392.9140 47.81382 36.72806
## 5 0.5 2578.548 1403.462 1175.0862 58.33333 25.78548
## 6 0.6 1787.649 1592.540 195.1087 66.53226 17.87649
## 7 0.7 1560.130 1684.871 -124.7405 68.89952 15.60130
## 8 0.8 1332.611 1811.286 -478.6749 71.92982 13.32611
## Weighted_idx
## 1 47.65803
## 2 47.33661
## 3 45.33847
## 4 42.27094
## 5 42.05941
## 6 42.20437
## 7 42.25041
## 8 42.62797
As threshold increases, success rate goes up and sensitivity goes down. We will choose a discrimination threshold of ****, since this maximizes the weighted index metric, thereby accounting for both success rate and sensitivity.
Profitability for ___ Discrimination Threshold
print(results[2, ])
## Threshold Rev_Per_1000 Cost_Per_1000 Profit_Per_1000 SuccessRate Sensitivity
## 2 0.2 5276.273 1193.189 4083.084 41.9105 52.76273
## Weighted_idx
## 2 47.33661
With a discrimination threshold of ___, the targeted calling method has a profitability rate of **** per thousand, will capture % of people who want to purchase a bank subscription, and will increase the call success rate to %.