Part C: ANN

Step 0: Define the Problem

We want to build an ANN model where we can put in any customer’s information and predict whether that customer is likely to buy a bank subscription or not. We had previously run a logistic regression model, but we are hoping that we can make the tool even more accurate by fitting an ANN model. This will make it so that our representatives are only calling good leads which will reduce wasted calls and improve their morale.

Step 1: Get Data

tele <- read.csv("tele.csv")
str(tele)
## 'data.frame':    41188 obs. of  22 variables:
##  $ X             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...
summary(tele)
##        X              age            job              marital         
##  Min.   :    1   Min.   :17.00   Length:41188       Length:41188      
##  1st Qu.:10298   1st Qu.:32.00   Class :character   Class :character  
##  Median :20595   Median :38.00   Mode  :character   Mode  :character  
##  Mean   :20595   Mean   :40.02                                        
##  3rd Qu.:30891   3rd Qu.:47.00                                        
##  Max.   :41188   Max.   :98.00                                        
##   education           default            housing              loan          
##  Length:41188       Length:41188       Length:41188       Length:41188      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    contact             month           day_of_week           duration     
##  Length:41188       Length:41188       Length:41188       Min.   :   0.0  
##  Class :character   Class :character   Class :character   1st Qu.: 102.0  
##  Mode  :character   Mode  :character   Mode  :character   Median : 180.0  
##                                                           Mean   : 258.3  
##                                                           3rd Qu.: 319.0  
##                                                           Max.   :4918.0  
##     campaign          pdays          previous       poutcome        
##  Min.   : 1.000   Min.   :  0.0   Min.   :0.000   Length:41188      
##  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   Class :character  
##  Median : 2.000   Median :999.0   Median :0.000   Mode  :character  
##  Mean   : 2.568   Mean   :962.5   Mean   :0.173                     
##  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000                     
##  Max.   :56.000   Max.   :999.0   Max.   :7.000                     
##   emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
##  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
##  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
##  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
##  Mean   : 0.08189   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
##  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
##  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
##   nr.employed        y            
##  Min.   :4964   Length:41188      
##  1st Qu.:5099   Class :character  
##  Median :5191   Mode  :character  
##  Mean   :5167                     
##  3rd Qu.:5228                     
##  Max.   :5228

Step 2: Clean the Data

We should get rid of unnecessary columns, factorize qualitative columns, and make sure our response variable only has values of 0 or 1.

# Get rid of unnecessary columns

tele$X <- NULL
tele$duration <- NULL
tele$pdays <- NULL

The duration of the call should not be included because we are using this model to target people. We will not have called them yet. Additionally pdays which calculates how many days since the the previous contact is completely dependent on date which is the last contact date, so pdays should be removed. It has a value of 999 if the person hasn’t been contacted before anyway, so these outliers would screw up the model.

# Columns to factor: job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome

tele$job <- as.factor(tele$job)
tele$marital <- as.factor(tele$marital)
tele$education <- as.factor(tele$education)
tele$default <- as.factor(tele$default)
tele$housing <- as.factor(tele$housing)
tele$loan <- as.factor(tele$loan)
tele$contact <- as.factor(tele$contact)
tele$month <- as.factor(tele$month)
tele$day_of_week <- as.factor(tele$day_of_week)
tele$poutcome <- as.factor(tele$poutcome)

# Convert the response variable into 0 and 1
tele$y <- ifelse(tele$y == "yes", 1, 0)

You can’t run the ANN algorithm with factors. Every variable that is a factor needs to be converted into a one-hot encoding setup.

# library which makes turning factors into a one-hot encoding easy
library(fastDummies)

# remove_first_dummy = TRUE makes it so that each factor with n levels has n-1 variables
# remove_selected_columns = TRUE makes it so that the original factor columns are removed
tele_encoded <- dummy_cols(tele, remove_first_dummy = TRUE, remove_selected_columns = TRUE)

For the ANN algorithm, each variable needs to be scaled and centered.

# UDF which allows us to scale and center each variable using the minmax method
minmax <- function(x){
  (x - min(x)) / (max(x)-min(x))
}

# applies the scaling function to each column
tele_scaled <- as.data.frame(lapply(tele_encoded, minmax))

summary(tele_scaled)
##       age            campaign          previous        emp.var.rate   
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.1852   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.3333  
##  Median :0.2593   Median :0.01818   Median :0.00000   Median :0.9375  
##  Mean   :0.2842   Mean   :0.02850   Mean   :0.02471   Mean   :0.7254  
##  3rd Qu.:0.3704   3rd Qu.:0.03636   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##  cons.price.idx   cons.conf.idx      euribor3m       nr.employed    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.3406   1st Qu.:0.3389   1st Qu.:0.1610   1st Qu.:0.5123  
##  Median :0.6033   Median :0.3766   Median :0.9574   Median :0.8597  
##  Mean   :0.5357   Mean   :0.4309   Mean   :0.6772   Mean   :0.7691  
##  3rd Qu.:0.6988   3rd Qu.:0.6025   3rd Qu.:0.9810   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##        y          job_blue.collar  job_entrepreneur  job_housemaid    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.1127   Mean   :0.2247   Mean   :0.03535   Mean   :0.02574  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##  job_management     job_retired      job_self.employed  job_services    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000    Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000    1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.0000    Median :0.00000  
##  Mean   :0.07099   Mean   :0.04176   Mean   :0.0345    Mean   :0.09636  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000    3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000    Max.   :1.00000  
##   job_student      job_technician   job_unemployed     job_unknown      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.000000  
##  Mean   :0.02124   Mean   :0.1637   Mean   :0.02462   Mean   :0.008012  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.000000  
##  marital_married  marital_single   marital_unknown    education_basic.6y
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000000   Min.   :0.00000   
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.00000   
##  Median :1.0000   Median :0.0000   Median :0.000000   Median :0.00000   
##  Mean   :0.6052   Mean   :0.2809   Mean   :0.001942   Mean   :0.05565   
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.000000   3rd Qu.:0.00000   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000000   Max.   :1.00000   
##  education_basic.9y education_high.school education_illiterate
##  Min.   :0.0000     Min.   :0.000         Min.   :0.000000    
##  1st Qu.:0.0000     1st Qu.:0.000         1st Qu.:0.000000    
##  Median :0.0000     Median :0.000         Median :0.000000    
##  Mean   :0.1468     Mean   :0.231         Mean   :0.000437    
##  3rd Qu.:0.0000     3rd Qu.:0.000         3rd Qu.:0.000000    
##  Max.   :1.0000     Max.   :1.000         Max.   :1.000000    
##  education_professional.course education_university.degree education_unknown
##  Min.   :0.0000                Min.   :0.0000              Min.   :0.00000  
##  1st Qu.:0.0000                1st Qu.:0.0000              1st Qu.:0.00000  
##  Median :0.0000                Median :0.0000              Median :0.00000  
##  Mean   :0.1273                Mean   :0.2954              Mean   :0.04203  
##  3rd Qu.:0.0000                3rd Qu.:1.0000              3rd Qu.:0.00000  
##  Max.   :1.0000                Max.   :1.0000              Max.   :1.00000  
##  default_unknown   default_yes       housing_unknown    housing_yes    
##  Min.   :0.0000   Min.   :0.00e+00   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00e+00   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00e+00   Median :0.00000   Median :1.0000  
##  Mean   :0.2087   Mean   :7.28e-05   Mean   :0.02404   Mean   :0.5238  
##  3rd Qu.:0.0000   3rd Qu.:0.00e+00   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00e+00   Max.   :1.00000   Max.   :1.0000  
##   loan_unknown        loan_yes      contact_telephone   month_aug   
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000    Min.   :0.00  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000    1st Qu.:0.00  
##  Median :0.00000   Median :0.0000   Median :0.0000    Median :0.00  
##  Mean   :0.02404   Mean   :0.1517   Mean   :0.3653    Mean   :0.15  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000    3rd Qu.:0.00  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000    Max.   :1.00  
##    month_dec          month_jul        month_jun        month_mar      
##  Min.   :0.000000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.004419   Mean   :0.1742   Mean   :0.1291   Mean   :0.01326  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##    month_may        month_nov         month_oct         month_sep      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.3343   Mean   :0.09957   Mean   :0.01743   Mean   :0.01384  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##  day_of_week_mon  day_of_week_thu  day_of_week_tue  day_of_week_wed 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.2067   Mean   :0.2094   Mean   :0.1964   Mean   :0.1975  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  poutcome_nonexistent poutcome_success 
##  Min.   :0.0000       Min.   :0.00000  
##  1st Qu.:1.0000       1st Qu.:0.00000  
##  Median :1.0000       Median :0.00000  
##  Mean   :0.8634       Mean   :0.03333  
##  3rd Qu.:1.0000       3rd Qu.:0.00000  
##  Max.   :1.0000       Max.   :1.00000

Step 3: Split the Data

We’ll use an 80/20 split for our training and testing datasets.

set.seed(12345)
train_rows <-sample(1:nrow(tele), .8*nrow(tele))

tele_scaled_train <- tele_scaled[train_rows, ]
tele_scaled_test <- tele_scaled[-train_rows, ]

# Check to make sure train and test sets are similar
mean(tele_scaled_train$y)
## [1] 0.1128073
mean(tele_scaled_test$y)
## [1] 0.1120418

Step 4: Build the Model

We will start by building a multi-layer ANN model with 6 nodes in the first layer and 3 nodes in the second layer

library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.4.3
# the ANN fitting process is random
set.seed(12345)

# hidden: refers to the hidden layers structure
# threshold: threshhold at which the algorithm says the weights are good enough
# stepmax: how many iterations model will do before it gives up
# lifesign: gives you progress as algorithm is running

#m1 <- neuralnet(y ~ ., data = tele_scaled_train, hidden = c(6,3), threshold = 0.05, stepmax = 1e8, lifesign = "full")

# saves the ANN model into an object so we don't have to run the above code every time
#saveRDS(m1, "ANNModelM1.rds")
m1 <- readRDS("ANNModelM1.rds")

#blue numbers are biases
plot(m1)

Strategy

We will only target people whom the model predicts as a 1 to greatly improve our chances of turning a call into a success.

Tradeoffs

MK, the Finance Manager, voiced that he wants us to capture as much of the market (potential buyers) as possible while also staying profitable. We can represent market coverage through the model’s sensitivity where sensitivity = TP / (TP + FN), and we can represent the profitability through the model’s success rate which we are defining as success rate = TP / (TP + FP).

Both of these things are a function of the discrimination threshold. As the discrimination threshold increases, success rate will increase, but unfortunately, sensitivity will decrease. Essentially, we can’t have both of these metrics be high at the same time (see the table in the Profitability section below).

To address this, we will create a weighted index which will value both of these metrics by setting weighted index = (0.5 x success rate) + (0.5 x sensitivity), and select the discrimination threshold which maximizes this sum.

Profitability

Previously Defined Functions

cost_per_1000 <- function(success_rate){
  1000 * (1+1000/(1000+100*(success_rate-11)))
}

cost <- function(success_rate, num_calls){
  cost_per_1000(success_rate) * (num_calls/1000)
}

revenue <- function(success_rate, num_calls){
  num_calls * (success_rate/100) * 10
}

net_income <- function(success_rate, num_calls){
  revenue(success_rate, num_calls) - cost(success_rate, num_calls)
}

Discrimination Threshold Profitability Schedule

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
results <- data.frame(Threshold = numeric(), Rev_Per_1000 = numeric(), Cost_Per_1000 = numeric(), Profit_Per_1000 = numeric(), SuccessRate = numeric(), Sensitivity = numeric(), Weighted_idx = numeric())

threshold <- 0.1

for (i in 1:8){
  tele_predict <- predict(m1, tele_scaled_test, type = "response")
  predict_bin <- ifelse(tele_predict < threshold, 0, 1)
  
  tele_matrix <- confusionMatrix(factor(predict_bin), factor(tele_scaled_test$y), positive =     "1")
  
  amount00 <- tele_matrix$table[1,1]
  amount01 <- tele_matrix$table[1,2]
  amount10 <- tele_matrix$table[2,1]
  amount11 <- tele_matrix$table[2,2]
  
  successRate <- (amount11/(amount11 + amount10)) * 100
  sensitivity <- (amount11/(amount11 + amount01)) * 100
  widx = (0.5 * successRate) + (0.5 * sensitivity)
  
  results <- rbind(results, data.frame(Threshold = threshold, Rev_Per_1000 =               revenue(sensitivity,1000), Cost_Per_1000 = cost(sensitivity,1000), Profit_Per_1000 =   net_income(sensitivity,1000), SuccessRate = successRate, Sensitivity = sensitivity, Weighted_idx = widx))
  
  threshold <- threshold + .1
}

results
##   Threshold Rev_Per_1000 Cost_Per_1000 Profit_Per_1000 SuccessRate Sensitivity
## 1       0.1     5861.322      1173.571       4687.7505    36.70285    58.61322
## 2       0.2     5276.273      1193.189       4083.0838    41.91050    52.76273
## 3       0.3     4637.053      1220.407       3416.6457    44.30642    46.37053
## 4       0.4     3672.806      1279.892       2392.9140    47.81382    36.72806
## 5       0.5     2578.548      1403.462       1175.0862    58.33333    25.78548
## 6       0.6     1787.649      1592.540        195.1087    66.53226    17.87649
## 7       0.7     1560.130      1684.871       -124.7405    68.89952    15.60130
## 8       0.8     1332.611      1811.286       -478.6749    71.92982    13.32611
##   Weighted_idx
## 1     47.65803
## 2     47.33661
## 3     45.33847
## 4     42.27094
## 5     42.05941
## 6     42.20437
## 7     42.25041
## 8     42.62797

As threshold increases, success rate goes up and sensitivity goes down. We will choose a discrimination threshold of ****, since this maximizes the weighted index metric, thereby accounting for both success rate and sensitivity.

Profitability for ___ Discrimination Threshold

print(results[2, ])
##   Threshold Rev_Per_1000 Cost_Per_1000 Profit_Per_1000 SuccessRate Sensitivity
## 2       0.2     5276.273      1193.189        4083.084     41.9105    52.76273
##   Weighted_idx
## 2     47.33661

With a discrimination threshold of ___, the targeted calling method has a profitability rate of **** per thousand, will capture % of people who want to purchase a bank subscription, and will increase the call success rate to %.