Introduction

A Portuguese bank seeks to increase its subscriber base. In pursuit of this goal, the institution conducts campaigns in which potential clients are contacted and their information collected. The bank’s aim is to predict whether an individual will subscribe to their bank based on the person’s demographic information and data from previous contacts with the potential client. The goal of this analysis is to produce a statistical classification model that predicts whether a client will subscribe a term deposit or not.

# Loading in the necessary libraries

library("ggplot2")
library("data.table")
library("plotly")
library("stats")
library("gridExtra")
require(gridExtra)
library("caret")
library("cvAUC")
library("caTools")
library("LiblineaR")
library("party")
library("randomForest")
library("e1071")
library("class")
library("xgboost")
library("mlr")
library("scales")
library("ROCR")
library("doParallel")
library("openxlsx")
library("dplyr")
theme_set(theme_gray())
registerDoParallel(4)
# Loading in the data sets

setwd('/Users/mareksalamon/Desktop/School/Hunter/Fall Semester 2018/Data Analysis/Bank Marketing Data Analysis')

train <- read.csv('train_data.csv', na.strings=c("","NA"))
test <- read.csv('test_data.csv', na.strings=c("","NA"))

Data Exploration

Now that we have loaded our data, let’s explore it. Ideally, we want to explore our train dataset only, in order to prevent the introduction of any subjective bias by the analyst. Although, it is possible that, if our train set requires cleaning, the test set will require cleaning in which case it will need to be explored as well.

The test dataset serves mainly as a collection of data to test the ability of our final model to predict/classify previously unseen, ‘future’ data.

# dimensions of dataframe
head(train)
##   age           job marital education default balance housing loan
## 1  39      services married secondary      no       0     yes   no
## 2  52    management married  tertiary      no    1646      no  yes
## 3  40 self-employed  single   unknown      no     109     yes   no
## 4  48    management married   primary      no     427      no   no
## 5  46    technician married  tertiary      no    2207     yes   no
## 6  33   blue-collar married secondary      no    2254     yes   no
##    contact day month duration campaign pdays previous poutcome  y
## 1  unknown  20   may      793        2    -1        0  unknown no
## 2 cellular   9   feb       92        3    -1        0  unknown no
## 3 cellular   4   feb      258        1    -1        0  unknown no
## 4 cellular  17   nov       62        1    -1        0  unknown no
## 5  unknown  16   may      243        3    -1        0  unknown no
## 6 cellular  17   apr      130        1   148        1  failure no

Let us make sure that the test dataset at least possesses the same features as our train set.

colnames(test)
##  [1] "age"       "job"       "marital"   "education" "default"  
##  [6] "balance"   "housing"   "loan"      "contact"   "day"      
## [11] "month"     "duration"  "campaign"  "pdays"     "previous" 
## [16] "poutcome"  "id"
# dimensions of dataframe
dim(train)
## [1] 31647    17

There seem to be a total of 31,647 observations in the training dataset possessing 17 features. Each feature and its corresponding meaning is listed below:

Feature Glossary

  • Client Information
    • age - age of client
    • job - type of job held by client
    • marital - marital status of client
    • education - highest level of education completed by client
    • default - has the client ever defaulted on previous debts?
    • balance - client’s average yearly balance, in euros
    • housing - does client possess a housing loan?
    • loan - does client possess a personal loan?
  • Information related to the last contact of the client during the current campaign
    • contact - communication type
    • month - month of year
    • day - day (of the month) that the client was contacted
    • duration - contact duration in seconds.
  • Miscellaneous Attributes
    • campaign - number of contacts performed during this campaign and for this client
    • pdays - number of days that passed by after the client was last contacted from a previous campaign
    • previous - number of contacts performed before this campaign and for this client
    • poutcome - outcome of the previous marketing campaign for this client
    • y - has the client subscribed a term deposit? (dependent var.)
# Percent of total data that is made up of missing/NA values:
(sum(is.na(train))/(nrow(train)*ncol(train)))*100
## [1] 0

It seems like there are no ‘NA’ values in our dataset signaling that no observation is missing information. It is still possible that values other than ‘NA’ are being used to signify missing information in which case it will be best to include it as a separate categorical level of the feature. Even missing information may prove to be valuable information in and of itself.

# Get list of features with only one unique value (including NAs)

static.list <- character(dim(train)[2])
i <- 1

for(col in colnames(train)){
  if(length(unique(train[[col]])) == 1){
    static.list[i] <- col
    i <- i + 1
  }
}

static.list <- static.list[static.list != ""]
static.list
## character(0)

Static columns only have a single value for all observations in the data, providing little to no additional information or predictability to the final model. For this reason, such variables should be discarded and/or excluded from the final model. Although, it seems like we do not have any static features in our dataset.

# Get data type of each column
str(train)
## 'data.frame':    31647 obs. of  17 variables:
##  $ age      : int  39 52 40 48 46 33 49 34 36 28 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 8 5 7 5 10 2 2 2 10 2 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 2 2 2 3 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 2 3 4 1 3 2 2 2 2 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  0 1646 109 427 2207 2254 171 103 5436 373 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 1 2 1 2 2 1 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 2 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 1 1 1 3 1 1 3 3 1 ...
##  $ day      : int  20 9 4 17 16 17 4 8 20 14 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 4 4 10 9 1 4 9 9 9 ...
##  $ duration : int  793 92 258 62 243 130 19 956 443 10 ...
##  $ campaign : int  2 3 1 1 3 1 3 2 1 9 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 148 175 -1 -1 344 ...
##  $ previous : int  0 0 0 0 0 1 2 0 0 1 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Based on the structure of the data, all of the features take on the reasonable data types and possess valid response values. Later on, we will take a closer look at the features and determine if they should be changed from numeric to categorical and vice-versa.

# Summary statistics of each column
summary(train)
##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:6860   divorced: 3616   primary  : 4788  
##  1st Qu.:33.00   management :6617   married :19032   secondary:16216  
##  Median :39.00   technician :5358   single  : 8999   tertiary : 9308  
##  Mean   :41.01   admin.     :3619                    unknown  : 1335  
##  3rd Qu.:49.00   services   :2856                                     
##  Max.   :94.00   retired    :1592                                     
##                  (Other)    :4745                                     
##  default        balance       housing      loan            contact     
##  no :31079   Min.   : -6847   no :14079   no :26629   cellular :20444  
##  yes:  568   1st Qu.:    72   yes:17568   yes: 5018   telephone: 2035  
##              Median :   450                           unknown  : 9168  
##              Mean   :  1351                                            
##              3rd Qu.:  1420                                            
##              Max.   :102127                                            
##                                                                        
##       day            month         duration         campaign     
##  Min.   : 1.00   may    :9678   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    :4827   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    :4379   Median : 179.0   Median : 2.000  
##  Mean   :15.81   jun    :3782   Mean   : 258.3   Mean   : 2.776  
##  3rd Qu.:21.00   nov    :2745   3rd Qu.: 317.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    :2047   Max.   :4918.0   Max.   :63.000  
##                  (Other):4189                                    
##      pdays           previous          poutcome       y        
##  Min.   : -1.00   Min.   : 0.0000   failure: 3443   no :27961  
##  1st Qu.: -1.00   1st Qu.: 0.0000   other  : 1281   yes: 3686  
##  Median : -1.00   Median : 0.0000   success: 1039              
##  Mean   : 40.06   Mean   : 0.5809   unknown:25884              
##  3rd Qu.: -1.00   3rd Qu.: 0.0000                              
##  Max.   :871.00   Max.   :58.0000                              
## 

Above, we have a summary of each feature’s min/median/mean/max values, if it is a numeric variable, and unique values with their corresponding counts, if it is categorical. This will serve as a good, holistic view of the data which will prove invaluable once each feature is analyzed in more granular detail.

Let us now visually explore the features available to us.

# Let's create custom, professional looking color palettes

# colors
corpo_colors <- c(
  `red`        = "#d11141",
  `green`      = "#00b159",
  `blue`       = "#00aedb",
  `orange`     = "#f37735",
  `yellow`     = "#ffc425",
  `light grey` = "#cccccc",
  `dark grey`  = "#8c8c8c")

# Function to extract corpo colors as hex codes
# '...' - Character names of corpo_colors 

corpo_cols <- function(...) {
  cols <- c(...)

  if (is.null(cols))
    return (corpo_colors)

  corpo_colors[cols]
}

# We may now create custome color palettes using the colors we have selected

corpo_palettes <- list(
  `main`  = corpo_cols("red", "green", "blue", "orange", "yellow", "light grey", "dark grey"),

  `cool`  = corpo_cols("blue", "green"),

  `hot`   = corpo_cols("yellow", "orange", "red"),

  `mixed` = corpo_cols("blue", "green", "yellow", "orange", "red"),

  `grey`  = corpo_cols("light grey", "dark grey")
)

# Return function to interpolate a corpo color palette; allows for shades/new color palettes
# "palette" - Character name of palette in drsimonj_palettes
# "reverse" - Boolean indicating whether the palette should be reversed
# "..." - Additional arguments to pass to colorRampPalette()

corpo_pal <- function(palette = "main", reverse = FALSE, ...) {
  pal <- corpo_palettes[[palette]]

  if (reverse) pal <- rev(pal)

  colorRampPalette(pal, ...)
}

#' Color scale constructor for corpo colors

# "palette" - Character name of palette in corpo_palettes
# "discrete" - Boolean indicating whether color aesthetic is discrete or not
# "reverse" - Boolean indicating whether the palette should be reversed
# "..." - Additional arguments passed to discrete_scale() or
# scale_color_gradientn(), used respectively when discrete is TRUE or FALSE

scale_color_corpo <- function(palette = "main", discrete = TRUE, reverse = FALSE, ...) {
  pal <- corpo_pal(palette = palette, reverse = reverse)
  if (discrete) {
    discrete_scale("colour", paste0("corpo_", palette), palette = pal, ...)
  } else {
    scale_color_gradientn(colours = pal(256), ...)
  }
}

# Fill scale constructor for corpo colors

# "palette" - Character name of palette in drsimonj_palettes
# "discrete" - Boolean indicating whether color aesthetic is discrete or not
# "reverse" - Boolean indicating whether the palette should be reversed
# "..." - Additional arguments passed to discrete_scale() or
# scale_fill_gradientn(), used respectively when discrete is TRUE or FALSE

scale_fill_corpo <- function(palette = "main", discrete = TRUE, reverse = FALSE, ...) {
  pal <- corpo_pal(palette = palette, reverse = reverse)

  if (discrete) {
    discrete_scale("fill", paste0("corpo_", palette), palette = pal, ...)
  } else {
    scale_fill_gradientn(colours = pal(256), ...)
  }
}
# Function that creates histograms.

histogram.plot <- function(df, feat, title, x_label, bw = 1, breaks = NULL, limits = NULL){

  z <- ggplot(df, aes(x=df[[feat]], fill = ..count..))
  h.plot <- z + geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = bw, color = 'white', show.legend = FALSE) + 
  labs(title = title) + xlab(x_label) + ylab("Percent of Clients") +                                            scale_x_continuous(breaks = breaks, limits = limits) + 
  scale_y_continuous(labels = scales::percent) +
  scale_fill_gradient(low = corpo_cols("dark grey"), high = corpo_cols("blue")) +
  theme(plot.title = element_text(hjust = 0.5),
        axis.title.x = element_text(margin = ggplot2::margin(t = 15, r = 0, b = 0, l = 0)),
        axis.title.y = element_text(margin = ggplot2::margin(t = 0, r = 15, b = 0, l = 0)))
  h.plot
  
}

# Function that creates vertical barplots.

vertical.barplot <- function(df, feat, title, x_label){

  z = ggplot(df, aes(x = df[[feat]])) 
  h.plot = z + geom_bar(aes(y = (..count..)/sum(..count..), fill = factor(..x..)), stat =                      "count") + 
               labs(title = title) + xlab(x_label) + ylab("Percent of Clients") +
               geom_text(aes(label = ..count.., y= (..count..)/sum(..count..)), stat= "count",                           vjust = -0.5, size=3) +                                                                 scale_y_continuous(labels=percent) + guides(fill=FALSE) + 
               scale_fill_corpo(palette = "mixed", guide = "none") +
               theme(plot.title = element_text(hjust = 0.5),
                     axis.title.x = element_text(margin = ggplot2::margin(t = 15, r = 0, b = 0, 
                                                                          l = 0)),
                     axis.title.y = element_text(margin = ggplot2::margin(t = 0, r = 15, b = 0, 
                                                                          l = 0)))
  return(h.plot)

}

# Function that create horizontal barplots.

horizontal.barplot <- function(df, feat, title, y_label){

  z = ggplot(df, aes(x = df[[feat]])) 
  h.plot = z + geom_bar(aes(y = (..count..)/sum(..count..), fill = factor(..x..)), stat =                      "count") + 
               labs(title = title) + xlab(y_label) + ylab("Percent of Clients") +
               geom_text(aes(label = ..count.., y= (..count..)/sum(..count..)), stat= "count",                          hjust = -0.5, size=3) +
              scale_y_continuous(labels=percent, oob = rescale_none, limits = c(0, 0.25)) + 
              coord_flip() + 
              guides(fill=FALSE) + 
              scale_fill_corpo(palette = "mixed") +
              theme(plot.title = element_text(hjust = 0.5),
                    axis.title.x = element_text(margin = ggplot2::margin(t = 15, r = 0, b = 0, 
                                                                         l = 0)),
                    axis.title.y = element_text(margin = ggplot2::margin(t = 0, r = 15, b = 0, 
                                                                         l = 0)))
  return(h.plot)

}
# Histogram of 'age'
age.plot <- histogram.plot(train, "age", title = "Distribution of Client Ages", x_label = "Age", breaks = seq(0,100,10), bw = 2) 
age.plot

It seems like most of our potential clients are between the ages of 30 and 60 with a majority between 30 and 40. Persons in their 20’s and past their 60’s compose the minority of our clients. It is clear that the age distribution of the clients is mostly skewed to the right.

# Barplot of 'job'
job.plot <- horizontal.barplot(train, "job", "Distribution of Client Employment Data", "Job Title")
job.plot

According to the graph above, a majority of the clients that are contacted hold either blue-collar, management, or technician positions. These jobs make up about 21%, 20.5%, and 17%, respectively, of all our clients in this dataset. Of note is the fact that this information is missing from only ~0.25% of our clients. This provides us with additional confidence in the inferences we make about the types of jobs that our clients have.

# Barplot of 'marital'
marital.plot <- vertical.barplot(train, "marital", "Distribution of Client Marital Status", "Marital Status")
marital.plot

Here, we see that 60%, the majority, of the clients in our dataset are married.

# Barplot of 'education'
education.plot <- vertical.barplot(train, "education", "Distribution of Highest Education Completed by Client", "Education")
education.plot

Above, we see that most of our clients seem to have, at most, completed secondary or tertiary education. In the Portuguese educational system, these roughly equate to completing high school and college/vocational program, respectively. This tells us that most of our clients are adequately educated and relatively skilled. In addition, we are missing this information from only ~5% of our clients.

# Barplot of 'default'
default.plot <- vertical.barplot(train, "default", "Distribution of Credit Defaults Among Clients", "Does Client Have Credit In Default?")
default.plot

A vast majority, nearly 100%, of the clients that are contacted have not defaulted on a debt before. This seems like great news but begs the question of whether these individuals have never defaulted because they have successfully repaid all of their loans, or if they never took out a loan and, as a result, were never given the chance to default in the first place.

# Histogram of 'balance'

balance.plot <- histogram.plot(train, "balance", title = "Distribution of Average Yearly Balance in Euros", x_label = "Balance", bw = 1000, breaks = seq(0,15000,1000), limits = c(-500,15000)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
balance.plot

The graph above shows us the average yearly balance held by the clients in euros. It seems like over 50% of clients have less than 1,000 euros in savings deposited in their bank accounts and over 75% have less than 2,000. This may signal several things to us as employees of the bank. This may be concerning as a lack of savings may signal a poor financial status and/or an incapability of the client to manage their finances. This may correlate with an increased chance of defaulting on debt and borrowed loans. On the other hand, this may be fortunate, for us, as we may use the fact that individuals are not saving their money as a sales tactic to convince individuals to begin saving now and to do so with our bank.

Being from the United States, our perception of how much money is needed to live comfortably is biased and the cost of daily living in Portugal should be taken into account before concluding the financial status of a client based on their average yearly balance. To provide some perspective, the average yearly salary of an individual in Lisbon, Portugal is around 10,000 euros a year.

# Barplot of 'housing'

housing.plot <- vertical.barplot(train, "housing", "Distribution of Presence of Housing Loans Among Clients", "Does Client Possess Housing Loan?")
housing.plot

The graph above shows us that over 55% of clients currently possess a housing loan; in other words, a mortgage. Our bank’s sales team may use this information to their advantage, highlighting any attractive refinancing option the bank may have for new subscribers.

# Barplot of 'loan'

loan.plot <- vertical.barplot(train, "loan", "Distribution of Presence of Personal Loans Among Clients", "Does Client Possess Personal Loan?")
loan.plot

Over 80% of clients are not currently paying off a personal loan.

# Barplot of 'contact'

contact.plot <- vertical.barplot(train, "contact", "Distribution of Communication Type Among Clients", "How Was Client Reached?")
contact.plot

We see that over 60% of our clients were reached via cellphone. Meanwhile, we are missing this information from 30% of our respondents. Based on the times we live in today, few individuals in the developed world maintain a land line as their primary form of communication. It would not be unreasonable to believe that most of the ‘unknown’ observations are in fact ‘cellular’. For a definitive conclusion to be made, additional data would need to be collected and, perhaps, changes in management would need to be made to ensure that sales representatives are determining the primary communication type of the clients.

# Histogram of 'day'

day.plot <- histogram.plot(train, "day", title = "Days on Which Clients Were Reached", x_label = "Day of Month", breaks = seq(0,31,5))
day.plot

There seems to be no pattern as to the day of the month that clients are being reached. The graph above points to a largely uniform distribution of the days of the month on which clients are reached.

# Barplot of 'month'

train$month <- factor(train$month, levels=c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"))

month.plot <- vertical.barplot(train, "month", title = "Months on Which Clients Were Reached", x_label = "Month")
month.plot

In contrast to the days of the month, clients seem to be more willing/able to speak to bank representatives during the summer months with a peak during May. More data should be collected to confirm that the pattern we see here is not a statistical anomaly because there is no obvious explanation as to why individuals would be more willing/able to be reached by a representative during May, specifically. It may be that the change in seasons produces psychological/emotional/social changes that lead to an increase in client response. Or, it may be that the bank primarily conducts its large scale campaigns during May leading to an increase in client contacts.

# Histogram of 'duration'

duration.plot <- histogram.plot(train, "duration", title = "Distribution of Contact Time With Clients", x_label = "Duration of Contact (seconds)", bw = 100, limits = c(0,2500), breaks = seq(0,2500,100)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
duration.plot

Above, we see that over 30% of communications with clients last less than 1.6 minutes and over 70% of communication last less than 5 minutes. It would be interesting to see what the distribution of duration times looks like for only those clients that ended up subscribing to the bank. This would provide additional insight into the effectiveness of the bank’s sales tactics.

# Histogram of 'duration' (y = 'yes')

duration.plot <- histogram.plot(train[train$y == 'yes',], "duration", title = "Distribution of Contact Time With with Eventual Subscribers", x_label = "Duration of Contact (seconds)", bw = 100, limits = c(0,2500), breaks = seq(0,2500,100)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
duration.plot

Now, this is interesting. It seems like a majority of the subscribers acquired by the bank result from communications that last 1.6 - 3.3 minutes Almost a third of the subscribers are involved in communications lasting 3.3 - 5 minutes. It would be wise to closely analyze these client calls to determine the tactics used by the sales representative to sell the bank’s subscription to the client. Although, it is important to note that the other 2/3 of acquired subscribers exhibit largely distributed duration of contact times. It may be that those individuals that subscribed within the first 3 minutes of contact were already interested in joining a bank. Nonetheless, these are interesting results.

# Histogram of 'campaign'

campaign.plot <- histogram.plot(train, "campaign", title = "Number of Times Client Was Contacted", x_label = "Number of Contacts", bw = 1, limits = c(0,20), breaks = seq(0,20,4))
campaign.plot

A majority of the clients are contacted only once and nearly 80% are contacted 3 times or less. Above, in the second graph, we see the same graph but for only those clients that became subscribers. It seems like persistence has some influence in the choice of a client to buy a subscription. With this information, the bank may consider allocating time spent reestablishing contact to contacting new clients.

# Histogram of 'pdays'

pdays.plot <- histogram.plot(train, "pdays", title = "Number of Days After Previous Campaign Client Was Contacted", x_label = "Number of Days", bw = 10, limits = c(-10,500), breaks = seq(0,500,100))
pdays.plot

Here, we discover over 80% of our data points are missing for the pdays feature. Because we have so little information on this feature, it is likely that we will remove this feature from the dataset at some point during the analysis.

# Histogram of 'previous'

previous.plot <- histogram.plot(train, "previous", title = "Number of Times Client was Contacted Before Current Campaign", x_label = "Number of Times", bw = 1, limits = c(-1,10), breaks = seq(0,10,1)) 
previous.plot

It looks like, during the campaign, most of the clients are new. Similarly, most of the subscribers obtained during the campaign have not been contacted before. 85% of subscribers have been contacted twice or less before the current campaign. It seems like if a client has not subscribed in the past 3 campaigns, including the current one, they are more likely to not subscribe during consecutive campaigns. Although, we must keep in mind that our data is biased towards new clients. It might be that most subscribers have only been contacted once because most of the clients are new. Still, our conclusion may be sound due to the steady decrease in the percentage of clients that subscribe as a function of increasing number of contacts before the current campaign.

# Barplot of 'poutcome'
poutcome.plot <- vertical.barplot(train, "poutcome", "Outcome of Previous Marketing Campaign", "Result of Previous Campaign on Client")
poutcome.plot

It seems like it is unclear, for over 80% of clients, as to whether or not the previous marketing campaign was successful or not. This make sense as most clients are not contacted more than once. This may be one feature that we will end up eventually leaving out from from our model for lack of additional, useful information.

# Barplot of 'y'
y.plot <- vertical.barplot(train, "y", "Outcome of Latest Marketing Campaign on Client", "Did the Client Subscribe?")
y.plot

It looks like a large majority of the time, the outcome variable produces a “no”. In other words, over 80% of clients failed to subscribe to the bank during the current campaign. Exactly how much of our data is made up of “no”s?

perc.no <- nrow(train[train$y == 'no',])/nrow(train)*100
print(paste0(as.character(round(perc.no)), "%"))
## [1] "88%"

88% of the time, the response variable comes up as “no”. Clearly, our dependent variable is very biased towards “no”. This means that our measure for the performance of our final model will need to be relatively high in order to bear any fruit in real world applications. Suppose we introduced a model that only predicted “no” for all observation in the data set, then, we would obtain an accuracy of 88%. This may seem like a well performing model when in fact it is practically useless, providing the same prediction every time. We will need to take this into account when analyzing the models that we create by analyzing metrics such as specificity and sensitivity along with the accuracy.

Based on the analysis of the data so far, the dependent variable that is being predicted is categorical; ‘yes’ or ‘no’. Due to this fact, the end product of this project will be a classification model. We can use the same tatic, of predicting ‘no’ for all observations, as an initial benchmark against which we will compare other potential models for the prediction of whether or not a client will subscribe to the bank.

Model Selection

We have completed an exploration of the dataset, looking at each feature individually as well as the dataset as a whole. Now, it is time to begin building the model that will allow us to predict which clients are likely to subscribe to our bank.

Let us not forget that our dependent variable here is binary; “yes” or “no”. This constitutes as a classification problem which narrows down the possibility of our potential models. Often, we want, and need, to consider the amount of investment in terms of time and resources that building a model will require. In addition, it is best to begin with a simple model because they are quick and easy to build.

Upon closer inspection of the data, there are several numeric variables, such as day, that should be treated as categorical variables. This is because these values are discrete which function primarily as labels. We’ll convert the appropriate numeric columns into categorical ones. This prevents the model from producing inaccurate relationships between the features and the dependent variable. Specifically, “day”, “campaign”, and “previous” will be turned into factor variables.

In addition, it was concluded that “pdays”, because it is missing over 80% of its data, should not be included in the model. And so, it will be removed. “poutcome” will also be excluded because it too is missing ~80% of its data and seems to provide no additional information that will add to the prediction capabilities of the model. “poutcome” describes whether a campaign was successful in gaining a certain subscriber or not. A large majority of our clients are persons who have never been contacted before which leads to 80% of the data being missing. If a campaign was successful, then there is no reason to re-establish contact and certain clients were labeled as “other”. It is unclear how a campaign had done anything else than either succeeded or failed in obtaining a client. This feature is largely designed for exploratory analysis and, as a result, will be excluded from the model. Finally, “duration” will be exluded due to its high correlation with whether or not a client subsribes to the bank. “duration” describes the duration of the last contact made with a client. If the last contact made with a client lasted for zero seconds, then that indicates there was in fact no contact with the client which inevitably leads to a failure to subscribe. The goal is to predict whether or not a client will subscribe to the bank before contact is made. For this reason, “duration” will be excluded from the model.

train.new <- train[ , !(names(train) %in% c("pdays", "poutcome", "duration"))]
test.new <- test[ , !(names(test) %in% c("pdays", "poutcome", "duration"))]

train.new[c("day", "campaign", "previous")] <- lapply(train.new[c("day", "campaign", "previous")], factor)
test.new[c("day", "campaign", "previous")] <- lapply(test.new[c("day", "campaign", "previous")], factor)
print("Train Columns:")
## [1] "Train Columns:"
sapply(train.new, class)
##       age       job   marital education   default   balance   housing 
## "integer"  "factor"  "factor"  "factor"  "factor" "integer"  "factor" 
##      loan   contact       day     month  campaign  previous         y 
##  "factor"  "factor"  "factor"  "factor"  "factor"  "factor"  "factor"
print("Test Columns:")
## [1] "Test Columns:"
sapply(test.new, class)
##       age       job   marital education   default   balance   housing 
## "integer"  "factor"  "factor"  "factor"  "factor" "integer"  "factor" 
##      loan   contact       day     month  campaign  previous        id 
##  "factor"  "factor"  "factor"  "factor"  "factor"  "factor" "integer"

Let’s perform a train-test split on our training data. This will simulate the prediction of previously unforeseen data and allow us to measure the performance of our model along the way.

set.seed(2018)

split <- sample.split(train.new$y, SplitRatio = 0.8)
train.train <- subset(train.new, split == TRUE)
train.test <- subset(train.new, split == FALSE)

Not all of the models are guaranteed to be able to handle categorical variables directly in their raw form; so, let us code each categorical feature in our dataset into dummy variables. Also, some models, such as support vector machines, require the data to be normalized across all of the features; for this reason, we will create a separate, normalized training set as well.

# Normalizing data

# A function that normalizes numeric data
data_norm <- function(x){return((x - min(x))/(max(x) - min(x)))}

train.train.norm <- train.train
train.test.norm <- train.test

for(col in colnames(train.train.norm)){
  if(is.numeric(train.train.norm[[col]])){
    train.train.norm[[col]] <- data_norm(train.train.norm[[col]])
  }
}

for(col in colnames(train.test.norm)){
  if(is.numeric(train.test.norm[[col]])){
    train.test.norm[[col]] <- data_norm(train.test.norm[[col]])
  }
}
# Codifying categorical variables

train.train.new <- data.frame(model.matrix(y ~ .-1, data = train.train.norm))
train.train.new <- cbind(train.train[["y"]], train.train.new)
colnames(train.train.new)[1] <- c("y")

train.test.new <- data.frame(model.matrix(y ~ .-1, data = train.test.norm))
train.test.new <- cbind(train.test[["y"]], train.test.new)
colnames(train.test.new)[1] <- c("y")
# Changing the train funciton to use cross-validation

train_control <- trainControl(method = "repeatedcv", number = 5, repeats = 3, 
                      classProbs = TRUE, summaryFunction = twoClassSummary, 
                      savePredictions = TRUE, verboseIter = TRUE)

Four different models will be fit to determine their performance in predicting whether or not a client will subscribe to the bank: Logistic Regression, Support Vector Machine (SVM), Random Forest, and K-Nearest Neighbors (KNN).

Because the dataset is severely unbalanced, we’ll use the area under the ROC curve (AUC) as our primary measure of model performance. Whereas accuracy measures the ratio of correct predictions to total predictions, the AUC provides a holistic measure that quantifies the model’s ability to correctly predict both the positive and negative cases, when dealing with a binary classification problem such as this one. A higher AUC signifies a better, all-around classification model.

Logistic Regression

start_time <- Sys.time()

set.seed(2018)

# Making the model
lr.model <- caret::train(y ~., data=train.train.new, trControl=train_control, method="glm", family = "binomial", metric = "AUC", tuneLength = 1)
## Aggregating results
## Fitting final model on full training set
pred1 = predict(lr.model, newdata=train.test.new)

end_time <- Sys.time()
time.elapse <- (end_time - start_time)
print(time.elapse)
## Time difference of 1.883505 mins
# Printing the average AUC of the model
cv.preds <- as.numeric(lr.model$pred[,1])
cv.labels <- as.numeric(lr.model$pred[,2])
print(paste("Cross-Validation AUC:", round(AUC(cv.preds, cv.labels),3)))   
## [1] "Cross-Validation AUC: 0.548"
cat("", sep = "\n")
# Printing specificity and sensitivity
cm1 <- confusionMatrix(data=pred1, train.test$y)
print(cm1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5509  663
##        yes   83   74
##                                         
##                Accuracy : 0.8821        
##                  95% CI : (0.8739, 0.89)
##     No Information Rate : 0.8836        
##     P-Value [Acc > NIR] : 0.6468        
##                                         
##                   Kappa : 0.13          
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.9852        
##             Specificity : 0.1004        
##          Pos Pred Value : 0.8926        
##          Neg Pred Value : 0.4713        
##              Prevalence : 0.8836        
##          Detection Rate : 0.8704        
##    Detection Prevalence : 0.9752        
##       Balanced Accuracy : 0.5428        
##                                         
##        'Positive' Class : no            
## 

Support Vector Machine

start_time <- Sys.time()

set.seed(2018)

# Making the model
svm.model <- caret::train(y ~., data=train.train.new, trControl=train_control, method="svmLinear", metric = "Accuracy", tuneLength = 1)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.
## Aggregating results
## Fitting final model on full training set
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
pred2 = predict(svm.model, newdata=train.test.new)

end_time <- Sys.time()
time.elapse <- (end_time - start_time)
print(time.elapse)
## Time difference of 5.167223 mins
# Printing the training AUC of the model
cv.preds <- as.numeric(svm.model$pred[,1])
cv.labels <- as.numeric(svm.model$pred[,2])
print(paste("Cross-Validation AUC:", round(AUC(cv.preds, cv.labels),3)))  
## [1] "Cross-Validation AUC: 0.515"
cat("", sep = "\n")
# Printing specificity and sensitivity
cm2 <- confusionMatrix(data=pred2, train.test$y)
print(cm2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5592  737
##        yes    0    0
##                                           
##                Accuracy : 0.8836          
##                  95% CI : (0.8754, 0.8914)
##     No Information Rate : 0.8836          
##     P-Value [Acc > NIR] : 0.5098          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8836          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8836          
##          Detection Rate : 0.8836          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : no              
## 

Random Forest

start_time <- Sys.time()

set.seed(2018)

# Making the model

rf.model <- randomForest(y~., data = train.train)

# Predicting test set results:

pred3 <- predict(rf.model, newdata = train.test[, !(colnames(train.test) %in% c("y"))])

end_time <- Sys.time()
time.elapse <- (end_time - start_time)
print(time.elapse)
## Time difference of 34.96428 secs
# Printing the training AUC of the model
auc.predictions <- as.vector(rf.model$votes[,2])
auc.pred <- ROCR::prediction(auc.predictions, train.train$y)
perf.auc <- ROCR::performance(auc.pred,"auc") 
rf.auc <- perf.auc@y.values[[1]]
print(paste("AUC:", round(rf.auc,3)))  
## [1] "AUC: 0.775"
cat("", sep = "\n")
cm3 <- confusionMatrix(pred3, train.test$y) 
print(cm3)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5401  534
##        yes  191  203
##                                           
##                Accuracy : 0.8854          
##                  95% CI : (0.8773, 0.8932)
##     No Information Rate : 0.8836          
##     P-Value [Acc > NIR] : 0.3276          
##                                           
##                   Kappa : 0.3024          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9658          
##             Specificity : 0.2754          
##          Pos Pred Value : 0.9100          
##          Neg Pred Value : 0.5152          
##              Prevalence : 0.8836          
##          Detection Rate : 0.8534          
##    Detection Prevalence : 0.9377          
##       Balanced Accuracy : 0.6206          
##                                           
##        'Positive' Class : no              
## 

Attempts to train the random forest model in conjunction with cross-validation proved unsuccessful requiring an excessive amount of time to run. It required so much time, in fact, that it was unable to complete training in any reasonable amount of time. Luckily, during the process of training, the random forest samples a subset of the observations to build any single tree in the forest. This leads to many decision trees being made with “different” datasets. The final predictions, for classification problems, are made by a majority vote of all the decision trees; essentially giving an average prediction of all the trees and, as a result, will provide an average accuracy of the model’s performance. In this way, random forests are said to provide “free cross-validation”.

KNN

start_time <- Sys.time()

set.seed(2018)

# Making the model

knn.model <- caret::train(y ~., data=train.train.new, method="knn", trControl=train_control, tuneLength = 1, metric = "Accuracy")
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.
## Aggregating results
## Fitting final model on full training set
# Predicting test set results:

pred4 = predict(knn.model, newdata=train.test.new[, !(colnames(train.test.new) %in% c("y"))])

end_time <- Sys.time()
time.elapse <- (end_time - start_time)
print(time.elapse)
## Time difference of 13.73167 mins
# Printing the training AUC of the model
cv.preds <- as.numeric(knn.model$pred[,1])
cv.labels <- as.numeric(knn.model$pred[,2])
print(paste("Cross-Validation AUC:", round(AUC(cv.preds, cv.labels),3))) 
## [1] "Cross-Validation AUC: 0.542"
cat("", sep = "\n")
# Printing specificity and sensitivity
cm4 <- confusionMatrix(data=pred4, train.test.new$y)
print(cm4)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5499  665
##        yes   93   72
##                                          
##                Accuracy : 0.8802         
##                  95% CI : (0.872, 0.8881)
##     No Information Rate : 0.8836         
##     P-Value [Acc > NIR] : 0.8007         
##                                          
##                   Kappa : 0.1223         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.98337        
##             Specificity : 0.09769        
##          Pos Pred Value : 0.89212        
##          Neg Pred Value : 0.43636        
##              Prevalence : 0.88355        
##          Detection Rate : 0.86886        
##    Detection Prevalence : 0.97393        
##       Balanced Accuracy : 0.54053        
##                                          
##        'Positive' Class : no             
## 

Model Selection Conclusions

Interestingly enough, despite identical overall accuracies, the Logistic Regression, SVM, Random Forest, and KNN exhibit average AUCs of, approximately, 0.55, 0.52, 0.78, and 0.54, respectively. It is clear that the Random Forest has proven to be the superior of the four base models, that were fitted, as it possesses the highest AUC score. The result of such a, relatively, high score can clearly be seen by referring to the corresponding specificities of each model. Throughout this project, the specificity of a model will refer to the proportion of ‘yes’s, in the test set, that it correctly predicts while ’sensitivity’ refers to the proportion of ’no’s. While the Logistic Regression, SVM, and KNN achieve specificities of ~10%, the Random Forest achieves a specificity of nearly 30% while maintaining a sensitivity of 97%. In other words, the Random Forest is able to capture ~20% more suscribers than do the other models. We emphasize the importance of the specificity because it is more important to us to capture those individuals that will eventually subscribe rather than those who will not. As a result, we will select the Random Forest as the model that we will continue with and improve upon throughout the rest of our analysis of the bank marketing data.

One may wonder why the base package was used to fit the random forest while ‘caret’ was used to fit the remaining models. The main reason that the caret package is used at all is to perform cross validation during training of the models and to produce confusion matrices. In additon, it seems like the train function of caret performs behind-the-scenes optimizations and enforces its own parameter values. Due to the bagging feature of the random forest, it basically provides us with a “free” cross-validated model removing the need for manual cross-validation. As a byproduct, by not training the random forest using the train function, we retain greater freedom during parameter tuning becasuse the base model fit function allows for specific assignmnet of more parameters of the random forest than does the caret train function. Finally, the random forest model trains faster using the base package because there is no additional over-head resulting from background optomization processes.

Model Tuning/Optomization

We have determined that the ranfom forest returned the most favorable performance with an average AUC of 0.78 and the ability to capture ~30% of the positive cases in our data. As a result, this model was deemed the best and selected, of the four base models that were fit, for continued use and improvement.

In this section, we will continue working on our final model by imporving the random forest model. Improvement in the performance of a model can be achieved via three main avenues:

  1. Model Boosting
  2. Feature Optimization
  3. Hyperparameter Tuning

Model Boosting

The random forest, by default, is itself an ensemble model. It is an amalgamation of decision tree models which utilize ‘bagging’ to produce an accuracy superior to that of any single decision tree. During bagging, each tree is trained using a new, random sample of the training data and the final predictions of all the end trees are averaged, in the regression case, or tallied, in classification case, to determine the most prevalent prediction. In this way, the random forest improves upon the performance of the decision tree by slightly increasing bias while greatly decreasing variance. Although, there is another way the model can average the predictions of all the deision trees in a forest; by introducing weights. This is called ‘boosting’.

The process of boosting mirrors that of bagging but with the extra addition of weights. During boosting, each decision tree is also fit using randomly sampled from the training data. At first, each data point is selected with equal probability but as each successive tree is trained, the probability of a data point being selected for a sample changes based on whether it belongs to a class that the previous tree had difficulty predicting or not. The points belonging to the class, which the previous decision tree had the most difficulty predicting, will have higher probabilities of being chosen for the sample on which the next decision tree will be trained. In this way, a boosted random forest seeks to decrease bias while slightly increasing variance.

Here, we will use the XGBoost package which is specifically designed to produce a boosted random forest called the XGBoost model.

# Preparing data for the XGBoost model; finding the best parameters for the XGBoost model

# Coding the dependent varaible 0/1
train.train.new$y <- as.numeric(train.train.new$y)
train.train.new$y <- (train.train.new$y - 1)
train.test.new$y <- as.numeric(train.test.new$y)
train.test.new$y <- (train.test.new$y - 1)

x.train.val = data.matrix(train.train.new[, !(colnames(train.train.new) %in% c("y"))])
y.train.val = data.matrix(train.train.new$y)
x.test.val = data.matrix(train.test.new[, !(colnames(train.test.new) %in% c("y"))])
y.test.val = data.matrix(train.test.new$y)

xgb.train <- xgb.DMatrix(data = x.train.val, label = y.train.val)
xgb.test <- xgb.DMatrix(data = x.test.val, label = y.test.val)

xgb.params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1, eval_metric = "auc")

xgbcv <- xgb.cv(params = xgb.params, data = xgb.train, nrounds = 200, nfold = 5, showsd = T, stratified = T, print_every_n = 50, early.stop.round = 20, maximize = F)
## [1]  train-auc:0.687491+0.002274 test-auc:0.684159+0.007512 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
## 
## Stopping. Best iteration:
## [1]  train-auc:0.687491+0.002274 test-auc:0.684159+0.007512
xgbcv$best_iteration
## [1] 1
# Building the XGBoost model

#first default - model training
xgb1 <- xgb.train(params = xgb.params, data = xgb.train, nrounds = 1, watchlist = list(val=xgb.test,train=xgb.train))
## [1]  val-auc:0.695645    train-auc:0.687578
#model prediction
xgbpred <- predict (xgb1,xgb.test)
xgbpred <- ifelse (xgbpred > 0.5,1,0)

xgbpred <- as.factor(xgbpred)
y.test.val <- as.factor(y.test.val)

print(paste("AUC:",max(xgb1$evaluation_log$train_auc)))
## [1] "AUC: 0.687578"
cat("", sep = "\n")
confusionMatrix (xgbpred, y.test.val)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5519  684
##          1   73   53
##                                           
##                Accuracy : 0.8804          
##                  95% CI : (0.8721, 0.8883)
##     No Information Rate : 0.8836          
##     P-Value [Acc > NIR] : 0.7896          
##                                           
##                   Kappa : 0.092           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.98695         
##             Specificity : 0.07191         
##          Pos Pred Value : 0.88973         
##          Neg Pred Value : 0.42063         
##              Prevalence : 0.88355         
##          Detection Rate : 0.87202         
##    Detection Prevalence : 0.98009         
##       Balanced Accuracy : 0.52943         
##                                           
##        'Positive' Class : 0               
## 

Thus far, the boosted model possesses an AUC lower to our base Random Forest but a specificity that is way below that of the base random forest. The XGBoost model possesses an AUC 8% points lower and a specificity ~20% points lower than the base Random Forest. Let’s see if optomizing the parameters increases the boosted model’s performance at all.

# Optomizing XGBoost

train.train.new$y <- as.factor(train.train.new$y)
train.test.new$y <- as.factor(train.test.new$y)

#create tasks
traintask <- makeClassifTask (data = data.frame(train.train.new), target = "y")
testtask <- makeClassifTask (data = data.frame(train.test.new), target = "y")

#create learner
lrn <- makeLearner("classif.xgboost",predict.type = "prob")
## Warning in makeParam(id = id, type = "numeric", learner.param = TRUE, lower = lower, : NA used as a default value for learner parameter missing.
## ParamHelpers uses NA as a special value for dependent parameters.
lrn$par.vals <- list( objective="binary:logistic", eval_metric="auc", nrounds=100L, eta=0.1)

#set parameter space
params <- makeParamSet( makeDiscreteParam("booster",values = c("gbtree","gblinear")), makeIntegerParam("max_depth",lower = 3L,upper = 10L), makeNumericParam("min_child_weight",lower = 1L,upper = 10L), makeNumericParam("subsample",lower = 0.5,upper = 1), makeNumericParam("colsample_bytree",lower = 0.5,upper = 1))

#set resampling strategy
rdesc <- makeResampleDesc("CV",stratify = T,iters=5L)

#search strategy
ctrl <- makeTuneControlRandom(maxit = 10L)

#set parallel backend
library(parallel)
library(parallelMap) 
parallelStartSocket(cpus = detectCores())
## Starting parallelization in mode=socket with cpus=4.
#parameter tuning
mytune <- tuneParams(learner = lrn, task = traintask, resampling = rdesc, measures = auc, par.set = params, control = ctrl, show.info = T)
## [Tune] Started tuning learner classif.xgboost for parameter set:
##                      Type len Def          Constr Req Tunable Trafo
## booster          discrete   -   - gbtree,gblinear   -    TRUE     -
## max_depth         integer   -   -         3 to 10   -    TRUE     -
## min_child_weight  numeric   -   -         1 to 10   -    TRUE     -
## subsample         numeric   -   -        0.5 to 1   -    TRUE     -
## colsample_bytree  numeric   -   -        0.5 to 1   -    TRUE     -
## With control class: TuneControlRandom
## Imputation value: -0
## Exporting objects to slaves for mode socket: .mlr.slave.options
## Mapping in parallel: mode = socket; cpus = 4; elements = 10.
## [Tune] Result: booster=gbtree; max_depth=8; min_child_weight=6.22; subsample=0.933; colsample_bytree=0.859 : auc.test.mean=0.7771248
mytune$y 
## auc.test.mean 
##     0.7771248

Now to refit the model using the optomized parameters.

# Set hyperparameters
lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)

# Train model
xgb.opt <- train(learner = lrn_tune, task = traintask)
## [1]  train-auc:0.700427 
## [2]  train-auc:0.702538 
## [3]  train-auc:0.723898 
## [4]  train-auc:0.736402 
## [5]  train-auc:0.758814 
## [6]  train-auc:0.762149 
## [7]  train-auc:0.764780 
## [8]  train-auc:0.765248 
## [9]  train-auc:0.772859 
## [10] train-auc:0.777354 
## [11] train-auc:0.781078 
## [12] train-auc:0.781726 
## [13] train-auc:0.783862 
## [14] train-auc:0.785770 
## [15] train-auc:0.787942 
## [16] train-auc:0.791965 
## [17] train-auc:0.793388 
## [18] train-auc:0.796854 
## [19] train-auc:0.798507 
## [20] train-auc:0.800867 
## [21] train-auc:0.801869 
## [22] train-auc:0.804150 
## [23] train-auc:0.805390 
## [24] train-auc:0.806590 
## [25] train-auc:0.808270 
## [26] train-auc:0.809962 
## [27] train-auc:0.811820 
## [28] train-auc:0.812916 
## [29] train-auc:0.814839 
## [30] train-auc:0.815986 
## [31] train-auc:0.817335 
## [32] train-auc:0.818026 
## [33] train-auc:0.818845 
## [34] train-auc:0.821237 
## [35] train-auc:0.822118 
## [36] train-auc:0.822916 
## [37] train-auc:0.824225 
## [38] train-auc:0.825186 
## [39] train-auc:0.826278 
## [40] train-auc:0.826929 
## [41] train-auc:0.827876 
## [42] train-auc:0.828508 
## [43] train-auc:0.829420 
## [44] train-auc:0.830686 
## [45] train-auc:0.831064 
## [46] train-auc:0.831988 
## [47] train-auc:0.832819 
## [48] train-auc:0.833659 
## [49] train-auc:0.835578 
## [50] train-auc:0.836638 
## [51] train-auc:0.837855 
## [52] train-auc:0.838994 
## [53] train-auc:0.840369 
## [54] train-auc:0.840663 
## [55] train-auc:0.841940 
## [56] train-auc:0.842787 
## [57] train-auc:0.842978 
## [58] train-auc:0.843480 
## [59] train-auc:0.844914 
## [60] train-auc:0.846092 
## [61] train-auc:0.847337 
## [62] train-auc:0.848234 
## [63] train-auc:0.849291 
## [64] train-auc:0.849488 
## [65] train-auc:0.850931 
## [66] train-auc:0.851220 
## [67] train-auc:0.851649 
## [68] train-auc:0.852316 
## [69] train-auc:0.853062 
## [70] train-auc:0.853611 
## [71] train-auc:0.854041 
## [72] train-auc:0.855311 
## [73] train-auc:0.855811 
## [74] train-auc:0.856390 
## [75] train-auc:0.856821 
## [76] train-auc:0.857099 
## [77] train-auc:0.857765 
## [78] train-auc:0.858415 
## [79] train-auc:0.858650 
## [80] train-auc:0.859197 
## [81] train-auc:0.859825 
## [82] train-auc:0.860200 
## [83] train-auc:0.860268 
## [84] train-auc:0.860598 
## [85] train-auc:0.861281 
## [86] train-auc:0.861791 
## [87] train-auc:0.862732 
## [88] train-auc:0.863846 
## [89] train-auc:0.864195 
## [90] train-auc:0.864843 
## [91] train-auc:0.865505 
## [92] train-auc:0.865877 
## [93] train-auc:0.866662 
## [94] train-auc:0.867167 
## [95] train-auc:0.867520 
## [96] train-auc:0.867736 
## [97] train-auc:0.868299 
## [98] train-auc:0.868985 
## [99] train-auc:0.869217 
## [100]    train-auc:0.869394
# Predict model
xgpred.opt <- predict(xgb.opt, testtask)

# Assess performance
confusionMatrix(xgpred.opt$data$response, xgpred.opt$data$truth)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5495  634
##          1   97  103
##                                           
##                Accuracy : 0.8845          
##                  95% CI : (0.8764, 0.8923)
##     No Information Rate : 0.8836          
##     P-Value [Acc > NIR] : 0.4165          
##                                           
##                   Kappa : 0.179           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9827          
##             Specificity : 0.1398          
##          Pos Pred Value : 0.8966          
##          Neg Pred Value : 0.5150          
##              Prevalence : 0.8836          
##          Detection Rate : 0.8682          
##    Detection Prevalence : 0.9684          
##       Balanced Accuracy : 0.5612          
##                                           
##        'Positive' Class : 0               
## 

It doesn’t seem as if a boosted random forest model will not help us to better predict whether or not a client will subscribe to our bank. Our original random forest also achieved a 30% specificity as oppose to the optomized XGBoost model which only captures ~14% of the same subscribers. For this reason, we will stick with our base random forest with bagging for the rest of the project.

Feature Optimization

A popular process by which a data scientist may increase the performance of their model is feature engineering. In this way, the engineer uses the features already at his or her disposal to create new ones that may improve the performance of the model. The engineer may also try to find additional features relating to the observations in the data that have not yet been accounted for. In this way, the engineer can increase the amount of variance in the data that can be explained by the model, further improving model performance.

It is no secret that industrial and consumer behavior can be predicted and determined by certain economic indicators. Such indicators are used to make generalized diagnoses about the state of an economy which can have sweeping implications that affect the general population of a country. Such indicators are also included in this dataset. These additional features include macroeconomic indicators such as net disposable income, interst rates, inflation rates, and unemployment rates all specific to the country of Portugal and the time frames, May 2008 to November 2010, in which each observation in the dataset was recorded. All economic data was gathered from https://tradingeconomics.com and conjoined with the original data in Microsoft Excel. Let’s read in these new features and join them to our existing dataset.

The main goal of this project is to role play as if we are a team of data analysists/data scientists tasked with creating a model using the train and test data we have been given. When training a model, one is able to increase the amount of data available for training in countless ways. Although, this project is done under the assumption that the bank has collected only a set number of features to work with for the test set. For this reason, the final model will only use the original set of features that are included in the test set. Although, a second model, using the dataset that includes the new features, will be built alongside the original one in order to highlight how much a model can improve when given additional, quality data to train on.

# Reading in the expanded bank data

bank_additional <- read.xlsx('new_bank_data.xlsx', na.strings=c("","NA"))

# Make characters in data to factors (since they are currently characters)

fact1 <- c("job", "marital", "education", "default","housing","loan","contact","month","y", "poutcome")

bank_additional[fact1]<-lapply(bank_additional[fact1], factor)
bank_additional<-as.data.frame(bank_additional)

fact2 <- c("age", "balance", "duration", "pdays", "day", "campaign", "previous")

bank_additional[fact2]<-lapply(bank_additional[fact2], as.integer)
bank_additional<-as.data.frame(bank_additional)

#use the dplyr library to help us join the tables together

train.newfeat <- inner_join(train, bank_additional)
## Joining, by = c("age", "job", "marital", "education", "default", "balance", "housing", "loan", "contact", "day", "month", "duration", "campaign", "pdays", "previous", "poutcome", "y")
## Warning: Column `month` joining factors with different levels, coercing to
## character vector
head(train.newfeat)
##   age           job marital education default balance housing loan
## 1  39      services married secondary      no       0     yes   no
## 2  52    management married  tertiary      no    1646      no  yes
## 3  40 self-employed  single   unknown      no     109     yes   no
## 4  48    management married   primary      no     427      no   no
## 5  46    technician married  tertiary      no    2207     yes   no
## 6  33   blue-collar married secondary      no    2254     yes   no
##    contact day month duration campaign pdays previous poutcome  y
## 1  unknown  20   may      793        2    -1        0  unknown no
## 2 cellular   9   feb       92        3    -1        0  unknown no
## 3 cellular   4   feb      258        1    -1        0  unknown no
## 4 cellular  17   nov       62        1    -1        0  unknown no
## 5  unknown  16   may      243        3    -1        0  unknown no
## 6 cellular  17   apr      130        1   148        1  failure no
##   net.disposable.income euribor3m consumer.spending loan.growth
## 1               36455.2     4.856           29219.0         9.7
## 2               34302.8     2.180           28485.0         5.7
## 3               34302.8     2.700           28485.0         5.7
## 4               35283.5     4.191           29106.3         7.8
## 5               36455.2     4.859           29219.0         9.7
## 6               35112.0     1.405           28281.4         4.5
##   consumer.confidence nr.employed unemployment.rate inflation.rate
## 1               -30.8      5191.0               7.2            2.8
## 2               -43.6      5176.3               8.8            0.2
## 3               -43.6      5176.3               8.8            0.2
## 4               -31.1      5195.8               7.8            1.3
## 5               -30.8      5191.0               7.2            2.8
## 6               -43.0      5099.1               9.0           -0.5
##   emp.var.rate
## 1          1.1
## 2         -0.2
## 3         -0.2
## 4         -0.1
## 5          1.1
## 6         -1.8

Much of the economic data added seem to mimic noise and may not contribute to the accuracy of the model. This is likely because the time scale, of our original data, is not spread out enough, so the short-term, volatile behavior of the Portuguese economy dominates the data where as a longer time frame may have a more pronounced, structured pattern to it. For example, nearly one-third of the data originates from may 2008 to august 2008, so any indicators would be biased toward this majority. Nevertheless, we will see how the random forest behaves with this data.

# Removing unecessary columns

train.newfeat <- train.newfeat[ , !(names(train.newfeat) %in% c("pdays", "poutcome", "duration"))]

# Changing certian numeric columns to factor columns

train.newfeat[c("day", "month", "campaign", "previous")] <- lapply(train.newfeat[c("day",  "month", "campaign", "previous")], factor)

# Train-test split for the train data with new features

set.seed(2018)

split <- sample.split(train.newfeat$y, SplitRatio = 0.8)
train.train.nf <- subset(train.newfeat, split == FALSE)
train.test.nf <- subset(train.newfeat, split == TRUE)

# Test x and y for original features

x.train.test <- train.test[, !(colnames(train.test) %in% c("y"))]
y.train.test <- train.test$y

# Test x and y for new features

x.train.test.nf <- train.test.nf[, !(colnames(train.test.nf) %in% c("y"))]
y.train.test.nf <- train.test.nf$y

# Test x for final test set 

x.test = test.new[, !(colnames(test.new) %in% c("y","id"))]

Now, let’s assess the base performance of both our random forests.

# Fitting the individual models

set.seed(2018)

# Random Forest (Original Features)

rf <- randomForest(y~., data = train.train)
pred.rf <- predict(rf, newdata = x.train.test)

auc.predictions <- as.vector(rf$votes[,2])
auc.pred <- ROCR::prediction(auc.predictions, train.train$y)
perf.auc <- ROCR::performance(auc.pred,"auc") 
rf.auc <- perf.auc@y.values[[1]]

cm.rf = confusionMatrix(pred.rf, y.train.test)

set.seed(2018)

# Random Forest (New Features)

rf.nf <- randomForest(y~., data = train.train.nf)
pred.rf.nf <- predict(rf.nf, newdata = x.train.test.nf)

auc.predictions <- as.vector(rf.nf$votes[,2])
auc.pred <- ROCR::prediction(auc.predictions, train.train.nf$y)
perf.auc <- ROCR::performance(auc.pred,"auc") 
rf.nf.auc <- perf.auc@y.values[[1]]

cm.rf.nf <- confusionMatrix(pred.rf.nf, y.train.test.nf)

print("Random Forest (Original Features)")
## [1] "Random Forest (Original Features)"
print(paste("AUC:", round(rf.auc,3)))
## [1] "AUC: 0.775"
cm.rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5401  534
##        yes  191  203
##                                           
##                Accuracy : 0.8854          
##                  95% CI : (0.8773, 0.8932)
##     No Information Rate : 0.8836          
##     P-Value [Acc > NIR] : 0.3276          
##                                           
##                   Kappa : 0.3024          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9658          
##             Specificity : 0.2754          
##          Pos Pred Value : 0.9100          
##          Neg Pred Value : 0.5152          
##              Prevalence : 0.8836          
##          Detection Rate : 0.8534          
##    Detection Prevalence : 0.9377          
##       Balanced Accuracy : 0.6206          
##                                           
##        'Positive' Class : no              
## 
print("Random Forest (New Features)")
## [1] "Random Forest (New Features)"
print(paste("AUC:", round(rf.nf.auc,3)))
## [1] "AUC: 0.787"
cm.rf.nf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  21258  1857
##        yes  1111  1092
##                                           
##                Accuracy : 0.8828          
##                  95% CI : (0.8787, 0.8867)
##     No Information Rate : 0.8835          
##     P-Value [Acc > NIR] : 0.6496          
##                                           
##                   Kappa : 0.3602          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9503          
##             Specificity : 0.3703          
##          Pos Pred Value : 0.9197          
##          Neg Pred Value : 0.4957          
##              Prevalence : 0.8835          
##          Detection Rate : 0.8396          
##    Detection Prevalence : 0.9130          
##       Balanced Accuracy : 0.6603          
##                                           
##        'Positive' Class : no              
## 

It looks like the the model utilizing the dataset with the new features performs better on the test set than the one without. Our original model possesses an AUC of 0.78 and specificity of 28%. Meanwhile, the model using the new features has an AUC of 0.79 and specificity of 37%. Although the model with the new features has only a like decrease in sensitivity, it captures 9% more of the positive cases in the test set. The new features seem promising.

Hyperparameter Tuning

Now, let’s determine the optimal parameter values for our random forests.

# Function that fits a random forest model using different parameter values and plots the change in accuracy

rf.paramopt.plot <- function(dataset = NULL, xtest= NULL, ytest = NULL, param = NULL, range = NULL){
  
  paramopt.acc = c()
  paramopt.specif = c()
  
  # ntree
  if(param == "ntree"){
    for(i in seq(1,length(range))){
      
      model = randomForest(y~., data = dataset, ntree = range[i])
      prediction <- predict(model, newdata = xtest)
      cm = confusionMatrix(prediction, ytest)
      
      paramopt.acc[i] <- as.numeric(cm$overall["Accuracy"])
      paramopt.specif[i] <- as.numeric(cm$byClass["Specificity"])
    }
  }
  
  # mtry
  if(param == "mtry"){
    for(i in seq(1,length(range))){
      
      model = randomForest(y~., data = dataset, mtry = range[i])
      prediction <- predict(model, newdata = xtest)
      cm = confusionMatrix(prediction, ytest)
      
      paramopt.acc[i] <- as.numeric(cm$overall["Accuracy"])
      paramopt.specif[i] <- as.numeric(cm$byClass["Specificity"])
    }
  }
  
  # nodesize
  if(param == "nodesize"){
    for(i in seq(1,length(range))){
      
      model = randomForest(y~., data = dataset, nodesize = range[i])
      prediction <- predict(model, newdata = xtest)
      cm = confusionMatrix(prediction, ytest)
      
      paramopt.acc[i] <- as.numeric(cm$overall["Accuracy"])
      paramopt.specif[i] <- as.numeric(cm$byClass["Specificity"])
    }
  }
  
  plt.acc = plot(x = range, y = paramopt.acc, 
            xlab = param,
            ylab = "Accuracy", 
            type="l",
            col="blue")
  points(x = range, y = paramopt.acc, col="blue", pch=19)
  grid (NULL,NULL, lty = 6, col = "cornsilk2")
  
  plt.specif = plot(x = range, y = paramopt.specif,
                    xlab = param,
                    ylab = "Specificity", 
                    type="l",
                    col="blue")
  points(x = range, y = paramopt.specif, col="blue", pch=19)
  grid (NULL,NULL, lty = 6, col = "cornsilk2")

}

Parameter Optimization (original features)

rf.paramopt.plot(dataset = train.train, xtest = x.train.test, ytest = y.train.test, param = "ntree", range = seq(100, 3000, 200))

An ntrees value of 900 seems to maximize both the accuracy and specificity will keeping the ntrees to a minimum. We do not want to overtrain the model and we want to keep the amount of time needed to train the model to a minimum.

rf.paramopt.plot(dataset = train.train, xtest = x.train.test, ytest = y.train.test, param = "mtry", range = seq(1,10,1))

An mtry of 5 likewise seems to maximize both the accuracy and specificity will keeping the mtry to a minimum.

rf.paramopt.plot(dataset = train.train, xtest = x.train.test, ytest = y.train.test, param = "nodesize", range = seq(1,10,1))

Looks like a nodesize of anything other than 1 will only decrease specificity. for this reason, we will stick with 1 as our nodesize value.

# Tuning 'cutoff'

# Setting values for 'cutoff'

paramopt.acc <- c()
paramopt.specif <- c()

k <- 0.50

for(i in seq(1,10)){
  
  model = randomForest(y~., data = train.train, cutoff = c(k,1-k))
  prediction <- predict(model, newdata = x.train.test)
  cm = confusionMatrix(prediction, y.train.test)
  
  paramopt.acc[i] <- as.numeric(cm$overall["Accuracy"])
  paramopt.specif[i] <- as.numeric(cm$byClass["Specificity"])
  
  k = k + 0.05
}

plt.acc = plot(x = seq(0.50, 0.95, 0.05), y = paramopt.acc, 
            xlab = "cutoff",
            ylab = "Accuracy", 
            type="l",
            col="blue")
  points(x = seq(0.50, 0.95, 0.05), y = paramopt.acc, col="blue", pch=19)
  grid (NULL,NULL, lty = 6, col = "cornsilk2")

  plt.specif = plot(x = seq(0.50, 0.95, 0.05), y = paramopt.specif,
                    xlab = "cutoff",
                    ylab = "Specificity", 
                    type="l",
                    col="blue")
  points(x = seq(0.50, 0.95, 0.05), y = paramopt.specif, col="blue", pch=19)
  grid (NULL,NULL, lty = 6, col = "cornsilk2")

It seems like a cutoff at (0.6, 0.4) does the best job of maximizing specificity while causing a minimal decrease in overall accuracy.

Parameter Optimization (new features added)

Now, let’s determine the optimal parameter values for our random forest using the dataset that includes our new features.

rf.paramopt.plot(dataset = train.train.nf, xtest = x.train.test.nf, ytest = y.train.test.nf, param = "ntree", range = seq(100, 3000, 200))

Using the dataset with the new features, it seems like ntrees of 500 gives us the best accuracy and speificity.

rf.paramopt.plot(dataset = train.train.nf, xtest = x.train.test.nf, ytest = y.train.test.nf, param = "mtry", range = seq(3,15,3))

For this collection of features, mtry of 6 seems best.

rf.paramopt.plot(dataset = train.train.nf, xtest = x.train.test.nf, ytest = y.train.test.nf, param = "nodesize", range = seq(1,10,1))

Once again, changing nodesize to anything other than 1 dereases specificity. We will leave it at that.

# Tuning 'cutoff'

# Setting values for 'cutoff'

paramopt.acc <- c()
paramopt.specif <- c()

k <- 0.50

for(i in seq(1,10)){
  
  model = randomForest(y~., data = train.train.nf, cutoff = c(k,1-k))
  prediction <- predict(model, newdata = x.train.test.nf)
  cm = confusionMatrix(prediction, y.train.test.nf)
  
  paramopt.acc[i] <- as.numeric(cm$overall["Accuracy"])
  paramopt.specif[i] <- as.numeric(cm$byClass["Specificity"])
  
  k = k + 0.05
}

plt.acc = plot(x = seq(0.50, 0.95, 0.05), y = paramopt.acc, 
            xlab = "cutoff",
            ylab = "Accuracy", 
            type="l",
            col="blue")
  points(x = seq(0.50, 0.95, 0.05), y = paramopt.acc, col="blue", pch=19)
  grid (NULL,NULL, lty = 6, col = "cornsilk2")

  plt.specif = plot(x = seq(0.50, 0.95, 0.05), y = paramopt.specif,
                    xlab = "cutoff",
                    ylab = "Specificity", 
                    type="l",
                    col="blue")
  points(x = seq(0.50, 0.95, 0.05), y = paramopt.specif, col="blue", pch=19)
  grid (NULL,NULL, lty = 6, col = "cornsilk2")

# Once again, a cutoff of (0.60,0.40) provide us with the best of both accuracy and specificity.

Based on the graph above, we will, once again, go with a cutoff of (0.6,0.4). Although it may be a conservative selection, such a cutoff will increase our specificity by ~20 percentage points while minimizing loss in accuracy.

Now, let’s compare model performance, using tuned parameters, with and without our new features

# Assessing performance of base Random Forests (with optomized parameters)

set.seed(2018)

rf <- randomForest(y~., data = train.train, ntree = 900, mtry = 5, nodesize = 1, cutoff = c(0.6,0.4))
pred.rf <- predict(rf, newdata = x.train.test)
cm.rf = confusionMatrix(pred.rf, y.train.test)

auc.predictions <- as.vector(rf$votes[,2])
auc.pred <- ROCR::prediction(auc.predictions, train.train$y)
perf.auc <- ROCR::performance(auc.pred,"auc") 
rf.auc <- perf.auc@y.values[[1]]

set.seed(2018)

rf.nf <- randomForest(y~., data = train.train.nf, ntree = 500, mtry = 6, nodesize = 1, cutoff = c(0.6,0.4))
pred.rf.nf <- predict(rf.nf, newdata = x.train.test.nf)
cm.rf.nf <- confusionMatrix(pred.rf.nf, y.train.test.nf)

auc.predictions <- as.vector(rf.nf$votes[,2])
auc.pred <- ROCR::prediction(auc.predictions, train.train.nf$y)
perf.auc <- ROCR::performance(auc.pred,"auc") 
rf.nf.auc <- perf.auc@y.values[[1]]

print("Random Forest (original features)")
## [1] "Random Forest (original features)"
print(paste("AUC:", round(rf.auc,3)))
## [1] "AUC: 0.777"
cm.rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  5223  446
##        yes  369  291
##                                           
##                Accuracy : 0.8712          
##                  95% CI : (0.8627, 0.8794)
##     No Information Rate : 0.8836          
##     P-Value [Acc > NIR] : 0.998801        
##                                           
##                   Kappa : 0.3445          
##  Mcnemar's Test P-Value : 0.007764        
##                                           
##             Sensitivity : 0.9340          
##             Specificity : 0.3948          
##          Pos Pred Value : 0.9213          
##          Neg Pred Value : 0.4409          
##              Prevalence : 0.8836          
##          Detection Rate : 0.8252          
##    Detection Prevalence : 0.8957          
##       Balanced Accuracy : 0.6644          
##                                           
##        'Positive' Class : no              
## 
print("Random Forest (new features)")
## [1] "Random Forest (new features)"
print(paste("AUC:", round(rf.nf.auc,3)))
## [1] "AUC: 0.781"
cm.rf.nf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  20639  1492
##        yes  1730  1457
##                                           
##                Accuracy : 0.8727          
##                  95% CI : (0.8686, 0.8768)
##     No Information Rate : 0.8835          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4026          
##  Mcnemar's Test P-Value : 2.976e-05       
##                                           
##             Sensitivity : 0.9227          
##             Specificity : 0.4941          
##          Pos Pred Value : 0.9326          
##          Neg Pred Value : 0.4572          
##              Prevalence : 0.8835          
##          Detection Rate : 0.8152          
##    Detection Prevalence : 0.8741          
##       Balanced Accuracy : 0.7084          
##                                           
##        'Positive' Class : no              
## 

Once again, the model using the the new features come out on top with a higher training AUC and specificity.

Now, let’s determine feature importances according to our models.

# Original Features

print("Original Features")
## [1] "Original Features"
rf.imp <- data.frame(rf$importance)
rf.imp
##           MeanDecreaseGini
## age              723.83149
## job              485.03658
## marital          148.64156
## education        195.16187
## default           16.48290
## balance          906.49455
## housing          142.30559
## loan              62.63238
## contact          100.11618
## day              984.61484
## month            698.91178
## campaign         362.51512
## previous         307.99483
# New Features

print("New Features")
## [1] "New Features"
rf.nf.imp <- data.frame(rf.nf$importance)
rf.nf.imp
##                       MeanDecreaseGini
## age                         128.209855
## job                         119.604751
## marital                      35.309518
## education                    45.044240
## default                       2.168825
## balance                     158.042471
## housing                      20.699186
## loan                         14.166906
## contact                      17.769829
## day                         214.949905
## month                        47.835212
## campaign                     78.523172
## previous                     57.369282
## net.disposable.income         8.355864
## euribor3m                    88.940370
## consumer.spending            10.034231
## loan.growth                  61.063818
## consumer.confidence          27.523423
## nr.employed                  60.950152
## unemployment.rate            29.118750
## inflation.rate               19.162105
## emp.var.rate                 19.341580

‘Mean Decrease in Gini’ is a measure of variable importance for estimating a target variable. The higher this value is, the higher the variable importance is to the model.

Based on our results, ‘day’, ‘balance’, and ‘age’ are the three most important features when using our data set devoid of any additional features. Meanwhile, when using the dataset with additional features, we see that the three most important features are ‘day’, ‘balance’, and ‘age’. It is interesting that both of the datasets possess ‘day’ and ‘age’ but they possesses different importances. The additional features must be intereacting with the original ones in some way to cause this change. Nevertheless, it is clear, based on performance, that atleast a few of the new features increase the predictictability of the model.

It should be noted that random forests actually perform a sort of implicit feature selection. For any decision tree in the forest, features are sampled from the dataset by how well they partition the training set into pure subsets. A decision tree then creates a list of steps for effectively partitioning the data in the quickest number of steps. So, the features highest up in the tree can be thought of as ‘most important’ as they are able to partion the most data points in the fewest steps.

Now, let’s retrain the model on all our data and obtain a final prediction set to be evaluated by the professor.

# Making sure the levels for each factor varaible of the test and train sets are the same

x.test$job <- factor(x.test$job, levels=levels(train.new$job))
x.test$marital <- factor(x.test$marital, levels=levels(train.new$marital))
x.test$education <- factor(x.test$education, levels=levels(train.new$education))
x.test$default <- factor(x.test$default, levels=levels(train.new$default))
x.test$housing <- factor(x.test$housing, levels=levels(train.new$housing))
x.test$loan <- factor(x.test$loan, levels=levels(train.new$loan))
x.test$contact <- factor(x.test$contact, levels=levels(train.new$contact))
x.test$day <- factor(x.test$day, levels=levels(train.new$day))
x.test$month <- factor(x.test$month, levels=levels(train.new$month))
x.test$campaign <- factor(x.test$campaign, levels=levels(train.new$campaign))
x.test$previous <- factor(x.test$previous, levels=levels(train.new$previous))
set.seed(2018)

rf.final <- randomForest(y~., data = train.train, ntree = 900, mtry = 5, nodesize = 1, cutoff = c(0.6,0.4))

pred.rf.final <- predict(rf, newdata = x.test)
pred.rf.final <- as.character(pred.rf.final)

# Re-attach the id to the predictions

final.predictions <- cbind(id = test$id, test.predictions = pred.rf.final)

# Write predictions to a csv file

write.csv(final.predictions,'group_5_test_predictions.csv')