Our task is to generate two decision trees models and a random forest model to make classification predictions, and then, using this sales article and examples of real cases where decision trees went wrong, answer how we can improve the perception of our final decision tree model.
We extended upon the dataset used in the first assignment. Information collected by a Portuguese banking institution and enhanced with economic data, like interest rates, by researchers. We’re trying to predict whether a customer of the bank contacted in a phone marketing campaign elects to open up a savings product.
We’re approaching that problem by creating two decision trees with different initial splits and a random forest model.
In comparing the models we are relying on Kappa instead of Accuracy. Accuracy is the percentage of predictions made by the model that were correct. Kappa is the accuracy adjusted for chance. Accuracy and Kappa would be indentical when the target predictions are perfectly balanced (50% yes and 50% no) but our data is moderately imbalanced (12% yes and 88% no) and so Kappa is a better guide.
Another variable at play is cp
. It’s a hyperparameter
for the decision tree algorithm that stands for Complexity Parameter.
It’s a measure of penalty applied to the decision tree to keep the model
parsimonious. In our case, the rpart
package determined
which of three cp
values to try in each model fittings’
tuning process, and our models always had the best results with lower
cp
values. If cp
is too high then the model
might be overly simplified but a little cp
can help prevent
overfitting.
Our first decision tree determined that the number of employees,
nr.employees
, was the predictor driving the initial split.
We weren’t able to drive a specific other predictor either with a
y ~ otherPredictor + .
or
y ~ . | otherPredictor
formula definition however we were
able to run the decision tree on a subset of predictors excluding the
nr.employees
predictor to help surface other relationships
in the data.
When we ran the random forest model we had a performance issue. My
laptop is an M2 Apple chip with 8GB of memory so not equipped for heavy
duty machine learning. In the first assignment we detected overfitting
in our decision tree model when we used the downsampled to 10% or 4,119
records and so we went with the full dataset of 41,188 records for this
assignment, however that decision made fitting the random forest model
complicated. Instead of opting for the downsampled dataset for the
random forest model we tried using parallel processing with the
doParallel
library. Also we initially ran our model with
fewer trees using ntree
to tune our parameters for
mtry
and maxnodes
.
We read DeciZone’s article, The good the bad the ugly of using decision trees, and noted that it was largely a sales pitch for DeciZone’s software, which presumably solves all of the bad and the ugly aspects of working with decision trees noted in the article.
Focusing on the “bad” list from the article we have the following observations:
Changes If we’re adopting decision trees for business decisions we should be able to change the decision trees as we get new information. Our task is slightly different, we’re trying to uncover the relationships in the current data whereas what’s being described in the article is a way to leverage decision trees proscriptively for future actions. This would be the next step to a project like the one we are completing today.
Subjectivity Do we understand our variables and the thought process behind each of them? In our case most of the variables are self-explanatory; However, the number of employees, the first branching predictor in our base decision tree model, isn’t clear how it could be leveraged for future marketing campaigns. It could be a confounding predictor and so it was serendipitous that we took it out for the second decision tree to help surface other relationships in the data.
Evolve How can we continuously improve the decision tree? In our case we can tune parameters, try variations on a decision tree model, use statistical measures instead of gini scores to identify splitting points in the data, but our task is to describe the relationship in the data we have without modeling noise and we’re not really able to modify or drill down the the decision tree using the R tools that we have. That would be more for a secondary approach after this project to memorialize the learnings from this exercise as we try to apply it to future marketing campaigns possibly using a tool like DeciZone’s.
Repeat How do we check that the same logical pattern doesn’t repeat in multiple branches. We can use variable importance to look at roughly which predictors had the biggest impact on determining outcomes.
Complexity and Familiarity presumably DeciZone’s software solution has a clean and simple interface and makes it easy for users to engage with it. We have charts and lists of variable importance.
First we confirm that our models were meaningfully accurate. Balance-adjusted accuracies (kappa scores) of ~30% are meaningful enough for exploring variables that have predictive value on something as complicated as human decision making, and so we can use our models to surface insights.
We observed in the two decision tree models that there was a winner-takes-all effect where the first-selected variables to build the tree, overshadowed the importance of other variables. In the base tree the first variable was the number of employees at our client’s banking institution and was likely a confounding variable. When we ran the decision tree again on a subset of the predictors that excluded the number of employees we started to be able to tell a story about how consumer confidence, the interest rate environment, and the absence of any shocks in the employment market made it more likely for the customers to agree to the savings product.
In the random forest model we were able to see a more nuanced picture of the predictive variables. For instance, the customer’s age, job, and education level had an impact on a ‘yes’ outcome and so we could advise our client to produce different marketing materials and call center scripts specific to different age groups, or some cluster of job and education levels. We then drilled down on month using partial dependence plots to observe that likely our client should focus on a larger campaign in the summer and a smaller campaign in November.
As a subsequent project for our client we could go through all of the partial dependencies to build out a series of recommendations for future marketing campaigns.
We are continuing our first homework task with the selection of the “Bank Marketing” data set from the UC Irvine Machine Learning repository.
A Portuguese banking institution collected this data in a direct marketing campaign from May 2008 to November 2010. The data include information about the customers called, their demographics, banking history, and the timing, length and number of interactions they had with the bank, and ultimately whether the customer subscribed to a term deposit.
Later, the paper authors enriched the data with five Portuguese economic indicators like interest rates that could affect whether a customer subscribed.
Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.
Our understanding was the originally intended data set from excelbianalytics.com were synthetic data and not likely to have meaningful interactions and predictions.
When the data options were opened to kaggle.com we knew we had to find a very large data set so we could subset out a random fraction to contrast strategies for larger and smaller data sets, however the first ten large data sets (30k+) found were either fractured with significant missingness, had too few features, or lent themselves to projects outside of our scope like Natural Language Processing (NLP).
We turned to the UC Irvine Machine Learning Repository looking for large, nonsynthetic sets that were used in academic papers. The Bank Marketing data set seemed like it would be interesting to work with and met the size and complexity requirements.
The difficulty looks medium in that we can tune some sophisticated models but there will be minimal data wrangling. After an initial review of the data, we’re unlikely to explore interaction terms without direction from a subject matter expert. Neither will we have to transform our data significantly since we are working with tree algorithms which can adjust for different scales unlike regression algorithms. There are no missing terms, however there are cell values of “unknown” which could be valued as “unknown”, imputed or deleted.
Here we load just our large data set, instead of both our large and small dataset. We had found in the first homework that a decision tree was capable of overfitting the small data set, even with its 4,119 observations.
There are 41,188 observations in our data set with a structure of 20 features or prediction variables and one output attribute or target variable.
# Check out packages
library(readr) # data importation
library(tidyverse) # has dplyr
library(corrplot) # correlation matrix plots
library(caret) # model structure
library(rpart) # decision trees
library(rpart.plot) # decision trees
library(partykit) # Specialty decision tree options
library(randomForest) # Random forest
library(doParallel) # speed up random forest
library(pdp) # partial dependence plots
library(party) # Conditional Inference Trees
library(gbm) # Boosted Trees
# Load data
loc_df <-"~/Documents/D622/HW1/bank-additional-full.csv"
df0 <- read_delim(loc_df, delim = ";", escape_double = FALSE, trim_ws = TRUE)
Here we show the data set structure with the first few examples in each variable.
# Data Preview
glimpse(df0)
## Rows: 41,188
## Columns: 21
## $ age <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job <chr> "housemaid", "services", "services", "admin.", "service…
## $ marital <chr> "married", "married", "married", "married", "married", …
## $ education <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
## $ default <chr> "no", "unknown", "no", "no", "no", "unknown", "no", "un…
## $ housing <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
## $ loan <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
## $ contact <chr> "telephone", "telephone", "telephone", "telephone", "te…
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
## $ day_of_week <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
## $ duration <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons.conf.idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr.employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
Here we borrow heavily from the documentation provided through the citation website.
Bank Customer Features
age
- customer’s age
job
- type of job
marital
- marital status (note, widowed is counted under
“divorced”)
education
- educational attainment
default
- does the customer have credit in default?
housing
- have a housing loan?
Campaign Features
contact
- contacted by “cellular” or “telephone”
month
- month of year they were last contacted
day_of_week
- day of week they were last contacted (Monday
- Friday)
duration
- last contact duration in seconds
Other Features
campaign
- number of contacts made to this customer for
this campaign
pdays
- number of days that since the customer was last
contacted in the previous campaign
previous
- number of contacts made to this customer for the
previous campaign
poutcome
- outcome of the previous marketing campaign
Economic Features
emp.var.rate
- quarterly employment variation rate
cons.price.idx
- monthly consumer price index
cons.conf.idx
- monthly consumer confidence index
euribor3m
- daily euribor 3 month rate
nr.employed
- quarterly number of employees
Here is our column representing the outcome, creating this as a classification problem
Target Variable
y
- outcome of the current marketing campaign, “has the
customer subscribed a term deposit?”
Here we provide summary statistics for our data set’s numeric values in order to look for outliers and unusualities.
pdays
or the number of days since the customer was last
contacted in the previous campaign, represents “never contacted” as 999
days. It may be this is sufficient for modeling purposes.
nr.employed
looks like it may be the number of employees
at the bank. It’s not clear how that could benefit the analysis. Maybe
fewer employees means less effective campaign calls.
# Show summary statistics for numeric columns
df0 %>%
select(where(is.numeric)) %>%
summary()
## age duration campaign pdays
## Min. :17.00 Min. : 0.0 Min. : 1.000 Min. : 0.0
## 1st Qu.:32.00 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0
## Median :38.00 Median : 180.0 Median : 2.000 Median :999.0
## Mean :40.02 Mean : 258.3 Mean : 2.568 Mean :962.5
## 3rd Qu.:47.00 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0
## Max. :98.00 Max. :4918.0 Max. :56.000 Max. :999.0
## previous emp.var.rate cons.price.idx cons.conf.idx
## Min. :0.000 Min. :-3.40000 Min. :92.20 Min. :-50.8
## 1st Qu.:0.000 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## Median :0.000 Median : 1.10000 Median :93.75 Median :-41.8
## Mean :0.173 Mean : 0.08189 Mean :93.58 Mean :-40.5
## 3rd Qu.:0.000 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. :7.000 Max. : 1.40000 Max. :94.77 Max. :-26.9
## euribor3m nr.employed
## Min. :0.634 Min. :4964
## 1st Qu.:1.344 1st Qu.:5099
## Median :4.857 Median :5191
## Mean :3.621 Mean :5167
## 3rd Qu.:4.961 3rd Qu.:5228
## Max. :5.045 Max. :5228
Here we use code to generate the counts of each type of value in the categorical variables.
For our target variable, 12.69% (4,640/36,548) of our data set were yes for the current campaign. This means we have a mild imbalance in our data that should be addressed either through resampling or model techniques.
Notably there were three “yes” for default
in our
dataset so we also have an issue with degenerate variables with low
categorial frequency.
# Get counts of every factor in each categorical column
categorical_counts <- df0 %>%
select(where(~ is.character(.) || is.factor(.))) %>%
pivot_longer(everything(), names_to = "Column", values_to = "Value") %>%
group_by(Column, Value) %>%
summarise(Count = n(), .groups = "drop")
# View the result
categorical_counts
## # A tibble: 55 × 3
## Column Value Count
## <chr> <chr> <int>
## 1 contact cellular 26144
## 2 contact telephone 15044
## 3 day_of_week fri 7827
## 4 day_of_week mon 8514
## 5 day_of_week thu 8623
## 6 day_of_week tue 8090
## 7 day_of_week wed 8134
## 8 default no 32588
## 9 default unknown 8597
## 10 default yes 3
## # ℹ 45 more rows
There are no NA values in our data set, however in six of the eleven categorical variables there are missing values coded as “unknown”. We’re going to treat “unknown” as a separate class but we could try an imputation technique to identify them and compare.
# There are zero NA values in our data set
#df0 %>% summarise(total_missing = sum(is.na(.)))
# Show categorical values of "unknown"
unknown_counts <- df0 %>%
select(where(~ is.character(.) || is.factor(.))) %>%
summarise(across(everything(), ~ sum(. == "unknown"), .names = "unknown_{.col}")) %>%
t()
unknown_counts <- as.data.frame(unknown_counts)
unknown_counts
## V1
## unknown_job 330
## unknown_marital 80
## unknown_education 1731
## unknown_default 8597
## unknown_housing 990
## unknown_loan 990
## unknown_contact 0
## unknown_month 0
## unknown_day_of_week 0
## unknown_poutcome 0
## unknown_y 0
Here is the correlation matrix plot for the numeric columns. While we
can’t see the correlation between these and the non-numeric target
variable y
we note some strong correlations.
pdays
and previous
have a strong negative
correlation. This is because if they were not contacted in the previous
campaign they would have 999 days since last contact and 0 previous
contacts but if they have at least one previous contact then the days
since the last contact will be much smaller, for example 15 or 30
days.
euribor3m
, the daily Euribor 3-month rate, is an average
of the rates European banks are lending Euros to each other, so it makes
sense that it would be positively correlated with inflation,
cons.price.idx
, and the change in quarterly employment
ratings, emp.var.rate
, since if borrowing costs are higher,
companies are less likely to hire, this includes our Portuguese banking
institution, whose total number of employees, nr.employed
,
is highly correlated to the daily Euribor 3-month rate,
euribor3m
.
# Correlation matrix plot
correlations <- cor(df0[, sapply(df0, is.numeric)])
corrplot::corrplot(correlations, method="square", type = "upper", order = 'original')
We’re going to remove duration
because if this is to be
a useful prediction model for the bank we can’t know in advance how long
the call with the customer is going to be.
Separate to this analysis, the bank could use this information to produce guidelines for how long it’s reasonable to spend on a phone call with a customer for future campaigns but for our purposes we will remove it.
# Remove duration column
df1 <- df0 %>% select(-duration)
Here we checked our dataset for multicollinearity by fitting a logistic regression model to access the Variance Inflation Factor (VIF) for each predictor.
A VIF close to one is ideal with no correlation between the variable and others. A VIF above 10 is a warning of high multicollinearity that could impair our modeling efforts.
Note, one of our variables, loan
, was found to have at
least one of it’s one-hot encoded subvariables be perfect collinear with
the other categorical variables. This means we have to remove the
variable to move forward.
Here we show three variables with VIF scores above 10, establishing that we have high multicollinearity. This means we may see a higher variability in the types of decision trees fit to our data. This is because highly correlated variables will compete to be chosen for splitting but only one of them will be chosen. We could get three roughly similar performing decision trees depending on which of the multicollinear variables is chosen. And for random forest-type algorithms, since they work with so many small trees looking at a limited scope of the features at any given tree, this tends to reduce the sensitivity of random forest-type algorithms to multicollinearity.
# Convert "yes" to 1 and "no" to 0 in the target column `y`
df2 <- df1
df2$y <- ifelse(df2$y == "yes", 1, 0)
# loan was perfectly collinear and had to be removed
#alias(glm_model)
df2 <- df2 %>% select(-loan)
# Logistic regression model and VIF for small
#glm_model <- glm(y ~ ., data = small2, family = binomial)
#vif_values <- car::vif(glm_model)
#vif_values
# Logistic regression model and VIF for large
glm_model <- glm(y ~ ., data = df2, family = binomial)
vif_values <- car::vif(glm_model)
vif_values
## GVIF Df GVIF^(1/(2*Df))
## age 2.203093 1 1.484282
## job 5.655303 11 1.081938
## marital 1.440082 3 1.062669
## education 3.214727 7 1.086988
## default 1.138725 2 1.033010
## housing 1.011423 2 1.002844
## contact 2.411083 1 1.552766
## month 65.363374 9 1.261397
## day_of_week 1.060082 4 1.007320
## campaign 1.043997 1 1.021761
## pdays 10.759483 1 3.280165
## previous 4.665959 1 2.160083
## poutcome 25.193842 2 2.240390
## emp.var.rate 144.876834 1 12.036479
## cons.price.idx 65.824565 1 8.113234
## cons.conf.idx 5.335383 1 2.309845
## euribor3m 142.091066 1 11.920196
## nr.employed 172.521063 1 13.134727
We’re going to create two decision tree models and one random forest to meet the requirements of the task however we’ll also try as many variants of the decision tree as we can to form a sort of lookbook for future modeling efforts.
Besides producing predictions and using 10-fold cross validation to
get a sense of how well our models generalize to new data, we’ll also
see if we can surface insights about which features are important in
predicting success in the subsequent Insights
section.
Here we split up our data into inputs dfx
and output
dfy
and remove records with sparse categorical data.
Additionally, some folds may fail during cross validation because of
sparse categories in education
(18 of “illiterate”) and
default
(3 of “yes”). Because of this we’ve decided to
exclude these records for modeling purposes. An alternative would be to
include these records but lump them into the next lowest educational
attainment category, or into the unknown default
category.
This would have been similar to how “widowed” was counted as “divorced”
in the marital
variable, to reduce rare categorical
levels.
# See all education attainment levels
#df2 %>%
# count(education)
# There are 3 yes and 18 illiterate in our dataset
#sum(df2$default == "yes", na.rm = TRUE)
#sum(df2$education == "illiterate", na.rm = TRUE)
# Remove records with the two rare level categories
df3 <- df2 %>%
filter(!(education == "illiterate" | default == "yes"))
# Resplit data
dfx <- df3 |> select(-y)
dfy <- df3$y
dfx <- as.data.frame(dfx)
dfx <- dfx %>% mutate(across(where(is.character), as.factor))
dfy <- as.factor(dfy)
Here we train our decision tree models with the same data used in the logistic regression models. There were no complications or need for additional steps after the grooming done for the logistic regression models.
There was no onerous increase in time to train on the larger data set, however we observed an extra high accuracy on the smaller data set that diminished when training the same decision tree model on the larger data set. This may be evidence that decision trees tend to overfit unless it is a large data set.
We have an accuracy of 89.99% and a Kappa of 0.2976. Note that without setting the same seed we’d get slightly different results each time we run this.
Note our cp
of 0.0031 is one of three options determined
by the rpart
package. What we can conclude is that our
model doesn’t benefit from highly penalized models with a high
cp
. However a little cp
is good to prevent the
likelihood of overfitting.
It would be interesting if a higher cp
would actually
increase the benefit of our variable importance analysis - hurting the
model’s accuracy but helping to simplify the discussion about which
variables are influencing outcome.
# Train Decision Tree Model
set.seed(175328)
decision_tree_base <- train(
x = dfx,
y = dfy,
method = "rpart",
trControl = trainControl(method = "cv", number = 10)
)
# View the Decision Tree Model Summary
print(decision_tree_base)
## CART
##
## 41167 samples
## 18 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 37050, 37049, 37050, 37050, 37051, 37051, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.003091746 0.8999198 0.2975951
## 0.004817371 0.8991912 0.2831523
## 0.053817947 0.8911506 0.0976546
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003091746.
pruned_tree <- prune(decision_tree_base$finalModel, cp = 0.003091746)
# Plot the pruned tree
rpart.plot(pruned_tree, main = "Pruned Decision Tree (cp = 0.00309)")
We’re including variable importance to showcase how rerunning after
removing nr.employed
makes room for new predictors to be
used for splitting.
# Calculate variable importance
importance <- varImp(decision_tree_base)
# Plot variable importance
plot(importance, main = "Variable Importance - Base Decision Tree")
We saw that our model first branched using the
nr.employed
feature. Some of this could depend on our seed
value and a different seed value would result in a different initial
split. It looks like one unseeded result actually initially split at
euribor3m
however I wasn’t able to duplicate that after 20
seeded attempts.
It may have been that some of my experiments with trying to force an
initial split using euribor3m
worked but in a way I hadn’t
controlled for in my code order.
I had tried code like putting euribor3m
before the
.
as in the example below:
tree <- ctree(y ~ euribor3m + ., data = df3)
Or defining the model formula as normal but with a pipe and then the name of the predictor we want to prioritize as in this example:
y ~ . | euribor3m
However I weren’t able to make any of them independently function. (And the last one might only be for “model-based recursive partitioning” rather than for a standard decision tree.)
Instead what I’m doing is running a decision tree model but on a
subset of the features, excluding nr.employed
so the tree
is forced to build without splitting on it. This should help surface
other relationships in the data.
Running the decision tree on the subset of features excluding
nr.employees
results in a decision tree with first split on
euribor3m
.
The Kappa is 0.2934 also with the cp of 0.0031. Remember, Kappa is accuracy adjusted for unbalanced target variable values. Accuracy is higher than Kappa because we need to weight the “yes” outcomes more highly since they represent only 12% of the records.
# Train Decision Tree Model
set.seed(175328)
decision_tree_subset <- train(
x = dfx[,-18],
y = dfy,
method = "rpart",
trControl = trainControl(method = "cv", number = 10)
)
# View the Decision Tree Model Summary
print(decision_tree_subset)
## CART
##
## 41167 samples
## 17 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 37050, 37049, 37050, 37050, 37051, 37051, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.003091746 0.8994826 0.29341261
## 0.004817371 0.8985596 0.26712814
## 0.052200173 0.8911020 0.09242677
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003091746.
pruned_tree3 <- prune(decision_tree_subset$finalModel, cp = 0.003091746)
# Plot the pruned tree
rpart.plot(pruned_tree3, main = "Pruned Decision Tree (cp = 0.0031)")
We can see here how removing nr.employed
brought
cons.conf.idx
into use.
The random forest model since it uses many smaller trees, with a subset of predictors available for each tree, will have a more balanced set of variable importances.
# Calculate variable importance
importance <- varImp(decision_tree_subset)
# Plot variable importance
plot(importance, main = "Variable Importance - Subset Decision Tree")
We have a slight decrease in our kappa scores when we remove our
number one predictor, with the balanced adjusted accuracy changing from
0.2976 to 0.2934. This makes sense from a data perspective because the
nr.employed
was chosen as the first predictor to split upon
in the base decision tree, so removing it would lower accuracy.
A note on balance-adjusted accuracy - In physics or manufacturing we would expect models to have a much higher accuracy to be meaningful. With human decisions we don’t expect nearly the same level of accuracy in order to have a meaningful model and so while these accuracies are low, they are significant in the nebulous endeavor of modeling human decisions.
By removing nr.employed
, which is split on twice in the
visible portion of the base decision tree plot, we allow for other
variables to be split upon in the second decision tree. The first split
is now based on euribor_3m
and the second
nr.employed
split is now performed with
emp.var.rate
.
Both are able to add a compelling guess as to why a person would
agree to the savings product. If the Euribor 3-month rate is higher then
the savings product will have a higher rate of return leading to more
yeses. If the emp.var.rate
is greater than -2.3 then there
is a chance for more yeses, which can be interpreted as if there’s not a
significant decrease in employment figures then people won’t be as
concerned about being fired and not having access to the amount
committed to the savings product.
When we compare the variable importance charts between the two
decision trees we arrive at no significant changes when you exclude
nr.employed
from the tree building, except
cons.conf.idx
increases in importance from minimal to
medium importance. This also adds a compelling guess that when consumer
confidence is high people are more likely to “buy” a savings product
with their extra money. It may be that the institution should adjust
it’s KPIs based on interest rates, swings in employment and consumer
confidence indexes.
We’ll see a more informative variable importance chart with the Random Forest since it’s compiled from many smaller trees and the earlier splitting variables won’t hide as much influence of subsequent variables as decisions trees do.
The random forest model is having slow to non-completing performance
when run normally. Instead of training on a downsampled dataset we’re
opting to use other techniques. First we set the number of trees,
ntree
, low, around 50, when the default is 500, for speed
of training while we set up our final model. Second we tuned
mtry
and maxnodes
and then trained our final
model using a fixed mtry
of 2 and the default for
maxnodes
, unlimited. Instead of ten folds we did three for
cross validation and set the final number of trees to 250.
Lastly, we added parallel processing through the
doParallel
package to make it run faster.
We have a Kappa of 0.3145, beating out both decision trees.
# Train the random forest model
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
tuneGrid <- expand.grid(
mtry = 2
)
set.seed(175328)
random_forest <- train(x = dfx,
y = dfy,
method = "rf",
tuneGrid = tuneGrid,
ntree = 250,
trControl = trainControl(method = "cv", number = 3)
)
stopCluster(cl)
random_forest
## Random Forest
##
## 41167 samples
## 18 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 27444, 27445, 27445
## Resampling results:
##
## Accuracy Kappa
## 0.9005757 0.3197851
##
## Tuning parameter 'mtry' was held constant at a value of 2
Here we show the variables by importance from the random forest model. Note that the variables aren’t as split into winners and losers compared to the decision trees, with a more even gradation between most to least important. This is the effect of compiling the importances from many smaller trees that were built on random subsets of the variables.
age
, job
, month
, and
education
, showed greater importance when the selection
criteria isn’t dominated by the highly correlated to outcome, early
splitting variables of the decision trees. This insight could lead to
segmenting campaign materials based on customer-demographics like
age.
Next, we’ll show an example of looking a particular variable using partial dependencies to surface more insights.
# Calculate variable importance
importance <- varImp(random_forest)
# Plot variable importance
plot(importance, main = "Variable Importance - Random Forest")
Here we have a partial dependence graph. It’s harder to read because the months aren’t factored into a particular order: Jan and Feb are missing, and we notice the most yeses in May through August with another bump in November.
With this insight we could advise the institution to make sure to have a large campaign in the summer, particularly May, and a second smaller campaign in November every year.
Similarly, we could use partial dependence plots to review all of the categorical variables of importance.
# Partial dependence plot for a specific variable
partialPlot(random_forest$finalModel, pred.data = dfx, x.var = "month", main = "Partial Dependence")
While we did not see high kappa scores, or balance-adjusted accuracies, like we would expect with physical experiments with stronger relationships between inputs and output. Since this was looking at trends related to human decision making, accuracy scores of 30% are highly meaningful and allow us to be confident in finding insights in our models.
We saw how in the first decision tree since it relies heavily on the
first one or two predictors, it suppresses the importance of less
determinant predictors. Unfortunately, the top predictor was the number
of employees at the institution, nr.employed
. It may be
that this is a confounding variable with spurious correlation. Maybe
when consumer confidence was low the company was able to hire more
people on the same salary budget but since consumer confidence was low,
customers were less likely to agree to lock up their money for the
savings product’s time period.
Removing the top predictor in the base decision tree, by running a
decision tree on the subset of predictors excluding the top predictor of
the base decision tree, nr.employed
allowed us, in what was
visible of the plotted tree branches, to surface additional variables
euribor3m
, a benchmark European interest rate, and
emp.var.rate
, the change in employment figures for the
period.
Looking at the variable importance graph confirms those two variables
predictive power but also show how the consumer confidence index,
cons.conf.idx
went from low importance in the base decision
tree to significant importance in the second. This highlights again how
decision trees have a “winner-takes-all” effect in determining the
predictive power of the variables, however these three variables don’t
appear confounding. You can tell a story about how high interest rates,
high consumer confidence and no crash in employment figures contribute
to customers being more likely to lock up their money in a savings
product. We can verify these by reviewing the direction the variables
would have to go to move down the tree towards a greater probability of
a yes outcome.
While decision trees may be valuable for memorializing business processes for learning from past marketing campaigns and approaching future ones, the random forest algorithm does a significantly better job surfacing variables that have an impact on a successful marketing campaign. The reason this is happening is random forest is made up of many smaller trees (sometimes called stumps) grown on a subset of the predictors so they’re less likely to be dominated by the variable with the highest correlation to the target variable.
We observe this with our random forest’s variable importance chart which shows more variables, of greater importance, and a more even gradation of importance as you go down the chart. Three of them, the customer’s age, job and educational level, stood out in that we could make a recommendation to the client to segment their call center scripts by customer demographics to appeal to different age groups or clusters of job and education-level.
We then go on to show how you can take any of of these (qualitative)
variables and look at the partial dependence plot to help surface
additional insights into how future marketing campaigns can be
successful. In the case of month
we find that the campaign
was highly successful in summer from May through August and had a bump
again in August. While there is a chance of these months being
confounding variables we have enough data to show the institution, our
client, might benefit from a larger campaign every summer and a small
campaign every November. Similarly we could group age into different
clusters and view the partial dependence plots for age-group, job and
educational level.
Ultimately the best use of this exercise is in looking at variable importance for all of our models to determine which variables have the strongest impact on a successful marketing campaign and how we would direct our client to investigate their procedures to improve results of future marketing campaigns.