Introduction

Our task is to generate two decision trees models and a random forest model to make classification predictions, and then, using this sales article and examples of real cases where decision trees went wrong, answer how we can improve the perception of our final decision tree model.

Essay

We extended upon the dataset used in the first assignment. Information collected by a Portuguese banking institution and enhanced with economic data, like interest rates, by researchers. We’re trying to predict whether a customer of the bank contacted in a phone marketing campaign elects to open up a savings product.

We’re approaching that problem by creating two decision trees with different initial splits and a random forest model.

In comparing the models we are relying on Kappa instead of Accuracy. Accuracy is the percentage of predictions made by the model that were correct. Kappa is the accuracy adjusted for chance. Accuracy and Kappa would be indentical when the target predictions are perfectly balanced (50% yes and 50% no) but our data is moderately imbalanced (12% yes and 88% no) and so Kappa is a better guide.

Another variable at play is cp. It’s a hyperparameter for the decision tree algorithm that stands for Complexity Parameter. It’s a measure of penalty applied to the decision tree to keep the model parsimonious. In our case, the rpart package determined which of three cp values to try in each model fittings’ tuning process, and our models always had the best results with lower cp values. If cp is too high then the model might be overly simplified but a little cp can help prevent overfitting.

Our first decision tree determined that the number of employees, nr.employees, was the predictor driving the initial split. We weren’t able to drive a specific other predictor either with a y ~ otherPredictor + . or y ~ . | otherPredictor formula definition however we were able to run the decision tree on a subset of predictors excluding the nr.employees predictor to help surface other relationships in the data.

When we ran the random forest model we had a performance issue. My laptop is an M2 Apple chip with 8GB of memory so not equipped for heavy duty machine learning. In the first assignment we detected overfitting in our decision tree model when we used the downsampled to 10% or 4,119 records and so we went with the full dataset of 41,188 records for this assignment, however that decision made fitting the random forest model complicated. Instead of opting for the downsampled dataset for the random forest model we tried using parallel processing with the doParallel library. Also we initially ran our model with fewer trees using ntree to tune our parameters for mtry and maxnodes.

Article Concerns

We read DeciZone’s article, The good the bad the ugly of using decision trees, and noted that it was largely a sales pitch for DeciZone’s software, which presumably solves all of the bad and the ugly aspects of working with decision trees noted in the article.

Focusing on the “bad” list from the article we have the following observations:

Changes If we’re adopting decision trees for business decisions we should be able to change the decision trees as we get new information. Our task is slightly different, we’re trying to uncover the relationships in the current data whereas what’s being described in the article is a way to leverage decision trees proscriptively for future actions. This would be the next step to a project like the one we are completing today.

Subjectivity Do we understand our variables and the thought process behind each of them? In our case most of the variables are self-explanatory; However, the number of employees, the first branching predictor in our base decision tree model, isn’t clear how it could be leveraged for future marketing campaigns. It could be a confounding predictor and so it was serendipitous that we took it out for the second decision tree to help surface other relationships in the data.

Evolve How can we continuously improve the decision tree? In our case we can tune parameters, try variations on a decision tree model, use statistical measures instead of gini scores to identify splitting points in the data, but our task is to describe the relationship in the data we have without modeling noise and we’re not really able to modify or drill down the the decision tree using the R tools that we have. That would be more for a secondary approach after this project to memorialize the learnings from this exercise as we try to apply it to future marketing campaigns possibly using a tool like DeciZone’s.

Repeat How do we check that the same logical pattern doesn’t repeat in multiple branches. We can use variable importance to look at roughly which predictors had the biggest impact on determining outcomes.

Complexity and Familiarity presumably DeciZone’s software solution has a clean and simple interface and makes it easy for users to engage with it. We have charts and lists of variable importance.

Conclusion Preview

First we confirm that our models were meaningfully accurate. Balance-adjusted accuracies (kappa scores) of ~30% are meaningful enough for exploring variables that have predictive value on something as complicated as human decision making, and so we can use our models to surface insights.

We observed in the two decision tree models that there was a winner-takes-all effect where the first-selected variables to build the tree, overshadowed the importance of other variables. In the base tree the first variable was the number of employees at our client’s banking institution and was likely a confounding variable. When we ran the decision tree again on a subset of the predictors that excluded the number of employees we started to be able to tell a story about how consumer confidence, the interest rate environment, and the absence of any shocks in the employment market made it more likely for the customers to agree to the savings product.

In the random forest model we were able to see a more nuanced picture of the predictive variables. For instance, the customer’s age, job, and education level had an impact on a ‘yes’ outcome and so we could advise our client to produce different marketing materials and call center scripts specific to different age groups, or some cluster of job and education levels. We then drilled down on month using partial dependence plots to observe that likely our client should focus on a larger campaign in the summer and a smaller campaign in November.

As a subsequent project for our client we could go through all of the partial dependencies to build out a series of recommendations for future marketing campaigns.

Data

Chosen Data

We are continuing our first homework task with the selection of the “Bank Marketing” data set from the UC Irvine Machine Learning repository.

A Portuguese banking institution collected this data in a direct marketing campaign from May 2008 to November 2010. The data include information about the customers called, their demographics, banking history, and the timing, length and number of interactions they had with the bank, and ultimately whether the customer subscribed to a term deposit.

Later, the paper authors enriched the data with five Portuguese economic indicators like interest rates that could affect whether a customer subscribed.

Citation

Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

Selection Rationale

Our understanding was the originally intended data set from excelbianalytics.com were synthetic data and not likely to have meaningful interactions and predictions.

When the data options were opened to kaggle.com we knew we had to find a very large data set so we could subset out a random fraction to contrast strategies for larger and smaller data sets, however the first ten large data sets (30k+) found were either fractured with significant missingness, had too few features, or lent themselves to projects outside of our scope like Natural Language Processing (NLP).

We turned to the UC Irvine Machine Learning Repository looking for large, nonsynthetic sets that were used in academic papers. The Bank Marketing data set seemed like it would be interesting to work with and met the size and complexity requirements.

Difficulty

The difficulty looks medium in that we can tune some sophisticated models but there will be minimal data wrangling. After an initial review of the data, we’re unlikely to explore interaction terms without direction from a subject matter expert. Neither will we have to transform our data significantly since we are working with tree algorithms which can adjust for different scales unlike regression algorithms. There are no missing terms, however there are cell values of “unknown” which could be valued as “unknown”, imputed or deleted.

Loading Data

Here we load just our large data set, instead of both our large and small dataset. We had found in the first homework that a decision tree was capable of overfitting the small data set, even with its 4,119 observations.

There are 41,188 observations in our data set with a structure of 20 features or prediction variables and one output attribute or target variable.

# Check out packages
library(readr)      # data importation
library(tidyverse)  # has dplyr
library(corrplot)   # correlation matrix plots
library(caret)      # model structure
library(rpart)      # decision trees
library(rpart.plot) # decision trees
library(partykit)   # Specialty decision tree options
library(randomForest) # Random forest
library(doParallel) # speed up random forest
library(pdp)        # partial dependence plots
library(party)      # Conditional Inference Trees
library(gbm)        # Boosted Trees

# Load data
loc_df <-"~/Documents/D622/HW1/bank-additional-full.csv"
df0 <- read_delim(loc_df, delim = ";", escape_double = FALSE, trim_ws = TRUE)

View data

Here we show the data set structure with the first few examples in each variable.

# Data Preview
glimpse(df0)

## Rows: 41,188
## Columns: 21
## $ age            <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job            <chr> "housemaid", "services", "services", "admin.", "service…
## $ marital        <chr> "married", "married", "married", "married", "married", …
## $ education      <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
## $ default        <chr> "no", "unknown", "no", "no", "no", "unknown", "no", "un…
## $ housing        <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
## $ loan           <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
## $ contact        <chr> "telephone", "telephone", "telephone", "telephone", "te…
## $ month          <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
## $ day_of_week    <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
## $ duration       <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate   <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons.conf.idx  <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m      <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr.employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…

Variable Descriptions

Here we borrow heavily from the documentation provided through the citation website.

Bank Customer Features
age - customer’s age
job - type of job
marital - marital status (note, widowed is counted under “divorced”)
education - educational attainment
default - does the customer have credit in default?
housing - have a housing loan?

Campaign Features
contact - contacted by “cellular” or “telephone”
month - month of year they were last contacted
day_of_week - day of week they were last contacted (Monday - Friday)
duration - last contact duration in seconds

Other Features
campaign - number of contacts made to this customer for this campaign
pdays - number of days that since the customer was last contacted in the previous campaign
previous - number of contacts made to this customer for the previous campaign
poutcome - outcome of the previous marketing campaign

Economic Features
emp.var.rate - quarterly employment variation rate
cons.price.idx - monthly consumer price index
cons.conf.idx - monthly consumer confidence index
euribor3m - daily euribor 3 month rate
nr.employed - quarterly number of employees

Label Identification

Here is our column representing the outcome, creating this as a classification problem

Target Variable
y - outcome of the current marketing campaign, “has the customer subscribed a term deposit?”

Exploratory Data Analysis

Summary Statistics - Numeric

Here we provide summary statistics for our data set’s numeric values in order to look for outliers and unusualities.

pdays or the number of days since the customer was last contacted in the previous campaign, represents “never contacted” as 999 days. It may be this is sufficient for modeling purposes.

nr.employed looks like it may be the number of employees at the bank. It’s not clear how that could benefit the analysis. Maybe fewer employees means less effective campaign calls.

# Show summary statistics for numeric columns
df0 %>%
    select(where(is.numeric)) %>%
    summary()

##       age           duration         campaign          pdays      
##  Min.   :17.00   Min.   :   0.0   Min.   : 1.000   Min.   :  0.0  
##  1st Qu.:32.00   1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0  
##  Median :38.00   Median : 180.0   Median : 2.000   Median :999.0  
##  Mean   :40.02   Mean   : 258.3   Mean   : 2.568   Mean   :962.5  
##  3rd Qu.:47.00   3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0  
##  Max.   :98.00   Max.   :4918.0   Max.   :56.000   Max.   :999.0  
##     previous      emp.var.rate      cons.price.idx  cons.conf.idx  
##  Min.   :0.000   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  1st Qu.:0.000   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  Median :0.000   Median : 1.10000   Median :93.75   Median :-41.8  
##  Mean   :0.173   Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
##  3rd Qu.:0.000   3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##  Max.   :7.000   Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##    euribor3m      nr.employed  
##  Min.   :0.634   Min.   :4964  
##  1st Qu.:1.344   1st Qu.:5099  
##  Median :4.857   Median :5191  
##  Mean   :3.621   Mean   :5167  
##  3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :5.045   Max.   :5228

Summary Statistics Categorical

Here we use code to generate the counts of each type of value in the categorical variables.

For our target variable, 12.69% (4,640/36,548) of our data set were yes for the current campaign. This means we have a mild imbalance in our data that should be addressed either through resampling or model techniques.

Notably there were three “yes” for default in our dataset so we also have an issue with degenerate variables with low categorial frequency.

# Get counts of every factor in each categorical column
categorical_counts <- df0 %>%
  select(where(~ is.character(.) || is.factor(.))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Value") %>%
  group_by(Column, Value) %>%
  summarise(Count = n(), .groups = "drop")

# View the result
categorical_counts

## # A tibble: 55 × 3
##    Column      Value     Count
##    <chr>       <chr>     <int>
##  1 contact     cellular  26144
##  2 contact     telephone 15044
##  3 day_of_week fri        7827
##  4 day_of_week mon        8514
##  5 day_of_week thu        8623
##  6 day_of_week tue        8090
##  7 day_of_week wed        8134
##  8 default     no        32588
##  9 default     unknown    8597
## 10 default     yes           3
## # ℹ 45 more rows

Missing Values

There are no NA values in our data set, however in six of the eleven categorical variables there are missing values coded as “unknown”. We’re going to treat “unknown” as a separate class but we could try an imputation technique to identify them and compare.

# There are zero NA values in our data set
#df0 %>% summarise(total_missing = sum(is.na(.)))

# Show categorical values of "unknown"
unknown_counts <- df0 %>%
  select(where(~ is.character(.) || is.factor(.))) %>%
  summarise(across(everything(), ~ sum(. == "unknown"), .names = "unknown_{.col}")) %>%
  t()
unknown_counts <- as.data.frame(unknown_counts)
unknown_counts

##                       V1
## unknown_job          330
## unknown_marital       80
## unknown_education   1731
## unknown_default     8597
## unknown_housing      990
## unknown_loan         990
## unknown_contact        0
## unknown_month          0
## unknown_day_of_week    0
## unknown_poutcome       0
## unknown_y              0

Correlation Matrix

Here is the correlation matrix plot for the numeric columns. While we can’t see the correlation between these and the non-numeric target variable y we note some strong correlations.

pdays and previous have a strong negative correlation. This is because if they were not contacted in the previous campaign they would have 999 days since last contact and 0 previous contacts but if they have at least one previous contact then the days since the last contact will be much smaller, for example 15 or 30 days.

euribor3m, the daily Euribor 3-month rate, is an average of the rates European banks are lending Euros to each other, so it makes sense that it would be positively correlated with inflation, cons.price.idx, and the change in quarterly employment ratings, emp.var.rate, since if borrowing costs are higher, companies are less likely to hire, this includes our Portuguese banking institution, whose total number of employees, nr.employed, is highly correlated to the daily Euribor 3-month rate, euribor3m.

# Correlation matrix plot
correlations <- cor(df0[, sapply(df0, is.numeric)])
corrplot::corrplot(correlations, method="square", type = "upper", order = 'original')

Variable Elimination

We’re going to remove duration because if this is to be a useful prediction model for the bank we can’t know in advance how long the call with the customer is going to be.

Separate to this analysis, the bank could use this information to produce guidelines for how long it’s reasonable to spend on a phone call with a customer for future campaigns but for our purposes we will remove it.

# Remove duration column
df1 <- df0 %>% select(-duration)

Multicollinearity

Here we checked our dataset for multicollinearity by fitting a logistic regression model to access the Variance Inflation Factor (VIF) for each predictor.

A VIF close to one is ideal with no correlation between the variable and others. A VIF above 10 is a warning of high multicollinearity that could impair our modeling efforts.

Note, one of our variables, loan, was found to have at least one of it’s one-hot encoded subvariables be perfect collinear with the other categorical variables. This means we have to remove the variable to move forward.

Here we show three variables with VIF scores above 10, establishing that we have high multicollinearity. This means we may see a higher variability in the types of decision trees fit to our data. This is because highly correlated variables will compete to be chosen for splitting but only one of them will be chosen. We could get three roughly similar performing decision trees depending on which of the multicollinear variables is chosen. And for random forest-type algorithms, since they work with so many small trees looking at a limited scope of the features at any given tree, this tends to reduce the sensitivity of random forest-type algorithms to multicollinearity.

# Convert "yes" to 1 and "no" to 0 in the target column `y`
df2 <- df1
df2$y <- ifelse(df2$y == "yes", 1, 0)

# loan was perfectly collinear and had to be removed
#alias(glm_model)
df2 <- df2 %>% select(-loan)

# Logistic regression model and VIF for small
#glm_model <- glm(y ~ ., data = small2, family = binomial)
#vif_values <- car::vif(glm_model)
#vif_values

# Logistic regression model and VIF for large
glm_model <- glm(y ~ ., data = df2, family = binomial)
vif_values <- car::vif(glm_model)
vif_values

##                      GVIF Df GVIF^(1/(2*Df))
## age              2.203093  1        1.484282
## job              5.655303 11        1.081938
## marital          1.440082  3        1.062669
## education        3.214727  7        1.086988
## default          1.138725  2        1.033010
## housing          1.011423  2        1.002844
## contact          2.411083  1        1.552766
## month           65.363374  9        1.261397
## day_of_week      1.060082  4        1.007320
## campaign         1.043997  1        1.021761
## pdays           10.759483  1        3.280165
## previous         4.665959  1        2.160083
## poutcome        25.193842  2        2.240390
## emp.var.rate   144.876834  1       12.036479
## cons.price.idx  65.824565  1        8.113234
## cons.conf.idx    5.335383  1        2.309845
## euribor3m      142.091066  1       11.920196
## nr.employed    172.521063  1       13.134727

Algorithms

Algorithm Selection Process

We’re going to create two decision tree models and one random forest to meet the requirements of the task however we’ll also try as many variants of the decision tree as we can to form a sort of lookbook for future modeling efforts.

Besides producing predictions and using 10-fold cross validation to get a sense of how well our models generalize to new data, we’ll also see if we can surface insights about which features are important in predicting success in the subsequent Insights section.

Prepare Data

Here we split up our data into inputs dfx and output dfy and remove records with sparse categorical data.

Additionally, some folds may fail during cross validation because of sparse categories in education (18 of “illiterate”) and default (3 of “yes”). Because of this we’ve decided to exclude these records for modeling purposes. An alternative would be to include these records but lump them into the next lowest educational attainment category, or into the unknown default category. This would have been similar to how “widowed” was counted as “divorced” in the marital variable, to reduce rare categorical levels.

# See all education attainment levels
#df2 %>%
#  count(education)

# There are 3 yes and 18 illiterate in our dataset
#sum(df2$default == "yes", na.rm = TRUE)
#sum(df2$education == "illiterate", na.rm = TRUE)

# Remove records with the two rare level categories
df3 <- df2 %>%
  filter(!(education == "illiterate" | default == "yes"))

# Resplit data
dfx <- df3 |> select(-y)
dfy <- df3$y
dfx <- as.data.frame(dfx)
dfx <- dfx %>% mutate(across(where(is.character), as.factor))
dfy <- as.factor(dfy)

Decision Tree models

Here we train our decision tree models with the same data used in the logistic regression models. There were no complications or need for additional steps after the grooming done for the logistic regression models.

There was no onerous increase in time to train on the larger data set, however we observed an extra high accuracy on the smaller data set that diminished when training the same decision tree model on the larger data set. This may be evidence that decision trees tend to overfit unless it is a large data set.

Base Decision Tree

We have an accuracy of 89.99% and a Kappa of 0.2976. Note that without setting the same seed we’d get slightly different results each time we run this.

Note our cp of 0.0031 is one of three options determined by the rpart package. What we can conclude is that our model doesn’t benefit from highly penalized models with a high cp. However a little cp is good to prevent the likelihood of overfitting.

It would be interesting if a higher cp would actually increase the benefit of our variable importance analysis - hurting the model’s accuracy but helping to simplify the discussion about which variables are influencing outcome.

Model report

# Train Decision Tree Model
set.seed(175328)
decision_tree_base <- train(
  x = dfx, 
  y = dfy, 
  method = "rpart",
  trControl = trainControl(method = "cv", number = 10)
)

# View the Decision Tree Model Summary
print(decision_tree_base)

## CART 
## 
## 41167 samples
##    18 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 37050, 37049, 37050, 37050, 37051, 37051, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa    
##   0.003091746  0.8999198  0.2975951
##   0.004817371  0.8991912  0.2831523
##   0.053817947  0.8911506  0.0976546
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003091746.

Model Plot

pruned_tree <- prune(decision_tree_base$finalModel, cp = 0.003091746)

# Plot the pruned tree
rpart.plot(pruned_tree, main = "Pruned Decision Tree (cp = 0.00309)")

Variable Importance

We’re including variable importance to showcase how rerunning after removing nr.employed makes room for new predictors to be used for splitting.

# Calculate variable importance
importance <- varImp(decision_tree_base)

# Plot variable importance
plot(importance, main = "Variable Importance - Base Decision Tree")

Forced Splitting

We saw that our model first branched using the nr.employed feature. Some of this could depend on our seed value and a different seed value would result in a different initial split. It looks like one unseeded result actually initially split at euribor3m however I wasn’t able to duplicate that after 20 seeded attempts.

It may have been that some of my experiments with trying to force an initial split using euribor3m worked but in a way I hadn’t controlled for in my code order.

I had tried code like putting euribor3m before the . as in the example below:

tree <- ctree(y ~ euribor3m + ., data = df3)

Or defining the model formula as normal but with a pipe and then the name of the predictor we want to prioritize as in this example:

y ~ . | euribor3m

However I weren’t able to make any of them independently function. (And the last one might only be for “model-based recursive partitioning” rather than for a standard decision tree.)

Instead what I’m doing is running a decision tree model but on a subset of the features, excluding nr.employed so the tree is forced to build without splitting on it. This should help surface other relationships in the data.

Model Report

Running the decision tree on the subset of features excluding nr.employees results in a decision tree with first split on euribor3m.

The Kappa is 0.2934 also with the cp of 0.0031. Remember, Kappa is accuracy adjusted for unbalanced target variable values. Accuracy is higher than Kappa because we need to weight the “yes” outcomes more highly since they represent only 12% of the records.

# Train Decision Tree Model
set.seed(175328)
decision_tree_subset <- train(
  x = dfx[,-18], 
  y = dfy, 
  method = "rpart",
  trControl = trainControl(method = "cv", number = 10)
)

# View the Decision Tree Model Summary
print(decision_tree_subset)

## CART 
## 
## 41167 samples
##    17 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 37050, 37049, 37050, 37050, 37051, 37051, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa     
##   0.003091746  0.8994826  0.29341261
##   0.004817371  0.8985596  0.26712814
##   0.052200173  0.8911020  0.09242677
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003091746.

Model Plot

pruned_tree3 <- prune(decision_tree_subset$finalModel, cp = 0.003091746)

# Plot the pruned tree
rpart.plot(pruned_tree3, main = "Pruned Decision Tree (cp = 0.0031)")

Variable Importance

We can see here how removing nr.employed brought cons.conf.idx into use.

The random forest model since it uses many smaller trees, with a subset of predictors available for each tree, will have a more balanced set of variable importances.

# Calculate variable importance
importance <- varImp(decision_tree_subset)

# Plot variable importance
plot(importance, main = "Variable Importance - Subset Decision Tree")

Decision Trees Comparison

Kappa Scores

We have a slight decrease in our kappa scores when we remove our number one predictor, with the balanced adjusted accuracy changing from 0.2976 to 0.2934. This makes sense from a data perspective because the nr.employed was chosen as the first predictor to split upon in the base decision tree, so removing it would lower accuracy.

What is Meaningful?

A note on balance-adjusted accuracy - In physics or manufacturing we would expect models to have a much higher accuracy to be meaningful. With human decisions we don’t expect nearly the same level of accuracy in order to have a meaningful model and so while these accuracies are low, they are significant in the nebulous endeavor of modeling human decisions.

Splits

By removing nr.employed, which is split on twice in the visible portion of the base decision tree plot, we allow for other variables to be split upon in the second decision tree. The first split is now based on euribor_3m and the second nr.employed split is now performed with emp.var.rate.

Both are able to add a compelling guess as to why a person would agree to the savings product. If the Euribor 3-month rate is higher then the savings product will have a higher rate of return leading to more yeses. If the emp.var.rate is greater than -2.3 then there is a chance for more yeses, which can be interpreted as if there’s not a significant decrease in employment figures then people won’t be as concerned about being fired and not having access to the amount committed to the savings product.

Variable Importance

When we compare the variable importance charts between the two decision trees we arrive at no significant changes when you exclude nr.employed from the tree building, except cons.conf.idx increases in importance from minimal to medium importance. This also adds a compelling guess that when consumer confidence is high people are more likely to “buy” a savings product with their extra money. It may be that the institution should adjust it’s KPIs based on interest rates, swings in employment and consumer confidence indexes.

We’ll see a more informative variable importance chart with the Random Forest since it’s compiled from many smaller trees and the earlier splitting variables won’t hide as much influence of subsequent variables as decisions trees do.

Random Forest

The random forest model is having slow to non-completing performance when run normally. Instead of training on a downsampled dataset we’re opting to use other techniques. First we set the number of trees, ntree, low, around 50, when the default is 500, for speed of training while we set up our final model. Second we tuned mtry and maxnodes and then trained our final model using a fixed mtry of 2 and the default for maxnodes, unlimited. Instead of ten folds we did three for cross validation and set the final number of trees to 250.

Lastly, we added parallel processing through the doParallel package to make it run faster.

Model Report

We have a Kappa of 0.3145, beating out both decision trees.

# Train the random forest model
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
tuneGrid <- expand.grid(
  mtry = 2
)

set.seed(175328)
random_forest <- train(x = dfx,
                y = dfy,
                method = "rf",
                tuneGrid = tuneGrid,
                ntree = 250,
                trControl = trainControl(method = "cv", number = 3)
                )
stopCluster(cl)
random_forest

## Random Forest 
## 
## 41167 samples
##    18 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 27444, 27445, 27445 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9005757  0.3197851
## 
## Tuning parameter 'mtry' was held constant at a value of 2

Variable Importance

Here we show the variables by importance from the random forest model. Note that the variables aren’t as split into winners and losers compared to the decision trees, with a more even gradation between most to least important. This is the effect of compiling the importances from many smaller trees that were built on random subsets of the variables.

age, job, month, and education, showed greater importance when the selection criteria isn’t dominated by the highly correlated to outcome, early splitting variables of the decision trees. This insight could lead to segmenting campaign materials based on customer-demographics like age.

Next, we’ll show an example of looking a particular variable using partial dependencies to surface more insights.

# Calculate variable importance
importance <- varImp(random_forest)

# Plot variable importance
plot(importance, main = "Variable Importance - Random Forest")

Partial Dependence Plot

Here we have a partial dependence graph. It’s harder to read because the months aren’t factored into a particular order: Jan and Feb are missing, and we notice the most yeses in May through August with another bump in November.

With this insight we could advise the institution to make sure to have a large campaign in the summer, particularly May, and a second smaller campaign in November every year.

Similarly, we could use partial dependence plots to review all of the categorical variables of importance.

# Partial dependence plot for a specific variable
partialPlot(random_forest$finalModel, pred.data = dfx, x.var = "month", main = "Partial Dependence")

Conclusion

While we did not see high kappa scores, or balance-adjusted accuracies, like we would expect with physical experiments with stronger relationships between inputs and output. Since this was looking at trends related to human decision making, accuracy scores of 30% are highly meaningful and allow us to be confident in finding insights in our models.

We saw how in the first decision tree since it relies heavily on the first one or two predictors, it suppresses the importance of less determinant predictors. Unfortunately, the top predictor was the number of employees at the institution, nr.employed. It may be that this is a confounding variable with spurious correlation. Maybe when consumer confidence was low the company was able to hire more people on the same salary budget but since consumer confidence was low, customers were less likely to agree to lock up their money for the savings product’s time period.

Removing the top predictor in the base decision tree, by running a decision tree on the subset of predictors excluding the top predictor of the base decision tree, nr.employed allowed us, in what was visible of the plotted tree branches, to surface additional variables euribor3m, a benchmark European interest rate, and emp.var.rate, the change in employment figures for the period.

Looking at the variable importance graph confirms those two variables predictive power but also show how the consumer confidence index, cons.conf.idx went from low importance in the base decision tree to significant importance in the second. This highlights again how decision trees have a “winner-takes-all” effect in determining the predictive power of the variables, however these three variables don’t appear confounding. You can tell a story about how high interest rates, high consumer confidence and no crash in employment figures contribute to customers being more likely to lock up their money in a savings product. We can verify these by reviewing the direction the variables would have to go to move down the tree towards a greater probability of a yes outcome.

While decision trees may be valuable for memorializing business processes for learning from past marketing campaigns and approaching future ones, the random forest algorithm does a significantly better job surfacing variables that have an impact on a successful marketing campaign. The reason this is happening is random forest is made up of many smaller trees (sometimes called stumps) grown on a subset of the predictors so they’re less likely to be dominated by the variable with the highest correlation to the target variable.

We observe this with our random forest’s variable importance chart which shows more variables, of greater importance, and a more even gradation of importance as you go down the chart. Three of them, the customer’s age, job and educational level, stood out in that we could make a recommendation to the client to segment their call center scripts by customer demographics to appeal to different age groups or clusters of job and education-level.

We then go on to show how you can take any of of these (qualitative) variables and look at the partial dependence plot to help surface additional insights into how future marketing campaigns can be successful. In the case of month we find that the campaign was highly successful in summer from May through August and had a bump again in August. While there is a chance of these months being confounding variables we have enough data to show the institution, our client, might benefit from a larger campaign every summer and a small campaign every November. Similarly we could group age into different clusters and view the partial dependence plots for age-group, job and educational level.

Ultimately the best use of this exercise is in looking at variable importance for all of our models to determine which variables have the strongest impact on a successful marketing campaign and how we would direct our client to investigate their procedures to improve results of future marketing campaigns.

D622 Homework 2: Decision Trees Algorithms

PK O’Flaherty

2024-12-30

Introduction

Essay

Article Concerns

Conclusion Preview

Data

Chosen Data

Citation

Selection Rationale

Difficulty

Loading Data

View data

Variable Descriptions

Label Identification

Exploratory Data Analysis

Summary Statistics - Numeric

Summary Statistics Categorical

Missing Values

Correlation Matrix

Variable Elimination

Multicollinearity

Algorithms

Algorithm Selection Process

Prepare Data

Decision Tree models

Base Decision Tree

Model report

Model Plot

Variable Importance

Forced Splitting

Model Report

Model Plot

Variable Importance

Decision Trees Comparison

Kappa Scores

What is Meaningful?

Splits

Variable Importance

Random Forest

Model Report

Variable Importance

Partial Dependence Plot

Conclusion