Our task is to conduct exploratory analysis to predict an outcome based on two sets of data, large and small, and compare the results. Also needed is an explanation of the selection of algorithms and how they relate to the data and what we are trying to do in a short essay.
After first identifying we were presented with a classification instead of a regression problem, we determined our data had a multicollinearity problem and a moderate imbalance in the classification output.
We compiled a chart reproduced below of machine learning algorithms covered in this course and assessed them in their ability to handle regression, classification, multicollinearity and imbalances. We also ignored algorithms that are computationally intensive like SVM and Neural Networks, since we were training on both small and large data sets.
Ultimately we selected logistic regression and decision trees and while we had high accuracy, when we adjusted our accuracy for random chance we had low performing models that while they could help us surface some insight on which features were important in predicting our success, should be refined with adjustments or new algorithms.
Learnings -
Our data came from a Portuguese banking institution’s marketing campaign that a team of researchers combined with economic indicators to predict the success of a given client signing up to a term savings product.
12% of the clients did end up agreeing to the product which introduced the concept of imbalance, whereby the accuracy of our models is not inflated because a model that just predicts all clients say “no” has a, pretty high, 88% accuracy rate. Because of this we relied on the Kappa score to show a measure of accuracy after random chance had been accounted for. Going back to our model that all clients say no with an 88% accuracy rate, the Kappa score would weight the 88% of accurate guess downward by dividing 88% and the 12% inaccurate upwards by dividing by 12% to show that we have a 50-50 adjusted chance of being correct which is not predictive advantage at all. In our case we attempted to resolve imbalance with the algorithm chosen. logistic regression and decision trees were OK, simpler models to start with and Random Forest and XGBoost would have been able to more powerfully account for the imbalance but we’re saving them for a future homework. An alternative would have been to train our data on a resampled training set that over represented the minority class of “yes”’s to eliminate the imbalance during training.
Through both the correlation matrix of numerical predictors and the Variance Inflation Factor (VIF) of an initially fit logistic regression model, we were able to establish that we had a high degree of multicollinearity in our model. Modeling works best when the predictors have low collinearity with each other. If you have two predictors that roughly do the same thing as each other, it allows, particularly regression models, to more easily model noise in the data. Multicollinearity can be at least partially addressed by adding regularization to the model (lasso, ridge or both, elastic-net), or by combining collinear terms in a fixed way. We chose to address multicollinearity by comparing the success of the modeling using logistic regression and Decision Tree. Ideally we would have used Random Forest, and we will see that model in the next assignment.
While fitting the models we noticed that the Decision Tree model was able to overfit on the small data set and required a relatively larger amount of data than the logistic regression model to become properly fit, even with 10-fold cross validation. While we know computationally intensive models like SVM and Neural Network would be onerous to train on the larger data set, we need to be mindful, especially since while we can always scale our data down we can’t scale our data up easily or at all, about how different algorithms require different amounts of data, depending on the dimensionality, to work properly.
Model | Regression | Classification | Multicollinearity | Imbalance |
---|---|---|---|---|
Linear Regression | Yes | No | No | |
Logistic Regression | Yes | No | Yes (with weighting) | |
kNN | Yes | Yes | No | No |
Linear Discriminant Analysis | Yes | No | No | |
Support Vector Machines | Yes | Yes | Yes | Yes (with weighting or kernel adjustments) |
Random Forest | Yes | Yes | Yes | Yes |
AdaBoost | Yes | Yes | No | Yes (with reweighting) |
XGBoost | Yes | Yes | Yes | Yes (with weighting) |
Neural Networks | Yes | Yes | Yes (if properly regularized) | Yes (with sampling or custom loss functions) |
We selected the “Bank Marketing” data set from the UC Irvine Machine Learning repository.
A Portuguese banking institution collected this data in a direct marketing campaign from May 2008 to November 2010. The data include information about the clients called, their demographics, banking history, and the timing, length and number of interactions they had with the bank, and ultimately whether the client subscribed to a term deposit.
Later, the paper authors enriched the data with five Portuguese economic indicators like interest rates that could affect whether a client subscribed.
Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.
Our understanding was the originally intended data set from excelbianalytics.com were synthetic data and not likely to have meaningful interactions and predictions.
When the data options were opened to kaggle.com we knew we had to find a very large data set so we could subset out a random fraction to contrast strategies for larger and smaller data sets, however the first ten large data sets (30k+) found were either fractured with significant missingness, had too few features, or lent themselves to projects outside of our scope like Natural Language Processing (NLP).
We turned to the UC Irvine Machine Learning Repository looking for large, nonsynthetic sets that were used in academic papers. The Bank Marketing data set seemed like it would be interesting to work with and met the size and complexity requirements.
The difficulty looks medium in that we can tune some sophisticated models but there will be minimal data wrangling. We’re unlikely to explore interaction terms or have to transform our data significantly. There are no missing terms, however there are cell values of “unknown” which could be valued as “unknown”, imputed or deleted.
Here we load our two data sets, small
and
large
.
There are 41,188 observations in the large data set. The small data set is a random 10% of the large data set with 4,119 observations.
They have identical structure with 20 features or prediction variables and one output attribute or target variable.
# Check out packages
library(readr) # data importation
library(tidyverse) # has dplyr
library(corrplot) # correlation matrix plots
library(caret) # model structure
library(rpart) # decision trees
library(rpart.plot) # decision trees
# Load data
loc_small <-"~/Documents/D622/HW1/bank-additional.csv"
loc_large <-"~/Documents/D622/HW1/bank-additional-full.csv"
small <- read_delim(loc_small, delim = ";", escape_double = FALSE, trim_ws = TRUE)
large <- read_delim(loc_large, delim = ";", escape_double = FALSE, trim_ws = TRUE)
Here we show the small data, which is identical in structure to the large data.
# Data Preview
glimpse(small)
## Rows: 4,119
## Columns: 21
## $ age <dbl> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36, 36, 47,…
## $ job <chr> "blue-collar", "services", "services", "services", "adm…
## $ marital <chr> "married", "single", "married", "married", "married", "…
## $ education <chr> "basic.9y", "high.school", "high.school", "basic.9y", "…
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "unknown", "n…
## $ housing <chr> "yes", "no", "yes", "unknown", "yes", "no", "yes", "yes…
## $ loan <chr> "no", "no", "no", "unknown", "no", "no", "no", "no", "n…
## $ contact <chr> "cellular", "telephone", "telephone", "telephone", "cel…
## $ month <chr> "may", "may", "jun", "jun", "nov", "sep", "sep", "nov",…
## $ day_of_week <chr> "fri", "fri", "wed", "fri", "mon", "thu", "mon", "mon",…
## $ duration <dbl> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, 301, 148,…
## $ campaign <dbl> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2, 6, 4, 2…
## $ pdays <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate <dbl> -1.8, 1.1, 1.4, 1.4, -0.1, -1.1, -1.1, -0.1, -0.1, 1.1,…
## $ cons.price.idx <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199, 94.199,…
## $ cons.conf.idx <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.5, -42.0,…
## $ euribor3m <dbl> 1.313, 4.855, 4.962, 4.959, 4.191, 0.884, 0.879, 4.191,…
## $ nr.employed <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6, 4963.6,…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
Here we borrow heavily from the documentation provided through the citation website.
Bank Client Features
age
- client’s age
job
- type of job
marital
- marital status (note, widowed is counted under
“divorced”)
education
- educational attainment
default
- does the client have credit in default?
housing
- have a housing loan?
Campaign Features
contact
- contacted by “cellular” or “telephone”
month
- month of year they were last contacted
day_of_week
- day of week they were last contacted (Monday
- Friday)
duration
- last contact duration in seconds
Other Features
campaign
- number of contacts made to this client for this
campaign
pdays
- number of days that since the client was last
contacted in the previous campaign
previous
- number of contacts made to this client for the
previous campaign
poutcome
- outcome of the previous marketing campaign
Economic Features
emp.var.rate
- quarterly employment variation rate
cons.price.idx
- monthly consumer price index
cons.conf.idx
- monthly consumer confidence index
euribor3m
- daily euribor 3 month rate
nr.employed
- quarterly number of employees
Here is our column representing the outcome, creating this as a classification problem
Target Variable
y
- outcome of the current marketing campaign, “has the
client subscribed a term deposit?”
Here we provide summary statistics for the large set’s numeric values in order to look for outliers and unusualities. Since the small is a subset of the large there is no need to repeat with the small data set.
pdays
or the number of days since the client was last
contacted in the previous campaign, represents “never contacted” as 999
days. It may be this is sufficient for modeling purposes.
nr.employed
looks like it may be the number of employees
at the bank. It’s not clear how that could benefit the analysis. Maybe
fewer employees means less effective campaign calls.
# Show summary statistics for numeric columns
large %>%
select(where(is.numeric)) %>%
summary()
## age duration campaign pdays
## Min. :17.00 Min. : 0.0 Min. : 1.000 Min. : 0.0
## 1st Qu.:32.00 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0
## Median :38.00 Median : 180.0 Median : 2.000 Median :999.0
## Mean :40.02 Mean : 258.3 Mean : 2.568 Mean :962.5
## 3rd Qu.:47.00 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0
## Max. :98.00 Max. :4918.0 Max. :56.000 Max. :999.0
## previous emp.var.rate cons.price.idx cons.conf.idx
## Min. :0.000 Min. :-3.40000 Min. :92.20 Min. :-50.8
## 1st Qu.:0.000 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## Median :0.000 Median : 1.10000 Median :93.75 Median :-41.8
## Mean :0.173 Mean : 0.08189 Mean :93.58 Mean :-40.5
## 3rd Qu.:0.000 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. :7.000 Max. : 1.40000 Max. :94.77 Max. :-26.9
## euribor3m nr.employed
## Min. :0.634 Min. :4964
## 1st Qu.:1.344 1st Qu.:5099
## Median :4.857 Median :5191
## Mean :3.621 Mean :5167
## 3rd Qu.:4.961 3rd Qu.:5228
## Max. :5.045 Max. :5228
Here we use code to generate the counts of each type of value in the categorical variables.
For our target variable, 12.69% (4,640/36,548) of the large data set were yes for the current campaign and 12.29% (451/3,668) of the small data set. This means we have a mild imbalance in our data that can be addressed with the algorithm selection for our model.
Notably there were three “yes” for default
in the large
data set and one in the small data set which means that they may have
taken pains for similar statistics between the two data sets.
# Get counts of every factor in each categorical column
categorical_counts <- large %>%
select(where(~ is.character(.) || is.factor(.))) %>%
pivot_longer(everything(), names_to = "Column", values_to = "Value") %>%
group_by(Column, Value) %>%
summarise(Count = n(), .groups = "drop")
# View the result
categorical_counts
## # A tibble: 55 × 3
## Column Value Count
## <chr> <chr> <int>
## 1 contact cellular 26144
## 2 contact telephone 15044
## 3 day_of_week fri 7827
## 4 day_of_week mon 8514
## 5 day_of_week thu 8623
## 6 day_of_week tue 8090
## 7 day_of_week wed 8134
## 8 default no 32588
## 9 default unknown 8597
## 10 default yes 3
## # ℹ 45 more rows
There are no NA values in either data set, however in six of the eleven categorical variables there are missing values coded as “unknown”. We’re going to treat “unknown” as a separate class but we could try an imputation technique to identify them and compare.
# There are zero NA values in either data set
#small %>% summarise(total_missing = sum(is.na(.)))
#large %>% summarise(total_missing = sum(is.na(.)))
# Show categorical values of "unknown"
unknown_counts <- large %>%
select(where(~ is.character(.) || is.factor(.))) %>%
summarise(across(everything(), ~ sum(. == "unknown"), .names = "unknown_{.col}")) %>%
t()
unknown_counts <- as.data.frame(unknown_counts)
unknown_counts
## V1
## unknown_job 330
## unknown_marital 80
## unknown_education 1731
## unknown_default 8597
## unknown_housing 990
## unknown_loan 990
## unknown_contact 0
## unknown_month 0
## unknown_day_of_week 0
## unknown_poutcome 0
## unknown_y 0
Here is the correlation matrix plot for the numeric columns. While we
can’t see the correlation between these and the non-numeric target
variable y
we note some strong correlations.
pdays
and previous
have a strong negative
correlation. This is because if they were not contacted in the previous
campaign they would have 999 days since last contact and 0 previous
contacts but if they have at least one previous contact then the days
since the last contact will be much smaller, for example 15 or 30
days.
euribor3m
, the daily Euribor 3-month rate, is an average
of the rates European banks are lending Euros to each other, so it makes
sense that it would be positively correlated with inflation,
cons.price.idx
, and the change in quarterly employment
ratings, emp.var.rate
, since if borrowing costs are higher,
companies are less likely to hire, this includes our Portuguese banking
institution, whose total number of employees, nr.employed
,
is highly correlated to the daily Euribor 3-month rate,
euribor3m
.
# Correlation matrix plot
correlations <- cor(small[, sapply(small, is.numeric)])
corrplot::corrplot(correlations, method="square", type = "upper", order = 'original')
We’re going to remove duration
because if this is to be
a useful prediction model for the bank we can’t know in advance how long
the call with the customer is going to be.
Separate to this analysis, the bank could use this information to produce guidelines for how long it’s reasonable to spend on a phone call with a client for future campaigns but for our purposes we will remove it.
# Remove duration column
small <- small %>% select(-duration)
large <- large %>% select(-duration)
Here we checked both large
and small
for
multicollinearity by fitting a logistic regression model to access the
Variance Inflation Factor (VIF) for each predictor.
A VIF close to one is ideal with no correlation between the variable and others. A VIF above 10 is a warning of high multicollinearity that could impair our modeling efforts.
Here we show three variables with VIF scores above 10, confirming that we have high multicollinearity which will influence our algorithm selections. Only the large data set results are displayed but the small data set had similar scores.
# Convert "yes" to 1 and "no" to 0 in the target column `y`
small2 <- small
large2 <- large
small2$y <- ifelse(small2$y == "yes", 1, 0)
large2$y <- ifelse(large2$y == "yes", 1, 0)
# loan was perfectly collinear and had to be removed
#alias(glm_model)
small2 <- small2 %>% select(-loan)
large2 <- large2 %>% select(-loan)
# Logistic regression model and VIF for small
#glm_model <- glm(y ~ ., data = small2, family = binomial)
#vif_values <- car::vif(glm_model)
#vif_values
# Logistic regression model and VIF for large
glm_model <- glm(y ~ ., data = large2, family = binomial)
vif_values <- car::vif(glm_model)
vif_values
## GVIF Df GVIF^(1/(2*Df))
## age 2.203093 1 1.484282
## job 5.655303 11 1.081938
## marital 1.440082 3 1.062669
## education 3.214727 7 1.086988
## default 1.138725 2 1.033010
## housing 1.011423 2 1.002844
## contact 2.411083 1 1.552766
## month 65.363374 9 1.261397
## day_of_week 1.060082 4 1.007320
## campaign 1.043997 1 1.021761
## pdays 10.759483 1 3.280165
## previous 4.665959 1 2.160083
## poutcome 25.193842 2 2.240390
## emp.var.rate 144.876834 1 12.036479
## cons.price.idx 65.824565 1 8.113234
## cons.conf.idx 5.335383 1 2.309845
## euribor3m 142.091066 1 11.920196
## nr.employed 172.521063 1 13.134727
Since we have labeled data we are selecting one of the classification models below.
If we were using a computationally demanding machine learning algorithm like Support Vector Machines (SVM) or Neural Networks it would be onerous to run models on the large data set so they are out of consideration.
We’re going to start with logistic regression since it is a strong introductory classification algorithm.
We’ve also demonstrated that our data is highly multicollinear so we are also going to try the decision tree algorithm. A random forest might be more robust and offer more insights about which features are important in predicting success however we will save that for a future exercise.
Model | Regression | Classification | Multicollinearity | Imbalance |
---|---|---|---|---|
Linear Regression | Yes | No | No | |
Logistic Regression | Yes | No | Yes (with weighting) | |
kNN | Yes | Yes | No | No |
Linear Discriminant Analysis | Yes | No | No | |
Support Vector Machines | Yes | Yes | Yes | Yes (with weighting or kernel adjustments) |
Random Forest | Yes | Yes | Yes | Yes |
AdaBoost | Yes | Yes | No | Yes (with reweighting) |
XGBoost | Yes | Yes | Yes | Yes (with weighting) |
Neural Networks | Yes | Yes | Yes (if properly regularized) | Yes (with sampling or custom loss functions) |
Here we split up our data into inputs and output for both the small and large data sets.
# Split data
lx <- large2 |> select(-y)
ly <- large2$y
sx <- small2 |> select(-y)
sy <- small2$y
lx <- as.data.frame(lx)
ly <- as.factor(ly)
sx <- as.data.frame(sx)
sy <- as.factor(sy)
We started with the small data set and ran into issues where two of
the folds failed during cross validation because of sparse categories in
education
(“illiterate”) and default
(“yes”).
Both of which are represented by only one record each in the small data
set and three “yes” and 18 “illiterate” in the large set.
We’ve made the decision to exclude these records for modeling
purposes. An alternative for at least the illiterate records would have
been to lump them into the next lowest educational attainment category,
“basic.4y” or “unknown”. I don’t believe we could meaningfully do the
same for default
unless maybe we assumed a number of the
“unknown” were actually also “yes” for default.
Note, this is probably why “widowed” was counted as “divorced in the
marital
variable, to reduce rare categorical levels.
# See all education attainment levels
#small %>%
# count(education)
# There is one each of yes and illiterate in the small data set
#sum(small$default == "yes", na.rm = TRUE)
#sum(small$education == "illiterate", na.rm = TRUE)
# There are 3 yes and 18 illiterate in the large data set
#sum(large$default == "yes", na.rm = TRUE)
#sum(large$education == "illiterate", na.rm = TRUE)
# Remove records with the two rare level categories
large3 <- large2 %>%
filter(!(education == "illiterate" | default == "yes"))
small3 <- small2 %>%
filter(!(education == "illiterate" | default == "yes"))
# Resplit data
lx <- large3 |> select(-y)
ly <- large3$y
sx <- small3 |> select(-y)
sy <- small3$y
lx <- as.data.frame(lx)
lx <- lx %>% mutate(across(where(is.character), as.factor))
ly <- as.factor(ly)
sx <- as.data.frame(sx)
sx <- sx %>% mutate(across(where(is.character), as.factor))
sy <- as.factor(sy)
We fit the logistic regression model with 10-fold cross-validation on the small data set and arrived at an accuracy of 89.82%.
# Train Logistic Regression model using caret for small
glm_model_s <- train(
x = sx,
y = sy,
method = "glm",
family = binomial(),
trControl = trainControl(method = "cv", number = 10)
)
# Summary of the model
print(glm_model_s)
## Generalized Linear Model
##
## 4117 samples
## 18 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3704, 3706, 3705, 3706, 3706, 3706, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8974965 0.2762803
We fit the logistic regression model with 10-fold cross-validation on the large data set and arrived at an accuracy of 89.99%. Additional data improved our accuracy and did not come with an onerous increase in computational time.
# Train Logistic Regression model using caret for large
glm_model_l <- train(
x = lx,
y = ly,
method = "glm",
family = binomial(),
trControl = trainControl(method = "cv", number = 10)
)
# Summary of the model
print(glm_model_l)
## Generalized Linear Model
##
## 41167 samples
## 18 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 37051, 37050, 37050, 37051, 37050, 37051, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8998712 0.2991016
Here we train our decision tree models with the same data used in the logistic regression models. There were no complications or need for additional steps after the grooming done for the logistic regression models.
There was no onerous increase in time to train on the larger data set, however we observed an extra high accuracy on the smaller data set that diminished when training the same decision tree model on the larger data set. This may be evidence that decision trees tend to overfit unless it is a large data set.
We have an accuracy of 90.31% on the small data set.
# Train Decision Tree Model
tree_model_s <- train(
x = sx,
y = sy,
method = "rpart",
trControl = trainControl(method = "cv", number = 10)
)
# View the Decision Tree Model Summary
print(tree_model_s)
## CART
##
## 4117 samples
## 18 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3705, 3705, 3705, 3705, 3706, 3705, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.007760532 0.8987161 0.2951341
## 0.009977827 0.9016299 0.2620359
## 0.058758315 0.8955531 0.1385242
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.009977827.
We have an accuracy of 89.90% on the large data set.
# Train Decision Tree Model
tree_model_l <- train(
x = lx,
y = ly,
method = "rpart",
trControl = trainControl(method = "cv", number = 10)
)
# View the Decision Tree Model Summary
print(tree_model_l)
## CART
##
## 41167 samples
## 18 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 37051, 37051, 37050, 37050, 37049, 37051, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.003091746 0.9004785 0.31045353
## 0.004817371 0.8992155 0.27535468
## 0.053817947 0.8913210 0.09338481
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003091746.
Accuracy is the percentage of predictions made by the model that were correct. Kappa is the accuracy adjusted for chance. Accuracy and Kappa should be identical when the target predictions are perfectly balanced (50% yes and 50% no) but our data was moderately imbalanced (12% yes 88% no) and so Kappa is a better guide.
Between the logistic regression models there is an insignificant improvement in accuracy when the model was trained on the large data, however there was a significant increase in accuracy after adjusting for chance.
The decision tree model trained on the small data set had the highest accuracy of all, however it may have been due to the decision tree’s chance ability to overfit the smaller data, an advantage in accuracy that disappears when the decision tree is trained on the larger data set. Interestingly the decision tree increases it’s chance adjusted accuracy when trained on the larger data set even though the accuracy went down.
Model | Accuracy | Kappa |
---|---|---|
Logistic Regression Small | 0.89823 | 0.27931 |
Logistic Regression Large | 0.89999 | 0.29931 |
Decision Tree Small | 0.90309 | 0.26962 |
Decision Tree Large | 0.89897 | 0.27309 |
In neither model are we seeing a high enough accuracy in the prediction rate when chance is accounted for and so we would need to do additional modeling and analysis.
What we learned in this exercise was an overview of common models we can use for both regression and classification problems, to consider computational efficiency against the size of our scope of data, and how to start addressing imbalances in the prediction target and rare levels of categories in qualitative prediction features.
We also acquainted ourselves with the role of data exploration to develop leads on which types of models to pursue and how to preprocess our data in a way to be able to apply the algorithms.
Our analysis showed how the amount of data used to train the model needs to be weighed against the algorithms used. In our case Decision Tree displayed potential overfitting with the smaller data set, even with 10-fold cross validation, and required the larger data set to achieve it’s best fit. Developing this intuition becomes more important when you aren’t able to scale the data you can train with upwards and need to pursue alternative methods to reduce overfitting.
From a business standpoint we could still surface insights from the current analysis using the coefficients of the logistic regression model and the branching of the decision tree as starting points. For example, the number of employees employed at the bank had a large predictive benefit. Was that because morale was high or people were less worked, we don’t know, but we have a starting place to look.
In conclusion, we learned a lot about using data exploration to drive the approach of a problem and which types of models to address the specific needs of the data.