Introduction

Our task is to conduct exploratory analysis to predict an outcome based on two sets of data, large and small, and compare the results. Also needed is an explanation of the selection of algorithms and how they relate to the data and what we are trying to do in a short essay.




Essay

After first identifying we were presented with a classification instead of a regression problem, we determined our data had a multicollinearity problem and a moderate imbalance in the classification output.

We compiled a chart reproduced below of machine learning algorithms covered in this course and assessed them in their ability to handle regression, classification, multicollinearity and imbalances. We also ignored algorithms that are computationally intensive like SVM and Neural Networks, since we were training on both small and large data sets.

Ultimately we selected logistic regression and decision trees and while we had high accuracy, when we adjusted our accuracy for random chance we had low performing models that while they could help us surface some insight on which features were important in predicting our success, should be refined with adjustments or new algorithms.

Learnings -

Our data came from a Portuguese banking institution’s marketing campaign that a team of researchers combined with economic indicators to predict the success of a given client signing up to a term savings product.

12% of the clients did end up agreeing to the product which introduced the concept of imbalance, whereby the accuracy of our models is not inflated because a model that just predicts all clients say “no” has a, pretty high, 88% accuracy rate. Because of this we relied on the Kappa score to show a measure of accuracy after random chance had been accounted for. Going back to our model that all clients say no with an 88% accuracy rate, the Kappa score would weight the 88% of accurate guess downward by dividing 88% and the 12% inaccurate upwards by dividing by 12% to show that we have a 50-50 adjusted chance of being correct which is not predictive advantage at all. In our case we attempted to resolve imbalance with the algorithm chosen. logistic regression and decision trees were OK, simpler models to start with and Random Forest and XGBoost would have been able to more powerfully account for the imbalance but we’re saving them for a future homework. An alternative would have been to train our data on a resampled training set that over represented the minority class of “yes”’s to eliminate the imbalance during training.

Through both the correlation matrix of numerical predictors and the Variance Inflation Factor (VIF) of an initially fit logistic regression model, we were able to establish that we had a high degree of multicollinearity in our model. Modeling works best when the predictors have low collinearity with each other. If you have two predictors that roughly do the same thing as each other, it allows, particularly regression models, to more easily model noise in the data. Multicollinearity can be at least partially addressed by adding regularization to the model (lasso, ridge or both, elastic-net), or by combining collinear terms in a fixed way. We chose to address multicollinearity by comparing the success of the modeling using logistic regression and Decision Tree. Ideally we would have used Random Forest, and we will see that model in the next assignment.

While fitting the models we noticed that the Decision Tree model was able to overfit on the small data set and required a relatively larger amount of data than the logistic regression model to become properly fit, even with 10-fold cross validation. While we know computationally intensive models like SVM and Neural Network would be onerous to train on the larger data set, we need to be mindful, especially since while we can always scale our data down we can’t scale our data up easily or at all, about how different algorithms require different amounts of data, depending on the dimensionality, to work properly.


Model Regression Classification Multicollinearity Imbalance
Linear Regression Yes No No
Logistic Regression Yes No Yes (with weighting)
kNN Yes Yes No No
Linear Discriminant Analysis Yes No No
Support Vector Machines Yes Yes Yes Yes (with weighting or kernel adjustments)
Random Forest Yes Yes Yes Yes
AdaBoost Yes Yes No Yes (with reweighting)
XGBoost Yes Yes Yes Yes (with weighting)
Neural Networks Yes Yes Yes (if properly regularized) Yes (with sampling or custom loss functions)



Data


Chosen Data

We selected the “Bank Marketing” data set from the UC Irvine Machine Learning repository.

A Portuguese banking institution collected this data in a direct marketing campaign from May 2008 to November 2010. The data include information about the clients called, their demographics, banking history, and the timing, length and number of interactions they had with the bank, and ultimately whether the client subscribed to a term deposit.

Later, the paper authors enriched the data with five Portuguese economic indicators like interest rates that could affect whether a client subscribed.


Citation

Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.


Selection Rationale

Our understanding was the originally intended data set from excelbianalytics.com were synthetic data and not likely to have meaningful interactions and predictions.

When the data options were opened to kaggle.com we knew we had to find a very large data set so we could subset out a random fraction to contrast strategies for larger and smaller data sets, however the first ten large data sets (30k+) found were either fractured with significant missingness, had too few features, or lent themselves to projects outside of our scope like Natural Language Processing (NLP).

We turned to the UC Irvine Machine Learning Repository looking for large, nonsynthetic sets that were used in academic papers. The Bank Marketing data set seemed like it would be interesting to work with and met the size and complexity requirements.


Difficulty

The difficulty looks medium in that we can tune some sophisticated models but there will be minimal data wrangling. We’re unlikely to explore interaction terms or have to transform our data significantly. There are no missing terms, however there are cell values of “unknown” which could be valued as “unknown”, imputed or deleted.


Loading Data

Here we load our two data sets, small and large.

There are 41,188 observations in the large data set. The small data set is a random 10% of the large data set with 4,119 observations.

They have identical structure with 20 features or prediction variables and one output attribute or target variable.

# Check out packages
library(readr)      # data importation
library(tidyverse)  # has dplyr
library(corrplot)   # correlation matrix plots
library(caret)      # model structure
library(rpart)      # decision trees
library(rpart.plot) # decision trees
# Load data
loc_small <-"~/Documents/D622/HW1/bank-additional.csv"
loc_large <-"~/Documents/D622/HW1/bank-additional-full.csv"
small <- read_delim(loc_small, delim = ";", escape_double = FALSE, trim_ws = TRUE)
large <- read_delim(loc_large, delim = ";", escape_double = FALSE, trim_ws = TRUE)


View data

Here we show the small data, which is identical in structure to the large data.

# Data Preview
glimpse(small)
## Rows: 4,119
## Columns: 21
## $ age            <dbl> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36, 36, 47,…
## $ job            <chr> "blue-collar", "services", "services", "services", "adm…
## $ marital        <chr> "married", "single", "married", "married", "married", "…
## $ education      <chr> "basic.9y", "high.school", "high.school", "basic.9y", "…
## $ default        <chr> "no", "no", "no", "no", "no", "no", "no", "unknown", "n…
## $ housing        <chr> "yes", "no", "yes", "unknown", "yes", "no", "yes", "yes…
## $ loan           <chr> "no", "no", "no", "unknown", "no", "no", "no", "no", "n…
## $ contact        <chr> "cellular", "telephone", "telephone", "telephone", "cel…
## $ month          <chr> "may", "may", "jun", "jun", "nov", "sep", "sep", "nov",…
## $ day_of_week    <chr> "fri", "fri", "wed", "fri", "mon", "thu", "mon", "mon",…
## $ duration       <dbl> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, 301, 148,…
## $ campaign       <dbl> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2, 6, 4, 2…
## $ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous       <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate   <dbl> -1.8, 1.1, 1.4, 1.4, -0.1, -1.1, -1.1, -0.1, -0.1, 1.1,…
## $ cons.price.idx <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199, 94.199,…
## $ cons.conf.idx  <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.5, -42.0,…
## $ euribor3m      <dbl> 1.313, 4.855, 4.962, 4.959, 4.191, 0.884, 0.879, 4.191,…
## $ nr.employed    <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6, 4963.6,…
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…


Variable Descriptions

Here we borrow heavily from the documentation provided through the citation website.

Bank Client Features
age - client’s age
job - type of job
marital - marital status (note, widowed is counted under “divorced”)
education - educational attainment
default - does the client have credit in default?
housing - have a housing loan?

Campaign Features
contact - contacted by “cellular” or “telephone”
month - month of year they were last contacted
day_of_week - day of week they were last contacted (Monday - Friday)
duration - last contact duration in seconds

Other Features
campaign - number of contacts made to this client for this campaign
pdays - number of days that since the client was last contacted in the previous campaign
previous - number of contacts made to this client for the previous campaign
poutcome - outcome of the previous marketing campaign

Economic Features
emp.var.rate - quarterly employment variation rate
cons.price.idx - monthly consumer price index
cons.conf.idx - monthly consumer confidence index
euribor3m - daily euribor 3 month rate
nr.employed - quarterly number of employees


Label Identification

Here is our column representing the outcome, creating this as a classification problem

Target Variable
y - outcome of the current marketing campaign, “has the client subscribed a term deposit?”




Exploratory Data Analysis


Summary Statistics - Numeric

Here we provide summary statistics for the large set’s numeric values in order to look for outliers and unusualities. Since the small is a subset of the large there is no need to repeat with the small data set.

pdays or the number of days since the client was last contacted in the previous campaign, represents “never contacted” as 999 days. It may be this is sufficient for modeling purposes.

nr.employed looks like it may be the number of employees at the bank. It’s not clear how that could benefit the analysis. Maybe fewer employees means less effective campaign calls.

# Show summary statistics for numeric columns
large %>%
    select(where(is.numeric)) %>%
    summary()
##       age           duration         campaign          pdays      
##  Min.   :17.00   Min.   :   0.0   Min.   : 1.000   Min.   :  0.0  
##  1st Qu.:32.00   1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0  
##  Median :38.00   Median : 180.0   Median : 2.000   Median :999.0  
##  Mean   :40.02   Mean   : 258.3   Mean   : 2.568   Mean   :962.5  
##  3rd Qu.:47.00   3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0  
##  Max.   :98.00   Max.   :4918.0   Max.   :56.000   Max.   :999.0  
##     previous      emp.var.rate      cons.price.idx  cons.conf.idx  
##  Min.   :0.000   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  1st Qu.:0.000   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  Median :0.000   Median : 1.10000   Median :93.75   Median :-41.8  
##  Mean   :0.173   Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
##  3rd Qu.:0.000   3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##  Max.   :7.000   Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##    euribor3m      nr.employed  
##  Min.   :0.634   Min.   :4964  
##  1st Qu.:1.344   1st Qu.:5099  
##  Median :4.857   Median :5191  
##  Mean   :3.621   Mean   :5167  
##  3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :5.045   Max.   :5228


Summary Statistics Categorical

Here we use code to generate the counts of each type of value in the categorical variables.

For our target variable, 12.69% (4,640/36,548) of the large data set were yes for the current campaign and 12.29% (451/3,668) of the small data set. This means we have a mild imbalance in our data that can be addressed with the algorithm selection for our model.

Notably there were three “yes” for default in the large data set and one in the small data set which means that they may have taken pains for similar statistics between the two data sets.

# Get counts of every factor in each categorical column
categorical_counts <- large %>%
  select(where(~ is.character(.) || is.factor(.))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Value") %>%
  group_by(Column, Value) %>%
  summarise(Count = n(), .groups = "drop")

# View the result
categorical_counts
## # A tibble: 55 × 3
##    Column      Value     Count
##    <chr>       <chr>     <int>
##  1 contact     cellular  26144
##  2 contact     telephone 15044
##  3 day_of_week fri        7827
##  4 day_of_week mon        8514
##  5 day_of_week thu        8623
##  6 day_of_week tue        8090
##  7 day_of_week wed        8134
##  8 default     no        32588
##  9 default     unknown    8597
## 10 default     yes           3
## # ℹ 45 more rows


Missing Values

There are no NA values in either data set, however in six of the eleven categorical variables there are missing values coded as “unknown”. We’re going to treat “unknown” as a separate class but we could try an imputation technique to identify them and compare.

# There are zero NA values in either data set
#small %>% summarise(total_missing = sum(is.na(.)))
#large %>% summarise(total_missing = sum(is.na(.)))

# Show categorical values of "unknown"
unknown_counts <- large %>%
  select(where(~ is.character(.) || is.factor(.))) %>%
  summarise(across(everything(), ~ sum(. == "unknown"), .names = "unknown_{.col}")) %>%
  t()
unknown_counts <- as.data.frame(unknown_counts)
unknown_counts
##                       V1
## unknown_job          330
## unknown_marital       80
## unknown_education   1731
## unknown_default     8597
## unknown_housing      990
## unknown_loan         990
## unknown_contact        0
## unknown_month          0
## unknown_day_of_week    0
## unknown_poutcome       0
## unknown_y              0


Correlation Matrix

Here is the correlation matrix plot for the numeric columns. While we can’t see the correlation between these and the non-numeric target variable y we note some strong correlations.

pdays and previous have a strong negative correlation. This is because if they were not contacted in the previous campaign they would have 999 days since last contact and 0 previous contacts but if they have at least one previous contact then the days since the last contact will be much smaller, for example 15 or 30 days.

euribor3m, the daily Euribor 3-month rate, is an average of the rates European banks are lending Euros to each other, so it makes sense that it would be positively correlated with inflation, cons.price.idx, and the change in quarterly employment ratings, emp.var.rate, since if borrowing costs are higher, companies are less likely to hire, this includes our Portuguese banking institution, whose total number of employees, nr.employed, is highly correlated to the daily Euribor 3-month rate, euribor3m.

# Correlation matrix plot
correlations <- cor(small[, sapply(small, is.numeric)])
corrplot::corrplot(correlations, method="square", type = "upper", order = 'original')


Variable Elimination

We’re going to remove duration because if this is to be a useful prediction model for the bank we can’t know in advance how long the call with the customer is going to be.

Separate to this analysis, the bank could use this information to produce guidelines for how long it’s reasonable to spend on a phone call with a client for future campaigns but for our purposes we will remove it.

# Remove duration column
small <- small %>% select(-duration)
large <- large %>% select(-duration)


Multicollinearity

Here we checked both large and small for multicollinearity by fitting a logistic regression model to access the Variance Inflation Factor (VIF) for each predictor.

A VIF close to one is ideal with no correlation between the variable and others. A VIF above 10 is a warning of high multicollinearity that could impair our modeling efforts.

Here we show three variables with VIF scores above 10, confirming that we have high multicollinearity which will influence our algorithm selections. Only the large data set results are displayed but the small data set had similar scores.

# Convert "yes" to 1 and "no" to 0 in the target column `y`
small2 <- small
large2 <- large
small2$y <- ifelse(small2$y == "yes", 1, 0)
large2$y <- ifelse(large2$y == "yes", 1, 0)

# loan was perfectly collinear and had to be removed
#alias(glm_model)
small2 <- small2 %>% select(-loan)
large2 <- large2 %>% select(-loan)

# Logistic regression model and VIF for small
#glm_model <- glm(y ~ ., data = small2, family = binomial)
#vif_values <- car::vif(glm_model)
#vif_values

# Logistic regression model and VIF for large
glm_model <- glm(y ~ ., data = large2, family = binomial)
vif_values <- car::vif(glm_model)
vif_values
##                      GVIF Df GVIF^(1/(2*Df))
## age              2.203093  1        1.484282
## job              5.655303 11        1.081938
## marital          1.440082  3        1.062669
## education        3.214727  7        1.086988
## default          1.138725  2        1.033010
## housing          1.011423  2        1.002844
## contact          2.411083  1        1.552766
## month           65.363374  9        1.261397
## day_of_week      1.060082  4        1.007320
## campaign         1.043997  1        1.021761
## pdays           10.759483  1        3.280165
## previous         4.665959  1        2.160083
## poutcome        25.193842  2        2.240390
## emp.var.rate   144.876834  1       12.036479
## cons.price.idx  65.824565  1        8.113234
## cons.conf.idx    5.335383  1        2.309845
## euribor3m      142.091066  1       11.920196
## nr.employed    172.521063  1       13.134727



Algorithms


Algorithm Selection Process

Since we have labeled data we are selecting one of the classification models below.

If we were using a computationally demanding machine learning algorithm like Support Vector Machines (SVM) or Neural Networks it would be onerous to run models on the large data set so they are out of consideration.

We’re going to start with logistic regression since it is a strong introductory classification algorithm.

We’ve also demonstrated that our data is highly multicollinear so we are also going to try the decision tree algorithm. A random forest might be more robust and offer more insights about which features are important in predicting success however we will save that for a future exercise.


Model Regression Classification Multicollinearity Imbalance
Linear Regression Yes No No
Logistic Regression Yes No Yes (with weighting)
kNN Yes Yes No No
Linear Discriminant Analysis Yes No No
Support Vector Machines Yes Yes Yes Yes (with weighting or kernel adjustments)
Random Forest Yes Yes Yes Yes
AdaBoost Yes Yes No Yes (with reweighting)
XGBoost Yes Yes Yes Yes (with weighting)
Neural Networks Yes Yes Yes (if properly regularized) Yes (with sampling or custom loss functions)


Split Data

Here we split up our data into inputs and output for both the small and large data sets.

# Split data
lx <- large2 |> select(-y)
ly <- large2$y
sx <- small2 |> select(-y)
sy <- small2$y
lx <- as.data.frame(lx)
ly <- as.factor(ly)
sx <- as.data.frame(sx)
sy <- as.factor(sy)


Logistic Regression

We started with the small data set and ran into issues where two of the folds failed during cross validation because of sparse categories in education (“illiterate”) and default (“yes”). Both of which are represented by only one record each in the small data set and three “yes” and 18 “illiterate” in the large set.

We’ve made the decision to exclude these records for modeling purposes. An alternative for at least the illiterate records would have been to lump them into the next lowest educational attainment category, “basic.4y” or “unknown”. I don’t believe we could meaningfully do the same for default unless maybe we assumed a number of the “unknown” were actually also “yes” for default.

Note, this is probably why “widowed” was counted as “divorced in the marital variable, to reduce rare categorical levels.

# See all education attainment levels
#small %>%
#  count(education)

# There is one each of yes and illiterate in the small data set
#sum(small$default == "yes", na.rm = TRUE)
#sum(small$education == "illiterate", na.rm = TRUE)

# There are 3 yes and 18 illiterate in the large data set
#sum(large$default == "yes", na.rm = TRUE)
#sum(large$education == "illiterate", na.rm = TRUE)

# Remove records with the two rare level categories
large3 <- large2 %>%
  filter(!(education == "illiterate" | default == "yes"))
small3 <- small2 %>%
  filter(!(education == "illiterate" | default == "yes"))

# Resplit data
lx <- large3 |> select(-y)
ly <- large3$y
sx <- small3 |> select(-y)
sy <- small3$y
lx <- as.data.frame(lx)
lx <- lx %>% mutate(across(where(is.character), as.factor))
ly <- as.factor(ly)
sx <- as.data.frame(sx)
sx <- sx %>% mutate(across(where(is.character), as.factor))
sy <- as.factor(sy)


Small Logistic Regression

We fit the logistic regression model with 10-fold cross-validation on the small data set and arrived at an accuracy of 89.82%.

# Train Logistic Regression model using caret for small
glm_model_s <- train(
  x = sx, 
  y = sy, 
  method = "glm",
  family = binomial(),
  trControl = trainControl(method = "cv", number = 10)
)

# Summary of the model
print(glm_model_s)
## Generalized Linear Model 
## 
## 4117 samples
##   18 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3704, 3706, 3705, 3706, 3706, 3706, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8974965  0.2762803


Large Logistic Regression

We fit the logistic regression model with 10-fold cross-validation on the large data set and arrived at an accuracy of 89.99%. Additional data improved our accuracy and did not come with an onerous increase in computational time.

# Train Logistic Regression model using caret for large
glm_model_l <- train(
  x = lx, 
  y = ly, 
  method = "glm",
  family = binomial(),
  trControl = trainControl(method = "cv", number = 10)
)

# Summary of the model
print(glm_model_l)
## Generalized Linear Model 
## 
## 41167 samples
##    18 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 37051, 37050, 37050, 37051, 37050, 37051, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8998712  0.2991016


Decision Tree

Here we train our decision tree models with the same data used in the logistic regression models. There were no complications or need for additional steps after the grooming done for the logistic regression models.

There was no onerous increase in time to train on the larger data set, however we observed an extra high accuracy on the smaller data set that diminished when training the same decision tree model on the larger data set. This may be evidence that decision trees tend to overfit unless it is a large data set.


Small Decision Tree

We have an accuracy of 90.31% on the small data set.

# Train Decision Tree Model
tree_model_s <- train(
  x = sx, 
  y = sy, 
  method = "rpart",
  trControl = trainControl(method = "cv", number = 10)
)

# View the Decision Tree Model Summary
print(tree_model_s)
## CART 
## 
## 4117 samples
##   18 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3705, 3705, 3705, 3705, 3706, 3705, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa    
##   0.007760532  0.8987161  0.2951341
##   0.009977827  0.9016299  0.2620359
##   0.058758315  0.8955531  0.1385242
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.009977827.


Large Decision Tree

We have an accuracy of 89.90% on the large data set.

# Train Decision Tree Model
tree_model_l <- train(
  x = lx, 
  y = ly, 
  method = "rpart",
  trControl = trainControl(method = "cv", number = 10)
)

# View the Decision Tree Model Summary
print(tree_model_l)
## CART 
## 
## 41167 samples
##    18 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 37051, 37051, 37050, 37050, 37049, 37051, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa     
##   0.003091746  0.9004785  0.31045353
##   0.004817371  0.8992155  0.27535468
##   0.053817947  0.8913210  0.09338481
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003091746.


Comparison

Accuracy is the percentage of predictions made by the model that were correct. Kappa is the accuracy adjusted for chance. Accuracy and Kappa should be identical when the target predictions are perfectly balanced (50% yes and 50% no) but our data was moderately imbalanced (12% yes 88% no) and so Kappa is a better guide.

Between the logistic regression models there is an insignificant improvement in accuracy when the model was trained on the large data, however there was a significant increase in accuracy after adjusting for chance.

The decision tree model trained on the small data set had the highest accuracy of all, however it may have been due to the decision tree’s chance ability to overfit the smaller data, an advantage in accuracy that disappears when the decision tree is trained on the larger data set. Interestingly the decision tree increases it’s chance adjusted accuracy when trained on the larger data set even though the accuracy went down.

Model Accuracy Kappa
Logistic Regression Small 0.89823 0.27931
Logistic Regression Large 0.89999 0.29931
Decision Tree Small 0.90309 0.26962
Decision Tree Large 0.89897 0.27309



Conclusions

In neither model are we seeing a high enough accuracy in the prediction rate when chance is accounted for and so we would need to do additional modeling and analysis.

What we learned in this exercise was an overview of common models we can use for both regression and classification problems, to consider computational efficiency against the size of our scope of data, and how to start addressing imbalances in the prediction target and rare levels of categories in qualitative prediction features.

We also acquainted ourselves with the role of data exploration to develop leads on which types of models to pursue and how to preprocess our data in a way to be able to apply the algorithms.

Our analysis showed how the amount of data used to train the model needs to be weighed against the algorithms used. In our case Decision Tree displayed potential overfitting with the smaller data set, even with 10-fold cross validation, and required the larger data set to achieve it’s best fit. Developing this intuition becomes more important when you aren’t able to scale the data you can train with upwards and need to pursue alternative methods to reduce overfitting.

From a business standpoint we could still surface insights from the current analysis using the coefficients of the logistic regression model and the branching of the decision tree as starting points. For example, the number of employees employed at the bank had a large predictive benefit. Was that because morale was high or people were less worked, we don’t know, but we have a starting place to look.

In conclusion, we learned a lot about using data exploration to drive the approach of a problem and which types of models to address the specific needs of the data.