# Load the bank-full data.
url <- "C:/Users/zengo/OneDrive/Magic Briefcase/School of Professional Studies/DATA 622 Machine Learning/Data 622 Asgmt 1/bank-full.csv"
bankdata <- read.csv(url)

# Per the Bank Marketing data set at https://archive.ics.uci.edu/dataset/222/bank+marketing, it stated that there are four datasets. I considered using the bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]. However, the website lacks a description for four variables. Therefore, I used the bank-full data that had full variable descriptions.

# View the first 6 rows of data.
head(bankdata)

##   age.job.marital.education.default.balance.housing.loan.contact.day.month.duration.campaign.pdays.previous.poutcome.y
## 1                                    58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
## 2                                      44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no
## 3                                    33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no
## 4                                     47;blue-collar;married;unknown;no;1506;yes;no;unknown;5;may;92;1;-1;0;unknown;no
## 5                                             33;unknown;single;unknown;no;1;no;no;unknown;5;may;198;1;-1;0;unknown;no
## 6                                     35;management;married;tertiary;no;231;yes;no;unknown;5;may;139;1;-1;0;unknown;no

Upon inspection, the data in each row are separated by semicolons, which all need to be separated into individual columns. This is accomplished by using the read_delim() function in the tidyverse package.

# Reload the file to separate the semicolon delineated data into separate columns.

bankdata <- read.csv(url, sep = ";", header = TRUE)

# View the data
head(bankdata)

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

The data are now separated into their own columns.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# I used the glimpse() command to show the variable names, data types, and some sample data. I loaded the tidyverse library to enable the glimpse command.
glimpse(bankdata)

## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

The data shows 45,211 rows and 17 columns. An dependent variable (y) and sixteen independent variables (features). Seven columns are continuous (integer) variables and ten columns are categorical (character) variables.

# To see clearly the explanations of all the variables, I inserted an interactive table. I cut and pasted the Variables Table data from the Bank Marketing homepage into an Excel CSV file. I then uploaded it in this file as follows:

names_url <- "C:/Users/zengo/OneDrive/Magic Briefcase/School of Professional Studies/DATA 622 Machine Learning/Data 622 Asgmt 1/bank-names.csv"

bank_names <- read.csv(names_url, stringsAsFactors = FALSE) # I used the stringsAsFactors = FALSE to make sure that the narrative remains as character data instead of being converted into factor variables.

# Load the DT library to make an interactive data table:
library(DT)

# I ran the following command to display the interactive data table:
datatable(bank_names, options = list(pageLength = 20, scrollX = TRUE)) # The options tell RStudio how many rows to display and the scrollx option produces a horizontal scrolling bar. A nice feature with the DT library is the installation of a search box for larger data tables.

Upon inspection, it appears that some of the variables may need to change their atomic vector characteristics. Job, marital, education, default, housing, loan, contact, month, and poutcome are converted from characters to factors to represent their specific levels.

# From the tidyverse library I used the the mutate_if command to identify only <chr> vectors with the is.character function, and change them to factors using the as.factor() function in the bankdata object. Doing this retains the integer variables as <int>.

bankdata <- bankdata %>%
  mutate_if(is.character, as.factor)

# I ran the glimpse() function again to verify the <chr> to <fct> changes, while retaining the <int> atomic vectors.
glimpse(bankdata)

## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital   <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

# To get a statistical summary variables of all the variables in the bankdata, I ran the summary() function:
summary(bankdata)

##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
##  Median :39.00   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
##  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##                  (Other): 6060                                    
##      pdays          previous           poutcome       y        
##  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.0000   success: 1511              
##  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
##  Max.   :871.0   Max.   :275.0000                              
##

Upon inspection of the variables, the dependent variable, y, is categorical, which points to the use of a classification algorithm. The response is binary, which is well suited for binomial logistic regression.

# Following the Nwanganga & Chapple (2020) Binomial Logistic Regression Model analysis, I separated the bankdata by categorical and continuous variables to parse the analysis. Since all the <chr> vectors were converted to <fct> vectors, I call just the is.factor function for the categorical variables:

bankdata %>%
 keep(is.factor) %>%
 summary()

##           job           marital          education     default     housing    
##  blue-collar:9732   divorced: 5207   primary  : 6851   no :44396   no :20081  
##  management :9458   married :27214   secondary:23202   yes:  815   yes:25130  
##  technician :7597   single  :12790   tertiary :13301                          
##  admin.     :5171                    unknown  : 1857                          
##  services   :4154                                                             
##  retired    :2264                                                             
##  (Other)    :6835                                                             
##   loan            contact          month          poutcome       y        
##  no :37967   cellular :29285   may    :13766   failure: 4901   no :39922  
##  yes: 7244   telephone: 2906   jul    : 6895   other  : 1840   yes: 5289  
##              unknown  :13020   aug    : 6247   success: 1511              
##                                jun    : 5341   unknown:36959              
##                                nov    : 3970                              
##                                apr    : 2932                              
##                                (Other): 6060

Note that there is no missing categorical data.The Bank Marketing webpage reported that there was no missing data, and now that is verified for the categorical variables.The variables do not report missing data (N/As).

# For the continuous variables, I call the is.numeric function for the continuous variables:

bankdata %>%
 keep(is.numeric) %>%
 summary()

##       age           balance            day           duration     
##  Min.   :18.00   Min.   : -8019   Min.   : 1.00   Min.   :   0.0  
##  1st Qu.:33.00   1st Qu.:    72   1st Qu.: 8.00   1st Qu.: 103.0  
##  Median :39.00   Median :   448   Median :16.00   Median : 180.0  
##  Mean   :40.94   Mean   :  1362   Mean   :15.81   Mean   : 258.2  
##  3rd Qu.:48.00   3rd Qu.:  1428   3rd Qu.:21.00   3rd Qu.: 319.0  
##  Max.   :95.00   Max.   :102127   Max.   :31.00   Max.   :4918.0  
##     campaign          pdays          previous       
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000  
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000  
##  Median : 2.000   Median : -1.0   Median :  0.0000  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803  
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000  
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000

Note that there is no continuous missing data. The Bank Marketing webpage reported that there was no missing data, and now that is verified for the continuous variables.The variables do not report missing data (N/As). I reviewed the data for outliers by visually comparing minimum and maximum values to the mean and median for each continuous variable. Balance, duration, campaign, pdays and previous all contain either minimum or maximum (or both) figures that are far afield from the mean or median values.

Suspect Variables

One variable sticks out. According to the Variable Names legend, “pdays” represents the number of days that passed by after the client was last contacted from a previous campaign. It is a numeric variable. The -1 means client was not previously contacted. If this is the case, logically, the values with -1 should be removed since the clients were never contacted. One cannot make this a zero value, as that would mean a client was recontacted the same day. The pdays -1 value needs to be further analyzed for it appropriateness to be further included in the analysis.

# To determine the appropriateness of the pdays variable, I determined the proportion of -1 values to the overall count of pdays data.I created a new object, "neg1_proportion" and took the bankdata and with the summarise function, piped the values with -1 as neg1. Then I called the total count of all the rows in the bankdata file (since there is no missing data) and created a proportion formula of the -1 values to the overall number of bankdata rows.

neg1_proportion <- bankdata %>%
  summarise(
    neg1 = sum(pdays == -1),
    total_count = n(),
    proportion = neg1 / total_count
    )

print(neg1_proportion)

##    neg1 total_count proportion
## 1 36954       45211  0.8173675

The outcome is telling. Approximately 8 out of 10 clients were never contacted, which suggests that approximately 18% of the clients were contacted for the first time. There is a choice to be made. I could drop the variable because pdays are meant to measure how much time passed between previous campaigns. Eighty-two percent of the data is useless. The alternative is to convert the continuous variable into a categorical variable, measuring whether a client was or was not previously contacted. That would introduce its own set of problems as I would not be able to rank the effect of time passage for those who were previously contacted. For simplicity, I just dropped the pdays variable.

# Drop the pdays variable due to unsuitable data.
bankdata <- bankdata %>%
  select(-pdays)

The “previous” variable also sticks out. The mean is very low and the median in zero. It has a maximum value of 275 but a mean of .5803, which indicates there would have to be a significant amount of zero values. I ran a proportion table to see how many zero values comprise the data.

# To determine the appropriateness of the "previous" variable, I determined the proportion of 0 values to the overall count of "previous" data. I created a new object, "previous0_proportion" and took the bankdata and with the summarise function, piped the values with 0 as an object labeled "previous0". Then, I called the total count of all the rows in the bankdata file (since there is no missing data) and created a proportion formula of the 0 values to the overall number of bankdata rows.

previous0_proportion <- bankdata %>%
  summarise(
    previous0 = sum(previous == 0),
    total_count = n(),
    proportion = previous0 / total_count
    )

print(previous0_proportion)

##   previous0 total_count proportion
## 1     36954       45211  0.8173675

All the data contained zeros. The “previous” variable is to measure the number of contacts performed before this campaign and for this client. Since it is zero for all clients, it does not make any sense to continue analyzing this variable and for simplicity, I am just dropping the “previous” variable.

# Drop the pdays variable due to unsuitable data.
bankdata <- bankdata %>%
  select(-previous)

Histograms

Although a visual inspection of the numbers indicate outlier activity, it is better to visualize them on a histogram chart.

# Load required libraries
library(ggplot2)
library(dplyr)
library(rlang)

## 
## Attaching package: 'rlang'

## The following objects are masked from 'package:purrr':
## 
##     %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice

# To avoid repeating commands to generate histograms for numeric variables, I created a list of variables of interest to plot. Then, I created a function called "histograms", which takes two values, the database and the variables within the database. I set a binwidth of 0.5. The for command loops through each variable and creates a histogram with the ggplot function. I can call the histograms function with the histograms(bankdata, variables, binwidth = .5) command. Because I am storing variable names in a string inside a function, the rlang package helps dynamically reference column names in ggplot2 (R Documentation).  

# Define numeric columns to plot
variables <- c("age", "balance", "day", "duration", "campaign")

histograms <- function(bankdata, variables, binwidth = .5) {
  for (var in variables) {
    plot_histograms <- ggplot(bankdata, aes(x = !!sym(var))) +  
      geom_histogram(binwidth = binwidth, fill = "blue", color = "black", alpha = 0.7) +
      labs(title = paste("Histogram of", var), x = var, y = "Frequency") +
      theme_minimal()
    
    print(plot_histograms)  # Print each plot one by one
  }
}

# Call the function to generate histograms
histograms(bankdata, variables, binwidth = .5)

The “day” variable histogram appears somewhat normally distributed.However, all other variables appear right-skewed.Per Nwanganga & Chapple (2020) “there are several approaches to dealing with outlier data. One approach is to use a simple rule of thumb based on the statistical properties of the data. The principle behind the rule is that any value that is larger or less than 1.5 times the interquartile range (IQR) is labeled as an outlier and should be removed from the data.”

# Using the mutate function, I created a new value for the remaining continuous variables. I removed any value that is larger or less than 1.5 times the interquartile range. First, for each variable, I found the upper boundaries using the quantile() function and found the third quartile (Q3). Then, with the IQR() function, I found the interquartile range, which equals Q3-Q1. The upper boundaries will be Q3 plus the interquartile range. Note that the "balance" variable has a significant negative outlier. All other variables were right skewed and showed upper boundary outliers, but the "balance" variable also showed a significant lower boundary variable, which needed to be addressed. I altered the code provided by Nwanganga & Chapple (2020) to remove values less than 1.5 times the interquartile range by the reverse treatment of the upper boundaries. That can be seen with the mutate(min2 = quartile(balance, .25) - (1.5 * IQR(balance))) command. 
bankdata <- bankdata %>%
 mutate(max1 = quantile(age, .75) + (1.5 * IQR(age))) %>%
 mutate(max2 = quantile(balance, .75) + (1.5 * IQR(balance))) %>%
 mutate(min2 = quantile(balance, .25) - (1.5 * IQR(balance))) %>%
 mutate(max3 = quantile(day, .75) + (1.5 * IQR(day))) %>%
 mutate(max4 = quantile(duration, .75) + (1.5 * IQR(duration))) %>%
 mutate(max5 = quantile(campaign, .75) + (1.5 * IQR(campaign))) %>%
    filter(age <= max1) %>%
    filter(balance <= max2) %>%
    filter(balance >= min2) %>%
    filter(day <= max3) %>%
    filter(duration <= max4) %>%
    filter(campaign <= max5) %>%
    select(-max1,-max2,-min2,-max3,-max4,-max5)

# I removed any value that is larger or less than 1.5 times the interquartile range: 

bankdata %>%
 keep(is.numeric) %>%
 summary()

##       age           balance             day           duration  
##  Min.   :18.00   Min.   :-1944.0   Min.   : 1.00   Min.   :  0  
##  1st Qu.:32.00   1st Qu.:   46.0   1st Qu.: 8.00   1st Qu.:102  
##  Median :39.00   Median :  344.0   Median :15.00   Median :171  
##  Mean   :40.28   Mean   :  633.8   Mean   :15.47   Mean   :207  
##  3rd Qu.:48.00   3rd Qu.:  967.0   3rd Qu.:21.00   3rd Qu.:280  
##  Max.   :70.00   Max.   : 3462.0   Max.   :31.00   Max.   :643  
##     campaign    
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :2.132  
##  3rd Qu.:3.000  
##  Max.   :6.000

The results of removing the outliers are noticeable. The minimum and maximum values are all closer to the mean and median values. Although the balance minimum and maximum values are still noticeably high, they are within the interquartile range. It is plausable that these values are normal. The negative values may indicate that some clients are overdrafting on their account balances.

# Regenerate histograms for numeric variables:

variables <- c("age", "balance", "day", "duration", "campaign")

histograms <- function(bankdata, variables, binwidth = .5) {
  for (var in variables) {
    plot_histograms <- ggplot(bankdata, aes(x = !!sym(var))) +  
      geom_histogram(binwidth = binwidth, fill = "blue", color = "black", alpha = 0.7) +
      labs(title = paste("Histogram of", var), x = var, y = "Frequency") +
      theme_minimal()
    
    print(plot_histograms)  # Print each plot one by one
  }
}

# Call the function to generate histograms
histograms(bankdata, variables, binwidth = .5)

After removing the outliers and retaining the interquartile data points, a review of the histograms still reveals skewed data. The “age” variable appears moderately right skewed. The “balance” variable appears a little right skewed and heavily leptokurtic, but appears fairly normal. The “day” variable appears normal. The “campaign” variable appears right skewed. The “duration” variable is right-skewed. Since the “day” and “balance” variables appear fairly normal, no further adjustments are needed to normalize them. However, I can attempt to normalize the “age”, “balance”, “campaign” and “duration” variables.

# The "age", "campaign" and "duration" variables are all right skewed and have positive values, which lend themselves to a log or square root transformation. I chose to do a log transformation and added one to the log so the zero values will not be undefined.
bankdata$age <- log(bankdata$age + 1)  
bankdata$campaign <- log(bankdata$campaign + 1) 
bankdata$duration <- log(bankdata$duration + 1)

# Rerun the histograms.
variables <- c("age", "duration", "campaign")

histograms <- function(bankdata, variables, binwidth = .5) {
  for (var in variables) {
    plot_histograms <- ggplot(bankdata, aes(x = !!sym(var))) +  
      geom_histogram(binwidth = binwidth, fill = "blue", color = "black", alpha = 0.7) +
      labs(title = paste("Histogram of", var), x = var, y = "Frequency") +
      theme_minimal()
    
    print(plot_histograms)  # Print each plot one by one
  }
}

# Call the function to generate histograms
histograms(bankdata, variables, binwidth = .5)

The “age” and “duration” variables appear more normal. The “campaign” variable still appears right skewed, but less skewed.

Normalize all continuous variables.

After conducted a log transformation to reduce skewness and attempt to normalize the skewed data as much as possible, the various continuous variables are not on the same scale, even after log transformation. For example, the “duration” variable is measured over seconds, from zero to 643. The ““balance” variable is measured in dollars, from -1,944 to 3,462. To overcome the scaling imbalance, I applied a min-max normalization technique, as discussed in Nwanganga & Chapple (2020). Irrespective of transformations applied to continuous variables, or existing normal distributions that did not require transformations, all variable must be normalized to be on the same scale for comparability.

# Min-max normalization function
min_max_norm <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# Apply min-max normalization to all continuous variables
bankdata$age <- min_max_norm(bankdata$age)
bankdata$balance <- min_max_norm(bankdata$balance)
bankdata$day <- min_max_norm(bankdata$day)
bankdata$duration <- min_max_norm(bankdata$duration)
bankdata$campaign <- min_max_norm(bankdata$campaign)

# To verify the new range of continuous variables, I call the is.numeric function for the continuous variables:

bankdata %>%
 keep(is.numeric) %>%
 summary()

##       age            balance            day            duration     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.4188   1st Qu.:0.3681   1st Qu.:0.2333   1st Qu.:0.7166  
##  Median :0.5647   Median :0.4232   Median :0.4667   Median :0.7959  
##  Mean   :0.5668   Mean   :0.4768   Mean   :0.4825   Mean   :0.7860  
##  3rd Qu.:0.7187   3rd Qu.:0.5385   3rd Qu.:0.6667   3rd Qu.:0.8718  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     campaign     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.3237  
##  Mean   :0.2973  
##  3rd Qu.:0.5533  
##  Max.   :1.0000

All continuous variables now have a range between zero and one, and thus, the transformed values are now easier to interpret. One of the potential disadvantages to applying the min-max normalization technique is that it could result in skewed data. From the summary above, it does not appear to have affected variables. Just to make sure that the scaling did not skew the “campaign” data, with a little help from ChatGPT I obtained code to overlay a density curve with my original histograms, I created another set of histograms for a visual check.

# Rerun the histograms.
variables <- c("age", "balance", "day", "duration", "campaign")

# Adjusted function to overlay a normal distribution curve
histograms <- function(bankdata, variables, binwidth = .5) {
  for (var in variables) {
    # Calculate mean and standard deviation for the variable
    mean_val <- mean(bankdata[[var]], na.rm = TRUE)
    sd_val <- sd(bankdata[[var]], na.rm = TRUE)
    
    plot_histograms <- ggplot(bankdata, aes(x = !!sym(var))) +  
      geom_histogram(aes(y = ..density..), binwidth = binwidth, fill = "blue", color = "black", alpha = 0.7) +
      stat_function(fun = dnorm, args = list(mean = mean_val, sd = sd_val), color = "red", size = 1) +
      labs(title = paste("Histogram of", var), x = var, y = "Density") +
      theme_minimal()
    
    print(plot_histograms)  # Print each plot one by one
  }
}

# Call the function to generate histograms with normal distribution overlay
histograms(bankdata, variables, binwidth = .5)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Categorical variable distributions

Below I created vertical bar charts to visualize the distribution of the categorical variables.

bankdata %>%
  select(job, marital, education, default, housing, loan, contact, month, poutcome) %>%
  gather(key = "variable", value = "category") %>% # Creates a new column named "variable", which stores the original column names, and "value" stores the actual values from these columns in a single column called "category". The gather() function reshapes data from wide format to long format, making it easier to work with ggplot2 for faceted plots.
  ggplot(aes(x = category)) +
  geom_bar(fill="light blue") +
  facet_wrap(~variable, scales = "free") + # the "~variable" tells ggplot2 to create separate facets (subplots) for each unique value in the variable column. scales = "free" allows the x-axis and y-axis to be independently scaled for each facet.
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Increase text size
    strip.text = element_text(size = 10),  # Increase facet label size
    plot.title = element_text(size = 10, face = "bold"),  # Increase title size
    axis.title = element_text(size = 12)  # Increase axis label size
  ) +
  ggtitle("Categorical Variable Distributions") +
  xlab("Categories") +
  ylab("Count")

## Warning: attributes are not identical across measure variables; they will be
## dropped

The categorical variables gives some key insights. One of the things I noticed is that more clients subscribe in the warmer months as opposed to the summer months (for the northern hemisphere). Therefore, I created two new variables, “warmmonths” and “coldmonths” to see the potential effect. I labeled the warm months as April, May, June, July, August, September and October, and the cold months will include November, December, January, February and March.

Creating a Seasons Categorical Variable.

# I checked the unique values in the month variable to make sure I have the correct names.
unique(bankdata$month)

##  [1] may jun jul aug oct nov dec jan feb mar apr sep
## Levels: apr aug dec feb jan jul jun mar may nov oct sep

# With the correct names verified, I categorized the "month" variable into two unique values: "warmmonths" and "coldmonths". Since the "month" variable was converted to a factor earlier in the process, I had to convert it back to a character vector to work with the mutate function. It also made sense to rename the "month" variable to "season":
bankdata <- bankdata %>%
  mutate(month = as.character(month)) %>%
  mutate(season = ifelse(month %in% c("apr", "may", "jun", "jul", "aug", "sep", "oct"),
                        "warmmonths", "coldmonths"))

# To make sure that the "month" variable collapsed to warmmonths and coldmonths, and get their counts, I created a table:
table(bankdata$season)

## 
## coldmonths warmmonths 
##       6583      28136

Since the “month” variable has been renamed to “season”, the “month” variable is dropped from further analysis.

bankdata <- subset(bankdata, select = -c(month))

# Since the "season" variable is a character vector, I need to change it a factor to create levels, 0 and 1:
bankdata <- bankdata %>%
  mutate(season = as.factor(ifelse(season %in% c("warmmonths"), 1, 0))) # Telling R to look at warmmonths and give it a code of 1, if not warmmonths, give it a zero code.

# Check the structure of the data
str(bankdata$season)

##  Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

I created a new histogram of the seasons.

bankdata %>%
  select(season) %>%
  gather(key = "variable", value = "category") %>% # Creates a new column named "variable", which stores the original column names, and "value" stores the actual values from these columns in a single column called "category". The gather() function reshapes data from wide format to long format, making it easier to work with ggplot2 for faceted plots.
  ggplot(aes(x = category)) +
  geom_bar(fill="light blue") +
  facet_wrap(~variable, scales = "free") + # the "~variable" tells ggplot2 to create separate facets (subplots) for each unique value in the variable column. scales = "free" allows the x-axis and y-axis to be independently scaled for each facet.
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Increase text size
    strip.text = element_text(size = 10),  # Increase facet label size
    plot.title = element_text(size = 10, face = "bold"),  # Increase title size
    axis.title = element_text(size = 12)  # Increase axis label size
  ) +
  ggtitle("Categorical Variable Distribution of Season: Cold Months (left) versus Warm Months (right)") +
  xlab("Categories") +
  ylab("Count")

Creating a Job Stability Categorical Variable.

The job variable has interesting categories. To try to tell a better story and possibly create a new variable, I did a frequency count on all of the levels of the job_counts category.

# Count occurrences of each job category
job_counts <- bankdata %>%
  group_by(job) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

# Print the job count table
print(job_counts)

## # A tibble: 12 × 2
##    job           count
##    <fct>         <int>
##  1 blue-collar    7705
##  2 management     7003
##  3 technician     5890
##  4 admin.         4187
##  5 services       3332
##  6 retired        1398
##  7 self-employed  1180
##  8 entrepreneur   1147
##  9 unemployed      984
## 10 housemaid       933
## 11 student         762
## 12 unknown         198

The frequency count reveals that the upper tier counts are jobs that have more stability of income. Therefore, I created two new variables, stablejobs and unstablejobs to see the potential effect. I labeled the stable jobs as blue-collar, management, technician, admin., services, and the unstable jobs as retired, self-employed, entrepreneur, unemployed and housemaid.

# I checked the unique values in the month variable to make sure I have the correct names.
unique(bankdata$job)

##  [1] management    technician    entrepreneur  blue-collar   unknown      
##  [6] retired       admin.        services      self-employed unemployed   
## [11] housemaid     student      
## 12 Levels: admin. blue-collar entrepreneur housemaid management ... unknown

# With the correct names, I categorized the "job" variable into two unique values: stablejobs and unstablejobs. Since the "job" variable was converted to a factor earlier in the process, I had to convert it back to a character vector to work with the mutate function. It also made sense to rename the "job" variable to "jobstability":
  
bankdata <- bankdata %>%
  mutate(job = as.character(job)) %>%# Categorize jobs into stablejobs and unstablejobs
  mutate(jobstability = ifelse(job %in% c("blue-collar", "management", "technician", "admin.", "services"),
                      "stablejobs", "unstablejobs"))

# To make sure that the "month" variable collapsed to stablejobs and unstablejobs, and get their counts, I created a table:
table(bankdata$jobstability)

## 
##   stablejobs unstablejobs 
##        28117         6602

Since the “job” variable has been renamed to “jobstabiity”, the “job” variable is dropped from further analysis.

# Drop "job" varible from further analysis:
bankdata <- subset(bankdata, select = -c(job))

# Since the "season" variable is a character vector, I need to change it a factor to create levels, 0 and 1:
bankdata <- bankdata %>%
  mutate(jobstability = factor(ifelse(jobstability == "stablejobs", 1, 0), levels = c(0, 1))) # Telling R to look at stablejobs and give it a code of 1, if not unstablejobs, give it a zero code.

# Check the structure of the data
str(bankdata$jobstability)

##  Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 1 2 ...

glimpse(bankdata)

## Rows: 34,719
## Columns: 15
## $ age          <dbl> 0.8595534, 0.6540713, 0.4414379, 0.7030293, 0.4414379, 0.…
## $ marital      <fct> married, single, married, married, single, married, singl…
## $ education    <fct> tertiary, secondary, secondary, unknown, unknown, tertiar…
## $ default      <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, …
## $ balance      <dbl> 0.7560118, 0.3649649, 0.3599704, 0.6381798, 0.3597854, 0.…
## $ housing      <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes…
## $ loan         <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no,…
## $ contact      <fct> unknown, unknown, unknown, unknown, unknown, unknown, unk…
## $ day          <dbl> 0.1333333, 0.1333333, 0.1333333, 0.1333333, 0.1333333, 0.…
## $ duration     <dbl> 0.8609468, 0.7767648, 0.6716153, 0.7008056, 0.8184217, 0.…
## $ campaign     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome     <fct> unknown, unknown, unknown, unknown, unknown, unknown, unk…
## $ y            <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ season       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ jobstability <fct> 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, …

bankdata %>%
  select(jobstability) %>%
  gather(key = "variable", value = "category") %>% # Creates a new column named "variable", which stores the original column names, and "value" stores the actual values from these columns in a single column called "category". The gather() function reshapes data from wide format to long format, making it easier to work with ggplot2 for faceted plots.
  ggplot(aes(x = category)) +
  geom_bar(fill="gold") +
  facet_wrap(~variable, scales = "free") + # the "~variable" tells ggplot2 to create separate facets (subplots) for each unique value in the variable column. scales = "free" allows the x-axis and y-axis to be independently scaled for each facet.
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Increase text size
    strip.text = element_text(size = 10),  # Increase facet label size
    plot.title = element_text(size = 10, face = "bold"),  # Increase title size
    axis.title = element_text(size = 12)  # Increase axis label size
  ) +
  ggtitle("Categorical Variable Distribution of Job Stability: Unstable Jobs (left) versus Stable Jobs (right)") +
  xlab("Categories") +
  ylab("Count")

Splitting the Data

Now that the data has been processed, the next step is to split the data with the goal of seeing how well a model generalizes to new data. I used the set.seed() function to reproduce results. I split the data 75% into a training set to fit a logistic regression model, and 25% into a testing set to evaluate the model’s performance.

set.seed(1234)
sample_set <- sample(nrow(bankdata), round(nrow(bankdata)*.75), replace = FALSE)
bankdata_train <- bankdata[sample_set, ]
bankdata_test <- bankdata[-sample_set, ]

I ran a test to see the proportionality of split data. I needed to make sure that the the training and testing data is proportional to the original data set for the dependent variable, “y”.

# Note that the "y" variable of the bankdata is isolated in the following code:
round(prop.table(table(select(bankdata, y), exclude = NULL)), 4) * 100

## y
##    no   yes 
## 91.64  8.36

round(prop.table(table(select(bankdata_train, y), exclude = NULL)), 4) * 100

## y
##    no   yes 
## 91.66  8.34

round(prop.table(table(select(bankdata_test, y), exclude = NULL)), 4) * 100

## y
##   no  yes 
## 91.6  8.4

Across all data sets, the “y” dependent variable is highly imbalanced. Since the “no” outcome is so prevalent, when generalizing on the test data, the “no” outcome may be unreliable. Synthetically balancing the data on the training set allows R to learn patterns of both dependent variable classes. Note that one of the reasons that the test data is not balanced is that it could lead to overfitting and run the risk of inability to generalize to other data sets.

Dealing with Class Imbalance

Per Nwanganga & Chapple (2020), there are several approaches to solving class imbalance problems; one of them is by using a synthetic minority oversampling technique (SMOTE). This technique works by creating new synthetic samples from the minority class to resolve the imbalance.

# The DMwR package has been deprecated for the current version of RStudio. I had to go to an older version and install it. I downloaded it from https://cran.r-project.org/src/contrib/Archive/DMwR/ and installed the DMwR_0.4.1.tar.gz    version, which contained the SMOTE function.

# When I first ran the code provided by , it did not work. AI tools suggested that SMOTE might conflict with functions from other packages, so I had to force it with DMwR::SMOTE. I used the following code, which worked. 
# bankdata_train <- DMwR::SMOTE(y ~ ., bankdata_train, perc.over = 100, perc.under = 200)

# However, when I restarted R and ran all chunks, the code below functioned, with warnings.

library(DMwR)

## Loading required package: lattice

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

set.seed(1234)
bankdata_train <- SMOTE(y ~ ., data = bankdata_train, perc.over = 100, perc.under = 200)

I reran the proportions and now the training data set is balanced. Using the SMOTE function, the bankdata_training data should be more evenly balanced.

round(prop.table(table(select(bankdata, y), exclude = NULL)), 4) * 100

## y
##    no   yes 
## 91.64  8.36

round(prop.table(table(select(bankdata_train, y), exclude = NULL)), 4) * 100

## y
##  no yes 
##  50  50

round(prop.table(table(select(bankdata_test, y), exclude = NULL)), 4) * 100

## y
##   no  yes 
## 91.6  8.4

# library(DMwR)
# set.seed(1234)
# # SMOTE might conflict with functions from other packages, so I had to force it with DMwR::SMOTE.
# bankdata_train <- DMwR::SMOTE(y ~ ., bankdata_train, perc.over = 100, perc.under = 200)

I changed the dependent variable from yes/no to 0/1 so the algorithm can work better with numerical values.

bankdata <- bankdata %>%
 mutate(y = as.factor(ifelse(y=="yes", 1, 0)))
bankdata_train <- bankdata_train %>%
 mutate(y = as.factor(ifelse(y=="yes", 1, 0)))
bankdata_test <- bankdata_test %>%
 mutate(y = as.factor(ifelse(y=="yes", 1, 0)))

Although I already tested the proportions in the test data, for my own understanding, I wanted to report the raw totals.

bankdata_test %>% count(y)

##   y    n
## 1 0 7951
## 2 1  729

Training a Model

In this section, I am building a binomial logistic regression model using the glm() function from the stats package.

bankdata_mod1 <- glm(data=bankdata_train, family=binomial, formula=y ~ .)

Evaluating the Model

summary(bankdata_mod1)

## 
## Call:
## glm(formula = y ~ ., family = binomial, data = bankdata_train)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -7.86434    0.35286 -22.288  < 2e-16 ***
## age                -0.49885    0.17099  -2.917 0.003529 ** 
## maritalmarried     -0.49858    0.08951  -5.570 2.55e-08 ***
## maritalsingle      -0.03246    0.10194  -0.318 0.750194    
## educationsecondary  0.18013    0.09260   1.945 0.051736 .  
## educationtertiary   0.69338    0.09820   7.061 1.66e-12 ***
## educationunknown    0.57477    0.15905   3.614 0.000302 ***
## defaultyes          0.16363    0.19678   0.832 0.405660    
## balance             1.21005    0.18785   6.442 1.18e-10 ***
## housingyes         -0.60979    0.05941 -10.264  < 2e-16 ***
## loanyes             0.09928    0.07353   1.350 0.176945    
## contacttelephone    0.67555    0.10960   6.164 7.10e-10 ***
## contactunknown     -0.87541    0.08027 -10.905  < 2e-16 ***
## day                -0.37768    0.10456  -3.612 0.000304 ***
## duration           11.39891    0.33946  33.580  < 2e-16 ***
## campaign           -1.01199    0.10514  -9.625  < 2e-16 ***
## poutcomeother       0.36701    0.13232   2.774 0.005542 ** 
## poutcomesuccess     2.43608    0.16492  14.771  < 2e-16 ***
## poutcomeunknown    -0.69101    0.08138  -8.491  < 2e-16 ***
## season1            -0.70347    0.06569 -10.709  < 2e-16 ***
## jobstability1      -0.59018    0.06746  -8.748  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12044  on 8687  degrees of freedom
## Residual deviance:  7603  on 8667  degrees of freedom
## AIC: 7645
## 
## Number of Fisher Scoring iterations: 6

I noted that all the variables were significant except for “default” and”loan”. To deal with the insignificant variables, I first tested for Multicollinearity.

# Test for Dealing with Multicollinearity

library(stats)
library(corrplot)

## corrplot 0.95 loaded

# compute the table of correlation coefficients shown previously using the cor() function.

bankdatanumeric <- bankdata %>%
 select(where(is.numeric)) 

bankdata_correlations <- cor(bankdatanumeric)

# Finally, we visualize these correlations using the corrplot.mixed function.

# Generate correlation plot with same colors but more pronounced
corrplot.mixed(bankdata_correlations, 
               lower = "number",        # Lower triangle: Numbers
               upper = "circle",        # Upper triangle: Circles
               number.cex = 1.5,        # Increase number size
               tl.col = "black",        # Make text labels black
               col.lim = c(-1, 1))      # Ensure full contrast

No multicollinearity present.Just to verify, I tested the variance inflation factor.

# Second approach to identify multicollinearity:

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

vif(bankdata_mod1)

##                  GVIF Df GVIF^(1/(2*Df))
## age          1.325316  1        1.151224
## marital      1.286949  2        1.065100
## education    1.080888  3        1.013048
## default      1.027298  1        1.013557
## balance      1.047139  1        1.023298
## housing      1.093070  1        1.045500
## loan         1.016340  1        1.008137
## contact      1.117826  2        1.028238
## day          1.004976  1        1.002485
## duration     1.074582  1        1.036620
## campaign     1.025593  1        1.012716
## poutcome     1.084021  3        1.013537
## season       1.063876  1        1.031444
## jobstability 1.038080  1        1.018862

No VIF greater than five, so no multicollinearity among any of the continuous or categorical variables.

I dropped the “default” and”loan” variables and reran the model.

# Create a new model "bankdata_mod2" with all of the variables from bankdata_train except for the day variable.
bankdata_mod2 <- glm(
  y ~ . - default -loan,  # Use all variables except "default" and"loan".
  data = bankdata_train, 
  family = binomial
)

summary(bankdata_mod2)

## 
## Call:
## glm(formula = y ~ . - default - loan, family = binomial, data = bankdata_train)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -7.83639    0.35215 -22.253  < 2e-16 ***
## age                -0.49203    0.17090  -2.879 0.003989 ** 
## maritalmarried     -0.50038    0.08948  -5.592 2.24e-08 ***
## maritalsingle      -0.03677    0.10183  -0.361 0.718040    
## educationsecondary  0.18312    0.09256   1.978 0.047882 *  
## educationtertiary   0.69432    0.09818   7.072 1.53e-12 ***
## educationunknown    0.56990    0.15895   3.585 0.000337 ***
## balance             1.17573    0.18609   6.318 2.65e-10 ***
## housingyes         -0.60606    0.05932 -10.216  < 2e-16 ***
## contacttelephone    0.67131    0.10953   6.129 8.83e-10 ***
## contactunknown     -0.87350    0.08020 -10.891  < 2e-16 ***
## day                -0.37780    0.10453  -3.614 0.000301 ***
## duration           11.40570    0.33900  33.645  < 2e-16 ***
## campaign           -1.00649    0.10501  -9.585  < 2e-16 ***
## poutcomeother       0.36916    0.13214   2.794 0.005211 ** 
## poutcomesuccess     2.43145    0.16487  14.748  < 2e-16 ***
## poutcomeunknown    -0.69073    0.08133  -8.493  < 2e-16 ***
## season1            -0.70696    0.06567 -10.766  < 2e-16 ***
## jobstability1      -0.58824    0.06743  -8.724  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12044.1  on 8687  degrees of freedom
## Residual deviance:  7605.6  on 8669  degrees of freedom
## AIC: 7643.6
## 
## Number of Fisher Scoring iterations: 6

The AIC for model one was 7645. For model two it is 7643.6, which is a marginal improvement, but no meaningful difference. Model two only reduces complexity. For learning purposes, I dropped the “default” and “loan” to reduce complexity and advance with model two.

The coefficents in the call above report the log-odds of the dependent variable “y” being “no”. Taking the example from Nwanganga & Chapple (2020) and modifying it for this data set, a value of 1.17573 for the coefficient of “balance” means that, for every unit increase in the value of “balance”, the log-odds of “y” being “no” changes by 1.17573. Since this is not too meaningful, I exponentiated the coefficients by using the exp() and coef() functions below.

Coefficients

exp(coef(bankdata_mod2)["age"])

##       age 
## 0.6113824

exp(coef(bankdata_mod2)["balance"])

##  balance 
## 3.240513

exp(coef(bankdata_mod2)["day"])

##       day 
## 0.6853662

exp(coef(bankdata_mod2)["duration"])

## duration 
## 89832.47

exp(coef(bankdata_mod2)["campaign"])

##  campaign 
## 0.3654987

exp(coef(bankdata_mod2)["jobstability1"])

## jobstability1 
##      0.555302

exp(coef(bankdata_mod2)["maritalmarried"])

## maritalmarried 
##       0.606302

exp(coef(bankdata_mod2)["maritalsingle"])

## maritalsingle 
##     0.9638975

exp(coef(bankdata_mod2)["educationsecondary"])

## educationsecondary 
##           1.200955

exp(coef(bankdata_mod2)["educationtertiary"])

## educationtertiary 
##          2.002339

exp(coef(bankdata_mod2)["educationunknown"])

## educationunknown 
##         1.768089

exp(coef(bankdata_mod2)["housingyes"])

## housingyes 
##  0.5454937

exp(coef(bankdata_mod2)["contacttelephone"])

## contacttelephone 
##         1.956794

exp(coef(bankdata_mod2)["contactunknown"])

## contactunknown 
##      0.4174861

exp(coef(bankdata_mod2)["season1"])

##   season1 
## 0.4931426

exp(coef(bankdata_mod2)["poutcomeother"])

## poutcomeother 
##      1.446517

exp(coef(bankdata_mod2)["poutcomesuccess"])

## poutcomesuccess 
##        11.37537

exp(coef(bankdata_mod2)["poutcomeunknown"])

## poutcomeunknown 
##       0.5012081

To interpret the outcome, assuming all other predictors are held constant, for a one unit increase in balance, the odds of a client responding to the marketing campaign increases by a factor of 3.240513. In other words, the higher the average yearly balance a client has in the bank, the odds substantially increase increase by a factor of 3.24 that they will respond “yes”.

Diagnostics

Model two has a Null deviance = 12044, which measures how well a model with no predictors fits the data. This high value suggests a poor fit. The Residual deviance = 7606, which measures how well the model fits the data with predictors. The lower deviance of indicates a better fit. The difference of 4438 suggests that the model explains a significant portion of the variation. To see if the difference is significant, I ran a Chi-square test.

# I assigned values to the Null and Residual deviances and calculated the difference between them:
null_deviance <- 12044.1
residual_deviance <- 7605.6

deviance_diff <- null_deviance - residual_deviance

# The Chi-Square test requires computing the difference in the degrees of freedom from the Null to the Residual deviance.
degrees_freedom_null <- 8687  
degrees_freedom_residual <- 8669  

degrees_freedom_differential <- degrees_freedom_null - degrees_freedom_residual  # Degrees of freedom lost

# Chi-square test
p_value <- pchisq(deviance_diff, df=degrees_freedom_differential, lower.tail=FALSE)

# Print results
cat("Chi-square statistic:", deviance_diff, "\n")

## Chi-square statistic: 4438.5

cat("Degrees of freedom:", degrees_freedom_differential, "\n")

## Degrees of freedom: 18

cat("p-value:", p_value, "\n")

## p-value: 0

The Chi-square test is significant, therefore, the predictors significantly improve the model.

Predictive Accuracy

The next thing we need to do is assess the performance of the model in actually predicting the response for out-of-sample observations. This involves using our model to predict respondedMailing in the test data (donors_test). To do this, we will use the predict() function from the stats package. Taking the example from Nwanganga & Chapple (2020) and modifying it for this data set, I get:

bankdata_pred1 <- predict(bankdata_mod2, bankdata_test, type = 'response')
head(bankdata_pred1)

##          5          8          9         11         12         17 
## 0.47892670 0.62918425 0.01332772 0.19027464 0.10159381 0.06785555

Using the testing data, this data shows five people and their probability that they will answer “yes” to subscribing to a term deposit. For example, person 5 has a 47.9% chance and person 9 has a 13.3% chance of subscribing to a term deposit.

Using responses of less than 0.5 as a “no” and greater than 0.5 as “yes”, we can quickly assess what clients are likely to subscribe to a term deposit.

bankdata_pred1 <- ifelse(bankdata_pred1>= 0.5, 1, 0)
head(bankdata_pred1)

##  5  8  9 11 12 17 
##  0  1  0  0  0  0

The outcome is that person 8 is likely to subscribe a term deposit and the others are not.

Confusion Matrix

To assess how well model two actually performed, I compare model two’s predicted values for “y” with the actual values in the test dataset. Earlier, when I split the data, I changed the dependent variable from yes/no to 0/1 on the training and test data, so they are consistent with factor with levels of 0 and 1. To double check, I looked at the structure and created a table for the dependent variable in the test data.

str(bankdata_test$y)  # View structure of the y column

##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

table(bankdata_test$y)  # Check class distribution

## 
##    0    1 
## 7951  729

I created the confusion matrix with the caret package, which gives a more comprehensive output.

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

bankdata_pred1 <- as.factor(bankdata_pred1)

conf_matrix <- confusionMatrix(bankdata_pred1, bankdata_test$y)
  print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6272  155
##          1 1679  574
##                                         
##                Accuracy : 0.7887        
##                  95% CI : (0.78, 0.7973)
##     No Information Rate : 0.916         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.2956        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.7888        
##             Specificity : 0.7874        
##          Pos Pred Value : 0.9759        
##          Neg Pred Value : 0.2548        
##              Prevalence : 0.9160        
##          Detection Rate : 0.7226        
##    Detection Prevalence : 0.7404        
##       Balanced Accuracy : 0.7881        
##                                         
##        'Positive' Class : 0             
##

The output means that model two correctly classified 78.87% of all test cases, with a 95% confidence level that correct classification will happen 78.00% to 79.73% of the time.

Joseph Foy - DATA 622 Assignment One: Exploratory Data Analysis and Essay - Associated R Code

2025-01-28