# Load the bank-full data.
url <- "C:/Users/zengo/OneDrive/Magic Briefcase/School of Professional Studies/DATA 622 Machine Learning/Data 622 Asgmt 1/bank-full.csv"
bankdata <- read.csv(url)
# Per the Bank Marketing data set at https://archive.ics.uci.edu/dataset/222/bank+marketing, it stated that there are four datasets. I considered using the bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]. However, the website lacks a description for four variables. Therefore, I used the bank-full data that had full variable descriptions.
# View the first 6 rows of data.
head(bankdata)
## age.job.marital.education.default.balance.housing.loan.contact.day.month.duration.campaign.pdays.previous.poutcome.y
## 1 58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
## 2 44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no
## 3 33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no
## 4 47;blue-collar;married;unknown;no;1506;yes;no;unknown;5;may;92;1;-1;0;unknown;no
## 5 33;unknown;single;unknown;no;1;no;no;unknown;5;may;198;1;-1;0;unknown;no
## 6 35;management;married;tertiary;no;231;yes;no;unknown;5;may;139;1;-1;0;unknown;no
Upon inspection, the data in each row are separated by semicolons, which all need to be separated into individual columns. This is accomplished by using the read_delim() function in the tidyverse package.
# Reload the file to separate the semicolon delineated data into separate columns.
bankdata <- read.csv(url, sep = ";", header = TRUE)
# View the data
head(bankdata)
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
The data are now separated into their own columns.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# I used the glimpse() command to show the variable names, data types, and some sample data. I loaded the tidyverse library to enable the glimpse command.
glimpse(bankdata)
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
The data shows 45,211 rows and 17 columns. An dependent variable (y) and sixteen independent variables (features). Seven columns are continuous (integer) variables and ten columns are categorical (character) variables.
# To see clearly the explanations of all the variables, I inserted an interactive table. I cut and pasted the Variables Table data from the Bank Marketing homepage into an Excel CSV file. I then uploaded it in this file as follows:
names_url <- "C:/Users/zengo/OneDrive/Magic Briefcase/School of Professional Studies/DATA 622 Machine Learning/Data 622 Asgmt 1/bank-names.csv"
bank_names <- read.csv(names_url, stringsAsFactors = FALSE) # I used the stringsAsFactors = FALSE to make sure that the narrative remains as character data instead of being converted into factor variables.
# Load the DT library to make an interactive data table:
library(DT)
# I ran the following command to display the interactive data table:
datatable(bank_names, options = list(pageLength = 20, scrollX = TRUE)) # The options tell RStudio how many rows to display and the scrollx option produces a horizontal scrolling bar. A nice feature with the DT library is the installation of a search box for larger data tables.
Upon inspection, it appears that some of the variables may need to change their atomic vector characteristics. Job, marital, education, default, housing, loan, contact, month, and poutcome are converted from characters to factors to represent their specific levels.
# From the tidyverse library I used the the mutate_if command to identify only <chr> vectors with the is.character function, and change them to factors using the as.factor() function in the bankdata object. Doing this retains the integer variables as <int>.
bankdata <- bankdata %>%
mutate_if(is.character, as.factor)
# I ran the glimpse() function again to verify the <chr> to <fct> changes, while retaining the <int> atomic vectors.
glimpse(bankdata)
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
# To get a statistical summary variables of all the variables in the bankdata, I ran the summary() function:
summary(bankdata)
## age job marital education
## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management :9458 married :27214 secondary:23202
## Median :39.00 technician :7597 single :12790 tertiary :13301
## Mean :40.94 admin. :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
## Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
## (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
## Median : -1.0 Median : 0.0000 success: 1511
## Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
##
Upon inspection of the variables, the dependent variable, y, is categorical, which points to the use of a classification algorithm. The response is binary, which is well suited for binomial logistic regression.
# Following the Nwanganga & Chapple (2020) Binomial Logistic Regression Model analysis, I separated the bankdata by categorical and continuous variables to parse the analysis. Since all the <chr> vectors were converted to <fct> vectors, I call just the is.factor function for the categorical variables:
bankdata %>%
keep(is.factor) %>%
summary()
## job marital education default housing
## blue-collar:9732 divorced: 5207 primary : 6851 no :44396 no :20081
## management :9458 married :27214 secondary:23202 yes: 815 yes:25130
## technician :7597 single :12790 tertiary :13301
## admin. :5171 unknown : 1857
## services :4154
## retired :2264
## (Other) :6835
## loan contact month poutcome y
## no :37967 cellular :29285 may :13766 failure: 4901 no :39922
## yes: 7244 telephone: 2906 jul : 6895 other : 1840 yes: 5289
## unknown :13020 aug : 6247 success: 1511
## jun : 5341 unknown:36959
## nov : 3970
## apr : 2932
## (Other): 6060
Note that there is no missing categorical data.The Bank Marketing webpage reported that there was no missing data, and now that is verified for the categorical variables.The variables do not report missing data (N/As).
# For the continuous variables, I call the is.numeric function for the continuous variables:
bankdata %>%
keep(is.numeric) %>%
summary()
## age balance day duration
## Min. :18.00 Min. : -8019 Min. : 1.00 Min. : 0.0
## 1st Qu.:33.00 1st Qu.: 72 1st Qu.: 8.00 1st Qu.: 103.0
## Median :39.00 Median : 448 Median :16.00 Median : 180.0
## Mean :40.94 Mean : 1362 Mean :15.81 Mean : 258.2
## 3rd Qu.:48.00 3rd Qu.: 1428 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :95.00 Max. :102127 Max. :31.00 Max. :4918.0
## campaign pdays previous
## Min. : 1.000 Min. : -1.0 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000
## Median : 2.000 Median : -1.0 Median : 0.0000
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
Note that there is no continuous missing data. The Bank Marketing webpage reported that there was no missing data, and now that is verified for the continuous variables.The variables do not report missing data (N/As). I reviewed the data for outliers by visually comparing minimum and maximum values to the mean and median for each continuous variable. Balance, duration, campaign, pdays and previous all contain either minimum or maximum (or both) figures that are far afield from the mean or median values.
One variable sticks out. According to the Variable Names legend, “pdays” represents the number of days that passed by after the client was last contacted from a previous campaign. It is a numeric variable. The -1 means client was not previously contacted. If this is the case, logically, the values with -1 should be removed since the clients were never contacted. One cannot make this a zero value, as that would mean a client was recontacted the same day. The pdays -1 value needs to be further analyzed for it appropriateness to be further included in the analysis.
# To determine the appropriateness of the pdays variable, I determined the proportion of -1 values to the overall count of pdays data.I created a new object, "neg1_proportion" and took the bankdata and with the summarise function, piped the values with -1 as neg1. Then I called the total count of all the rows in the bankdata file (since there is no missing data) and created a proportion formula of the -1 values to the overall number of bankdata rows.
neg1_proportion <- bankdata %>%
summarise(
neg1 = sum(pdays == -1),
total_count = n(),
proportion = neg1 / total_count
)
print(neg1_proportion)
## neg1 total_count proportion
## 1 36954 45211 0.8173675
The outcome is telling. Approximately 8 out of 10 clients were never contacted, which suggests that approximately 18% of the clients were contacted for the first time. There is a choice to be made. I could drop the variable because pdays are meant to measure how much time passed between previous campaigns. Eighty-two percent of the data is useless. The alternative is to convert the continuous variable into a categorical variable, measuring whether a client was or was not previously contacted. That would introduce its own set of problems as I would not be able to rank the effect of time passage for those who were previously contacted. For simplicity, I just dropped the pdays variable.
# Drop the pdays variable due to unsuitable data.
bankdata <- bankdata %>%
select(-pdays)
The “previous” variable also sticks out. The mean is very low and the median in zero. It has a maximum value of 275 but a mean of .5803, which indicates there would have to be a significant amount of zero values. I ran a proportion table to see how many zero values comprise the data.
# To determine the appropriateness of the "previous" variable, I determined the proportion of 0 values to the overall count of "previous" data. I created a new object, "previous0_proportion" and took the bankdata and with the summarise function, piped the values with 0 as an object labeled "previous0". Then, I called the total count of all the rows in the bankdata file (since there is no missing data) and created a proportion formula of the 0 values to the overall number of bankdata rows.
previous0_proportion <- bankdata %>%
summarise(
previous0 = sum(previous == 0),
total_count = n(),
proportion = previous0 / total_count
)
print(previous0_proportion)
## previous0 total_count proportion
## 1 36954 45211 0.8173675
All the data contained zeros. The “previous” variable is to measure the number of contacts performed before this campaign and for this client. Since it is zero for all clients, it does not make any sense to continue analyzing this variable and for simplicity, I am just dropping the “previous” variable.
# Drop the pdays variable due to unsuitable data.
bankdata <- bankdata %>%
select(-previous)
Although a visual inspection of the numbers indicate outlier activity, it is better to visualize them on a histogram chart.
# Load required libraries
library(ggplot2)
library(dplyr)
library(rlang)
##
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
##
## %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
# To avoid repeating commands to generate histograms for numeric variables, I created a list of variables of interest to plot. Then, I created a function called "histograms", which takes two values, the database and the variables within the database. I set a binwidth of 0.5. The for command loops through each variable and creates a histogram with the ggplot function. I can call the histograms function with the histograms(bankdata, variables, binwidth = .5) command. Because I am storing variable names in a string inside a function, the rlang package helps dynamically reference column names in ggplot2 (R Documentation).
# Define numeric columns to plot
variables <- c("age", "balance", "day", "duration", "campaign")
histograms <- function(bankdata, variables, binwidth = .5) {
for (var in variables) {
plot_histograms <- ggplot(bankdata, aes(x = !!sym(var))) +
geom_histogram(binwidth = binwidth, fill = "blue", color = "black", alpha = 0.7) +
labs(title = paste("Histogram of", var), x = var, y = "Frequency") +
theme_minimal()
print(plot_histograms) # Print each plot one by one
}
}
# Call the function to generate histograms
histograms(bankdata, variables, binwidth = .5)
The “day” variable histogram appears somewhat normally distributed.However, all other variables appear right-skewed.Per Nwanganga & Chapple (2020) “there are several approaches to dealing with outlier data. One approach is to use a simple rule of thumb based on the statistical properties of the data. The principle behind the rule is that any value that is larger or less than 1.5 times the interquartile range (IQR) is labeled as an outlier and should be removed from the data.”
# Using the mutate function, I created a new value for the remaining continuous variables. I removed any value that is larger or less than 1.5 times the interquartile range. First, for each variable, I found the upper boundaries using the quantile() function and found the third quartile (Q3). Then, with the IQR() function, I found the interquartile range, which equals Q3-Q1. The upper boundaries will be Q3 plus the interquartile range. Note that the "balance" variable has a significant negative outlier. All other variables were right skewed and showed upper boundary outliers, but the "balance" variable also showed a significant lower boundary variable, which needed to be addressed. I altered the code provided by Nwanganga & Chapple (2020) to remove values less than 1.5 times the interquartile range by the reverse treatment of the upper boundaries. That can be seen with the mutate(min2 = quartile(balance, .25) - (1.5 * IQR(balance))) command.
bankdata <- bankdata %>%
mutate(max1 = quantile(age, .75) + (1.5 * IQR(age))) %>%
mutate(max2 = quantile(balance, .75) + (1.5 * IQR(balance))) %>%
mutate(min2 = quantile(balance, .25) - (1.5 * IQR(balance))) %>%
mutate(max3 = quantile(day, .75) + (1.5 * IQR(day))) %>%
mutate(max4 = quantile(duration, .75) + (1.5 * IQR(duration))) %>%
mutate(max5 = quantile(campaign, .75) + (1.5 * IQR(campaign))) %>%
filter(age <= max1) %>%
filter(balance <= max2) %>%
filter(balance >= min2) %>%
filter(day <= max3) %>%
filter(duration <= max4) %>%
filter(campaign <= max5) %>%
select(-max1,-max2,-min2,-max3,-max4,-max5)
# I removed any value that is larger or less than 1.5 times the interquartile range:
bankdata %>%
keep(is.numeric) %>%
summary()
## age balance day duration
## Min. :18.00 Min. :-1944.0 Min. : 1.00 Min. : 0
## 1st Qu.:32.00 1st Qu.: 46.0 1st Qu.: 8.00 1st Qu.:102
## Median :39.00 Median : 344.0 Median :15.00 Median :171
## Mean :40.28 Mean : 633.8 Mean :15.47 Mean :207
## 3rd Qu.:48.00 3rd Qu.: 967.0 3rd Qu.:21.00 3rd Qu.:280
## Max. :70.00 Max. : 3462.0 Max. :31.00 Max. :643
## campaign
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :2.132
## 3rd Qu.:3.000
## Max. :6.000
The results of removing the outliers are noticeable. The minimum and maximum values are all closer to the mean and median values. Although the balance minimum and maximum values are still noticeably high, they are within the interquartile range. It is plausable that these values are normal. The negative values may indicate that some clients are overdrafting on their account balances.
# Regenerate histograms for numeric variables:
variables <- c("age", "balance", "day", "duration", "campaign")
histograms <- function(bankdata, variables, binwidth = .5) {
for (var in variables) {
plot_histograms <- ggplot(bankdata, aes(x = !!sym(var))) +
geom_histogram(binwidth = binwidth, fill = "blue", color = "black", alpha = 0.7) +
labs(title = paste("Histogram of", var), x = var, y = "Frequency") +
theme_minimal()
print(plot_histograms) # Print each plot one by one
}
}
# Call the function to generate histograms
histograms(bankdata, variables, binwidth = .5)
After removing the outliers and retaining the interquartile data points, a review of the histograms still reveals skewed data. The “age” variable appears moderately right skewed. The “balance” variable appears a little right skewed and heavily leptokurtic, but appears fairly normal. The “day” variable appears normal. The “campaign” variable appears right skewed. The “duration” variable is right-skewed. Since the “day” and “balance” variables appear fairly normal, no further adjustments are needed to normalize them. However, I can attempt to normalize the “age”, “balance”, “campaign” and “duration” variables.
# The "age", "campaign" and "duration" variables are all right skewed and have positive values, which lend themselves to a log or square root transformation. I chose to do a log transformation and added one to the log so the zero values will not be undefined.
bankdata$age <- log(bankdata$age + 1)
bankdata$campaign <- log(bankdata$campaign + 1)
bankdata$duration <- log(bankdata$duration + 1)
# Rerun the histograms.
variables <- c("age", "duration", "campaign")
histograms <- function(bankdata, variables, binwidth = .5) {
for (var in variables) {
plot_histograms <- ggplot(bankdata, aes(x = !!sym(var))) +
geom_histogram(binwidth = binwidth, fill = "blue", color = "black", alpha = 0.7) +
labs(title = paste("Histogram of", var), x = var, y = "Frequency") +
theme_minimal()
print(plot_histograms) # Print each plot one by one
}
}
# Call the function to generate histograms
histograms(bankdata, variables, binwidth = .5)
The “age” and “duration” variables appear more normal. The “campaign” variable still appears right skewed, but less skewed.
After conducted a log transformation to reduce skewness and attempt to normalize the skewed data as much as possible, the various continuous variables are not on the same scale, even after log transformation. For example, the “duration” variable is measured over seconds, from zero to 643. The ““balance” variable is measured in dollars, from -1,944 to 3,462. To overcome the scaling imbalance, I applied a min-max normalization technique, as discussed in Nwanganga & Chapple (2020). Irrespective of transformations applied to continuous variables, or existing normal distributions that did not require transformations, all variable must be normalized to be on the same scale for comparability.
# Min-max normalization function
min_max_norm <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
# Apply min-max normalization to all continuous variables
bankdata$age <- min_max_norm(bankdata$age)
bankdata$balance <- min_max_norm(bankdata$balance)
bankdata$day <- min_max_norm(bankdata$day)
bankdata$duration <- min_max_norm(bankdata$duration)
bankdata$campaign <- min_max_norm(bankdata$campaign)
# To verify the new range of continuous variables, I call the is.numeric function for the continuous variables:
bankdata %>%
keep(is.numeric) %>%
summary()
## age balance day duration
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.4188 1st Qu.:0.3681 1st Qu.:0.2333 1st Qu.:0.7166
## Median :0.5647 Median :0.4232 Median :0.4667 Median :0.7959
## Mean :0.5668 Mean :0.4768 Mean :0.4825 Mean :0.7860
## 3rd Qu.:0.7187 3rd Qu.:0.5385 3rd Qu.:0.6667 3rd Qu.:0.8718
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## campaign
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.3237
## Mean :0.2973
## 3rd Qu.:0.5533
## Max. :1.0000
All continuous variables now have a range between zero and one, and thus, the transformed values are now easier to interpret. One of the potential disadvantages to applying the min-max normalization technique is that it could result in skewed data. From the summary above, it does not appear to have affected variables. Just to make sure that the scaling did not skew the “campaign” data, with a little help from ChatGPT I obtained code to overlay a density curve with my original histograms, I created another set of histograms for a visual check.
# Rerun the histograms.
variables <- c("age", "balance", "day", "duration", "campaign")
# Adjusted function to overlay a normal distribution curve
histograms <- function(bankdata, variables, binwidth = .5) {
for (var in variables) {
# Calculate mean and standard deviation for the variable
mean_val <- mean(bankdata[[var]], na.rm = TRUE)
sd_val <- sd(bankdata[[var]], na.rm = TRUE)
plot_histograms <- ggplot(bankdata, aes(x = !!sym(var))) +
geom_histogram(aes(y = ..density..), binwidth = binwidth, fill = "blue", color = "black", alpha = 0.7) +
stat_function(fun = dnorm, args = list(mean = mean_val, sd = sd_val), color = "red", size = 1) +
labs(title = paste("Histogram of", var), x = var, y = "Density") +
theme_minimal()
print(plot_histograms) # Print each plot one by one
}
}
# Call the function to generate histograms with normal distribution overlay
histograms(bankdata, variables, binwidth = .5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Below I created vertical bar charts to visualize the distribution of the categorical variables.
bankdata %>%
select(job, marital, education, default, housing, loan, contact, month, poutcome) %>%
gather(key = "variable", value = "category") %>% # Creates a new column named "variable", which stores the original column names, and "value" stores the actual values from these columns in a single column called "category". The gather() function reshapes data from wide format to long format, making it easier to work with ggplot2 for faceted plots.
ggplot(aes(x = category)) +
geom_bar(fill="light blue") +
facet_wrap(~variable, scales = "free") + # the "~variable" tells ggplot2 to create separate facets (subplots) for each unique value in the variable column. scales = "free" allows the x-axis and y-axis to be independently scaled for each facet.
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8), # Increase text size
strip.text = element_text(size = 10), # Increase facet label size
plot.title = element_text(size = 10, face = "bold"), # Increase title size
axis.title = element_text(size = 12) # Increase axis label size
) +
ggtitle("Categorical Variable Distributions") +
xlab("Categories") +
ylab("Count")
## Warning: attributes are not identical across measure variables; they will be
## dropped
The categorical variables gives some key insights. One of the things I noticed is that more clients subscribe in the warmer months as opposed to the summer months (for the northern hemisphere). Therefore, I created two new variables, “warmmonths” and “coldmonths” to see the potential effect. I labeled the warm months as April, May, June, July, August, September and October, and the cold months will include November, December, January, February and March.
# I checked the unique values in the month variable to make sure I have the correct names.
unique(bankdata$month)
## [1] may jun jul aug oct nov dec jan feb mar apr sep
## Levels: apr aug dec feb jan jul jun mar may nov oct sep
# With the correct names verified, I categorized the "month" variable into two unique values: "warmmonths" and "coldmonths". Since the "month" variable was converted to a factor earlier in the process, I had to convert it back to a character vector to work with the mutate function. It also made sense to rename the "month" variable to "season":
bankdata <- bankdata %>%
mutate(month = as.character(month)) %>%
mutate(season = ifelse(month %in% c("apr", "may", "jun", "jul", "aug", "sep", "oct"),
"warmmonths", "coldmonths"))
# To make sure that the "month" variable collapsed to warmmonths and coldmonths, and get their counts, I created a table:
table(bankdata$season)
##
## coldmonths warmmonths
## 6583 28136
Since the “month” variable has been renamed to “season”, the “month” variable is dropped from further analysis.
bankdata <- subset(bankdata, select = -c(month))
# Since the "season" variable is a character vector, I need to change it a factor to create levels, 0 and 1:
bankdata <- bankdata %>%
mutate(season = as.factor(ifelse(season %in% c("warmmonths"), 1, 0))) # Telling R to look at warmmonths and give it a code of 1, if not warmmonths, give it a zero code.
# Check the structure of the data
str(bankdata$season)
## Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
I created a new histogram of the seasons.
bankdata %>%
select(season) %>%
gather(key = "variable", value = "category") %>% # Creates a new column named "variable", which stores the original column names, and "value" stores the actual values from these columns in a single column called "category". The gather() function reshapes data from wide format to long format, making it easier to work with ggplot2 for faceted plots.
ggplot(aes(x = category)) +
geom_bar(fill="light blue") +
facet_wrap(~variable, scales = "free") + # the "~variable" tells ggplot2 to create separate facets (subplots) for each unique value in the variable column. scales = "free" allows the x-axis and y-axis to be independently scaled for each facet.
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8), # Increase text size
strip.text = element_text(size = 10), # Increase facet label size
plot.title = element_text(size = 10, face = "bold"), # Increase title size
axis.title = element_text(size = 12) # Increase axis label size
) +
ggtitle("Categorical Variable Distribution of Season: Cold Months (left) versus Warm Months (right)") +
xlab("Categories") +
ylab("Count")
The job variable has interesting categories. To try to tell a better story and possibly create a new variable, I did a frequency count on all of the levels of the job_counts category.
# Count occurrences of each job category
job_counts <- bankdata %>%
group_by(job) %>%
summarise(count = n()) %>%
arrange(desc(count))
# Print the job count table
print(job_counts)
## # A tibble: 12 × 2
## job count
## <fct> <int>
## 1 blue-collar 7705
## 2 management 7003
## 3 technician 5890
## 4 admin. 4187
## 5 services 3332
## 6 retired 1398
## 7 self-employed 1180
## 8 entrepreneur 1147
## 9 unemployed 984
## 10 housemaid 933
## 11 student 762
## 12 unknown 198
The frequency count reveals that the upper tier counts are jobs that have more stability of income. Therefore, I created two new variables, stablejobs and unstablejobs to see the potential effect. I labeled the stable jobs as blue-collar, management, technician, admin., services, and the unstable jobs as retired, self-employed, entrepreneur, unemployed and housemaid.
# I checked the unique values in the month variable to make sure I have the correct names.
unique(bankdata$job)
## [1] management technician entrepreneur blue-collar unknown
## [6] retired admin. services self-employed unemployed
## [11] housemaid student
## 12 Levels: admin. blue-collar entrepreneur housemaid management ... unknown
# With the correct names, I categorized the "job" variable into two unique values: stablejobs and unstablejobs. Since the "job" variable was converted to a factor earlier in the process, I had to convert it back to a character vector to work with the mutate function. It also made sense to rename the "job" variable to "jobstability":
bankdata <- bankdata %>%
mutate(job = as.character(job)) %>%# Categorize jobs into stablejobs and unstablejobs
mutate(jobstability = ifelse(job %in% c("blue-collar", "management", "technician", "admin.", "services"),
"stablejobs", "unstablejobs"))
# To make sure that the "month" variable collapsed to stablejobs and unstablejobs, and get their counts, I created a table:
table(bankdata$jobstability)
##
## stablejobs unstablejobs
## 28117 6602
Since the “job” variable has been renamed to “jobstabiity”, the “job” variable is dropped from further analysis.
# Drop "job" varible from further analysis:
bankdata <- subset(bankdata, select = -c(job))
# Since the "season" variable is a character vector, I need to change it a factor to create levels, 0 and 1:
bankdata <- bankdata %>%
mutate(jobstability = factor(ifelse(jobstability == "stablejobs", 1, 0), levels = c(0, 1))) # Telling R to look at stablejobs and give it a code of 1, if not unstablejobs, give it a zero code.
# Check the structure of the data
str(bankdata$jobstability)
## Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 1 2 ...
glimpse(bankdata)
## Rows: 34,719
## Columns: 15
## $ age <dbl> 0.8595534, 0.6540713, 0.4414379, 0.7030293, 0.4414379, 0.…
## $ marital <fct> married, single, married, married, single, married, singl…
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiar…
## $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, …
## $ balance <dbl> 0.7560118, 0.3649649, 0.3599704, 0.6381798, 0.3597854, 0.…
## $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes…
## $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no,…
## $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unk…
## $ day <dbl> 0.1333333, 0.1333333, 0.1333333, 0.1333333, 0.1333333, 0.…
## $ duration <dbl> 0.8609468, 0.7767648, 0.6716153, 0.7008056, 0.8184217, 0.…
## $ campaign <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unk…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ season <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ jobstability <fct> 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, …
bankdata %>%
select(jobstability) %>%
gather(key = "variable", value = "category") %>% # Creates a new column named "variable", which stores the original column names, and "value" stores the actual values from these columns in a single column called "category". The gather() function reshapes data from wide format to long format, making it easier to work with ggplot2 for faceted plots.
ggplot(aes(x = category)) +
geom_bar(fill="gold") +
facet_wrap(~variable, scales = "free") + # the "~variable" tells ggplot2 to create separate facets (subplots) for each unique value in the variable column. scales = "free" allows the x-axis and y-axis to be independently scaled for each facet.
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8), # Increase text size
strip.text = element_text(size = 10), # Increase facet label size
plot.title = element_text(size = 10, face = "bold"), # Increase title size
axis.title = element_text(size = 12) # Increase axis label size
) +
ggtitle("Categorical Variable Distribution of Job Stability: Unstable Jobs (left) versus Stable Jobs (right)") +
xlab("Categories") +
ylab("Count")
Now that the data has been processed, the next step is to split the data with the goal of seeing how well a model generalizes to new data. I used the set.seed() function to reproduce results. I split the data 75% into a training set to fit a logistic regression model, and 25% into a testing set to evaluate the model’s performance.
set.seed(1234)
sample_set <- sample(nrow(bankdata), round(nrow(bankdata)*.75), replace = FALSE)
bankdata_train <- bankdata[sample_set, ]
bankdata_test <- bankdata[-sample_set, ]
I ran a test to see the proportionality of split data. I needed to make sure that the the training and testing data is proportional to the original data set for the dependent variable, “y”.
# Note that the "y" variable of the bankdata is isolated in the following code:
round(prop.table(table(select(bankdata, y), exclude = NULL)), 4) * 100
## y
## no yes
## 91.64 8.36
round(prop.table(table(select(bankdata_train, y), exclude = NULL)), 4) * 100
## y
## no yes
## 91.66 8.34
round(prop.table(table(select(bankdata_test, y), exclude = NULL)), 4) * 100
## y
## no yes
## 91.6 8.4
Across all data sets, the “y” dependent variable is highly imbalanced. Since the “no” outcome is so prevalent, when generalizing on the test data, the “no” outcome may be unreliable. Synthetically balancing the data on the training set allows R to learn patterns of both dependent variable classes. Note that one of the reasons that the test data is not balanced is that it could lead to overfitting and run the risk of inability to generalize to other data sets.
Per Nwanganga & Chapple (2020), there are several approaches to solving class imbalance problems; one of them is by using a synthetic minority oversampling technique (SMOTE). This technique works by creating new synthetic samples from the minority class to resolve the imbalance.
# The DMwR package has been deprecated for the current version of RStudio. I had to go to an older version and install it. I downloaded it from https://cran.r-project.org/src/contrib/Archive/DMwR/ and installed the DMwR_0.4.1.tar.gz version, which contained the SMOTE function.
# When I first ran the code provided by , it did not work. AI tools suggested that SMOTE might conflict with functions from other packages, so I had to force it with DMwR::SMOTE. I used the following code, which worked.
# bankdata_train <- DMwR::SMOTE(y ~ ., bankdata_train, perc.over = 100, perc.under = 200)
# However, when I restarted R and ran all chunks, the code below functioned, with warnings.
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
set.seed(1234)
bankdata_train <- SMOTE(y ~ ., data = bankdata_train, perc.over = 100, perc.under = 200)
I reran the proportions and now the training data set is balanced. Using the SMOTE function, the bankdata_training data should be more evenly balanced.
round(prop.table(table(select(bankdata, y), exclude = NULL)), 4) * 100
## y
## no yes
## 91.64 8.36
round(prop.table(table(select(bankdata_train, y), exclude = NULL)), 4) * 100
## y
## no yes
## 50 50
round(prop.table(table(select(bankdata_test, y), exclude = NULL)), 4) * 100
## y
## no yes
## 91.6 8.4
# library(DMwR)
# set.seed(1234)
# # SMOTE might conflict with functions from other packages, so I had to force it with DMwR::SMOTE.
# bankdata_train <- DMwR::SMOTE(y ~ ., bankdata_train, perc.over = 100, perc.under = 200)
I changed the dependent variable from yes/no to 0/1 so the algorithm can work better with numerical values.
bankdata <- bankdata %>%
mutate(y = as.factor(ifelse(y=="yes", 1, 0)))
bankdata_train <- bankdata_train %>%
mutate(y = as.factor(ifelse(y=="yes", 1, 0)))
bankdata_test <- bankdata_test %>%
mutate(y = as.factor(ifelse(y=="yes", 1, 0)))
Although I already tested the proportions in the test data, for my own understanding, I wanted to report the raw totals.
bankdata_test %>% count(y)
## y n
## 1 0 7951
## 2 1 729
In this section, I am building a binomial logistic regression model using the glm() function from the stats package.
bankdata_mod1 <- glm(data=bankdata_train, family=binomial, formula=y ~ .)
summary(bankdata_mod1)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = bankdata_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.86434 0.35286 -22.288 < 2e-16 ***
## age -0.49885 0.17099 -2.917 0.003529 **
## maritalmarried -0.49858 0.08951 -5.570 2.55e-08 ***
## maritalsingle -0.03246 0.10194 -0.318 0.750194
## educationsecondary 0.18013 0.09260 1.945 0.051736 .
## educationtertiary 0.69338 0.09820 7.061 1.66e-12 ***
## educationunknown 0.57477 0.15905 3.614 0.000302 ***
## defaultyes 0.16363 0.19678 0.832 0.405660
## balance 1.21005 0.18785 6.442 1.18e-10 ***
## housingyes -0.60979 0.05941 -10.264 < 2e-16 ***
## loanyes 0.09928 0.07353 1.350 0.176945
## contacttelephone 0.67555 0.10960 6.164 7.10e-10 ***
## contactunknown -0.87541 0.08027 -10.905 < 2e-16 ***
## day -0.37768 0.10456 -3.612 0.000304 ***
## duration 11.39891 0.33946 33.580 < 2e-16 ***
## campaign -1.01199 0.10514 -9.625 < 2e-16 ***
## poutcomeother 0.36701 0.13232 2.774 0.005542 **
## poutcomesuccess 2.43608 0.16492 14.771 < 2e-16 ***
## poutcomeunknown -0.69101 0.08138 -8.491 < 2e-16 ***
## season1 -0.70347 0.06569 -10.709 < 2e-16 ***
## jobstability1 -0.59018 0.06746 -8.748 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12044 on 8687 degrees of freedom
## Residual deviance: 7603 on 8667 degrees of freedom
## AIC: 7645
##
## Number of Fisher Scoring iterations: 6
I noted that all the variables were significant except for “default” and”loan”. To deal with the insignificant variables, I first tested for Multicollinearity.
# Test for Dealing with Multicollinearity
library(stats)
library(corrplot)
## corrplot 0.95 loaded
# compute the table of correlation coefficients shown previously using the cor() function.
bankdatanumeric <- bankdata %>%
select(where(is.numeric))
bankdata_correlations <- cor(bankdatanumeric)
# Finally, we visualize these correlations using the corrplot.mixed function.
# Generate correlation plot with same colors but more pronounced
corrplot.mixed(bankdata_correlations,
lower = "number", # Lower triangle: Numbers
upper = "circle", # Upper triangle: Circles
number.cex = 1.5, # Increase number size
tl.col = "black", # Make text labels black
col.lim = c(-1, 1)) # Ensure full contrast
No multicollinearity present.Just to verify, I tested the variance
inflation factor.
# Second approach to identify multicollinearity:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(bankdata_mod1)
## GVIF Df GVIF^(1/(2*Df))
## age 1.325316 1 1.151224
## marital 1.286949 2 1.065100
## education 1.080888 3 1.013048
## default 1.027298 1 1.013557
## balance 1.047139 1 1.023298
## housing 1.093070 1 1.045500
## loan 1.016340 1 1.008137
## contact 1.117826 2 1.028238
## day 1.004976 1 1.002485
## duration 1.074582 1 1.036620
## campaign 1.025593 1 1.012716
## poutcome 1.084021 3 1.013537
## season 1.063876 1 1.031444
## jobstability 1.038080 1 1.018862
No VIF greater than five, so no multicollinearity among any of the continuous or categorical variables.
I dropped the “default” and”loan” variables and reran the model.
# Create a new model "bankdata_mod2" with all of the variables from bankdata_train except for the day variable.
bankdata_mod2 <- glm(
y ~ . - default -loan, # Use all variables except "default" and"loan".
data = bankdata_train,
family = binomial
)
summary(bankdata_mod2)
##
## Call:
## glm(formula = y ~ . - default - loan, family = binomial, data = bankdata_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.83639 0.35215 -22.253 < 2e-16 ***
## age -0.49203 0.17090 -2.879 0.003989 **
## maritalmarried -0.50038 0.08948 -5.592 2.24e-08 ***
## maritalsingle -0.03677 0.10183 -0.361 0.718040
## educationsecondary 0.18312 0.09256 1.978 0.047882 *
## educationtertiary 0.69432 0.09818 7.072 1.53e-12 ***
## educationunknown 0.56990 0.15895 3.585 0.000337 ***
## balance 1.17573 0.18609 6.318 2.65e-10 ***
## housingyes -0.60606 0.05932 -10.216 < 2e-16 ***
## contacttelephone 0.67131 0.10953 6.129 8.83e-10 ***
## contactunknown -0.87350 0.08020 -10.891 < 2e-16 ***
## day -0.37780 0.10453 -3.614 0.000301 ***
## duration 11.40570 0.33900 33.645 < 2e-16 ***
## campaign -1.00649 0.10501 -9.585 < 2e-16 ***
## poutcomeother 0.36916 0.13214 2.794 0.005211 **
## poutcomesuccess 2.43145 0.16487 14.748 < 2e-16 ***
## poutcomeunknown -0.69073 0.08133 -8.493 < 2e-16 ***
## season1 -0.70696 0.06567 -10.766 < 2e-16 ***
## jobstability1 -0.58824 0.06743 -8.724 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12044.1 on 8687 degrees of freedom
## Residual deviance: 7605.6 on 8669 degrees of freedom
## AIC: 7643.6
##
## Number of Fisher Scoring iterations: 6
The AIC for model one was 7645. For model two it is 7643.6, which is a marginal improvement, but no meaningful difference. Model two only reduces complexity. For learning purposes, I dropped the “default” and “loan” to reduce complexity and advance with model two.
The coefficents in the call above report the log-odds of the dependent variable “y” being “no”. Taking the example from Nwanganga & Chapple (2020) and modifying it for this data set, a value of 1.17573 for the coefficient of “balance” means that, for every unit increase in the value of “balance”, the log-odds of “y” being “no” changes by 1.17573. Since this is not too meaningful, I exponentiated the coefficients by using the exp() and coef() functions below.
exp(coef(bankdata_mod2)["age"])
## age
## 0.6113824
exp(coef(bankdata_mod2)["balance"])
## balance
## 3.240513
exp(coef(bankdata_mod2)["day"])
## day
## 0.6853662
exp(coef(bankdata_mod2)["duration"])
## duration
## 89832.47
exp(coef(bankdata_mod2)["campaign"])
## campaign
## 0.3654987
exp(coef(bankdata_mod2)["jobstability1"])
## jobstability1
## 0.555302
exp(coef(bankdata_mod2)["maritalmarried"])
## maritalmarried
## 0.606302
exp(coef(bankdata_mod2)["maritalsingle"])
## maritalsingle
## 0.9638975
exp(coef(bankdata_mod2)["educationsecondary"])
## educationsecondary
## 1.200955
exp(coef(bankdata_mod2)["educationtertiary"])
## educationtertiary
## 2.002339
exp(coef(bankdata_mod2)["educationunknown"])
## educationunknown
## 1.768089
exp(coef(bankdata_mod2)["housingyes"])
## housingyes
## 0.5454937
exp(coef(bankdata_mod2)["contacttelephone"])
## contacttelephone
## 1.956794
exp(coef(bankdata_mod2)["contactunknown"])
## contactunknown
## 0.4174861
exp(coef(bankdata_mod2)["season1"])
## season1
## 0.4931426
exp(coef(bankdata_mod2)["poutcomeother"])
## poutcomeother
## 1.446517
exp(coef(bankdata_mod2)["poutcomesuccess"])
## poutcomesuccess
## 11.37537
exp(coef(bankdata_mod2)["poutcomeunknown"])
## poutcomeunknown
## 0.5012081
To interpret the outcome, assuming all other predictors are held constant, for a one unit increase in balance, the odds of a client responding to the marketing campaign increases by a factor of 3.240513. In other words, the higher the average yearly balance a client has in the bank, the odds substantially increase increase by a factor of 3.24 that they will respond “yes”.
Model two has a Null deviance = 12044, which measures how well a model with no predictors fits the data. This high value suggests a poor fit. The Residual deviance = 7606, which measures how well the model fits the data with predictors. The lower deviance of indicates a better fit. The difference of 4438 suggests that the model explains a significant portion of the variation. To see if the difference is significant, I ran a Chi-square test.
# I assigned values to the Null and Residual deviances and calculated the difference between them:
null_deviance <- 12044.1
residual_deviance <- 7605.6
deviance_diff <- null_deviance - residual_deviance
# The Chi-Square test requires computing the difference in the degrees of freedom from the Null to the Residual deviance.
degrees_freedom_null <- 8687
degrees_freedom_residual <- 8669
degrees_freedom_differential <- degrees_freedom_null - degrees_freedom_residual # Degrees of freedom lost
# Chi-square test
p_value <- pchisq(deviance_diff, df=degrees_freedom_differential, lower.tail=FALSE)
# Print results
cat("Chi-square statistic:", deviance_diff, "\n")
## Chi-square statistic: 4438.5
cat("Degrees of freedom:", degrees_freedom_differential, "\n")
## Degrees of freedom: 18
cat("p-value:", p_value, "\n")
## p-value: 0
The Chi-square test is significant, therefore, the predictors significantly improve the model.
The next thing we need to do is assess the performance of the model in actually predicting the response for out-of-sample observations. This involves using our model to predict respondedMailing in the test data (donors_test). To do this, we will use the predict() function from the stats package. Taking the example from Nwanganga & Chapple (2020) and modifying it for this data set, I get:
bankdata_pred1 <- predict(bankdata_mod2, bankdata_test, type = 'response')
head(bankdata_pred1)
## 5 8 9 11 12 17
## 0.47892670 0.62918425 0.01332772 0.19027464 0.10159381 0.06785555
Using the testing data, this data shows five people and their probability that they will answer “yes” to subscribing to a term deposit. For example, person 5 has a 47.9% chance and person 9 has a 13.3% chance of subscribing to a term deposit.
Using responses of less than 0.5 as a “no” and greater than 0.5 as “yes”, we can quickly assess what clients are likely to subscribe to a term deposit.
bankdata_pred1 <- ifelse(bankdata_pred1>= 0.5, 1, 0)
head(bankdata_pred1)
## 5 8 9 11 12 17
## 0 1 0 0 0 0
The outcome is that person 8 is likely to subscribe a term deposit and the others are not.
To assess how well model two actually performed, I compare model two’s predicted values for “y” with the actual values in the test dataset. Earlier, when I split the data, I changed the dependent variable from yes/no to 0/1 on the training and test data, so they are consistent with factor with levels of 0 and 1. To double check, I looked at the structure and created a table for the dependent variable in the test data.
str(bankdata_test$y) # View structure of the y column
## Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
table(bankdata_test$y) # Check class distribution
##
## 0 1
## 7951 729
I created the confusion matrix with the caret package, which gives a more comprehensive output.
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
bankdata_pred1 <- as.factor(bankdata_pred1)
conf_matrix <- confusionMatrix(bankdata_pred1, bankdata_test$y)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6272 155
## 1 1679 574
##
## Accuracy : 0.7887
## 95% CI : (0.78, 0.7973)
## No Information Rate : 0.916
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2956
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7888
## Specificity : 0.7874
## Pos Pred Value : 0.9759
## Neg Pred Value : 0.2548
## Prevalence : 0.9160
## Detection Rate : 0.7226
## Detection Prevalence : 0.7404
## Balanced Accuracy : 0.7881
##
## 'Positive' Class : 0
##
The output means that model two correctly classified 78.87% of all test cases, with a 95% confidence level that correct classification will happen 78.00% to 79.73% of the time.