The data is related with direct marketing campaigns (phone calls) of
a Portuguese banking institution. The classification goal is to predict
if the client will subscribe a term deposit (variable
y).
Input variables:
# bank client data:
1 - age (numeric)
2 - job: type of job (categorical:
“admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”,
“blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 -
marital: marital status (categorical:
“married”,“divorced”,“single”; note: “divorced” means divorced or
widowed)
4 - education (categorical:
1=“primary”,2=“secondary”,2=“tertiary”,9=“unknown”,)
5 - default: has credit in default? (binary:
“yes”,“no”)
6 - balance: average yearly balance, in euros
(numeric)
7 - housing: has housing loan? (binary:
“yes”,“no”)
8 - loan: has personal loan? (binary: “yes”,“no”)
# related with the last contact of the current
campaign:
9 - contact: contact communication type
(categorical: “unknown”,“telephone”,“cellular”) 10 -
day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: “jan”, “feb”,
“mar”, …, “nov”, “dec”)
12 - duration: last contact duration, in seconds
(numeric)
# other attributes:
13 - campaign: number of contacts performed during
this campaign and for this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the
client was last contacted from a previous campaign (numeric, -1 means
client was not previously contacted)
15 - previous: number of contacts performed before this
campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing
campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target):
17 - y - has the client subscribed a term deposit?
(binary: “yes”,“no”)
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
library(nnet) # for dummy variable coding
library(gmodels) # for confusion matrix
# Load the dataset
bank.df <- read.csv("bank.csv", na.strings = "", sep = ";")
# Display the first few rows and structure
head(bank.df)## age job marital education default balance housing loan contact day
## 1 30 unemployed married primary no 1787 no no cellular 19
## 2 33 services married secondary no 4789 yes yes cellular 11
## 3 35 management single tertiary no 1350 yes no cellular 16
## 4 30 management married tertiary no 1476 yes yes unknown 3
## 5 59 blue-collar married secondary no 0 yes no unknown 5
## 6 35 management single tertiary no 747 no no cellular 23
## month duration campaign pdays previous poutcome y
## 1 oct 79 1 -1 0 unknown no
## 2 may 220 1 339 4 failure no
## 3 apr 185 1 330 1 failure no
## 4 jun 199 4 -1 0 unknown no
## 5 may 226 1 -1 0 unknown no
## 6 feb 141 2 176 3 failure no
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
**Complete the code below by filling spaces (“___“).**
Instructions: - Normalize age,
balance, campaign, and previous
to the range [0, 1]. - Use the formula:
(x - min(x)) / (max(x) - min(x)). - Report the first 6
scaled records.
# Define the normalization function
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
# Normalize numeric variables to [0,1]
bank.df$age <- normalize(bank.df$age)
bank.df$balance <- normalize(bank.df$balance)
bank.df$campaign <- normalize(bank.df$campaign)
bank.df$duration <- normalize(bank.df$duration)
# Display first 6 records
head(bank.df, n = 6)## age job marital education default balance housing loan
## 1 0.1617647 unemployed married primary no 0.06845546 no no
## 2 0.2058824 services married secondary no 0.10875022 yes yes
## 3 0.2352941 management single tertiary no 0.06258976 yes no
## 4 0.1617647 management married tertiary no 0.06428102 yes yes
## 5 0.5882353 blue-collar married secondary no 0.04446920 yes no
## 6 0.2352941 management single tertiary no 0.05449591 no no
## contact day month duration campaign pdays previous poutcome y
## 1 cellular 19 oct 0.02482622 0.00000000 -1 0 unknown no
## 2 cellular 11 may 0.07149950 0.00000000 339 4 failure no
## 3 cellular 16 apr 0.05991394 0.00000000 330 1 failure no
## 4 unknown 3 jun 0.06454816 0.06122449 -1 0 unknown no
## 5 unknown 5 may 0.07348560 0.00000000 -1 0 unknown no
## 6 cellular 23 feb 0.04534922 0.02040816 176 3 failure no
Instructions: - Create dummy variables for
education and y using class.ind
from the nnet package. - Combine them with the original
dataframe to create bank.df1. - Rename the new columns
appropriately.
# Create dummy variables
bank.df1 <- cbind(bank.df, class.ind(bank.df$education), class.ind(bank.df$y))
# Rename columns
names(bank.df1)[18:23] <- c(
paste("edu_", c(1, 2, 3, 9), sep = ""),
paste("term_d_", c("yes", "no"), sep = "")
)
# Display column names
names(bank.df1)## [1] "age" "job" "marital" "education" "default"
## [6] "balance" "housing" "loan" "contact" "day"
## [11] "month" "duration" "campaign" "pdays" "previous"
## [16] "poutcome" "y" "edu_1" "edu_2" "edu_3"
## [21] "edu_9" "term_d_yes" "term_d_no"
Instructions: - Select the following attributes:
age, balance, campaign,
previous, edu_1, edu_2,
edu_3, edu_9, term_d_yes,
term_d_no. - Create a new dataframe bank.df2
with only these attributes.
# Select specific attributes
vars <- c("age", "balance", "campaign", "previous", "edu_1", "edu_2", "edu_3", "edu_9", "term_d_yes", "term_d_no")
bank.df2 <- bank.df1[, vars]
# Display column names
names(bank.df2)## [1] "age" "balance" "campaign" "previous" "edu_1"
## [6] "edu_2" "edu_3" "edu_9" "term_d_yes" "term_d_no"
Instructions: - Partition the data into 60% training and 40% test sets. - Use a random seed to ensure reproducibility.
# Partition the data
set.seed(2)
train.index <- sample(rownames(bank.df2), dim(bank.df2)[1] * 0.6)
test.index <- setdiff(rownames(bank.df2), train.index)
train.df <- bank.df2[train.index, ]
test.df <- bank.df2[test.index, ]
# Display dimensions of training and test sets
dim(train.df)## [1] 2712 10
## [1] 1809 10
Instructions: - Fit a neural network with hidden layers (3, 2). - Plot the resulting network.
# Fit neural network
nn <- neuralnet(term_d_yes + term_d_no ~ ., data = train.df, hidden = c(3, 2))
# Plot the neural network
plot(nn, rep = "best")Instructions: - Predict classes for training and
test datasets. - Use CrossTable from the
gmodels package to create a confusion matrix. - Calculate
the accuracy for both training and test predictions using
mean.
# (a) Predictions on training set
training.prediction <- compute(nn, train.df[, -c(9:10)])
training.class <- apply(training.prediction$net.result, 1, which.max) - 1
# Confusion matrix and accuracy for training set
CrossTable(factor(training.class), factor(train.df$term_d_no),
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE)##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 2712
##
##
## | factor(train.df$term_d_no)
## factor(training.class) | 0 | 1 | Row Total |
## -----------------------|-----------|-----------|-----------|
## 0 | 2367 | 311 | 2678 |
## | 0.873 | 0.115 | |
## -----------------------|-----------|-----------|-----------|
## 1 | 12 | 22 | 34 |
## | 0.004 | 0.008 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 2379 | 333 | 2712 |
## -----------------------|-----------|-----------|-----------|
##
##
## [1] 0.8808997
# (b) Predictions on test set
test.prediction <- compute(nn, test.df[, -c(9:10)])
test.class <- apply(test.prediction$net.result, 1, which.max) - 1
# Confusion matrix and accuracy for test set
CrossTable(factor(test.class), factor(test.df$term_d_no),
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE)##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1809
##
##
## | factor(test.df$term_d_no)
## factor(test.class) | 0 | 1 | Row Total |
## -------------------|-----------|-----------|-----------|
## 0 | 1600 | 182 | 1782 |
## | 0.884 | 0.101 | |
## -------------------|-----------|-----------|-----------|
## 1 | 21 | 6 | 27 |
## | 0.012 | 0.003 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 1621 | 188 | 1809 |
## -------------------|-----------|-----------|-----------|
##
##
## Warning in test.class == train.df$term_d_no: longer object length is not a
## multiple of shorter object length
## [1] 0.8646755
Answer the following questions:
Answer: The ANN performs pretty well predicting both the positives and negatives due to a high accuracy on both the training and testing df.
Answer: There is a 3 percent decrease in accruacy, however there also is a decrease in the length of df. This means that although the score wasn’t as high in the testing data set as it was for the training set, it was still above average and was able to accurately predict most of the time.
Instructions: - Fit a logistic regression model to
classify the outcome term_d_no. - Use the same predictors
and the same training and test datasets, except edu_9 and
term_d_yes - Display the coefficients using
coef. - Add a summary of the model.
# need to drop columns 8 and 9 (edu_9 and term_d_yes)
train.df.log <- train.df[, -c(8,9)]
test.df.log <- test.df[, -c(8,9)]
# Fit logistic regression model
logistic.model <- glm(term_d_no ~ ., data = train.df.log, family = "binomial")
# Display coefficients
coef(logistic.model)## (Intercept) age balance campaign previous edu_1
## -2.433914133 1.211885563 1.023242015 -3.716195024 0.119418065 -0.367928506
## edu_2 edu_3
## 0.003841825 0.261760907
##
## Call:
## glm(formula = term_d_no ~ ., family = "binomial", data = train.df.log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.433914 0.340480 -7.148 8.77e-13 ***
## age 1.211886 0.376046 3.223 0.00127 **
## balance 1.023242 1.363276 0.751 0.45291
## campaign -3.716195 1.346734 -2.759 0.00579 **
## previous 0.119418 0.028596 4.176 2.97e-05 ***
## edu_1 -0.367929 0.336362 -1.094 0.27402
## edu_2 0.003842 0.302848 0.013 0.98988
## edu_3 0.261761 0.307831 0.850 0.39514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2020.1 on 2711 degrees of freedom
## Residual deviance: 1973.2 on 2704 degrees of freedom
## AIC: 1989.2
##
## Number of Fisher Scoring iterations: 5
Instructions: - Predict the outcomes for the test
set using the logistic regression model. - Use the pROC
library to plot the ROC curve. - Display the AUC value. - Create a
confusion matrix for the predictions with a threshold of 0.50, assuming
“yes” for predicted probabilities greater than 0.50.
## Warning: package 'pROC' was built under R version 4.3.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
##
## ci
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Predict probabilities for test set
test.prob <- predict(logistic.model, test.df.log, type = "response")
# Plot ROC curve
roc_curve <- roc(test.df.log$term_d_no, test.prob)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6534
# Create confusion matrix with threshold 0.50
test.pred.class <- ifelse(test.prob > 0.50, 1, 0)
CrossTable(factor(test.pred.class), factor(test.df.log$term_d_no),
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c("Predicted", "Actual"))##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1809
##
##
## | Actual
## Predicted | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 1617 | 188 | 1805 |
## | 0.894 | 0.104 | |
## -------------|-----------|-----------|-----------|
## 1 | 4 | 0 | 4 |
## | 0.002 | 0.000 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1621 | 188 | 1809 |
## -------------|-----------|-----------|-----------|
##
##
## [1] 0.893864
Answer the following questions:
Answer: The logistic model performs well in predicting “no” outcomes, because of a relatively high accuracy on the test set. The confusion matrix shows that the model predicts “no” with good accuracy. The ROC curve and AUC further support the model’s effectiveness in distinguishing between “yes” and “no” outcomes. However, there is always room for improvement through model adjustments and threshold tuning.
Answer:
The significant predictors are determined by their p-values in the summary. Variables with low p-values which are usually less than 0.05 are considered significant. Previous contacts shows the strongest associations with the likelihood of not subscribing.
Answer:
The coefficient of edu_3=tertiary represents the log-odds of a client subscribing to a term deposit with relation to the reference category. A positive coefficient means that tertiary education increases the likelihood of a “no” outcome. This relates back to the use of two variables not necessarily increasing the price of homes as found in our previous labs. The coefficient is the ultimate factor of the impact of two variables creating an effect.
Answer:
The ROC curve tells us that there is trade-off between true positive rate and false positive rate across different thresholds. The AUC value quantifies this, with a higher AUC which would indicate a better model performance. An AUC of 0.8 or higher is considered strong, therefore meaning that the model is good at distinguishing between yes and no. In this case, the AUC suggests that the logistic model has good performance.
Answer: When using a 0.50 threshold for accuracy in this case it worked out well, however if the dataset is imbalanced it can also backfire. For example, if “no” outcomes dominate, the model might predict “no” most of the time, resulting in high accuracy but poor model performance in predicting “yes” outcomes. A change in threshold is something that must be determined after understanding the data.