The data is related with direct marketing campaigns (phone calls) of
a Portuguese banking institution. The classification goal is to predict
if the client will subscribe a term deposit (variable
y).
Input variables:
# bank client data:
1 - age (numeric)
2 - job: type of job (categorical:
“admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”,
“blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 -
marital: marital status (categorical:
“married”,“divorced”,“single”; note: “divorced” means divorced or
widowed)
4 - education (categorical:
1=“primary”,2=“secondary”,2=“tertiary”,9=“unknown”,)
5 - default: has credit in default? (binary:
“yes”,“no”)
6 - balance: average yearly balance, in euros
(numeric)
7 - housing: has housing loan? (binary:
“yes”,“no”)
8 - loan: has personal loan? (binary: “yes”,“no”)
# related with the last contact of the current
campaign:
9 - contact: contact communication type
(categorical: “unknown”,“telephone”,“cellular”) 10 -
day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: “jan”, “feb”,
“mar”, …, “nov”, “dec”)
12 - duration: last contact duration, in seconds
(numeric)
# other attributes:
13 - campaign: number of contacts performed during
this campaign and for this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the
client was last contacted from a previous campaign (numeric, -1 means
client was not previously contacted)
15 - previous: number of contacts performed before this
campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing
campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target):
17 - y - has the client subscribed a term deposit?
(binary: “yes”,“no”)
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
library(nnet) # for dummy variable coding
library(gmodels) # for confusion matrix
# Load the dataset
bank.df <- read.csv("bank.csv", na.strings = "", sep = ";")
# Display the first few rows and structure
head(bank.df)## age job marital education default balance housing loan contact day
## 1 30 unemployed married primary no 1787 no no cellular 19
## 2 33 services married secondary no 4789 yes yes cellular 11
## 3 35 management single tertiary no 1350 yes no cellular 16
## 4 30 management married tertiary no 1476 yes yes unknown 3
## 5 59 blue-collar married secondary no 0 yes no unknown 5
## 6 35 management single tertiary no 747 no no cellular 23
## month duration campaign pdays previous poutcome y
## 1 oct 79 1 -1 0 unknown no
## 2 may 220 1 339 4 failure no
## 3 apr 185 1 330 1 failure no
## 4 jun 199 4 -1 0 unknown no
## 5 may 226 1 -1 0 unknown no
## 6 feb 141 2 176 3 failure no
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
**Complete the code below by filling spaces (“___”).**
Instructions: - Normalize age,
balance, campaign, and previous
to the range [0, 1]. - Use the formula:
(x - min(x)) / (max(x) - min(x)). - Report the first 6
scaled records.
# Define the normalization function
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
# Normalize numeric variables to [0,1]
bank.df$age <- normalize(bank.df$age)
bank.df$balance <- normalize(bank.df$balance)
bank.df$campaign <- normalize(bank.df$campaign)
bank.df$previous <- normalize(bank.df$previous)
# Display first 6 records
head(bank.df, n = 6)## age job marital education default balance housing loan
## 1 0.1617647 unemployed married primary no 0.06845546 no no
## 2 0.2058824 services married secondary no 0.10875022 yes yes
## 3 0.2352941 management single tertiary no 0.06258976 yes no
## 4 0.1617647 management married tertiary no 0.06428102 yes yes
## 5 0.5882353 blue-collar married secondary no 0.04446920 yes no
## 6 0.2352941 management single tertiary no 0.05449591 no no
## contact day month duration campaign pdays previous poutcome y
## 1 cellular 19 oct 79 0.00000000 -1 0.00 unknown no
## 2 cellular 11 may 220 0.00000000 339 0.16 failure no
## 3 cellular 16 apr 185 0.00000000 330 0.04 failure no
## 4 unknown 3 jun 199 0.06122449 -1 0.00 unknown no
## 5 unknown 5 may 226 0.00000000 -1 0.00 unknown no
## 6 cellular 23 feb 141 0.02040816 176 0.12 failure no
Instructions: - Create dummy variables for
education and y using class.ind
from the nnet package. - Combine them with the original
dataframe to create bank.df1. - Rename the new columns
appropriately.
# Create dummy variables
bank.df1 <- cbind(bank.df, class.ind(bank.df$education), class.ind(bank.df$y))
# Rename columns
names(bank.df1)[18:23] <- c(
paste("edu_", c(1, 2, 3, 9), sep = ""),
paste("term_d_", c("yes", "no"), sep = "")
)
# Display column names
names(bank.df1)## [1] "age" "job" "marital" "education" "default"
## [6] "balance" "housing" "loan" "contact" "day"
## [11] "month" "duration" "campaign" "pdays" "previous"
## [16] "poutcome" "y" "edu_1" "edu_2" "edu_3"
## [21] "edu_9" "term_d_yes" "term_d_no"
Instructions: - Select the following attributes:
age, balance, campaign,
previous, edu_1, edu_2,
edu_3, edu_9, term_d_yes,
term_d_no. - Create a new dataframe bank.df2
with only these attributes.
# Select specific attributes
vars <- c("age", "balance", "campaign", "previous", "edu_1", "edu_2", "edu_3", "edu_9", "term_d_yes", "term_d_no")
bank.df2 <- bank.df1[, vars]
# Display column names
names(bank.df2)## [1] "age" "balance" "campaign" "previous" "edu_1"
## [6] "edu_2" "edu_3" "edu_9" "term_d_yes" "term_d_no"
Instructions: - Partition the data into 60% training and 40% test sets. - Use a random seed to ensure reproducibility.
# Partition the data
set.seed(2)
train.index <- sample(rownames(bank.df2), dim(bank.df2)[1] * 0.6)
test.index <- setdiff(rownames(bank.df2), train.index)
train.df <- bank.df2[train.index, ]
test.df <- bank.df2[test.index, ]
# Display dimensions of training and test sets
dim(train.df)## [1] 2712 10
## [1] 1809 10
Instructions: - Fit a neural network with hidden layers (3, 2). - Plot the resulting network.
# Fit neural network
nn <- neuralnet(term_d_yes + term_d_no ~ ., data = train.df, hidden = c(3, 2))
# Plot the neural network
plot(nn, rep = "best")Instructions: - Predict classes for training and
test datasets. - Use CrossTable from the
gmodels package to create a confusion matrix. - Calculate
the accuracy for both training and test predictions using
mean.
# (a) Predictions on training set
training.prediction <- compute(nn, train.df[, -c(9:10)])
training.class <- apply(training.prediction$net.result, 1, which.max) - 1
# Confusion matrix and accuracy for training set
CrossTable(factor(training.class), factor(train.df$term_d_no),
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE)##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 2712
##
##
## | factor(train.df$term_d_no)
## factor(training.class) | 0 | 1 | Row Total |
## -----------------------|-----------|-----------|-----------|
## 0 | 2369 | 304 | 2673 |
## | 0.874 | 0.112 | |
## -----------------------|-----------|-----------|-----------|
## 1 | 10 | 29 | 39 |
## | 0.004 | 0.011 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 2379 | 333 | 2712 |
## -----------------------|-----------|-----------|-----------|
##
##
## [1] 0.8842183
# (b) Predictions on test set
test.prediction <- compute(nn, test.df[, -c(9:10)])
test.class <- apply(test.prediction$net.result, 1, which.max) - 1
# Confusion matrix and accuracy for test set
CrossTable(factor(test.class), factor(test.df$term_d_no),
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE)##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1809
##
##
## | factor(test.df$term_d_no)
## factor(test.class) | 0 | 1 | Row Total |
## -------------------|-----------|-----------|-----------|
## 0 | 1594 | 178 | 1772 |
## | 0.881 | 0.098 | |
## -------------------|-----------|-----------|-----------|
## 1 | 27 | 10 | 37 |
## | 0.015 | 0.006 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 1621 | 188 | 1809 |
## -------------------|-----------|-----------|-----------|
##
##
## [1] 0.8866777
Answer the following questions:
Answer:
The ANN model is very good at predicting “no” (negative) outcomes, but it struggles to predict “yes” (positive) outcomes. It has a high number of true negatives and a low number of true positives, indicating it is biased toward predicting “no”.
The recall for “yes” (TP rate) is quite high in the training set (~74.4%), which means the model is fairly good at identifying most of the positive cases when they are present. However, the precision for “yes” is low (~8.7%), which means when the model predicts “yes”, it is wrong a significant proportion of the time (304 false positives).
In the training set:True Negatives (TN): 2369, True Positives (TP): 29 , False Positives (FP): 304, False Negatives (FN): 10
The recall for “yes” (TP rate) is quite high in the training set (~74.4%), which means the model is fairly good at identifying most of the positive cases when they are present. However, the precision for “yes” is low (~8.7%), which means when the model predicts “yes”, it is wrong a significant proportion of the time (304 false positives).
In the test set: True Negatives (TN): 1594, True Positives (TP): 10, False Positives (FP): 178, False Negatives (FN): 27
In summary, the model seems biased towards predicting “no” due to the class imbalance, but when it does predict “yes”, it is often incorrect. This indicates that the model’s sensitivity to the positive class (“yes”) is low, and it struggles to correctly identify the minority class.
Answer:
The model has high accuracy (~88%) in both training and test sets, but this is mainly due to correctly predicting the dominant “no” class. The low recall and precision for “yes” indicate that the model performs poorly for the minority class. The high accuracy is misleading due to the class imbalance.
Accuracy doesn’t reflect model’s true performance:
While 88.4% accuracy in the training set and 88.7% in the test set might seem good at first glance, it is important to note that these accuracy values are driven largely by the model’s success in predicting the dominant class (“no”). Since the “no” class is more frequent, the model is likely predicting “no” most of the time, which leads to a high accuracy even though it fails to properly identify the minority class (“yes”). Class Imbalance:
The model is heavily biased toward predicting “no”, as demonstrated by the very high number of true negatives (TN) and the low number of true positives (TP) for the “yes” class. This results in low precision and low recall for the “yes” class, which means that although the model correctly identifies “no” outcomes most of the time, it misses many “yes” cases and often misclassifies them as “no”.
Potential Overfitting:
Given the relatively small difference in accuracy between the training (88.4%) and test (88.7%) datasets, there is a chance the model is not overfitting significantly. However, it is still important to consider other performance metrics, such as precision, recall, and F1-score, which are more informative when dealing with imbalanced datasets.
Instructions: - Fit a logistic regression model to
classify the outcome term_d_no. - Use the same predictors
and the same training and test datasets, except edu_9 and
term_d_yes - Display the coefficients using
coef. - Add a summary of the model.
# need to drop columns 8 and 9 (edu_9 and term_d_yes)
train.df.log <- train.df[, -c(8,9)]
test.df.log <- test.df[, -c(8,9)]
# Fit logistic regression model
logistic.model <- glm(term_d_no ~ ., data = train.df.log, family = binomial())
# Display coefficients
coef(logistic.model)## (Intercept) age balance campaign previous edu_1
## -2.433914133 1.211885563 1.023242015 -3.716195024 2.985451634 -0.367928506
## edu_2 edu_3
## 0.003841825 0.261760907
##
## Call:
## glm(formula = term_d_no ~ ., family = binomial(), data = train.df.log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.433914 0.340480 -7.148 8.77e-13 ***
## age 1.211886 0.376046 3.223 0.00127 **
## balance 1.023242 1.363276 0.751 0.45291
## campaign -3.716195 1.346734 -2.759 0.00579 **
## previous 2.985452 0.714903 4.176 2.97e-05 ***
## edu_1 -0.367929 0.336362 -1.094 0.27402
## edu_2 0.003842 0.302848 0.013 0.98988
## edu_3 0.261761 0.307831 0.850 0.39514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2020.1 on 2711 degrees of freedom
## Residual deviance: 1973.2 on 2704 degrees of freedom
## AIC: 1989.2
##
## Number of Fisher Scoring iterations: 5
Instructions: - Predict the outcomes for the test
set using the logistic regression model. - Use the pROC
library to plot the ROC curve. - Display the AUC value. - Create a
confusion matrix for the predictions with a threshold of 0.50, assuming
“yes” for predicted probabilities greater than 0.50.
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
##
## ci
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Predict probabilities for test set
test.prob <- predict(logistic.model, test.df.log, type = "response")
# Plot ROC curve
roc_curve <- roc(test.df.log$term_d_no,test.prob )## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6534
# Create confusion matrix with threshold 0.50
test.pred.class <- ifelse(test.prob > 0.50, 1, 0)
CrossTable(factor(test.pred.class), factor(test.df.log$term_d_no),
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c("Predicted", "Actual"))##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1809
##
##
## | Actual
## Predicted | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 1617 | 188 | 1805 |
## | 0.894 | 0.104 | |
## -------------|-----------|-----------|-----------|
## 1 | 4 | 0 | 4 |
## | 0.002 | 0.000 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1621 | 188 | 1809 |
## -------------|-----------|-----------|-----------|
##
##
## [1] 0.893864
Answer the following questions:
Answer: The logistic model performs well in predicting “no” (negative) outcomes. It correctly predicts 1617 out of 1621”no” cases, giving a high accuracy for the “no” class (around 89%). However, it fails to predict any positive (“yes”) outcomes in the test set, as indicated by the 0 predicted positives.
Answer: The significant predictors in the model are:age (p-value = 0.00127): Significant because the p-value is below the 0.05 threshold.campaign (p-value = 0.00579): Significant due to its p-value being below 0.05.previous (p-value = 2.97e-05): Highly significant with a very low p-value.
The other predictors like balance, edu_1, edu_2, and edu_3 are not significant (p-values > 0.05), suggesting that they do not contribute meaningfully to predicting the outcome.
Answer:
The coefficient for edu_3 (tertiary) is 0.261761, but it is not statistically significant (p-value = 0.39514). This means that having a tertiary education does not significantly affect the likelihood of subscribing to a term deposit compared to the reference category (likely “primary” education). Since it’s not significant, it doesn’t provide useful information for predicting the target variable.
Answer:
The ROC curve is a graphical representation of the trade-off between the true positive rate (sensitivity) and false positive rate (1 - specificity). The AUC (Area Under the Curve) quantifies the overall ability of the model to distinguish between the two classes.
AUC ranges from 0 to 1, where AUC = 1 represents perfect classification and AUC = 0.5represents a random classifier.
In this case, the AUC of 0.893864 suggests that the model does a good job at distinguishing between the “yes” and “no” outcomes, though not perfectly.
Answer:
Yes, there is a significant problem with using accuracy based on a 0.50 threshold for the logistic model. The model has failed to predict any positive (“yes”) cases in the test set, so the accuracy estimate (~89%) is misleading.
The high accuracy is primarily due to the model correctly predicting “no” (the majority class), but it fails to identify the minority class (“yes”).
This issue is a result of class imbalance, where the majority class dominates the predictions, leading to inflated accuracy despite poor performance on the minority class.