title: '
r Sys.Date() {r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, comment = " ", error = FALSE, fig.align = 'center')
**Full Name:** Lina Maslovaite 14461366
Online Assignment: There is a new online assignment at the DataCamp with the name “Chapter 1: k-Nearest Neighbors (kNN)” which is a part of the online course “Supervised Learning in R: Classification” at the DataCamp. The online assignments at the DataCamp are not mandatory.
Your task is to answer the following questions in Part 1 and Part 2 in this R-markdown file. Please upload both your R-markdown (.Rmd file) and the HTML files separately on Canvas. Note that your R-markdown (.Rmd file) and the HTML files have to be in the right format.
Here, we are going to use the following R packages:
partition(), kNN() functions in this
package.roc()
and auc() functions in this package.If it’s needed, install these packages on your computer. Here we load
them:
{r} library(liver) library(ggplot2) library(pROC)
We aim to identify customer segments through the analysis of data from customers who have subscribed to a term deposit. This will enable us to determine the characteristics of customers who are more inclined to purchase the product.
Find the best strategies to improve for the next marketing campaign. How can the financial institution have greater effectiveness for future marketing campaigns? To make a data-driven decision, we need to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions to develop future strategies.
Two main approaches for enterprises to promote products/services are:
In general, positive responses to mass campaigns are typically very low (less than 1%). On the other hand, direct marketing focuses on targets that are keener to that specific product/service, making this kind of campaign more effective. However, direct marketing has some drawbacks, for instance, it may trigger a negative attitude towards banks due to the intrusion of privacy.
Banks are interested to increase financial assets. One strategy is to offer attractive long-term deposit applications with good interest rates, in particular, by using directed marketing campaigns. Also, the same drivers are pressing for a reduction in costs and time. Thus, there is a need for an improvement in efficiency: lesser contacts should be done, but an approximate number of successes (clients subscribing to the deposit) should be kept.
A Term Deposit is a deposit that a bank or a financial institution offers with a fixed rate (often better than just opening a deposit account), in which your money will be returned at a specific maturity time. For more information with regards to Term Deposits please check here.
The bank dataset is related to direct marketing campaigns of a Portuguese banking institution. You can find more information related to this dataset at: https://rdrr.io/cran/liver/man/bank.html
The marketing campaigns were based on phone calls. Often, more than one contact (to the same client) was required, to access if the product (bank term deposit) would be (or not) subscribed. The classification goal is to predict if the client will subscribe to a term deposit (variable deposit).
We import the bank dataset: {r} data(bank)
We can see the structure of the dataset by using the
str() function: {r} str(bank)
It shows that the bank dataset as a data.frame
has r ncol(bank) variables and r nrow(bank)
observations. The dataset has r ncol(bank) - 1 predictors
along with the target variable deposit which is a binary
variable with 2 levels "yes" and "no". The variables in this dataset
are:
age: numeric.job: type of job; categorical: "admin.", "unknown",
"unemployed", "management", "housemaid", "entrepreneur", "student",
"blue-collar, "self-employed", "retired", "technician", "services".marital: marital status; categorical: "married",
"divorced", "single"; note: "divorced" means divorced or widowed.education: categorical: "secondary", "primary",
"tertiary", "unknown".default: has credit in default?; binary:
"yes","no".balance: average yearly balance, in euros;
numeric.housing: has housing loan? binary: "yes", "no".loan: has personal loan? binary: "yes", "no".Related with the last contact of the current campaign:
contact: contact: contact communication type;
categorical: "unknown","telephone","cellular".day: last contact day of the month; numeric.month: last contact month of year; categorical: "jan",
"feb", "mar", ..., "nov", "dec".duration: last contact duration, in seconds;
numeric.Other attributes:
campaign: number of contacts performed during this
campaign and for this client; numeric, includes last contact.pdays: number of days that passed by after the client
was last contacted from a previous campaign; numeric, -1 means client
was not previously contacted.previous: number of contacts performed before this
campaign and for this client; numeric.poutcome: outcome of the previous marketing campaign;
categorical: "success", "failure", "unknown", "other".Target variable:
deposit: Indicator of whether the client subscribed a
term deposit; binary: "yes" or "no".Here we report the summary of the dataset:
{r} summary(bank)
We partition the bank dataset randomly into two groups:
train set (80%) and test set (20%). Here, we use the
partition() function from the liver
package:
```{r} set.seed(5)
data_sets = partition(data = bank, ratio = c(0.8, 0.2))
trainset = datasets$part1 testset = datasets$part2
actualtest = testset$deposit ```
Note that here we are using the set.seed() function to
create reproducible results.
We want to validate the partition by testing whether the proportion
of the target variable deposit differs between the two data
sets. We use a Two-Sample Z-Test for the difference in proportions. To
run the test, we use the prop.test() function in
R: ```{r} x1 = sum(trainset$deposit == "yes") x2 =
sum(testset $deposit == "yes")
n1 = nrow(trainset) n2 = nrow(testset)
prop.test(x = c(x1, x2), n = c(n1, n2)) ```
Based on the output, answer the following questions:
a. Why is the above hypothesis test suitable for the above research question? Provide your reasons.
b. Specify the null and alternative hypotheses?
c. Explain that you reject or do not reject the null hypothesis, at $\alpha=0.05$. What would be your statistical conclusion?
prop.test() function we see that
p-value = 0.09533. And since the p-value is greater
than $\alpha=0.05$, we can not reject the null hypothesis. Thus, there
is no statistically significant difference between the training and test
sets.d. What would be a non-statistical interpretation of your findings in c?
The results from the "Exploratory Data Analysis (EDA)" (from last
week) indicate that the following predictors from
r ncol(bank) - 1 predictors in the bank dataset
are important to predict deposit.
age, default, balance,
housing, loan, duration,
campaign, pdays, and
previous.
Thus, here, based on the training dataset, we want to apply kNN
algorithm, by using above predictors in our model. We use the following
formula:
{r} formula = deposit ~ age + default + balance + housing + loan + duration + campaign + pdays + previous
NOTE: The above formula means deposit
is the target variable and the rest of the variables in the right side
of tilde ("~") are independent variables.
Based on the training dataset, we want to find the k-nearest neighbor
for the test data set. Here we use two different values for k (k = 3 and
k = 10). We use the kNN() function from the R
package liver: ```{r} predictknn3 =
kNN(formula, train = trainset, test = testset, k = 3)
predictknn10 = kNN(formula, train = trainset, test = testset, k = 10) ```
To have an overview of the prediction result, we report Confusion
Matrix for two different values of k by using the
conf.mat function: ```{r} (confknn3 =
conf.mat(predictknn3, actual_test))
(confknn10 = conf.mat(predictknn10, actual_test)) ```
We also could report Confusion Matrix by using the
conf.mat.plot() command: ```{r fig.show = "hold", out.width
= "50%", fig.align = 'default'} conf.mat.plot(predictknn3,
actual_test, main = "kNN with k = 3")
conf.mat.plot(predictknn10, actual_test, main = "kNN with k = 10") ```
What do these values mean? Explain what conclusion you will draw.
To evaluate the accuracy of the predictions, we calculate the Mean
Square Error (MSE) by using the mse() function from the
liver package:
```{r} MSE3 = mse(predictknn3, actualtest) MSE_3
MSE10 = mse(predictknn10, actualtest) MSE_10 ```
For the case k=3, the MSE = r round(MSE_3 , 3)
and for the case k = 10, the MSE = r round(MSE_10, 3). What
do these values mean? Explain what conclusion you will
draw.
To report the ROC curve, we need the probability of our classification prediction. We can have it by using: ```{r} probknn3 = kNN(formula, train = trainset, test = testset, k = 3 , type = "prob")[, 1]
probknn10 = kNN(formula, train = trainset, test = testset, k = 10, type = "prob")[, 1] ```
To visualize the model performance, we could report the ROC curve
plot by using the plot.roc() function from the
pROC package:
```{r} rocknn3 = roc(actualtest, probknn_3)
rocknn10 = roc(actualtest, probknn_10)
ggroc(list(rocknn3, rocknn10), size = 0.8) +
thememinimal() + ggtitle("ROC plots with AUC for kNN") +
scalecolormanual(values = c("red", "blue"), labels =
c(paste("k=3 ; AUC=", round(auc(rocknn3), 3)), paste("k=10;
AUC=", round(auc(rocknn10), 3)) )) + theme(legend.title =
elementblank()) + theme(legend.position = c(.7, .3), text =
elementtext(size = 17)) +
geomsegment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey",
linetype = "dashed") ```
In the above plot, 'red' curve is for the case k = 3 and the 'blue' curve is for the case k = 10.
Explain what conclusion you will draw. Do we need to report AUC (Area Under the Curve) as well?
The predictors that we used in the previous question, do not have the
same scale. For example, variable duration change between
r min(bank$duration) and r max(bank$duration),
whereas the variable loan is binary. In this case, the
values of variable duration will overwhelm the contribution
of the variable loan. To avoid this situation we use
normalization. So, we use min-max normalization and transfer the
predictors.
Now, we find the k-nearest neighbor for the test set, based on the training dataset, for the k = 10:
```{r} predictknn10trans = kNN(formula, train = trainset, test = test_set, scaler = "minmax", k = 10)
conf.mat.plot(predictknn10trans, actualtest) ```
To report the ROC curve, we need the probability of our classification prediction. We can have it by using: ```{r} probknn10 = kNN(formula, train = trainset, test = testset, k = 10, type = "prob")[, 1]
probknn10trans = kNN(formula, train = trainset, test = test_set, scaler = "minmax", k = 10, type = "prob")[, 1] ```
To visualize the model performance between the raw data and the
transformed data, we could report the ROC curve plot as well as AUC
(Area Under the Curve) by using the plot.roc() function
from the pROC package:
```{r} rocknn10 = roc(actualtest, probknn_10)
rocknn10trans = roc(actualtest, probknn10_trans)
ggroc(list(rocknn10, rocknn10trans), size =
0.8) + thememinimal() + ggtitle("ROC plots with AUC for kNN") +
scalecolormanual(values = c("red", "blue"), labels =
c(paste("Raw data ; AUC=", round(auc(rocknn10), 3)),
paste("Transformed data; AUC=", round(auc(rocknn10trans),
3)))) + theme(legend.title = elementblank()) +
theme(legend.position = c(.7, .3), text = elementtext(size = 17))
+
geomsegment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey",
linetype = "dashed") ```
In the above plot black curve is for the raw dataset and the red curve is for the transformed dataset.
Explain what conclusion you will draw. Based on the values of AUC (Area Under the Curve), explain what conclusion you will draw.
In the previous questions, for finding the k-nearest neighbor for the test set, we set k = 10. But why 10? Here, we want to find out the optimal value of k based on our dataset.
To find out the optimal value of k based on Error
Rate, for the different values of k from 1 to 30, we run the
k-nearest neighbor for the test set and compute the Error Rate
for these models, by running kNN.plot() command
{r} kNN.plot(formula, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 7)
Based on the plot, what value of k is optimal? Provide your reasons.
Apply the kNN algorithm to analyze the churn dataset which is available in the R package liver.
Import the churn dataset in R and report the structure and
summary of the dataset, using str() and
summary() function.
```{r} library(liver)
data(churn)
str(churn)
summary(churn) ```
First, partition the churn dataset randomly into two groups as a train set (70%) and test set (30%). Then, validate the partition for a couple of variables; for example, you could validate the partition by testing whether the proportion of churners differs between the two datasets. Or you could validate the partition by testing whether the population means for the number of customer service calls differs between the two datasets.
{r} set.seed(123) data_sets <- partition(data = churn, ratio = c(0.7, 0.3)) train_set <- data_sets$part1 test_set <- data_sets$part2
```{r} x1 <- sum(trainset$churn == "yes") x2 <-
sum(testset$churn == "yes") n1 <- nrow(trainset) n2 <-
nrow(testset)
prop.test(x = c(x1, x2), n = c(n1, n2)) ```
Based on the training dataset, find the k-nearest neighbor for the test data set. For this, use all the 19 predictors of the churn dataset for the analysis. You should use min-max normalization and transfer the predictors. Note that you should first find the optimal value of k. Finally, evaluate the accuracy of the predictions by reporting
Formula {r} formula <- churn ~ .
Optimal value of k
{r} kNN.plot(formula, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 123)
The optimal value of k = 10
```{r} pred <- kNN(formula, train = trainset, test = testset, scaler = "minmax", k = 10)
actual <- test_set$churn ```
{r} conf.mat.plot(pred, actual, main = "Confusion Matrix for kNN")
```{r} probknn <- kNN(formula, train = trainset, test = test_set, scaler = "minmax", k = 10, type = "prob")[, 1]
rocknn <- roc(actual, probknn)
ggroc(rocknn, size = 1) + thememinimal() + ggtitle(paste("ROC Curve for kNN (k = 10) — AUC =", round(auc(rocknn), 3))) + geomsegment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") ```
In the previous exercises for the churn dataset, you suppose to use all the 19 predictors for the analysis. But we know based on the lecture of week 2, we should use only those predictors that have a relationship with the target variable. So, here we use the following predictors:
account.length, voice.plan,
voice.messages, intl.plan,
intl.mins, day.mins, eve.mins,
night.mins, and customer.calls.
First, based on the above predictors, find the k-nearest neighbor for the test set. You should use min-max normalization and transfer the predictors. Note that you should first find the optimal value of k. Finally, evaluate the accuracy of the predictions by reporting
Formula
{r} formula3 <- churn ~ account.length + voice.plan + voice.messages + intl.plan + intl.mins + day.mins + eve.mins + night.mins + customer.calls
Optimal value of k
{r} kNN.plot(formula3, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 123)
The optimal value of k = 7
{r} predict_knn <- kNN(formula3, train = train_set, test = test_set, scaler = "minmax", k = 7)
Confusion Matrix
{r} conf.mat.plot(predict_knn, actual)
ROC curve and AUC ```{r} probknn <- kNN(formula3, train = trainset, test = test_set, scaler = "minmax", k = 7, type = "prob")[, 1]
rocknn <- roc(actual, probknn)
ggroc(rocknn, size = 1) + thememinimal() + ggtitle(paste("ROC Curve for kNN — AUC =", round(auc(rocknn), 3))) + geomsegment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") ```
Compare the results with the previous section. What would be your conclusion?
To conclude, by comparing the results, I can see that using carefully selected subset of predictors (those with a known relationship to churn) can yield a model that is both accurate and more interpretable.