title: '

The Exercises of Week 4 - Classification and Model Evaluation
' author: '
Reza Mohammadi
' date: '
r Sys.Date()
' output: htmldocument: numbersections: true figcaption: true toc: true figwidth: 7 figheight: 5 theme: cosmo highlight: tango codefolding: show ---

{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, comment = " ", error = FALSE, fig.align = 'center')

**Full Name:** Lina Maslovaite 14461366

Online Assignment: There is a new online assignment at the DataCamp with the name “Chapter 1: k-Nearest Neighbors (kNN)” which is a part of the online course “Supervised Learning in R: Classification” at the DataCamp. The online assignments at the DataCamp are not mandatory.

Your task is to answer the following questions in Part 1 and Part 2 in this R-markdown file. Please upload both your R-markdown (.Rmd file) and the HTML files separately on Canvas. Note that your R-markdown (.Rmd file) and the HTML files have to be in the right format.

Here, we are going to use the following R packages:

  • liver: the bank and churn datasets are in this package. We also use the partition(), kNN() functions in this package.
  • ggplot2: we use this package for visualization.
  • pROC: to create ROC curve and compute AUC values, we use the roc() and auc() functions in this package.

If it’s needed, install these packages on your computer. Here we load them: {r} library(liver) library(ggplot2) library(pROC)

Predicting Term Deposit Subscriptions using 'bank' dataset (40 points)

We aim to identify customer segments through the analysis of data from customers who have subscribed to a term deposit. This will enable us to determine the characteristics of customers who are more inclined to purchase the product.

Problem Understanding

Find the best strategies to improve for the next marketing campaign. How can the financial institution have greater effectiveness for future marketing campaigns? To make a data-driven decision, we need to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions to develop future strategies.

Bank direct marketing info

Two main approaches for enterprises to promote products/services are:

  • mass campaigns: targeting general indiscriminate public,
  • directed marketing, targeting a specific set of contacts.

In general, positive responses to mass campaigns are typically very low (less than 1%). On the other hand, direct marketing focuses on targets that are keener to that specific product/service, making this kind of campaign more effective. However, direct marketing has some drawbacks, for instance, it may trigger a negative attitude towards banks due to the intrusion of privacy.

Banks are interested to increase financial assets. One strategy is to offer attractive long-term deposit applications with good interest rates, in particular, by using directed marketing campaigns. Also, the same drivers are pressing for a reduction in costs and time. Thus, there is a need for an improvement in efficiency: lesser contacts should be done, but an approximate number of successes (clients subscribing to the deposit) should be kept.

What is a Term Deposit?

A Term Deposit is a deposit that a bank or a financial institution offers with a fixed rate (often better than just opening a deposit account), in which your money will be returned at a specific maturity time. For more information with regards to Term Deposits please check here.

Data Undestanding

The bank dataset is related to direct marketing campaigns of a Portuguese banking institution. You can find more information related to this dataset at: https://rdrr.io/cran/liver/man/bank.html

The marketing campaigns were based on phone calls. Often, more than one contact (to the same client) was required, to access if the product (bank term deposit) would be (or not) subscribed. The classification goal is to predict if the client will subscribe to a term deposit (variable deposit).

We import the bank dataset: {r} data(bank)

We can see the structure of the dataset by using the str() function: {r} str(bank)

It shows that the bank dataset as a data.frame has r ncol(bank) variables and r nrow(bank) observations. The dataset has r ncol(bank) - 1 predictors along with the target variable deposit which is a binary variable with 2 levels "yes" and "no". The variables in this dataset are:

  • age: numeric.
  • job: type of job; categorical: "admin.", "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", "blue-collar, "self-employed", "retired", "technician", "services".
  • marital: marital status; categorical: "married", "divorced", "single"; note: "divorced" means divorced or widowed.
  • education: categorical: "secondary", "primary", "tertiary", "unknown".
  • default: has credit in default?; binary: "yes","no".
  • balance: average yearly balance, in euros; numeric.
  • housing: has housing loan? binary: "yes", "no".
  • loan: has personal loan? binary: "yes", "no".

Related with the last contact of the current campaign:

  • contact: contact: contact communication type; categorical: "unknown","telephone","cellular".
  • day: last contact day of the month; numeric.
  • month: last contact month of year; categorical: "jan", "feb", "mar", ..., "nov", "dec".
  • duration: last contact duration, in seconds; numeric.

Other attributes:

  • campaign: number of contacts performed during this campaign and for this client; numeric, includes last contact.
  • pdays: number of days that passed by after the client was last contacted from a previous campaign; numeric, -1 means client was not previously contacted.
  • previous: number of contacts performed before this campaign and for this client; numeric.
  • poutcome: outcome of the previous marketing campaign; categorical: "success", "failure", "unknown", "other".

Target variable:

  • deposit: Indicator of whether the client subscribed a term deposit; binary: "yes" or "no".

Here we report the summary of the dataset: {r} summary(bank)

Data Setup to Model

We partition the bank dataset randomly into two groups: train set (80%) and test set (20%). Here, we use the partition() function from the liver package:

```{r} set.seed(5)

data_sets = partition(data = bank, ratio = c(0.8, 0.2))

trainset = datasets$part1 testset = datasets$part2

actualtest = testset$deposit ```

Note that here we are using the set.seed() function to create reproducible results.

We want to validate the partition by testing whether the proportion of the target variable deposit differs between the two data sets. We use a Two-Sample Z-Test for the difference in proportions. To run the test, we use the prop.test() function in R: ```{r} x1 = sum(trainset$deposit == "yes") x2 = sum(testset $deposit == "yes")

n1 = nrow(trainset) n2 = nrow(testset)

prop.test(x = c(x1, x2), n = c(n1, n2)) ```

Based on the output, answer the following questions:

a. Why is the above hypothesis test suitable for the above research question? Provide your reasons.

  • First of all, we are comparing proportions between two independent samples (train and test sets) and the sample sizes are large enough to justify the use of a Z-test. Moreover, the outcome variable (deposit) is binary and our goal is to ensure that partitioning did not introduce bias in the distribution of the target variable.

b. Specify the null and alternative hypotheses?

  • p1 = proportion of "yes" responses in the training set; p2 = proportion of "yes" responses in the test set. Null hypothesis:(p1 = p2) The proportion of term deposit subscriptions is the same in both sets. Alternative hypothesis:(p1 ≠ p2) The proportions are different between the training and test datasets.

c. Explain that you reject or do not reject the null hypothesis, at $\alpha=0.05$. What would be your statistical conclusion?

  • From the output of the prop.test() function we see that p-value = 0.09533. And since the p-value is greater than $\alpha=0.05$, we can not reject the null hypothesis. Thus, there is no statistically significant difference between the training and test sets.

d. What would be a non-statistical interpretation of your findings in c?

  • The way the dataset was split into training and test sets seems fair because the proportion of customers who subscribed to a term deposit is roughly the same in both groups. Which means the model training and evaluation will be based on representative samples, and the results should generralize well.

Classification using the kNN algorithm

The results from the "Exploratory Data Analysis (EDA)" (from last week) indicate that the following predictors from r ncol(bank) - 1 predictors in the bank dataset are important to predict deposit.

age, default, balance, housing, loan, duration, campaign, pdays, and previous.

Thus, here, based on the training dataset, we want to apply kNN algorithm, by using above predictors in our model. We use the following formula: {r} formula = deposit ~ age + default + balance + housing + loan + duration + campaign + pdays + previous

NOTE: The above formula means deposit is the target variable and the rest of the variables in the right side of tilde ("~") are independent variables.

Based on the training dataset, we want to find the k-nearest neighbor for the test data set. Here we use two different values for k (k = 3 and k = 10). We use the kNN() function from the R package liver: ```{r} predictknn3 = kNN(formula, train = trainset, test = testset, k = 3)

predictknn10 = kNN(formula, train = trainset, test = testset, k = 10) ```

To have an overview of the prediction result, we report Confusion Matrix for two different values of k by using the conf.mat function: ```{r} (confknn3 = conf.mat(predictknn3, actual_test))

(confknn10 = conf.mat(predictknn10, actual_test)) ```

We also could report Confusion Matrix by using the conf.mat.plot() command: ```{r fig.show = "hold", out.width = "50%", fig.align = 'default'} conf.mat.plot(predictknn3, actual_test, main = "kNN with k = 3")

conf.mat.plot(predictknn10, actual_test, main = "kNN with k = 10") ```

What do these values mean? Explain what conclusion you will draw.

  • k = 3 has more correct “yes” predictions (28 vs. 12), but also more false negatives. While, k = 10 has fewer false negatives (better at identifying “yes” clients), but more false positives. Therefore, based on the confusion matrices alone, k = 10 seems to reduce the number of missed “yes” predictions, which could be more valuable for a marketing campaign aiming to identify potential subscribers.

Model evaluation by MSE

To evaluate the accuracy of the predictions, we calculate the Mean Square Error (MSE) by using the mse() function from the liver package:

```{r} MSE3 = mse(predictknn3, actualtest) MSE_3

MSE10 = mse(predictknn10, actualtest) MSE_10 ```

For the case k=3, the MSE = r round(MSE_3 , 3) and for the case k = 10, the MSE = r round(MSE_10, 3). What do these values mean? Explain what conclusion you will draw.

  • A lower MSE indicates better predictive accuracy, therefore, the model with k = 10 performs slightly better in terms of prediction accuracy because MSE for k = 10 is lower than the one for k = 3. This then suggests that using more neighbors (k = 10) helps the model generalize better and reduce classification error, which can lead to more efficient targeting in future campaigns.

Visualizing Model Performance by ROC curve

To report the ROC curve, we need the probability of our classification prediction. We can have it by using: ```{r} probknn3 = kNN(formula, train = trainset, test = testset, k = 3 , type = "prob")[, 1]

probknn10 = kNN(formula, train = trainset, test = testset, k = 10, type = "prob")[, 1] ```

To visualize the model performance, we could report the ROC curve plot by using the plot.roc() function from the pROC package:

```{r} rocknn3 = roc(actualtest, probknn_3)

rocknn10 = roc(actualtest, probknn_10)

ggroc(list(rocknn3, rocknn10), size = 0.8) + thememinimal() + ggtitle("ROC plots with AUC for kNN") + scalecolormanual(values = c("red", "blue"), labels = c(paste("k=3 ; AUC=", round(auc(rocknn3), 3)), paste("k=10; AUC=", round(auc(rocknn10), 3)) )) + theme(legend.title = elementblank()) + theme(legend.position = c(.7, .3), text = elementtext(size = 17)) +
geom
segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") ```

In the above plot, 'red' curve is for the case k = 3 and the 'blue' curve is for the case k = 10.

Explain what conclusion you will draw. Do we need to report AUC (Area Under the Curve) as well?

  • Overall, the ROC curve closer to the top-left corner indicates better performance. Thus, seeing that the blue curve (k = 10) is constantly above the red one (k = 3), we can say that k = 10 is the better model. Aditionally, it is important to report AUC because it provides a clear, quantitative measure of model perdormance that supports the visual interpretation.

kNN algorithm with data transformation

The predictors that we used in the previous question, do not have the same scale. For example, variable duration change between r min(bank$duration) and r max(bank$duration), whereas the variable loan is binary. In this case, the values of variable duration will overwhelm the contribution of the variable loan. To avoid this situation we use normalization. So, we use min-max normalization and transfer the predictors.

Now, we find the k-nearest neighbor for the test set, based on the training dataset, for the k = 10:

```{r} predictknn10trans = kNN(formula, train = trainset, test = test_set, scaler = "minmax", k = 10)

conf.mat.plot(predictknn10trans, actualtest) ```

ROC curve and AUC for transformed data

To report the ROC curve, we need the probability of our classification prediction. We can have it by using: ```{r} probknn10 = kNN(formula, train = trainset, test = testset, k = 10, type = "prob")[, 1]

probknn10trans = kNN(formula, train = trainset, test = test_set, scaler = "minmax", k = 10, type = "prob")[, 1] ```

To visualize the model performance between the raw data and the transformed data, we could report the ROC curve plot as well as AUC (Area Under the Curve) by using the plot.roc() function from the pROC package:

```{r} rocknn10 = roc(actualtest, probknn_10)

rocknn10trans = roc(actualtest, probknn10_trans)

ggroc(list(rocknn10, rocknn10trans), size = 0.8) + thememinimal() + ggtitle("ROC plots with AUC for kNN") + scalecolormanual(values = c("red", "blue"), labels = c(paste("Raw data ; AUC=", round(auc(rocknn10), 3)), paste("Transformed data; AUC=", round(auc(rocknn10trans), 3)))) + theme(legend.title = elementblank()) + theme(legend.position = c(.7, .3), text = elementtext(size = 17)) +
geom
segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") ```

In the above plot black curve is for the raw dataset and the red curve is for the transformed dataset.

Explain what conclusion you will draw. Based on the values of AUC (Area Under the Curve), explain what conclusion you will draw.

  • We can see that the raw data model had lower value AUC, while transformed data model has higher value AUC. Thus, based on the AUC values, the kNN model trained on normalized data performs better than the one trained on raw data. Moreover, this means that scaling the predictors improves the model’s ability to correctly classify clients who will subscribe to a term deposit.

Optimal value of k for the kNN algorithm

In the previous questions, for finding the k-nearest neighbor for the test set, we set k = 10. But why 10? Here, we want to find out the optimal value of k based on our dataset.

To find out the optimal value of k based on Error Rate, for the different values of k from 1 to 30, we run the k-nearest neighbor for the test set and compute the Error Rate for these models, by running kNN.plot() command

{r} kNN.plot(formula, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 7)

Based on the plot, what value of k is optimal? Provide your reasons.

  • The optimal value of k is 10, because it is marked to have the highest accuracy and it yields the lowest Error Rate without over fitting.

Applying kNN algorithm for churn dataset (60 points)

Apply the kNN algorithm to analyze the churn dataset which is available in the R package liver.

Import the churn dataset

Import the churn dataset in R and report the structure and summary of the dataset, using str() and summary() function.

```{r} library(liver)

data(churn)

str(churn)

summary(churn) ```

Data Setup to Model

First, partition the churn dataset randomly into two groups as a train set (70%) and test set (30%). Then, validate the partition for a couple of variables; for example, you could validate the partition by testing whether the proportion of churners differs between the two datasets. Or you could validate the partition by testing whether the population means for the number of customer service calls differs between the two datasets.

{r} set.seed(123) data_sets <- partition(data = churn, ratio = c(0.7, 0.3)) train_set <- data_sets$part1 test_set <- data_sets$part2 ```{r} x1 <- sum(trainset$churn == "yes") x2 <- sum(testset$churn == "yes") n1 <- nrow(trainset) n2 <- nrow(testset)

prop.test(x = c(x1, x2), n = c(n1, n2)) ```

  • P-value = 0.6199 is greater than $\alpha=0.05$, so we reject the null hypothesis; meaning, the is no statistically significant difference in the proportion of churners betweeen training and test sets. Moreover, we can say that the dataset was split fairly.

Applying the kNN algorithm using all predictors

Based on the training dataset, find the k-nearest neighbor for the test data set. For this, use all the 19 predictors of the churn dataset for the analysis. You should use min-max normalization and transfer the predictors. Note that you should first find the optimal value of k. Finally, evaluate the accuracy of the predictions by reporting

  • Formula {r} formula <- churn ~ .

  • Optimal value of k {r} kNN.plot(formula, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 123) The optimal value of k = 10

```{r} pred <- kNN(formula, train = trainset, test = testset, scaler = "minmax", k = 10)

actual <- test_set$churn ```

  • Confusion Matrix

{r} conf.mat.plot(pred, actual, main = "Confusion Matrix for kNN")

  • ROC curve and AUC

```{r} probknn <- kNN(formula, train = trainset, test = test_set, scaler = "minmax", k = 10, type = "prob")[, 1]

rocknn <- roc(actual, probknn)

ggroc(rocknn, size = 1) + thememinimal() + ggtitle(paste("ROC Curve for kNN (k = 10) — AUC =", round(auc(rocknn), 3))) + geomsegment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") ```

Applying the KNN algorithm using part of predictors

In the previous exercises for the churn dataset, you suppose to use all the 19 predictors for the analysis. But we know based on the lecture of week 2, we should use only those predictors that have a relationship with the target variable. So, here we use the following predictors:

account.length, voice.plan, voice.messages, intl.plan, intl.mins, day.mins, eve.mins, night.mins, and customer.calls.

First, based on the above predictors, find the k-nearest neighbor for the test set. You should use min-max normalization and transfer the predictors. Note that you should first find the optimal value of k. Finally, evaluate the accuracy of the predictions by reporting

  • Formula {r} formula3 <- churn ~ account.length + voice.plan + voice.messages + intl.plan + intl.mins + day.mins + eve.mins + night.mins + customer.calls

  • Optimal value of k {r} kNN.plot(formula3, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 123) The optimal value of k = 7

{r} predict_knn <- kNN(formula3, train = train_set, test = test_set, scaler = "minmax", k = 7)

  • Confusion Matrix {r} conf.mat.plot(predict_knn, actual)

  • ROC curve and AUC ```{r} probknn <- kNN(formula3, train = trainset, test = test_set, scaler = "minmax", k = 7, type = "prob")[, 1]

rocknn <- roc(actual, probknn)

Plot ROC curve

ggroc(rocknn, size = 1) + thememinimal() + ggtitle(paste("ROC Curve for kNN — AUC =", round(auc(rocknn), 3))) + geomsegment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") ```

Compare the results with the previous section. What would be your conclusion?

To conclude, by comparing the results, I can see that using carefully selected subset of predictors (those with a known relationship to churn) can yield a model that is both accurate and more interpretable.