Full Name: Lina Maslovaite 14461366

Online Assignment: There is a new online assignment at the DataCamp with the name “Chapter 1: k-Nearest Neighbors (kNN)” which is a part of the online course “Supervised Learning in R: Classification” at the DataCamp. The online assignments at the DataCamp are not mandatory.

Your task is to answer the following questions in Part 1 and Part 2 in this R-markdown file. Please upload both your R-markdown (.Rmd file) and the HTML files separately on Canvas. Note that your R-markdown (.Rmd file) and the HTML files have to be in the right format.

Here, we are going to use the following R packages:

If it’s needed, install these packages on your computer. Here we load them:

library(liver)  
library(ggplot2)   
library(pROC)   

1 Predicting Term Deposit Subscriptions using ‘bank’ dataset (40 points)

We aim to identify customer segments through the analysis of data from customers who have subscribed to a term deposit. This will enable us to determine the characteristics of customers who are more inclined to purchase the product.

1.1 Problem Understanding

Find the best strategies to improve for the next marketing campaign. How can the financial institution have greater effectiveness for future marketing campaigns? To make a data-driven decision, we need to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions to develop future strategies.

1.1.1 Bank direct marketing info

Two main approaches for enterprises to promote products/services are:

  • mass campaigns: targeting general indiscriminate public,
  • directed marketing, targeting a specific set of contacts.

In general, positive responses to mass campaigns are typically very low (less than 1%). On the other hand, direct marketing focuses on targets that are keener to that specific product/service, making this kind of campaign more effective. However, direct marketing has some drawbacks, for instance, it may trigger a negative attitude towards banks due to the intrusion of privacy.

Banks are interested to increase financial assets. One strategy is to offer attractive long-term deposit applications with good interest rates, in particular, by using directed marketing campaigns. Also, the same drivers are pressing for a reduction in costs and time. Thus, there is a need for an improvement in efficiency: lesser contacts should be done, but an approximate number of successes (clients subscribing to the deposit) should be kept.

1.1.2 What is a Term Deposit?

A Term Deposit is a deposit that a bank or a financial institution offers with a fixed rate (often better than just opening a deposit account), in which your money will be returned at a specific maturity time. For more information with regards to Term Deposits please check here.

1.2 Data Undestanding

The bank dataset is related to direct marketing campaigns of a Portuguese banking institution. You can find more information related to this dataset at: https://rdrr.io/cran/liver/man/bank.html

The marketing campaigns were based on phone calls. Often, more than one contact (to the same client) was required, to access if the product (bank term deposit) would be (or not) subscribed. The classification goal is to predict if the client will subscribe to a term deposit (variable deposit).

We import the bank dataset:

data(bank)      

We can see the structure of the dataset by using the str() function:

str(bank)
  'data.frame': 4521 obs. of  17 variables:
   $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
   $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
   $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
   $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
   $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
   $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
   $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
   $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
   $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
   $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
   $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
   $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
   $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
   $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
   $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
   $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
   $ deposit  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

It shows that the bank dataset as a data.frame has 17 variables and 4521 observations. The dataset has 16 predictors along with the target variable deposit which is a binary variable with 2 levels “yes” and “no”. The variables in this dataset are:

  • age: numeric.
  • job: type of job; categorical: “admin.”, “unknown”, “unemployed”, “management”, “housemaid”, “entrepreneur”, “student”, “blue-collar,”self-employed”, “retired”, “technician”, “services”.
  • marital: marital status; categorical: “married”, “divorced”, “single”; note: “divorced” means divorced or widowed.
  • education: categorical: “secondary”, “primary”, “tertiary”, “unknown”.
  • default: has credit in default?; binary: “yes”,“no”.
  • balance: average yearly balance, in euros; numeric.
  • housing: has housing loan? binary: “yes”, “no”.
  • loan: has personal loan? binary: “yes”, “no”.

Related with the last contact of the current campaign:

  • contact: contact: contact communication type; categorical: “unknown”,“telephone”,“cellular”.
  • day: last contact day of the month; numeric.
  • month: last contact month of year; categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”.
  • duration: last contact duration, in seconds; numeric.

Other attributes:

  • campaign: number of contacts performed during this campaign and for this client; numeric, includes last contact.
  • pdays: number of days that passed by after the client was last contacted from a previous campaign; numeric, -1 means client was not previously contacted.
  • previous: number of contacts performed before this campaign and for this client; numeric.
  • poutcome: outcome of the previous marketing campaign; categorical: “success”, “failure”, “unknown”, “other”.

Target variable:

  • deposit: Indicator of whether the client subscribed a term deposit; binary: “yes” or “no”.

Here we report the summary of the dataset:

summary(bank)
        age                 job          marital         education    default   
   Min.   :19.00   management :969   divorced: 528   primary  : 678   no :4445  
   1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306   yes:  76  
   Median :39.00   technician :768   single  :1196   tertiary :1350             
   Mean   :41.17   admin.     :478                   unknown  : 187             
   3rd Qu.:49.00   services   :417                                              
   Max.   :87.00   retired    :230                                              
                   (Other)    :713                                              
      balance      housing     loan           contact          day       
   Min.   :-3313   no :1962   no :3830   cellular :2896   Min.   : 1.00  
   1st Qu.:   69   yes:2559   yes: 691   telephone: 301   1st Qu.: 9.00  
   Median :  444                         unknown  :1324   Median :16.00  
   Mean   : 1423                                          Mean   :15.92  
   3rd Qu.: 1480                                          3rd Qu.:21.00  
   Max.   :71188                                          Max.   :31.00  
                                                                         
       month         duration       campaign          pdays       
   may    :1398   Min.   :   4   Min.   : 1.000   Min.   : -1.00  
   jul    : 706   1st Qu.: 104   1st Qu.: 1.000   1st Qu.: -1.00  
   aug    : 633   Median : 185   Median : 2.000   Median : -1.00  
   jun    : 531   Mean   : 264   Mean   : 2.794   Mean   : 39.77  
   nov    : 389   3rd Qu.: 329   3rd Qu.: 3.000   3rd Qu.: -1.00  
   apr    : 293   Max.   :3025   Max.   :50.000   Max.   :871.00  
   (Other): 571                                                   
      previous          poutcome    deposit   
   Min.   : 0.0000   failure: 490   no :4000  
   1st Qu.: 0.0000   other  : 197   yes: 521  
   Median : 0.0000   success: 129             
   Mean   : 0.5426   unknown:3705             
   3rd Qu.: 0.0000                            
   Max.   :25.0000                            
  

1.3 Data Setup to Model

We partition the bank dataset randomly into two groups: train set (80%) and test set (20%). Here, we use the partition() function from the liver package:

set.seed(5)

data_sets = partition(data = bank, ratio = c(0.8, 0.2))

train_set = data_sets$part1
test_set  = data_sets$part2

actual_test  = test_set$deposit

Note that here we are using the set.seed() function to create reproducible results.

We want to validate the partition by testing whether the proportion of the target variable deposit differs between the two data sets. We use a Two-Sample Z-Test for the difference in proportions. To run the test, we use the prop.test() function in R:

x1 = sum(train_set$deposit == "yes")
x2 = sum(test_set $deposit == "yes")

n1 = nrow(train_set)
n2 = nrow(test_set)

prop.test(x = c(x1, x2), n = c(n1, n2))
  
    2-sample test for equality of proportions with continuity correction
  
  data:  c(x1, x2) out of c(n1, n2)
  X-squared = 2.782, df = 1, p-value = 0.09533
  alternative hypothesis: two.sided
  95 percent confidence interval:
   -0.045490250  0.004499574
  sample estimates:
     prop 1    prop 2 
  0.1111418 0.1316372

Based on the output, answer the following questions:

  1. Why is the above hypothesis test suitable for the above research question? Provide your reasons.
  • First of all, we are comparing proportions between two independent samples (train and test sets) and the sample sizes are large enough to justify the use of a Z-test. Moreover, the outcome variable (deposit) is binary and our goal is to ensure that partitioning did not introduce bias in the distribution of the target variable.
  1. Specify the null and alternative hypotheses?
  • p1 = proportion of “yes” responses in the training set; p2 = proportion of “yes” responses in the test set. Null hypothesis:(p1 = p2) The proportion of term deposit subscriptions is the same in both sets. Alternative hypothesis:(p1 ≠ p2) The proportions are different between the training and test datasets.
  1. Explain that you reject or do not reject the null hypothesis, at \(\alpha=0.05\). What would be your statistical conclusion?
  • From the output of the prop.test() function we see that p-value = 0.09533. And since the p-value is greater than \(\alpha=0.05\), we can not reject the null hypothesis. Thus, there is no statistically significant difference between the training and test sets.
  1. What would be a non-statistical interpretation of your findings in c?
  • The way the dataset was split into training and test sets seems fair because the proportion of customers who subscribed to a term deposit is roughly the same in both groups. Which means the model training and evaluation will be based on representative samples, and the results should generralize well.

1.4 Classification using the kNN algorithm

The results from the “Exploratory Data Analysis (EDA)” (from last week) indicate that the following predictors from 16 predictors in the bank dataset are important to predict deposit.

age, default, balance, housing, loan, duration, campaign, pdays, and previous.

Thus, here, based on the training dataset, we want to apply kNN algorithm, by using above predictors in our model. We use the following formula:

formula = deposit ~ age + default + balance + housing + loan + duration + campaign + pdays + previous

NOTE: The above formula means deposit is the target variable and the rest of the variables in the right side of tilde (“~”) are independent variables.

Based on the training dataset, we want to find the k-nearest neighbor for the test data set. Here we use two different values for k (k = 3 and k = 10). We use the kNN() function from the R package liver:

predict_knn_3  = kNN(formula, train = train_set, test = test_set, k = 3)

predict_knn_10 = kNN(formula, train = train_set, test = test_set, k = 10)

To have an overview of the prediction result, we report Confusion Matrix for two different values of k by using the conf.mat function:

(conf_knn_3 = conf.mat(predict_knn_3, actual_test))
  Setting levels: reference = "no", case = "yes"
         Actual
  Predict  no yes
      no  749  91
      yes  36  28
(conf_knn_10 = conf.mat(predict_knn_10, actual_test))
  Setting levels: reference = "no", case = "yes"
         Actual
  Predict  no yes
      no  771 107
      yes  14  12

We also could report Confusion Matrix by using the conf.mat.plot() command:

conf.mat.plot(predict_knn_3, actual_test, main = "kNN with k = 3")
  Setting levels: reference = "no", case = "yes"
conf.mat.plot(predict_knn_10, actual_test, main = "kNN with k = 10")
  Setting levels: reference = "no", case = "yes"

What do these values mean? Explain what conclusion you will draw.

  • k = 3 has more correct “yes” predictions (28 vs. 12), but also more false negatives. While, k = 10 has fewer false negatives (better at identifying “yes” clients), but more false positives. Therefore, based on the confusion matrices alone, k = 10 seems to reduce the number of missed “yes” predictions, which could be more valuable for a marketing campaign aiming to identify potential subscribers.

1.5 Model evaluation by MSE

To evaluate the accuracy of the predictions, we calculate the Mean Square Error (MSE) by using the mse() function from the liver package:

MSE_3 = mse(predict_knn_3, actual_test)
MSE_3 
  [1] 0.1404867
MSE_10 = mse(predict_knn_10, actual_test)
MSE_10
  [1] 0.1338496

For the case k=3, the MSE = 0.14 and for the case k = 10, the MSE = 0.134. What do these values mean? Explain what conclusion you will draw.

  • A lower MSE indicates better predictive accuracy, therefore, the model with k = 10 performs slightly better in terms of prediction accuracy because MSE for k = 10 is lower than the one for k = 3. This then suggests that using more neighbors (k = 10) helps the model generalize better and reduce classification error, which can lead to more efficient targeting in future campaigns.

1.6 Visualizing Model Performance by ROC curve

To report the ROC curve, we need the probability of our classification prediction. We can have it by using:

prob_knn_3  = kNN(formula, train = train_set, test = test_set, k = 3 , type = "prob")[, 1]

prob_knn_10 = kNN(formula, train = train_set, test = test_set, k = 10, type = "prob")[, 1]

To visualize the model performance, we could report the ROC curve plot by using the plot.roc() function from the pROC package:

roc_knn_3 = roc(actual_test, prob_knn_3)

roc_knn_10 = roc(actual_test, prob_knn_10)

ggroc(list(roc_knn_3, roc_knn_10), size = 0.8) + 
    theme_minimal() + ggtitle("ROC plots with AUC for kNN") +
    scale_color_manual(values = c("red", "blue"), 
    labels = c(paste("k=3 ; AUC=", round(auc(roc_knn_3), 3)),
                paste("k=10; AUC=", round(auc(roc_knn_10), 3))
             )) +
    theme(legend.title = element_blank()) +
    theme(legend.position = c(.7, .3), text = element_text(size = 17)) +  
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") 

In the above plot, ‘red’ curve is for the case k = 3 and the ‘blue’ curve is for the case k = 10.

Explain what conclusion you will draw. Do we need to report AUC (Area Under the Curve) as well?

  • Overall, the ROC curve closer to the top-left corner indicates better performance. Thus, seeing that the blue curve (k = 10) is constantly above the red one (k = 3), we can say that k = 10 is the better model. Aditionally, it is important to report AUC because it provides a clear, quantitative measure of model perdormance that supports the visual interpretation.

1.7 kNN algorithm with data transformation

The predictors that we used in the previous question, do not have the same scale. For example, variable duration change between 4 and 3025, whereas the variable loan is binary. In this case, the values of variable duration will overwhelm the contribution of the variable loan. To avoid this situation we use normalization. So, we use min-max normalization and transfer the predictors.

Now, we find the k-nearest neighbor for the test set, based on the training dataset, for the k = 10:

predict_knn_10_trans = kNN(formula, train = train_set, test = test_set, scaler = "minmax", k = 10)

conf.mat.plot(predict_knn_10_trans, actual_test)
  Setting levels: reference = "no", case = "yes"

1.8 ROC curve and AUC for transformed data

To report the ROC curve, we need the probability of our classification prediction. We can have it by using:

prob_knn_10 = kNN(formula, train = train_set, test = test_set, k = 10, type = "prob")[, 1]

prob_knn_10_trans = kNN(formula, train = train_set, test = test_set, scaler = "minmax", k = 10, type = "prob")[, 1]

To visualize the model performance between the raw data and the transformed data, we could report the ROC curve plot as well as AUC (Area Under the Curve) by using the plot.roc() function from the pROC package:

roc_knn_10 = roc(actual_test, prob_knn_10)

roc_knn_10_trans = roc(actual_test, prob_knn_10_trans)

ggroc(list(roc_knn_10, roc_knn_10_trans), size = 0.8) + 
    theme_minimal() + ggtitle("ROC plots with AUC for kNN") +
    scale_color_manual(values = c("red", "blue"), 
      labels = c(paste("Raw data             ; AUC=", round(auc(roc_knn_10), 3)), 
                  paste("Transformed data; AUC=", round(auc(roc_knn_10_trans), 3)))) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(.7, .3), text = element_text(size = 17)) +  
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") 

In the above plot black curve is for the raw dataset and the red curve is for the transformed dataset.

Explain what conclusion you will draw. Based on the values of AUC (Area Under the Curve), explain what conclusion you will draw.

  • We can see that the raw data model had lower value AUC, while transformed data model has higher value AUC. Thus, based on the AUC values, the kNN model trained on normalized data performs better than the one trained on raw data. Moreover, this means that scaling the predictors improves the model’s ability to correctly classify clients who will subscribe to a term deposit.

1.9 Optimal value of k for the kNN algorithm

In the previous questions, for finding the k-nearest neighbor for the test set, we set k = 10. But why 10? Here, we want to find out the optimal value of k based on our dataset.

To find out the optimal value of k based on Error Rate, for the different values of k from 1 to 30, we run the k-nearest neighbor for the test set and compute the Error Rate for these models, by running kNN.plot() command

kNN.plot(formula, train = train_set, test = test_set, scaler = "minmax", 
          k.max = 30, set.seed = 7)
  Setting levels: reference = "no", case = "yes"

Based on the plot, what value of k is optimal? Provide your reasons.

  • The optimal value of k is 10, because it is marked to have the highest accuracy and it yields the lowest Error Rate without over fitting.

2 Applying kNN algorithm for churn dataset (60 points)

Apply the kNN algorithm to analyze the churn dataset which is available in the R package liver.

2.1 Import the churn dataset

Import the churn dataset in R and report the structure and summary of the dataset, using str() and summary() function.

library(liver)

data(churn)

str(churn)
  'data.frame': 5000 obs. of  20 variables:
   $ state         : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
   $ area.code     : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
   $ account.length: int  128 107 137 84 75 118 121 147 117 141 ...
   $ voice.plan    : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
   $ voice.messages: int  25 26 0 0 0 0 24 0 0 37 ...
   $ intl.plan     : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
   $ intl.mins     : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
   $ intl.calls    : int  3 3 5 7 3 6 7 6 4 5 ...
   $ intl.charge   : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
   $ day.mins      : num  265 162 243 299 167 ...
   $ day.calls     : int  110 123 114 71 113 98 88 79 97 84 ...
   $ day.charge    : num  45.1 27.5 41.4 50.9 28.3 ...
   $ eve.mins      : num  197.4 195.5 121.2 61.9 148.3 ...
   $ eve.calls     : int  99 103 110 88 122 101 108 94 80 111 ...
   $ eve.charge    : num  16.78 16.62 10.3 5.26 12.61 ...
   $ night.mins    : num  245 254 163 197 187 ...
   $ night.calls   : int  91 103 104 89 121 118 118 96 90 97 ...
   $ night.charge  : num  11.01 11.45 7.32 8.86 8.41 ...
   $ customer.calls: int  1 1 0 2 3 0 3 0 1 0 ...
   $ churn         : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
summary(churn)
       state              area.code    account.length  voice.plan
   WV     : 158   area_code_408:1259   Min.   :  1.0   yes:1323  
   MN     : 125   area_code_415:2495   1st Qu.: 73.0   no :3677  
   AL     : 124   area_code_510:1246   Median :100.0             
   ID     : 119                        Mean   :100.3             
   VA     : 118                        3rd Qu.:127.0             
   OH     : 116                        Max.   :243.0             
   (Other):4240                                                  
   voice.messages   intl.plan    intl.mins       intl.calls      intl.charge   
   Min.   : 0.000   yes: 473   Min.   : 0.00   Min.   : 0.000   Min.   :0.000  
   1st Qu.: 0.000   no :4527   1st Qu.: 8.50   1st Qu.: 3.000   1st Qu.:2.300  
   Median : 0.000              Median :10.30   Median : 4.000   Median :2.780  
   Mean   : 7.755              Mean   :10.26   Mean   : 4.435   Mean   :2.771  
   3rd Qu.:17.000              3rd Qu.:12.00   3rd Qu.: 6.000   3rd Qu.:3.240  
   Max.   :52.000              Max.   :20.00   Max.   :20.000   Max.   :5.400  
                                                                               
      day.mins       day.calls     day.charge       eve.mins       eve.calls    
   Min.   :  0.0   Min.   :  0   Min.   : 0.00   Min.   :  0.0   Min.   :  0.0  
   1st Qu.:143.7   1st Qu.: 87   1st Qu.:24.43   1st Qu.:166.4   1st Qu.: 87.0  
   Median :180.1   Median :100   Median :30.62   Median :201.0   Median :100.0  
   Mean   :180.3   Mean   :100   Mean   :30.65   Mean   :200.6   Mean   :100.2  
   3rd Qu.:216.2   3rd Qu.:113   3rd Qu.:36.75   3rd Qu.:234.1   3rd Qu.:114.0  
   Max.   :351.5   Max.   :165   Max.   :59.76   Max.   :363.7   Max.   :170.0  
                                                                                
     eve.charge      night.mins     night.calls      night.charge   
   Min.   : 0.00   Min.   :  0.0   Min.   :  0.00   Min.   : 0.000  
   1st Qu.:14.14   1st Qu.:166.9   1st Qu.: 87.00   1st Qu.: 7.510  
   Median :17.09   Median :200.4   Median :100.00   Median : 9.020  
   Mean   :17.05   Mean   :200.4   Mean   : 99.92   Mean   : 9.018  
   3rd Qu.:19.90   3rd Qu.:234.7   3rd Qu.:113.00   3rd Qu.:10.560  
   Max.   :30.91   Max.   :395.0   Max.   :175.00   Max.   :17.770  
                                                                    
   customer.calls churn     
   Min.   :0.00   yes: 707  
   1st Qu.:1.00   no :4293  
   Median :1.00             
   Mean   :1.57             
   3rd Qu.:2.00             
   Max.   :9.00             
  

2.2 Data Setup to Model

First, partition the churn dataset randomly into two groups as a train set (70%) and test set (30%). Then, validate the partition for a couple of variables; for example, you could validate the partition by testing whether the proportion of churners differs between the two datasets. Or you could validate the partition by testing whether the population means for the number of customer service calls differs between the two datasets.

set.seed(123)
data_sets <- partition(data = churn, ratio = c(0.7, 0.3))
train_set <- data_sets$part1
test_set <- data_sets$part2
x1 <- sum(train_set$churn == "yes")
x2 <- sum(test_set$churn == "yes")
n1 <- nrow(train_set)
n2 <- nrow(test_set)

prop.test(x = c(x1, x2), n = c(n1, n2))
  
    2-sample test for equality of proportions with continuity correction
  
  data:  c(x1, x2) out of c(n1, n2)
  X-squared = 0.24601, df = 1, p-value = 0.6199
  alternative hypothesis: two.sided
  95 percent confidence interval:
   -0.01559571  0.02721476
  sample estimates:
     prop 1    prop 2 
  0.1431429 0.1373333
  • P-value = 0.6199 is greater than \(\alpha=0.05\), so we reject the null hypothesis; meaning, the is no statistically significant difference in the proportion of churners betweeen training and test sets. Moreover, we can say that the dataset was split fairly.

2.3 Applying the kNN algorithm using all predictors

Based on the training dataset, find the k-nearest neighbor for the test data set. For this, use all the 19 predictors of the churn dataset for the analysis. You should use min-max normalization and transfer the predictors. Note that you should first find the optimal value of k. Finally, evaluate the accuracy of the predictions by reporting

  • Formula
formula <- churn ~ .
  • Optimal value of k
kNN.plot(formula, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 123)
  Setting levels: reference = "yes", case = "no"

The optimal value of k = 10

pred <- kNN(formula, train = train_set, test = test_set, scaler = "minmax", k = 10)

actual <- test_set$churn
  • Confusion Matrix
conf.mat.plot(pred, actual, main = "Confusion Matrix for kNN")
  Setting levels: reference = "yes", case = "no"

  • ROC curve and AUC
prob_knn <- kNN(formula, train = train_set, test = test_set, scaler = "minmax", k = 10, type = "prob")[, 1]

roc_knn <- roc(actual, prob_knn)

ggroc(roc_knn, size = 1) +
  theme_minimal() +
  ggtitle(paste("ROC Curve for kNN (k = 10) — AUC =", round(auc(roc_knn), 3))) +
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed")

2.4 Applying the KNN algorithm using part of predictors

In the previous exercises for the churn dataset, you suppose to use all the 19 predictors for the analysis. But we know based on the lecture of week 2, we should use only those predictors that have a relationship with the target variable. So, here we use the following predictors:

account.length, voice.plan, voice.messages, intl.plan, intl.mins, day.mins, eve.mins, night.mins, and customer.calls.

First, based on the above predictors, find the k-nearest neighbor for the test set. You should use min-max normalization and transfer the predictors. Note that you should first find the optimal value of k. Finally, evaluate the accuracy of the predictions by reporting

  • Formula
formula3 <- churn ~ account.length + voice.plan + voice.messages + intl.plan +
           intl.mins + day.mins + eve.mins + night.mins + customer.calls
  • Optimal value of k
kNN.plot(formula3, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 123)
  Setting levels: reference = "yes", case = "no"

The optimal value of k = 7

predict_knn <- kNN(formula3, train = train_set, test = test_set, scaler = "minmax", k = 7)
  • Confusion Matrix
conf.mat.plot(predict_knn, actual)
  Setting levels: reference = "yes", case = "no"

  • ROC curve and AUC
prob_knn <- kNN(formula3, train = train_set, test = test_set, scaler = "minmax", k = 7, type = "prob")[, 1]

roc_knn <- roc(actual, prob_knn)

# Plot ROC curve
ggroc(roc_knn, size = 1) +
  theme_minimal() +
  ggtitle(paste("ROC Curve for kNN — AUC =", round(auc(roc_knn), 3))) +
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed")

Compare the results with the previous section. What would be your conclusion?