Full Name: Lina Maslovaite 14461366
Online Assignment: There is a new online assignment at the DataCamp with the name “Chapter 1: k-Nearest Neighbors (kNN)” which is a part of the online course “Supervised Learning in R: Classification” at the DataCamp. The online assignments at the DataCamp are not mandatory.
Your task is to answer the following questions in Part 1 and Part 2 in this R-markdown file. Please upload both your R-markdown (.Rmd file) and the HTML files separately on Canvas. Note that your R-markdown (.Rmd file) and the HTML files have to be in the right format.
Here, we are going to use the following R packages:
partition(), kNN() functions in this
package.roc()
and auc() functions in this package.If it’s needed, install these packages on your computer. Here we load them:
We aim to identify customer segments through the analysis of data from customers who have subscribed to a term deposit. This will enable us to determine the characteristics of customers who are more inclined to purchase the product.
Find the best strategies to improve for the next marketing campaign. How can the financial institution have greater effectiveness for future marketing campaigns? To make a data-driven decision, we need to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions to develop future strategies.
Two main approaches for enterprises to promote products/services are:
In general, positive responses to mass campaigns are typically very low (less than 1%). On the other hand, direct marketing focuses on targets that are keener to that specific product/service, making this kind of campaign more effective. However, direct marketing has some drawbacks, for instance, it may trigger a negative attitude towards banks due to the intrusion of privacy.
Banks are interested to increase financial assets. One strategy is to offer attractive long-term deposit applications with good interest rates, in particular, by using directed marketing campaigns. Also, the same drivers are pressing for a reduction in costs and time. Thus, there is a need for an improvement in efficiency: lesser contacts should be done, but an approximate number of successes (clients subscribing to the deposit) should be kept.
A Term Deposit is a deposit that a bank or a financial institution offers with a fixed rate (often better than just opening a deposit account), in which your money will be returned at a specific maturity time. For more information with regards to Term Deposits please check here.
The bank dataset is related to direct marketing campaigns of a Portuguese banking institution. You can find more information related to this dataset at: https://rdrr.io/cran/liver/man/bank.html
The marketing campaigns were based on phone calls. Often, more than one contact (to the same client) was required, to access if the product (bank term deposit) would be (or not) subscribed. The classification goal is to predict if the client will subscribe to a term deposit (variable deposit).
We import the bank dataset:
We can see the structure of the dataset by using the
str() function:
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 33 35 30 59 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
$ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
$ day : int 19 11 16 3 5 23 14 6 14 17 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
$ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
$ previous : int 0 4 1 0 0 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ deposit : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
It shows that the bank dataset as a data.frame
has 17 variables and 4521 observations. The dataset has 16 predictors
along with the target variable deposit which is a binary
variable with 2 levels “yes” and “no”. The variables in this dataset
are:
age: numeric.job: type of job; categorical: “admin.”, “unknown”,
“unemployed”, “management”, “housemaid”, “entrepreneur”, “student”,
“blue-collar,”self-employed”, “retired”, “technician”, “services”.marital: marital status; categorical: “married”,
“divorced”, “single”; note: “divorced” means divorced or widowed.education: categorical: “secondary”, “primary”,
“tertiary”, “unknown”.default: has credit in default?; binary:
“yes”,“no”.balance: average yearly balance, in euros;
numeric.housing: has housing loan? binary: “yes”, “no”.loan: has personal loan? binary: “yes”, “no”.Related with the last contact of the current campaign:
contact: contact: contact communication type;
categorical: “unknown”,“telephone”,“cellular”.day: last contact day of the month; numeric.month: last contact month of year; categorical: “jan”,
“feb”, “mar”, …, “nov”, “dec”.duration: last contact duration, in seconds;
numeric.Other attributes:
campaign: number of contacts performed during this
campaign and for this client; numeric, includes last contact.pdays: number of days that passed by after the client
was last contacted from a previous campaign; numeric, -1 means client
was not previously contacted.previous: number of contacts performed before this
campaign and for this client; numeric.poutcome: outcome of the previous marketing campaign;
categorical: “success”, “failure”, “unknown”, “other”.Target variable:
deposit: Indicator of whether the client subscribed a
term deposit; binary: “yes” or “no”.Here we report the summary of the dataset:
age job marital education default
Min. :19.00 management :969 divorced: 528 primary : 678 no :4445
1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306 yes: 76
Median :39.00 technician :768 single :1196 tertiary :1350
Mean :41.17 admin. :478 unknown : 187
3rd Qu.:49.00 services :417
Max. :87.00 retired :230
(Other) :713
balance housing loan contact day
Min. :-3313 no :1962 no :3830 cellular :2896 Min. : 1.00
1st Qu.: 69 yes:2559 yes: 691 telephone: 301 1st Qu.: 9.00
Median : 444 unknown :1324 Median :16.00
Mean : 1423 Mean :15.92
3rd Qu.: 1480 3rd Qu.:21.00
Max. :71188 Max. :31.00
month duration campaign pdays
may :1398 Min. : 4 Min. : 1.000 Min. : -1.00
jul : 706 1st Qu.: 104 1st Qu.: 1.000 1st Qu.: -1.00
aug : 633 Median : 185 Median : 2.000 Median : -1.00
jun : 531 Mean : 264 Mean : 2.794 Mean : 39.77
nov : 389 3rd Qu.: 329 3rd Qu.: 3.000 3rd Qu.: -1.00
apr : 293 Max. :3025 Max. :50.000 Max. :871.00
(Other): 571
previous poutcome deposit
Min. : 0.0000 failure: 490 no :4000
1st Qu.: 0.0000 other : 197 yes: 521
Median : 0.0000 success: 129
Mean : 0.5426 unknown:3705
3rd Qu.: 0.0000
Max. :25.0000
We partition the bank dataset randomly into two groups:
train set (80%) and test set (20%). Here, we use the
partition() function from the liver
package:
set.seed(5)
data_sets = partition(data = bank, ratio = c(0.8, 0.2))
train_set = data_sets$part1
test_set = data_sets$part2
actual_test = test_set$depositNote that here we are using the set.seed() function to
create reproducible results.
We want to validate the partition by testing whether the proportion
of the target variable deposit differs between the two data
sets. We use a Two-Sample Z-Test for the difference in proportions. To
run the test, we use the prop.test() function in
R:
x1 = sum(train_set$deposit == "yes")
x2 = sum(test_set $deposit == "yes")
n1 = nrow(train_set)
n2 = nrow(test_set)
prop.test(x = c(x1, x2), n = c(n1, n2))
2-sample test for equality of proportions with continuity correction
data: c(x1, x2) out of c(n1, n2)
X-squared = 2.782, df = 1, p-value = 0.09533
alternative hypothesis: two.sided
95 percent confidence interval:
-0.045490250 0.004499574
sample estimates:
prop 1 prop 2
0.1111418 0.1316372
Based on the output, answer the following questions:
prop.test() function we see that
p-value = 0.09533. And since the p-value is greater
than \(\alpha=0.05\), we can not reject
the null hypothesis. Thus, there is no statistically significant
difference between the training and test sets.The results from the “Exploratory Data Analysis (EDA)” (from last
week) indicate that the following predictors from 16 predictors in the
bank dataset are important to predict deposit.
age, default, balance,
housing, loan, duration,
campaign, pdays, and
previous.
Thus, here, based on the training dataset, we want to apply kNN algorithm, by using above predictors in our model. We use the following formula:
formula = deposit ~ age + default + balance + housing + loan + duration + campaign + pdays + previousNOTE: The above formula means deposit
is the target variable and the rest of the variables in the right side
of tilde (“~”) are independent variables.
Based on the training dataset, we want to find the k-nearest neighbor
for the test data set. Here we use two different values for k (k = 3 and
k = 10). We use the kNN() function from the R
package liver:
predict_knn_3 = kNN(formula, train = train_set, test = test_set, k = 3)
predict_knn_10 = kNN(formula, train = train_set, test = test_set, k = 10)To have an overview of the prediction result, we report Confusion
Matrix for two different values of k by using the
conf.mat function:
Setting levels: reference = "no", case = "yes"
Actual
Predict no yes
no 749 91
yes 36 28
Setting levels: reference = "no", case = "yes"
Actual
Predict no yes
no 771 107
yes 14 12
We also could report Confusion Matrix by using the
conf.mat.plot() command:
Setting levels: reference = "no", case = "yes"
Setting levels: reference = "no", case = "yes"
What do these values mean? Explain what conclusion you will draw.
To evaluate the accuracy of the predictions, we calculate the Mean
Square Error (MSE) by using the mse() function from the
liver package:
[1] 0.1404867
[1] 0.1338496
For the case k=3, the MSE = 0.14 and for the case k = 10, the MSE = 0.134. What do these values mean? Explain what conclusion you will draw.
To report the ROC curve, we need the probability of our classification prediction. We can have it by using:
prob_knn_3 = kNN(formula, train = train_set, test = test_set, k = 3 , type = "prob")[, 1]
prob_knn_10 = kNN(formula, train = train_set, test = test_set, k = 10, type = "prob")[, 1]To visualize the model performance, we could report the ROC curve
plot by using the plot.roc() function from the
pROC package:
roc_knn_3 = roc(actual_test, prob_knn_3)
roc_knn_10 = roc(actual_test, prob_knn_10)
ggroc(list(roc_knn_3, roc_knn_10), size = 0.8) +
theme_minimal() + ggtitle("ROC plots with AUC for kNN") +
scale_color_manual(values = c("red", "blue"),
labels = c(paste("k=3 ; AUC=", round(auc(roc_knn_3), 3)),
paste("k=10; AUC=", round(auc(roc_knn_10), 3))
)) +
theme(legend.title = element_blank()) +
theme(legend.position = c(.7, .3), text = element_text(size = 17)) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") In the above plot, ‘red’ curve is for the case k = 3 and the ‘blue’ curve is for the case k = 10.
Explain what conclusion you will draw. Do we need to report AUC (Area Under the Curve) as well?
The predictors that we used in the previous question, do not have the
same scale. For example, variable duration change between 4
and 3025, whereas the variable loan is binary. In this
case, the values of variable duration will overwhelm the
contribution of the variable loan. To avoid this situation
we use normalization. So, we use min-max normalization and transfer the
predictors.
Now, we find the k-nearest neighbor for the test set, based on the training dataset, for the k = 10:
predict_knn_10_trans = kNN(formula, train = train_set, test = test_set, scaler = "minmax", k = 10)
conf.mat.plot(predict_knn_10_trans, actual_test) Setting levels: reference = "no", case = "yes"
To report the ROC curve, we need the probability of our classification prediction. We can have it by using:
prob_knn_10 = kNN(formula, train = train_set, test = test_set, k = 10, type = "prob")[, 1]
prob_knn_10_trans = kNN(formula, train = train_set, test = test_set, scaler = "minmax", k = 10, type = "prob")[, 1]To visualize the model performance between the raw data and the
transformed data, we could report the ROC curve plot as well as AUC
(Area Under the Curve) by using the plot.roc() function
from the pROC package:
roc_knn_10 = roc(actual_test, prob_knn_10)
roc_knn_10_trans = roc(actual_test, prob_knn_10_trans)
ggroc(list(roc_knn_10, roc_knn_10_trans), size = 0.8) +
theme_minimal() + ggtitle("ROC plots with AUC for kNN") +
scale_color_manual(values = c("red", "blue"),
labels = c(paste("Raw data ; AUC=", round(auc(roc_knn_10), 3)),
paste("Transformed data; AUC=", round(auc(roc_knn_10_trans), 3)))) +
theme(legend.title = element_blank()) +
theme(legend.position = c(.7, .3), text = element_text(size = 17)) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed") In the above plot black curve is for the raw dataset and the red curve is for the transformed dataset.
Explain what conclusion you will draw. Based on the values of AUC (Area Under the Curve), explain what conclusion you will draw.
In the previous questions, for finding the k-nearest neighbor for the test set, we set k = 10. But why 10? Here, we want to find out the optimal value of k based on our dataset.
To find out the optimal value of k based on Error
Rate, for the different values of k from 1 to 30, we run the
k-nearest neighbor for the test set and compute the Error Rate
for these models, by running kNN.plot() command
Setting levels: reference = "no", case = "yes"
Based on the plot, what value of k is optimal? Provide your reasons.
Apply the kNN algorithm to analyze the churn dataset which is available in the R package liver.
Import the churn dataset in R and report the structure and
summary of the dataset, using str() and
summary() function.
'data.frame': 5000 obs. of 20 variables:
$ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
$ area.code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
$ account.length: int 128 107 137 84 75 118 121 147 117 141 ...
$ voice.plan : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
$ voice.messages: int 25 26 0 0 0 0 24 0 0 37 ...
$ intl.plan : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
$ intl.mins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
$ intl.calls : int 3 3 5 7 3 6 7 6 4 5 ...
$ intl.charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
$ day.mins : num 265 162 243 299 167 ...
$ day.calls : int 110 123 114 71 113 98 88 79 97 84 ...
$ day.charge : num 45.1 27.5 41.4 50.9 28.3 ...
$ eve.mins : num 197.4 195.5 121.2 61.9 148.3 ...
$ eve.calls : int 99 103 110 88 122 101 108 94 80 111 ...
$ eve.charge : num 16.78 16.62 10.3 5.26 12.61 ...
$ night.mins : num 245 254 163 197 187 ...
$ night.calls : int 91 103 104 89 121 118 118 96 90 97 ...
$ night.charge : num 11.01 11.45 7.32 8.86 8.41 ...
$ customer.calls: int 1 1 0 2 3 0 3 0 1 0 ...
$ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
state area.code account.length voice.plan
WV : 158 area_code_408:1259 Min. : 1.0 yes:1323
MN : 125 area_code_415:2495 1st Qu.: 73.0 no :3677
AL : 124 area_code_510:1246 Median :100.0
ID : 119 Mean :100.3
VA : 118 3rd Qu.:127.0
OH : 116 Max. :243.0
(Other):4240
voice.messages intl.plan intl.mins intl.calls intl.charge
Min. : 0.000 yes: 473 Min. : 0.00 Min. : 0.000 Min. :0.000
1st Qu.: 0.000 no :4527 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
Median : 0.000 Median :10.30 Median : 4.000 Median :2.780
Mean : 7.755 Mean :10.26 Mean : 4.435 Mean :2.771
3rd Qu.:17.000 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
Max. :52.000 Max. :20.00 Max. :20.000 Max. :5.400
day.mins day.calls day.charge eve.mins eve.calls
Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0
Median :180.1 Median :100 Median :30.62 Median :201.0 Median :100.0
Mean :180.3 Mean :100 Mean :30.65 Mean :200.6 Mean :100.2
3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0
Max. :351.5 Max. :165 Max. :59.76 Max. :363.7 Max. :170.0
eve.charge night.mins night.calls night.charge
Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510
Median :17.09 Median :200.4 Median :100.00 Median : 9.020
Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018
3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560
Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770
customer.calls churn
Min. :0.00 yes: 707
1st Qu.:1.00 no :4293
Median :1.00
Mean :1.57
3rd Qu.:2.00
Max. :9.00
First, partition the churn dataset randomly into two groups as a train set (70%) and test set (30%). Then, validate the partition for a couple of variables; for example, you could validate the partition by testing whether the proportion of churners differs between the two datasets. Or you could validate the partition by testing whether the population means for the number of customer service calls differs between the two datasets.
set.seed(123)
data_sets <- partition(data = churn, ratio = c(0.7, 0.3))
train_set <- data_sets$part1
test_set <- data_sets$part2x1 <- sum(train_set$churn == "yes")
x2 <- sum(test_set$churn == "yes")
n1 <- nrow(train_set)
n2 <- nrow(test_set)
prop.test(x = c(x1, x2), n = c(n1, n2))
2-sample test for equality of proportions with continuity correction
data: c(x1, x2) out of c(n1, n2)
X-squared = 0.24601, df = 1, p-value = 0.6199
alternative hypothesis: two.sided
95 percent confidence interval:
-0.01559571 0.02721476
sample estimates:
prop 1 prop 2
0.1431429 0.1373333
Based on the training dataset, find the k-nearest neighbor for the test data set. For this, use all the 19 predictors of the churn dataset for the analysis. You should use min-max normalization and transfer the predictors. Note that you should first find the optimal value of k. Finally, evaluate the accuracy of the predictions by reporting
kNN.plot(formula, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 123) Setting levels: reference = "yes", case = "no"
The optimal value of k = 10
pred <- kNN(formula, train = train_set, test = test_set, scaler = "minmax", k = 10)
actual <- test_set$churn Setting levels: reference = "yes", case = "no"
prob_knn <- kNN(formula, train = train_set, test = test_set, scaler = "minmax", k = 10, type = "prob")[, 1]
roc_knn <- roc(actual, prob_knn)
ggroc(roc_knn, size = 1) +
theme_minimal() +
ggtitle(paste("ROC Curve for kNN (k = 10) — AUC =", round(auc(roc_knn), 3))) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed")In the previous exercises for the churn dataset, you suppose to use all the 19 predictors for the analysis. But we know based on the lecture of week 2, we should use only those predictors that have a relationship with the target variable. So, here we use the following predictors:
account.length, voice.plan,
voice.messages, intl.plan,
intl.mins, day.mins, eve.mins,
night.mins, and customer.calls.
First, based on the above predictors, find the k-nearest neighbor for the test set. You should use min-max normalization and transfer the predictors. Note that you should first find the optimal value of k. Finally, evaluate the accuracy of the predictions by reporting
formula3 <- churn ~ account.length + voice.plan + voice.messages + intl.plan +
intl.mins + day.mins + eve.mins + night.mins + customer.callskNN.plot(formula3, train = train_set, test = test_set, scaler = "minmax", k.max = 30, set.seed = 123) Setting levels: reference = "yes", case = "no"
The optimal value of k = 7
Setting levels: reference = "yes", case = "no"
prob_knn <- kNN(formula3, train = train_set, test = test_set, scaler = "minmax", k = 7, type = "prob")[, 1]
roc_knn <- roc(actual, prob_knn)
# Plot ROC curve
ggroc(roc_knn, size = 1) +
theme_minimal() +
ggtitle(paste("ROC Curve for kNN — AUC =", round(auc(roc_knn), 3))) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed")Compare the results with the previous section. What would be your conclusion?