The purpose of writing graded lab reports is to help students to stay on track and to provide summative feedback. Each lab report is just 1% of the total course mark. Please do not cheat - it is not worth it!
Solve the practical question, knit your document into a PDF and submit to NTULearn before the deadline. The deadline is very tight because the task is simple. We are sure that everyone is capable to do it by themselves and we want to discourage taking someone else’s report and writing it with your own words.
You will get “excellent”, or 100% for this lab report if everything
is perfect. You will get “good”, or 75% if there are minor issues. For
example, you will get “good” if you do data normalization manually
instead of using built-in R functions learned in the lab session of if
you use loops instead of functions from the apply
family.
You will get “average”, or 50% if there are serious issues in your
report, such as failing to do normalization for KNN at all. You will get
“poor”, or 25% if you barely attempt this report. You will get “not
done”, or 0% if you do not attempt this report.
17 Jul 2020, midnight
Here, we load libraries and set the random seed. Replace the number “1729” with the numeric part of your matric no here.
library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR) # for datasets from 'Introduction to Statistical Learning'
library(caret) # for machine learning, including KNN
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
set.seed(1729) # Replace the number "1729" with the numeric part of your matric no
Explain why it is wrong to normalize all the data first and then split it into a training set and a test set.
Answer: the central idea in machine learning is that a test set is generated at first and then moved away. It should not be involved into training models at all - we can only use it for final model validation to compare different models. And if we normalize the data before splitting it into training and test sets, it means that we are using the test set for training our models.
Start with the dataset Carseats
from the library
ISLR
. Split it into 75% training and 25% test data. Keep
only Sales
, Income
, Advertising
,
and Price
variables. Print dimensions of the training
dataset and the test dataset.
ind <- runif(nrow(Carseats)) <= 0.75
train_data <- Carseats %>% filter(ind) %>%
select(Sales, Income, Advertising, Price)
test_data <- Carseats %>% filter(!ind) %>%
select(Sales, Income, Advertising, Price)
dim(train_data)
## [1] 316 4
dim(test_data)
## [1] 84 4
First, we will create a function that calculates the mean absolute error of a vector of predicted values vs a vector of reference values:
mae <- function(predicted_values, reference_values) {
(predicted_values - reference_values) %>%
abs %>% mean
}
Now train a KNN regression model with \(K=15\) on the training dataset and find its error on the test set. Modify the code below:
knn_mod <- train(Sales ~ ., data = train_data, method = "knn",
trControl = trainControl("none"),
tuneGrid = expand.grid(k = 15),
preProcess = c("scale"))
preds <- predict(knn_mod, test_data)
mae(preds, test_data$Sales)
## [1] 1.903077
Train a set of KNN models with \(K=3,5,7,\dots,25\) on the training data, report the mean absolute error of every one of them on the test data.
# Make the table of errors here and print it
model_error <- function(K) {
knn_mod <- train(Sales ~ ., data = train_data, method = "knn",
trControl = trainControl("none"),
tuneGrid = expand.grid(k = K),
preProcess = c("scale"))
pred <- predict(knn_mod, test_data)
mae(pred, test_data$Sales)
}
values_of_k <- seq(from = 3, to = 25, by = 2)
error_table <- values_of_k %>%
sapply(model_error) %>%
set_names(paste("K =", values_of_k))
error_table
## K = 3 K = 5 K = 7 K = 9 K = 11 K = 13 K = 15 K = 17
## 2.229206 2.186294 2.080901 2.029640 1.998120 1.939834 1.903077 1.884849
## K = 19 K = 21 K = 23 K = 25
## 1.914388 1.909703 1.923410 1.931399
Write a single R command that prints the value of \(K\) that minimizes the mean absolute error (your R command should work with any input data, i.e, you cannot just look at the table above, find the smallest error and print the corresponding value of \(K\)):
which.min(error_table) %>% names
## [1] "K = 17"
Remarks:
error_table <- data.frame(
K = values_of_k,
error = sapply(values_of_k, model_error))
error_table %>% slice(which.min(error))
train
can take care of it too, just like it takes care of
data normalization.