Info

Objective

The purpose of writing graded lab reports is to help students to stay on track and to provide summative feedback. Each lab report is just 1% of the total course mark. Please do not cheat - it is not worth it!

Your task

Solve the practical question, knit your document into a PDF and submit to NTULearn before the deadline. The deadline is very tight because the task is simple. We are sure that everyone is capable to do it by themselves and we want to discourage taking someone else’s report and writing it with your own words.

Marking scheme

You will get “excellent”, or 100% for this lab report if everything is perfect. You will get “good”, or 75% if there are minor issues. For example, you will get “good” if you do data normalization manually instead of using built-in R functions learned in the lab session of if you use loops instead of functions from the apply family. You will get “average”, or 50% if there are serious issues in your report, such as failing to do normalization for KNN at all. You will get “poor”, or 25% if you barely attempt this report. You will get “not done”, or 0% if you do not attempt this report.

Deadline

17 Jul 2020, midnight

Libraries

Here, we load libraries and set the random seed. Replace the number “1729” with the numeric part of your matric no here.

library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR) # for datasets from 'Introduction to Statistical Learning'
library(caret) # for machine learning, including KNN
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(1729) # Replace the number "1729" with the numeric part of your matric no

Question 1

Explain why it is wrong to normalize all the data first and then split it into a training set and a test set.

Answer: the central idea in machine learning is that a test set is generated at first and then moved away. It should not be involved into training models at all - we can only use it for final model validation to compare different models. And if we normalize the data before splitting it into training and test sets, it means that we are using the test set for training our models.

Question 2

Start with the dataset Carseats from the library ISLR. Split it into 75% training and 25% test data. Keep only Sales, Income, Advertising, and Price variables. Print dimensions of the training dataset and the test dataset.

ind <- runif(nrow(Carseats)) <= 0.75

train_data <- Carseats %>% filter(ind) %>%
  select(Sales, Income, Advertising, Price)

test_data <- Carseats %>% filter(!ind) %>%
  select(Sales, Income, Advertising, Price)

dim(train_data)
## [1] 316   4
dim(test_data)
## [1] 84  4

Question 3

First, we will create a function that calculates the mean absolute error of a vector of predicted values vs a vector of reference values:

mae <- function(predicted_values, reference_values) {
  (predicted_values - reference_values) %>%
    abs %>% mean
}

Now train a KNN regression model with \(K=15\) on the training dataset and find its error on the test set. Modify the code below:

knn_mod <- train(Sales ~ ., data = train_data, method = "knn", 
                   trControl = trainControl("none"),
                   tuneGrid = expand.grid(k = 15),
                   preProcess = c("scale"))
  
preds <- predict(knn_mod, test_data)
mae(preds, test_data$Sales)
## [1] 1.903077

Question 4

Train a set of KNN models with \(K=3,5,7,\dots,25\) on the training data, report the mean absolute error of every one of them on the test data.

# Make the table of errors here and print it
model_error <- function(K) {
  knn_mod <- train(Sales ~ ., data = train_data, method = "knn", 
                   trControl = trainControl("none"),
                   tuneGrid = expand.grid(k = K),
                   preProcess = c("scale"))
  pred <- predict(knn_mod, test_data)
  mae(pred, test_data$Sales)
}

values_of_k <- seq(from = 3, to = 25, by = 2)
error_table <- values_of_k %>%
  sapply(model_error) %>%
  set_names(paste("K =", values_of_k))

error_table
##    K = 3    K = 5    K = 7    K = 9   K = 11   K = 13   K = 15   K = 17 
## 2.229206 2.186294 2.080901 2.029640 1.998120 1.939834 1.903077 1.884849 
##   K = 19   K = 21   K = 23   K = 25 
## 1.914388 1.909703 1.923410 1.931399

Write a single R command that prints the value of \(K\) that minimizes the mean absolute error (your R command should work with any input data, i.e, you cannot just look at the table above, find the smallest error and print the corresponding value of \(K\)):

which.min(error_table) %>% names
## [1] "K = 17"

Remarks:

  1. There are many ways to create such a table of model errors. Instead of making a named vector, we can make a data frame:
error_table <- data.frame(
  K = values_of_k,
  error = sapply(values_of_k, model_error))

error_table %>% slice(which.min(error))
  1. Later we will learn a better method of trying different values of \(K\) to train a KNN. The function train can take care of it too, just like it takes care of data normalization.