Info

Objectives

By the end of this activity, you should

Manipulate data with the package tidyverse
Split the data into training and test sets.
Train a KNN model
Explain importance of data normalization for KNN

Preliminaries

You should be familiar with basics of R - installing packages, variables, atomic types, lists, data frames and matrices. These topics were covered in lab 0.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.

Please fill the survey in the end - it’ll only take a couple of minutes.

Libraries

library(tidyverse) # for manipulation with data

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ISLR) # for datasets from 'Introduction to Statistical Learning'
library(caret) # for machine learning, including KNN

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

Functions in R

R is a functional programming language (google “functional programming” if you are interested). The syntax to define a function is very similar to the syntax to define a numeric variable or a data frame.:

square_function <- function(x) {
  x^2
}

square_function(14)

## [1] 196

Thus a function is, in fact, a variable of class “function”:

class(square_function)

## [1] "function"

If we need to calculate the value of some numeric expression, say, \(\frac{2020}{42} - 2^3\), we do not need to define variables and assign constant values to them - we can simply write the expression in R:

2020 / 42 - 2^3

## [1] 40.09524

In the same way, in order to use a function in an expression, we do not have to create a variable of class “function”:

(function(x) x^2)(14)

## [1] 196

Apply family

In R, we very seldom use explicit loops - they are very slow and often hard to read, especially nested loops. Instead, we use functions of the apply family from base R or map family from tidyverse.

The most basic of them is lapply / map. We use it when we have a list or a vector \[ a_1,a_2,\cdots,a_n \] and a function \(f\) that we want to apply to every element of \(v\) to create a new list \[ f(a_1), f(a_2),\dots, f(a_n) \]

Below is an example:

v <- 1:10
cat("Our vector v is (", v, ")\n")

## Our vector v is ( 1 2 3 4 5 6 7 8 9 10 )

cat("And now we will apply the square function to every its entry\n")

## And now we will apply the square function to every its entry

map(v, square_function)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25
## 
## [[6]]
## [1] 36
## 
## [[7]]
## [1] 49
## 
## [[8]]
## [1] 64
## 
## [[9]]
## [1] 81
## 
## [[10]]
## [1] 100

Note that lapply and map return a list. However, it is usually more convenient to work with vectors or matrices and for that we have sapply and map_vec. They are a bit smarter in that they return the result in a vector form:

map_vec(v, square_function)

##  [1]   1   4   9  16  25  36  49  64  81 100

sapply(v, square_function)

##  [1]   1   4   9  16  25  36  49  64  81 100

Question 1

Here, we look at the dataset Carseats from the library `ISLR’:

head(Carseats)

Use the functions map_vec and class to produce the vector with the type of each variable in the dataset Carseats. Note that this will be a named vector.

# Write your code below
map_vec(Carseats, class)

##       Sales   CompPrice      Income Advertising  Population       Price 
##   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
##   ShelveLoc         Age   Education       Urban          US 
##    "factor"   "numeric"   "numeric"    "factor"    "factor"

Pipe operator

An English text is read from left to right but functions in mathematics are applied from right to left, i.e., \[ \ln(\cos(\exp\sqrt{7})) \] is calculated by starting with number \(7\), then applying the square root function, then the exponential function etc. In R (as in most programming languages), the order is also from right to left:

log(cos(exp(sqrt(7))))

## [1] -3.143688

It is, however, often more convenient to apply functions from left to right This is done with the pipe operator %>% from the libraries tidyverse and magrittr (the former is a big collection of tools, the latter contains just the pipe operator):

7 %>% sqrt %>% exp %>% cos %>% log

## [1] -3.143688

If a function used in such a pipe chain needs several arguments, then whatever comes from the left of the pipe chain is inserted as the first argument. For example,

max(exp(-3), 2)

## [1] 2

is the same as

-3 %>% exp %>% max(2)

## [1] 2

Pipe operator for data transformation

Let’s say we want to apply the following transformations to the dataset Carseats:

Keep the variables Sales, Income, Advertising, Price, Age, ShelveLoc, Urban and drop the rest of the variables.
Extract only records with Sales at least 10.
Convert ShelveLoc from factor to character (factor is a format for categorical data in R).
Convert Urban from yes/no factor to logical true/false.

We could do it as follows:

A <- Carseats[ , c("Sales", "Income", "Advertising", 
            "Price", "Age", "ShelveLoc", "Urban")]
A <- A[A$Sales >= 10 , ]
A$ShelveLoc <- as.character(A$ShelveLoc)
A$Urban <- A$Urban == "Yes"
head(A)

With the pipe operator and data transformation functions select, filter and mutate (these functions are in the library dplyr that is a part of tidyverse), this sequence of transformations is a bit easier to read:

A <- Carseats %>%
  select(`Sales`, `Income`, `Advertising`, 
            `Price`, `Age`, `ShelveLoc`, `Urban`) %>%
  filter(Sales >= 10) %>%
  mutate(ShelveLoc = as.character(ShelveLoc)) %>%
  mutate(Urban = Urban == "Yes")

head(A)

Question 2

Write a pipe chain that transforms the dataset Carseats as follows:

Extracts records with the price between 90 and 140.
Keeps the variables Sales, Population, Price, Education, US.
Creates the new variable Price_SGD that is equal to 1.38 * Price.
Sorts all records by Population in the decreasing order (for that, you will need the function arrange)

# Write your code below
Carseats %>%
  filter(Price <= 140 & Price >= 90) %>%
  select(`Sales`, `Population`, `Price`, `Education`, `US`) %>%
  mutate(`Price_SGD` = 1.38 * `Price`) %>%
  arrange(-Population)

Training a KNN

Data normalization: why?

To illustrate the importance of normalization, let us look at the dataset iris available in R by default. Below is a sample:

set.seed(42)
iris %>% sample_n(6)

The dataset contains measurements of 150 flowers of three different species, 50 each. Below we plot them as a 2D scatterplot with axes representing petal measurements in cm and colour representing the species

ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  geom_point() +
  coord_fixed()

Now we will change the units of measurement. We will express petal length in feet instead of centimetres, i.e., we will just divide Petal.Length by 30.

iris_transformed <- iris %>%
  select(Petal.Length, Petal.Width, Species) %>%
  mutate(Petal.Length = Petal.Length / 30)

ggplot(data = iris_transformed, 
       aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  geom_point() +
  coord_fixed()

The problem with this transformation is that points that were far apart from each other will become close and hence the very notion of a neighbourhood of a point will be changed. A neighbourood \(N(x)\) containing \(K\) nearest neighbours in the original dataset and a neighbourood \(N(x)\) containing \(K\) nearest neighbours in the transformed dataset will be different, i.e., they will contain different sets of \(K\) data points.

By looking at the plot above, we see that proximity between points in the transformed dataset is essentially only affected by petal width since petal length measured in feet is just always small as compared to petal width measured in cm.

Creating training and test sets

The simplest method to split our data into training and test sets is by creating a logical vector with TRUE at positions corresponding to training set observations and FALSE at positions corresponding to test set observations.

Since this is a random procedure, it is a good idea to set a random seed so that every time you run your code, it will generate the same random numbers.

set.seed(42)
ind <- runif(150) <= 0.7
ind[1:10]

##  [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE

The number of entries in the training set is

sum(ind)

## [1] 96

The number of entries in the test set is

sum(!ind)

## [1] 54

Now we will create two pairs (training set, test set) - one from the original iris dataset, the other one is from the transformed dataset where petal length is measured in inches.

train_data_1 <- iris %>% filter(ind) %>%
  select(Petal.Length, Petal.Width, Species) 
test_data_1 <- iris %>% filter(!ind) %>%
  select(Petal.Length, Petal.Width, Species) 

train_data_2 <- iris_transformed %>% filter(ind) %>%
  select(Petal.Length, Petal.Width, Species) 
test_data_2 <- iris_transformed %>% filter(!ind) %>%
  select(Petal.Length, Petal.Width, Species) 

cat("Dimensions of training set 1 are", dim(train_data_1), "\n")

## Dimensions of training set 1 are 96 3

cat("Dimensions of test set 2 are", dim(test_data_2), "\n")

## Dimensions of test set 2 are 54 3

An alternative method of splitting the data into training and test sets is based on slice - use it if you want to specify the exact number of data points in training and test sets:

set.seed(42)
s <- sample(1:150, 105)

train_data_3 <- iris %>% slice(s) %>%
  select(Petal.Length, Petal.Width, Species) 
test_data_3 <- iris %>% slice(-s) %>%
  select(Petal.Length, Petal.Width, Species) 

cat("Dimensions of training set 3 are", dim(train_data_3), "\n")

## Dimensions of training set 3 are 105 3

cat("Dimensions of test set 3 are", dim(test_data_3), "\n")

## Dimensions of test set 3 are 45 3

Experiment: training KNN without normalization

First, we train our KNN model with \(K=3\) on the original iris dataset (well, the training set extracted from the original dataset):

# Training the model:
mod_knn_1 <- knn3(Species ~ . , data = train_data_1, k =3)

# Making predictions:
preds_1 <- predict(mod_knn_1, test_data_1, type = "class")
# Note that without the option type = "class" this function
# will return probability vectors rather than predicted labels
# Very often, it is better to have probability vectors,
# but now, just to make a point, we will work with labels.

head(preds_1)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

Now we will train another KNN model, also with \(K=3\), but now on the transformed dataset

mod_knn_2 <- knn3(Species ~ . , data = train_data_2, k = 3)
preds_2 <- predict(mod_knn_2, test_data_2, type = "class")
head(preds_2)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

Below is the confusion matrix between predictions of the two models:

confusionMatrix(preds_1, preds_2)$table

##             Reference
## Prediction   setosa versicolor virginica
##   setosa         22          0         0
##   versicolor      0         15         2
##   virginica       0          2        13

# The function confusionMatrix returns a lot of information
# We just need the matrix itself, that's why we added '$table'
# You can try deleting '$table' to see what else you can get from it

The two models agree on predicting “setosa”, but disagree on predicting “versicolor” and “virginica”. The point of this exercise is to see that changing the scale of variables affects KNN predictions.

Normalizing the data

To normalize the data means to change all variables to the same scale. The simplest thing to do is applying the transformation \[ x_i\mapsto \frac{x_i - \min(x)}{\max(x) - \min(x)} \] to every entry \(x^i\) of each vector of observations \(x\).

We could start with our original dataset and apply this function to every numeric column as follows:

normalize <- function(x, m = min(x), M = max(x)) {
  (x - m) / (M - m)
}

# Here we normalize all numeric variables in the data
# To do it, we need the special function "across" from tidyverse

iris_normalized <- iris %>%
  mutate(across(is.numeric, normalize))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(is.numeric, normalize)`.
## Caused by warning:
## ! Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
##   # Was:
##   data %>% select(is.numeric)
## 
##   # Now:
##   data %>% select(where(is.numeric))

iris_normalized %>% sample_n(5)

Question 3

Why is this method of normalizing data is incorrect?

Answer: Finding the minimum and the maximum values of each variable is a part of training and we can’t use the test set for it.

Correct method of normalizing the data

The right method to normalize the data is to normalize the training set and apply the same transformation to the test set. This is a bit more complicated because we will have use minimum and maximum from the training set on the test set. There are different ways to do it. For example, we can save the largest and the smallest entry in each column of the training set. Alternatively, instead of creating two different datasets for training and testing, we can label records in the original dataset.

Here is how we do it:

iris %>% 
  mutate(label = ifelse(ind, "train", "test")) %>%
  select(Petal.Length, Petal.Width, Species, label)

Now we will create the normalized combined training and test sets by applying the transformation \[ x_i\mapsto \frac{x_i - \min(x_{train})}{\max(x_{train}) - \min(x_{train})} \] across all numeric variables:

norm_data_1 <- iris %>% 
  mutate(label = ifelse(ind, "train", "test")) %>%
  select(Petal.Length, Petal.Width, Species, label) %>%
  mutate(
    across(is.numeric,
           function(x) normalize(x, min(x[ind]), max(x[ind]))
           )
  )

norm_data_1

Now we will define our training and test sets:

train_data_1_norm <- norm_data_1 %>% 
  filter(label == "train") %>%
  select(-label)

test_data_1_norm <- norm_data_1 %>% 
  filter(label == "test") %>%
  select(-label)

And finally we will train a KNN model with \(K=3\) on the normalized training data and apply it to the normalized test data. Below is the confusion matrix of the model on the test data

mod_knn_1_norm <- knn3(Species ~ . , data = train_data_1_norm, k =3)
preds_1 <- predict(mod_knn_1_norm, test_data_1_norm, type = "class")
confusionMatrix(preds_1, test_data_1_norm$Species)$table

##             Reference
## Prediction   setosa versicolor virginica
##   setosa         22          0         0
##   versicolor      0         12         4
##   virginica       0          0        16

Question 4

Apply the same method of normalization to the second pair (training data, test data), the one where the petal length is measured in inches. Check that the two KNN models constructed on datasets of different scale will yield the same predictions.

# Write your code below
norm_data_2 <- iris_transformed %>% 
  mutate(label = ifelse(ind, "train", "test")) %>%
  select(Petal.Length, Petal.Width, Species, label) %>%
  mutate(
    across(is.numeric,
           function(x) normalize(x, min(x[ind]), max(x[ind]))
           )
  )

train_data_2_norm <- norm_data_2 %>% 
  filter(label == "train") %>%
  select(-label)

test_data_2_norm <- norm_data_1 %>% 
  filter(label == "test") %>%
  select(-label)

mod_knn_2_norm <- knn3(Species ~ . , data = train_data_2_norm, k =3)
preds_2 <- predict(mod_knn_2_norm, test_data_2_norm, type = "class")
confusionMatrix(preds_1, preds_2)$table

##             Reference
## Prediction   setosa versicolor virginica
##   setosa         22          0         0
##   versicolor      0         16         0
##   virginica       0          0        16

Built-in method to train a KNN

We implemented data normalization by hand above. We did it to better understand pitfalls of the KNN method and to get some extra practice on data transformation and on functional programming.

R packages provide built-in methods to do automatic normalization. In practice, we just use built-in methods. Here is how we do it in caret:

knn_mod_1 <- train(Species ~ ., data = train_data_1, method = "knn", 
                   trControl = trainControl("none"),
                   tuneGrid = expand.grid(k = 3),
                   preProcess = c("range"))

preds_1 <- predict(knn_mod_1, test_data_1, type = "raw")
head(preds_1)

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

Note that the function train from the library caret is a universal wrapper for training a lot of different models. We will later better understand how it works. Now let’s just accept this syntax. Besides, note that is is the option preProcess = c("scale") that enables automatic data normalization.

Now we will train another KNN model with the same settings but using the second training set (with petal length measured in inches). And then we will report the confusion matrix to see that the two models produce the same results.

knn_mod_2 <- train(Species ~ ., data = train_data_2, method = "knn", 
                   trControl = trainControl("none"),
                   tuneGrid = expand.grid(k = 3),
                   preProcess = c("scale"))

preds_2 <- predict(knn_mod_2, test_data_2, type = "raw")
confusionMatrix(preds_1, preds_2)$table

##             Reference
## Prediction   setosa versicolor virginica
##   setosa         22          0         0
##   versicolor      0         16         0
##   virginica       0          0        16

Survey

There is a link to a simple survey after lab 1:

https://forms.gle/g6nmYeKBUnZBxcVU7

Answers

The full handout with answers is available here:

https://rpubs.com/fduzhin/mh4510-lab-1

Lab 1 — Introduction to data mining.