By the end of this activity, you should
Manipulate data with the package tidyverse
Split the data into training and test sets.
Train a KNN model
Explain importance of data normalization for KNN
You should be familiar with basics of R - installing packages, variables, atomic types, lists, data frames and matrices. These topics were covered in lab 0.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.
Please fill the survey in the end - it’ll only take a couple of minutes.
library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR) # for datasets from 'Introduction to Statistical Learning'
library(caret) # for machine learning, including KNN
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
R is a functional programming language (google “functional programming” if you are interested). The syntax to define a function is very similar to the syntax to define a numeric variable or a data frame.:
square_function <- function(x) {
x^2
}
square_function(14)
## [1] 196
Thus a function is, in fact, a variable of class “function”:
class(square_function)
## [1] "function"
If we need to calculate the value of some numeric expression, say, \(\frac{2020}{42} - 2^3\), we do not need to define variables and assign constant values to them - we can simply write the expression in R:
2020 / 42 - 2^3
## [1] 40.09524
In the same way, in order to use a function in an expression, we do not have to create a variable of class “function”:
(function(x) x^2)(14)
## [1] 196
In R, we very seldom use explicit loops - they are very slow and
often hard to read, especially nested loops. Instead, we use functions
of the apply
family from base R or map
family
from tidyverse
.
The most basic of them is lapply
/ map
. We
use it when we have a list or a vector \[
a_1,a_2,\cdots,a_n
\] and a function \(f\) that we
want to apply to every element of \(v\)
to create a new list \[
f(a_1), f(a_2),\dots, f(a_n)
\]
Below is an example:
v <- 1:10
cat("Our vector v is (", v, ")\n")
## Our vector v is ( 1 2 3 4 5 6 7 8 9 10 )
cat("And now we will apply the square function to every its entry\n")
## And now we will apply the square function to every its entry
map(v, square_function)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 9
##
## [[4]]
## [1] 16
##
## [[5]]
## [1] 25
##
## [[6]]
## [1] 36
##
## [[7]]
## [1] 49
##
## [[8]]
## [1] 64
##
## [[9]]
## [1] 81
##
## [[10]]
## [1] 100
Note that lapply
and map
return a list.
However, it is usually more convenient to work with vectors or matrices
and for that we have sapply
and map_vec
. They
are a bit smarter in that they return the result in a vector form:
map_vec(v, square_function)
## [1] 1 4 9 16 25 36 49 64 81 100
or
sapply(v, square_function)
## [1] 1 4 9 16 25 36 49 64 81 100
Here, we look at the dataset Carseats
from the library
`ISLR’:
head(Carseats)
Use the functions map_vec
and class
to
produce the vector with the type of each variable in the dataset
Carseats
. Note that this will be a named vector.
# Write your code below
map_vec(Carseats, class)
## Sales CompPrice Income Advertising Population Price
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## ShelveLoc Age Education Urban US
## "factor" "numeric" "numeric" "factor" "factor"
An English text is read from left to right but functions in mathematics are applied from right to left, i.e., \[ \ln(\cos(\exp\sqrt{7})) \] is calculated by starting with number \(7\), then applying the square root function, then the exponential function etc. In R (as in most programming languages), the order is also from right to left:
log(cos(exp(sqrt(7))))
## [1] -3.143688
It is, however, often more convenient to apply functions from left to
right This is done with the pipe operator %>%
from the
libraries tidyverse
and magrittr
(the former
is a big collection of tools, the latter contains just the pipe
operator):
7 %>% sqrt %>% exp %>% cos %>% log
## [1] -3.143688
If a function used in such a pipe chain needs several arguments, then whatever comes from the left of the pipe chain is inserted as the first argument. For example,
max(exp(-3), 2)
## [1] 2
is the same as
-3 %>% exp %>% max(2)
## [1] 2
Let’s say we want to apply the following transformations to the
dataset Carseats
:
Keep the variables Sales
, Income
,
Advertising
, Price
, Age
,
ShelveLoc
, Urban
and drop the rest of the
variables.
Extract only records with Sales at least 10.
Convert ShelveLoc
from factor to character (factor
is a format for categorical data in R).
Convert Urban
from yes/no factor to logical
true/false.
We could do it as follows:
A <- Carseats[ , c("Sales", "Income", "Advertising",
"Price", "Age", "ShelveLoc", "Urban")]
A <- A[A$Sales >= 10 , ]
A$ShelveLoc <- as.character(A$ShelveLoc)
A$Urban <- A$Urban == "Yes"
head(A)
With the pipe operator and data transformation functions
select
, filter
and mutate
(these
functions are in the library dplyr
that is a part of
tidyverse
), this sequence of transformations is a bit
easier to read:
A <- Carseats %>%
select(`Sales`, `Income`, `Advertising`,
`Price`, `Age`, `ShelveLoc`, `Urban`) %>%
filter(Sales >= 10) %>%
mutate(ShelveLoc = as.character(ShelveLoc)) %>%
mutate(Urban = Urban == "Yes")
head(A)
Write a pipe chain that transforms the dataset Carseats
as follows:
Extracts records with the price between 90 and 140.
Keeps the variables Sales
, Population
,
Price
, Education
, US
.
Creates the new variable Price_SGD
that is equal to
1.38 * Price
.
Sorts all records by Population in the decreasing order (for
that, you will need the function arrange
)
# Write your code below
Carseats %>%
filter(Price <= 140 & Price >= 90) %>%
select(`Sales`, `Population`, `Price`, `Education`, `US`) %>%
mutate(`Price_SGD` = 1.38 * `Price`) %>%
arrange(-Population)
To illustrate the importance of normalization, let us look at the
dataset iris
available in R by default. Below is a
sample:
set.seed(42)
iris %>% sample_n(6)
The dataset contains measurements of 150 flowers of three different species, 50 each. Below we plot them as a 2D scatterplot with axes representing petal measurements in cm and colour representing the species
ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
geom_point() +
coord_fixed()
Now we will change the units of measurement. We will express petal
length in feet instead of centimetres, i.e., we will just divide
Petal.Length
by 30.
iris_transformed <- iris %>%
select(Petal.Length, Petal.Width, Species) %>%
mutate(Petal.Length = Petal.Length / 30)
ggplot(data = iris_transformed,
aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
geom_point() +
coord_fixed()
The problem with this transformation is that points that were far apart from each other will become close and hence the very notion of a neighbourhood of a point will be changed. A neighbourood \(N(x)\) containing \(K\) nearest neighbours in the original dataset and a neighbourood \(N(x)\) containing \(K\) nearest neighbours in the transformed dataset will be different, i.e., they will contain different sets of \(K\) data points.
By looking at the plot above, we see that proximity between points in the transformed dataset is essentially only affected by petal width since petal length measured in feet is just always small as compared to petal width measured in cm.
The simplest method to split our data into training and test sets is by creating a logical vector with TRUE at positions corresponding to training set observations and FALSE at positions corresponding to test set observations.
Since this is a random procedure, it is a good idea to set a random seed so that every time you run your code, it will generate the same random numbers.
set.seed(42)
ind <- runif(150) <= 0.7
ind[1:10]
## [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
The number of entries in the training set is
sum(ind)
## [1] 96
The number of entries in the test set is
sum(!ind)
## [1] 54
Now we will create two pairs (training set, test set) - one from the
original iris
dataset, the other one is from the
transformed dataset where petal length is measured in inches.
train_data_1 <- iris %>% filter(ind) %>%
select(Petal.Length, Petal.Width, Species)
test_data_1 <- iris %>% filter(!ind) %>%
select(Petal.Length, Petal.Width, Species)
train_data_2 <- iris_transformed %>% filter(ind) %>%
select(Petal.Length, Petal.Width, Species)
test_data_2 <- iris_transformed %>% filter(!ind) %>%
select(Petal.Length, Petal.Width, Species)
cat("Dimensions of training set 1 are", dim(train_data_1), "\n")
## Dimensions of training set 1 are 96 3
cat("Dimensions of test set 2 are", dim(test_data_2), "\n")
## Dimensions of test set 2 are 54 3
An alternative method of splitting the data into training and test
sets is based on slice
- use it if you want to specify the
exact number of data points in training and test sets:
set.seed(42)
s <- sample(1:150, 105)
train_data_3 <- iris %>% slice(s) %>%
select(Petal.Length, Petal.Width, Species)
test_data_3 <- iris %>% slice(-s) %>%
select(Petal.Length, Petal.Width, Species)
cat("Dimensions of training set 3 are", dim(train_data_3), "\n")
## Dimensions of training set 3 are 105 3
cat("Dimensions of test set 3 are", dim(test_data_3), "\n")
## Dimensions of test set 3 are 45 3
First, we train our KNN model with \(K=3\) on the original iris
dataset (well, the training set extracted from the original
dataset):
# Training the model:
mod_knn_1 <- knn3(Species ~ . , data = train_data_1, k =3)
# Making predictions:
preds_1 <- predict(mod_knn_1, test_data_1, type = "class")
# Note that without the option type = "class" this function
# will return probability vectors rather than predicted labels
# Very often, it is better to have probability vectors,
# but now, just to make a point, we will work with labels.
head(preds_1)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
Now we will train another KNN model, also with \(K=3\), but now on the transformed dataset
mod_knn_2 <- knn3(Species ~ . , data = train_data_2, k = 3)
preds_2 <- predict(mod_knn_2, test_data_2, type = "class")
head(preds_2)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
Below is the confusion matrix between predictions of the two models:
confusionMatrix(preds_1, preds_2)$table
## Reference
## Prediction setosa versicolor virginica
## setosa 22 0 0
## versicolor 0 15 2
## virginica 0 2 13
# The function confusionMatrix returns a lot of information
# We just need the matrix itself, that's why we added '$table'
# You can try deleting '$table' to see what else you can get from it
The two models agree on predicting “setosa”, but disagree on predicting “versicolor” and “virginica”. The point of this exercise is to see that changing the scale of variables affects KNN predictions.
To normalize the data means to change all variables to the same scale. The simplest thing to do is applying the transformation \[ x_i\mapsto \frac{x_i - \min(x)}{\max(x) - \min(x)} \] to every entry \(x^i\) of each vector of observations \(x\).
We could start with our original dataset and apply this function to every numeric column as follows:
normalize <- function(x, m = min(x), M = max(x)) {
(x - m) / (M - m)
}
# Here we normalize all numeric variables in the data
# To do it, we need the special function "across" from tidyverse
iris_normalized <- iris %>%
mutate(across(is.numeric, normalize))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(is.numeric, normalize)`.
## Caused by warning:
## ! Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
## # Was:
## data %>% select(is.numeric)
##
## # Now:
## data %>% select(where(is.numeric))
iris_normalized %>% sample_n(5)
Why is this method of normalizing data is incorrect?
Answer: Finding the minimum and the maximum values of each variable is a part of training and we can’t use the test set for it.
The right method to normalize the data is to normalize the training set and apply the same transformation to the test set. This is a bit more complicated because we will have use minimum and maximum from the training set on the test set. There are different ways to do it. For example, we can save the largest and the smallest entry in each column of the training set. Alternatively, instead of creating two different datasets for training and testing, we can label records in the original dataset.
Here is how we do it:
iris %>%
mutate(label = ifelse(ind, "train", "test")) %>%
select(Petal.Length, Petal.Width, Species, label)
Now we will create the normalized combined training and test sets by applying the transformation \[ x_i\mapsto \frac{x_i - \min(x_{train})}{\max(x_{train}) - \min(x_{train})} \] across all numeric variables:
norm_data_1 <- iris %>%
mutate(label = ifelse(ind, "train", "test")) %>%
select(Petal.Length, Petal.Width, Species, label) %>%
mutate(
across(is.numeric,
function(x) normalize(x, min(x[ind]), max(x[ind]))
)
)
norm_data_1
Now we will define our training and test sets:
train_data_1_norm <- norm_data_1 %>%
filter(label == "train") %>%
select(-label)
test_data_1_norm <- norm_data_1 %>%
filter(label == "test") %>%
select(-label)
And finally we will train a KNN model with \(K=3\) on the normalized training data and apply it to the normalized test data. Below is the confusion matrix of the model on the test data
mod_knn_1_norm <- knn3(Species ~ . , data = train_data_1_norm, k =3)
preds_1 <- predict(mod_knn_1_norm, test_data_1_norm, type = "class")
confusionMatrix(preds_1, test_data_1_norm$Species)$table
## Reference
## Prediction setosa versicolor virginica
## setosa 22 0 0
## versicolor 0 12 4
## virginica 0 0 16
Apply the same method of normalization to the second pair (training data, test data), the one where the petal length is measured in inches. Check that the two KNN models constructed on datasets of different scale will yield the same predictions.
# Write your code below
norm_data_2 <- iris_transformed %>%
mutate(label = ifelse(ind, "train", "test")) %>%
select(Petal.Length, Petal.Width, Species, label) %>%
mutate(
across(is.numeric,
function(x) normalize(x, min(x[ind]), max(x[ind]))
)
)
train_data_2_norm <- norm_data_2 %>%
filter(label == "train") %>%
select(-label)
test_data_2_norm <- norm_data_1 %>%
filter(label == "test") %>%
select(-label)
mod_knn_2_norm <- knn3(Species ~ . , data = train_data_2_norm, k =3)
preds_2 <- predict(mod_knn_2_norm, test_data_2_norm, type = "class")
confusionMatrix(preds_1, preds_2)$table
## Reference
## Prediction setosa versicolor virginica
## setosa 22 0 0
## versicolor 0 16 0
## virginica 0 0 16
We implemented data normalization by hand above. We did it to better understand pitfalls of the KNN method and to get some extra practice on data transformation and on functional programming.
R packages provide built-in methods to do automatic normalization. In
practice, we just use built-in methods. Here is how we do it in
caret
:
knn_mod_1 <- train(Species ~ ., data = train_data_1, method = "knn",
trControl = trainControl("none"),
tuneGrid = expand.grid(k = 3),
preProcess = c("range"))
preds_1 <- predict(knn_mod_1, test_data_1, type = "raw")
head(preds_1)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
Note that the function train
from the library
caret
is a universal wrapper for training a lot of
different models. We will later better understand how it works. Now
let’s just accept this syntax. Besides, note that is is the option
preProcess = c("scale")
that enables automatic data
normalization.
Now we will train another KNN model with the same settings but using the second training set (with petal length measured in inches). And then we will report the confusion matrix to see that the two models produce the same results.
knn_mod_2 <- train(Species ~ ., data = train_data_2, method = "knn",
trControl = trainControl("none"),
tuneGrid = expand.grid(k = 3),
preProcess = c("scale"))
preds_2 <- predict(knn_mod_2, test_data_2, type = "raw")
confusionMatrix(preds_1, preds_2)$table
## Reference
## Prediction setosa versicolor virginica
## setosa 22 0 0
## versicolor 0 16 0
## virginica 0 0 16