מגיש: אלדד אביב
ת.ז: 206836165
Load Libraries:
First, load the relevant libraries for conducting a predictive model using KNN algorithm.
Our specific use of each library will be describe in a previews comment.
# Load ISLR library to get wage data
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.3
# Load rsample to split data
library(rsample)
## Warning: package 'rsample' was built under R version 4.3.3
# Load tidyverse for data manipulations
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
# Load recipes for model configuration
library(recipes)
## Warning: package 'recipes' was built under R version 4.3.3
# Load caret for model fitting and prediction
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
Functions:
set.seed() - make random split process
reproducible, running the same split code multiple times will yield the
same results.initial_split() From
rsample library, split data into training and test sets.
(arguments: df, training set proportion).Implantation:
# Define Wage as data
data <- Wage
# Use set.seed() to make sure split is reproducible
set.seed(1)
# Apply split on wage using prop of 0.7
splits <- initial_split(data, prop = 0.7)
# Define training and testing sets
train_data <- training(splits)
test_data <- testing(splits)
Functions:
recipe() - From recipes
allow to define a predictive model and a fixed pre process.step_center(all_predictors() - Center
all predictors.step_scale(all_predictors() - Scale
all predictors.prep() - Prepares the recipe by
estimating the required parameters (e.g mean, std).bake() - Applying the prepared recipe
to data.expand.grid() - Allow to define a
value or a range of hyperparameters(k).train() - Allow training different
algorithms.# Define a model to predict wage with year, age and logwage
rec <- recipe(wage ~ year + age + logwage,
data = train_data)
rec
We centered and scaled our predictors to ensure that all features contribute equally to the distance calculations in the KNN algorithm.
# Define recipe(rec) to center and scale sets
rec <- rec |>
step_center(all_predictors()) |>
step_scale(all_predictors())
rec
# Prepare the recipe(rec)
prep(rec)
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 3
##
## ── Training information
## Training data contained 2100 data points and no incomplete rows.
##
## ── Operations
## • Centering for: year, age, logwage | Trained
## • Scaling for: year, age, logwage | Trained
# Apply the prepared recipe to the training data and show the first few rows
bake(prep(rec), new_data = NULL) |> head() # When new_data = NULL > operate on training set
## # A tibble: 6 × 4
## year age logwage wage
## <dbl> <dbl> <dbl> <dbl>
## 1 -0.893 1.42 1.61 182.
## 2 -1.39 0.213 -0.143 99.7
## 3 0.590 -1.86 -0.248 96.1
## 4 0.0955 0.559 1.33 165.
## 5 -0.893 1.42 -0.468 89.2
## 6 0.0955 -0.133 -0.143 99.7
# Apply the prepared recipe to the test data and show the first few rows
bake(prep(rec), new_data = test_data) |> head()
## # A tibble: 6 × 4
## year age logwage wage
## <dbl> <dbl> <dbl> <dbl>
## 1 0.0955 -2.12 -0.970 75.0
## 2 -0.893 -1.60 -1.15 70.5
## 3 1.08 0.991 0.565 127.
## 4 0.0955 0.645 2.07 213.
## 5 -1.39 -0.479 -0.175 98.6
## 6 -1.39 -0.392 1.89 201.
# Define K as 5
tg <- expand.grid(
k = 5 # [1, N] neighbors
)
tc <- trainControl(method = "none")
# Train the model
knn.fit5 <- train(
x = rec, # the recipe does two things:
# (1) informs the model what the y and what the Xs are
# (2) what pre-proc steps should be taken.
data = train_data, # the training data (a MUST argument)
method = "knn", # method used for fitting the model (now - knn)
tuneGrid = tg, # define hyperparameter k
trControl = tc
)
predict() - Make predictions based on
a fitted model.test_data$wage_hat <- predict(knn.fit5, # fitted model used for prediction
newdata = test_data) # the data to predict from
# Values for the PREDICTORS will be taken from the TEST data. Note that
# predict() *also* processes the newdata according to the trained recipe!
plot(wage ~ wage_hat, data = test_data) # true vs predicted values
c(
Rsq = cor(test_data$wage, test_data$wage_hat)^2,
RMSE = sqrt(mean((test_data$wage - test_data$wage_hat)^2)),
MAE = mean(abs(test_data$wage - test_data$wage_hat))
)
## Rsq RMSE MAE
## 0.9918558 4.2925924 2.6261591
Decreasing 𝑘 improves flexibility by making the model more sensitive to finer details and patterns in the data.
thus, we will decrease 𝑘 to 3 and test our model:
# Redefine K as 3
tg2 <- expand.grid(k = 3)
# Train the model
knn.fit3 <- train(
x = rec, # Apply recipe
data = train_data, # Define data
method = "knn", # method used for fitting the model (now - knn)
tuneGrid = tg2,
trControl = tc
)
# Assess performance
test_data$wage_hat2 <- predict(knn.fit3, # fitted model used for prediction
newdata = test_data) # the data to predict from
# Values for the PREDICTORS will be taken from the TEST data. Note that
# predict() *also* processes the newdata according to the trained recipe!
plot(wage ~ wage_hat2, data = test_data) # true vs predicted values
c(
Rsq = cor(test_data$wage, test_data$wage_hat2)^2,
RMSE = sqrt(mean((test_data$wage - test_data$wage_hat2)^2)),
MAE = mean(abs(test_data$wage - test_data$wage_hat2))
)
## Rsq RMSE MAE
## 0.9906159 4.4967065 2.5549004
# Redefine K as 10
tg3 <- expand.grid(k = 10)
# Train the model
knn.fit10 <- train(
x = rec,
data = train_data,
method = "knn",
tuneGrid = tg3,
trControl = tc
)
test_data$wage_hat3 <- predict(knn.fit10, newdata = test_data)
plot(wage ~ wage_hat3, data = test_data) # true vs predicted values
c(
Rsq = cor(test_data$wage, test_data$wage_hat3)^2,
RMSE = sqrt(mean((test_data$wage - test_data$wage_hat3)^2)),
MAE = mean(abs(test_data$wage - test_data$wage_hat3))
)
## Rsq RMSE MAE
## 0.990068 5.151366 3.048927
With smaller 𝑘(e.g., 𝑘=3):
With larger 𝑘(e.g., 𝑘=10):