(1) Split the data to train and test (use p=0.7).
(2) Predict wage out 3 of the other variables with a knn model with (k=5). That is, fit and predict.
(3) Assess performance using the metrics you’ve learned.
(4) To improve flexibility, try a different k. Will you use bigger nor smaller k?
(5) If you tried smaller k try now bigger k (or vice versa).
- What will you earn from this and what will you lose? (in terms of performance indices)

מגיש: אלדד אביב

ת.ז: 206836165

Load Libraries:

First, load the relevant libraries for conducting a predictive model using KNN algorithm.

Our specific use of each library will be describe in a previews comment.

# Load ISLR library to get wage data
library(ISLR)

## Warning: package 'ISLR' was built under R version 4.3.3

# Load rsample to split data
library(rsample)

## Warning: package 'rsample' was built under R version 4.3.3

# Load tidyverse for data manipulations 
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

# Load recipes for model configuration
library(recipes)

## Warning: package 'recipes' was built under R version 4.3.3

# Load caret for model fitting and prediction
library(caret)

## Warning: package 'caret' was built under R version 4.3.3

(1) Split the data to train and test (use p=0.7).

Functions:

set.seed() - make random split process reproducible, running the same split code multiple times will yield the same results.
initial_split() From rsample library, split data into training and test sets. (arguments: df, training set proportion).

Implantation:

# Define Wage as data
data <- Wage

# Use set.seed() to make sure split is reproducible
set.seed(1)

# Apply split on wage using prop of 0.7
splits <- initial_split(data, prop = 0.7)

# Define training and testing sets
train_data <- training(splits) 
test_data <- testing(splits)

(2) Predict wage out 3 of the other variables with a knn model with (k=5). That is, fit and predict.

Functions:

recipe() - From recipes allow to define a predictive model and a fixed pre process.
step_center(all_predictors() - Center all predictors.
step_scale(all_predictors() - Scale all predictors.
prep() - Prepares the recipe by estimating the required parameters (e.g mean, std).
bake() - Applying the prepared recipe to data.
expand.grid() - Allow to define a value or a range of hyperparameters(k).
train() - Allow training different algorithms.

# Define a model to predict wage with year, age and logwage
rec <- recipe(wage ~ year + age + logwage, 
              data = train_data)
rec

We centered and scaled our predictors to ensure that all features contribute equally to the distance calculations in the KNN algorithm.

# Define recipe(rec) to center and scale sets
rec <- rec |> 
  step_center(all_predictors()) |> 
  step_scale(all_predictors())
rec

# Prepare the recipe(rec) 
prep(rec)

##

## ── Recipe ──────────────────────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:   1
## predictor: 3

##

## ── Training information

## Training data contained 2100 data points and no incomplete rows.

##

## ── Operations

## • Centering for: year, age, logwage | Trained

## • Scaling for: year, age, logwage | Trained

# Apply the prepared recipe to the training data and show the first few rows
bake(prep(rec), new_data = NULL) |> head() # When new_data = NULL > operate on training set

## # A tibble: 6 × 4
##      year    age logwage  wage
##     <dbl>  <dbl>   <dbl> <dbl>
## 1 -0.893   1.42    1.61  182. 
## 2 -1.39    0.213  -0.143  99.7
## 3  0.590  -1.86   -0.248  96.1
## 4  0.0955  0.559   1.33  165. 
## 5 -0.893   1.42   -0.468  89.2
## 6  0.0955 -0.133  -0.143  99.7

# Apply the prepared recipe to the test data and show the first few rows
bake(prep(rec), new_data = test_data) |> head()

## # A tibble: 6 × 4
##      year    age logwage  wage
##     <dbl>  <dbl>   <dbl> <dbl>
## 1  0.0955 -2.12   -0.970  75.0
## 2 -0.893  -1.60   -1.15   70.5
## 3  1.08    0.991   0.565 127. 
## 4  0.0955  0.645   2.07  213. 
## 5 -1.39   -0.479  -0.175  98.6
## 6 -1.39   -0.392   1.89  201.

# Define K as 5
tg <- expand.grid(
  k = 5 # [1, N] neighbors 
)
tc <- trainControl(method = "none")

# Train the model
knn.fit5 <- train(
  x = rec, # the recipe does two things: 
  # (1) informs the model what the y and what the Xs are 
  # (2) what pre-proc steps should be taken.
  data = train_data, # the training data (a MUST argument)
  method = "knn", # method used for fitting the model (now - knn)
  tuneGrid = tg, # define hyperparameter k 
  trControl = tc
)

(3) Assess performance using the metrics you’ve learned.

predict() - Make predictions based on a fitted model.

test_data$wage_hat <- predict(knn.fit5,  # fitted model used for prediction
                             newdata = test_data)  # the data to predict from
# Values for the PREDICTORS will be taken from the TEST data. Note that
# predict() *also* processes the newdata according to the trained recipe!

plot(wage ~ wage_hat, data = test_data) # true vs predicted values

c(
  Rsq = cor(test_data$wage, test_data$wage_hat)^2,
  RMSE = sqrt(mean((test_data$wage - test_data$wage_hat)^2)),
  MAE = mean(abs(test_data$wage - test_data$wage_hat))
)

##       Rsq      RMSE       MAE 
## 0.9918558 4.2925924 2.6261591

(4) To improve flexibility, try a different k. Will you use bigger nor smaller k?

Decreasing 𝑘 improves flexibility by making the model more sensitive to finer details and patterns in the data.

thus, we will decrease 𝑘 to 3 and test our model:

# Redefine K as 3
tg2 <- expand.grid(k = 3)

# Train the model
knn.fit3 <- train(
  x = rec, # Apply recipe
  data = train_data, # Define data
  method = "knn", # method used for fitting the model (now - knn)
  tuneGrid = tg2,
  trControl = tc
)

# Assess performance
test_data$wage_hat2 <- predict(knn.fit3,  # fitted model used for prediction
                             newdata = test_data)  # the data to predict from
# Values for the PREDICTORS will be taken from the TEST data. Note that
# predict() *also* processes the newdata according to the trained recipe!

plot(wage ~ wage_hat2, data = test_data) # true vs predicted values

c(
  Rsq = cor(test_data$wage, test_data$wage_hat2)^2,
  RMSE = sqrt(mean((test_data$wage - test_data$wage_hat2)^2)),
  MAE = mean(abs(test_data$wage - test_data$wage_hat2))
)

##       Rsq      RMSE       MAE 
## 0.9906159 4.4967065 2.5549004

(5) If you tried smaller k try now bigger k (or vice versa).

# Redefine K as 10
tg3 <- expand.grid(k = 10)

# Train the model
knn.fit10 <- train(
  x = rec,
  data = train_data,
  method = "knn",
  tuneGrid = tg3,
  trControl = tc
)

test_data$wage_hat3 <- predict(knn.fit10, newdata = test_data)

plot(wage ~ wage_hat3, data = test_data) # true vs predicted values

c(
  Rsq = cor(test_data$wage, test_data$wage_hat3)^2,
  RMSE = sqrt(mean((test_data$wage - test_data$wage_hat3)^2)),
  MAE = mean(abs(test_data$wage - test_data$wage_hat3))
)

##      Rsq     RMSE      MAE 
## 0.990068 5.151366 3.048927

What will you earn from this and what will you lose? (in terms of performance indices)

With smaller 𝑘(e.g., 𝑘=3):

Earnings: The model is more flexible and sensitive to finer details and patterns in the data, potentially leading to better performance on training data.
Losses: Higher variance and sensitivity to noise, which can lead to overfitting and worse generalization to new data.

With larger 𝑘(e.g., 𝑘=10):

Earnings: The model is more stable and less sensitive to noise, leading to better generalization and reduced variance.
Losses: Higher bias, as the model becomes less flexible and may miss finer details, potentially leading to underfitting.

ML - Assignment 1 Eldad Aviv

Eldad Aviv

2024-06-06