# Load necessary libraries
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

Question 1: Flexible vs. Inflexible Methods

(a) Sample size n is extremely large, and predictors p are small.

(b) Predictors P are extremely large, and observations n are small.

(c) The relationship between predictors and response is highly non-linear.

(d) The variance of error terms (σ2=Var(ϵ)σ2=Var(ϵ)) is extremely high.

Question 2: Classification or Regression Problems

(a) CEO Salary Analysis

(b) Product Success Prediction

(c) USD/Euro Exchange Rate Prediction

Question 3: Bias-Variance Trade-off

(a) Sketch of Bias, Variance, and Error Curves

# Simulate flexibility levels
flexibility <- seq(1, 10, length.out = 100)

# Define components of the bias-variance trade-off
bias_sq <- 1 / flexibility            # Bias^2 decreases with flexibility
variance <- flexibility / 10          # Variance increases with flexibility
irreducible_error <- rep(0.5, 100)    # Irreducible error is constant
test_error <- bias_sq + variance + irreducible_error # Test error combines all terms
training_error <- variance            # Training error decreases with flexibility

# Plot the components
plot(flexibility, test_error, type = "l", col = "red", lwd = 2,
     ylim = c(0, max(test_error)), ylab = "Error", xlab = "Model Flexibility",
     main = "Bias-Variance Trade-off")
lines(flexibility, bias_sq, col = "blue", lwd = 2)
lines(flexibility, variance, col = "green", lwd = 2)
lines(flexibility, irreducible_error, col = "purple", lwd = 2)
lines(flexibility, training_error, col = "orange", lwd = 2)

# Add a legend
legend("topright", legend = c("Test Error", "Bias^2", "Variance",
                              "Irreducible Error", "Training Error"),
       col = c("red", "blue", "green", "purple", "orange"), lwd = 2)

(b) Explanation:

  1. Bias Curve: Decreases as flexibility increases because more complex models fit the data better.

  2. Variance Curve: Increases with flexibility because complex models are sensitive to small changes in training data.

  3. Training Error: Decreases as flexibility increases since models can overfit training data.

  4. Test Error: Forms a U-shape due to the bias-variance trade-off.

  5. Irreducible Error: Constant since it represents noise inherent in the data.

Question 4: Real-Life Applications

(a) Classification Applications:

  1. Email Spam Detection:

    • Response: Spam or Not Spam.

    • Predictors: Email content features.

    • Goal: Prediction.

  2. Disease Diagnosis:

    • Response: Disease type.

    • Predictors: Patient symptoms and test results.

    • Goal: Prediction.

  3. Customer Churn:

    • Response: Churn or Not Churn.

    • Predictors: Customer behavior metrics.

    • Goal: Prediction.

(b) Regression Applications:

  1. House Price Prediction:

    • Response: House price.

    • Predictors: Size, location, number of rooms.

    • Goal: Prediction.

  2. Sales Forecasting:

    • Response: Sales revenue.

    • Predictors: Advertising spend on TV/radio/newspaper.

    • Goal: Prediction.

  3. Stock Price Prediction:

    • Response: Stock price change.

    • Predictors: Market indicators.

    • Goal: Prediction.

(c) Clustering Applications:

  1. Customer Segmentation:

    • Group customers based on purchasing behavior.
  2. Document Clustering:

    • Group articles based on topics using text features.
  3. Image Segmentation:

    • Cluster pixels in an image for object detection.

Question 5:

Question 7:

a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

The Euclidean distance between two points (x1,x2,x3) and (y1,y2,y3) in 3-dimensional space is given by:

  1. Distance between (0,0,0) & (0,3,0) is 3.

  2. Distance between (0,0,0) & (2,0,0) is 2.

  3. Distance between (0,0,0) & (0,1,3) is 3.162.

  4. Distance between (0,0,0) & (0,1,2) is 2.236.

  5. Distance between (0,0,0) & (-1,0,1) is 1.414.

  6. Distance between (0,0,0) & (1,1,1) is 1.732.

(b) What is our prediction with K=1? Why?

With K=1, we choose the observation that is closest to the test point. From the distances calculated. Observation 5 has the smallest distance (d5=1.414). Since Observation 5 has Y=Green, our prediction with K=1 is Green.

(c) What is our prediction with K=3? Why?

With K=3, we consider the three nearest neighbors. The three smallest distances are:

  • d5​=1.414 (Green)

  • d6​=1.732 (Red)

  • d2​=2 (Red)

So, we have two Red observations and one Green observation among the three nearest neighbors. Since Red occurs more frequently, the prediction with K=3 is Red.

(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

When the Bayes decision boundary is highly nonlinear, a smaller value of KK in KK-nearest neighbors (KNN) is generally preferred. This is because smaller KK values allow the model to focus on local patterns in the data, which are crucial for capturing the intricacies of a nonlinear boundary. In contrast, larger KK values tend to smooth out these local variations, potentially oversimplifying the model and leading to biased predictions. While larger KK values may generalize better in simpler scenarios, they are less effective in cases where the true decision boundary exhibits significant nonlinearity. Let me know if you’d like further refinements!

This involves implementing KNN using Euclidean distances in R:

# Data
data <- data.frame(
    X1 = c(0, 2, 0, 0, -1, 1),
    X2 = c(3, 0, 1, 2, 0, 1),
    X3 = c(0, 0, 3, 2, 1, 1),
    Y = c("Red", "Red", "Red", "Green", "Green", "Red")
)

# Test point
test_point <- c(0, 0, 0)

# Compute Euclidean distances
distances <- apply(data[, c("X1", "X2", "X3")], MARGIN = 1,
                   FUN = function(x) sqrt(sum((x - test_point)^2)))
data$Distance <- distances

# Sort by distance
data_sorted <- data[order(data$Distance), ]

# K=1 prediction
k1_prediction <- data_sorted$Y[1]

# K=3 prediction
k3_prediction <- names(sort(table(data_sorted$Y[1:3]), decreasing = TRUE))[1]

list(K_1_Prediction = k1_prediction,
     K_3_Prediction = k3_prediction)
## $K_1_Prediction
## [1] "Green"
## 
## $K_3_Prediction
## [1] "Red"