Exercise 1

# Load necessary libraries
library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.4.2

## Warning: package 'readr' was built under R version 4.4.2

## Warning: package 'dplyr' was built under R version 4.4.2

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)

Question 1: Flexible vs. Inflexible Methods

(a) Sample size n is extremely large, and predictors p are small.

Answer: A flexible method would likely perform better because with a large sample size (nn), we have enough data to estimate the complex relationships between predictors and the response without over fitting. The small number of predictors (pp) reduces the risk of over fitting.
Justification: Flexible methods can capture non-linear relationships in the data, which might be missed by inflexible methods.

(b) Predictors P are extremely large, and observations n are small.

Answer: An inflexible method is preferred in this scenario because flexible methods are prone to overfitting when there are many predictors relative to observations.
Justification: With p≫np≫n, flexible methods may over-fit the noise in the data due to their complexity, whereas simpler models are more robust.

(c) The relationship between predictors and response is highly non-linear.

Answer: A flexible method would perform better because it can capture the complex non-linear relationships.
Justification: Inflexible methods like linear regression assume a specific functional form (e.g., linearity), which may not fit the data well if the true relationship is non-linear.

(d) The variance of error terms (σ2=Var(ϵ)σ2=Var(ϵ)) is extremely high.

Answer: An inflexible method is preferred because flexible methods tend to have higher variance, which can exacerbate the impact of noisy error terms.
Justification: Inflexible methods provide more stable estimates when noise is high.

Question 2: Classification or Regression Problems

(a) CEO Salary Analysis

Problem Type: Regression (response variable is continuous).
Goal: Inference (understanding factors affecting CEO salary).
n: 500 firms.
p: Number of predictors (e.g., profit, employees, industry).

(b) Product Success Prediction

Problem Type: Classification (response variable is binary: success or failure).
Goal: Prediction (to predict success or failure of a new product).
nn: 20 products.
pp: 13 predictors (price, marketing budget, etc.).

(c) USD/Euro Exchange Rate Prediction

Problem Type: Regression (response variable is continuous percentage change).
Goal: Prediction (to predict future exchange rate changes).
nn: Weekly data for 2012 (~52 observations).
pp: 3 predictors (% changes in US, British, and German markets).

Question 3: Bias-Variance Trade-off

(a) Sketch of Bias, Variance, and Error Curves

# Simulate flexibility levels
flexibility <- seq(1, 10, length.out = 100)

# Define components of the bias-variance trade-off
bias_sq <- 1 / flexibility            # Bias^2 decreases with flexibility
variance <- flexibility / 10          # Variance increases with flexibility
irreducible_error <- rep(0.5, 100)    # Irreducible error is constant
test_error <- bias_sq + variance + irreducible_error # Test error combines all terms
training_error <- variance            # Training error decreases with flexibility

# Plot the components
plot(flexibility, test_error, type = "l", col = "red", lwd = 2,
     ylim = c(0, max(test_error)), ylab = "Error", xlab = "Model Flexibility",
     main = "Bias-Variance Trade-off")
lines(flexibility, bias_sq, col = "blue", lwd = 2)
lines(flexibility, variance, col = "green", lwd = 2)
lines(flexibility, irreducible_error, col = "purple", lwd = 2)
lines(flexibility, training_error, col = "orange", lwd = 2)

# Add a legend
legend("topright", legend = c("Test Error", "Bias^2", "Variance",
                              "Irreducible Error", "Training Error"),
       col = c("red", "blue", "green", "purple", "orange"), lwd = 2)

(b) Explanation:

Bias Curve: Decreases as flexibility increases because more complex models fit the data better.
Variance Curve: Increases with flexibility because complex models are sensitive to small changes in training data.
Training Error: Decreases as flexibility increases since models can overfit training data.
Test Error: Forms a U-shape due to the bias-variance trade-off.
Irreducible Error: Constant since it represents noise inherent in the data.

Question 4: Real-Life Applications

(a) Classification Applications:

Email Spam Detection:
- Response: Spam or Not Spam.
- Predictors: Email content features.
- Goal: Prediction.
Disease Diagnosis:
- Response: Disease type.
- Predictors: Patient symptoms and test results.
- Goal: Prediction.
Customer Churn:
- Response: Churn or Not Churn.
- Predictors: Customer behavior metrics.
- Goal: Prediction.

(b) Regression Applications:

House Price Prediction:
- Response: House price.
- Predictors: Size, location, number of rooms.
- Goal: Prediction.
Sales Forecasting:
- Response: Sales revenue.
- Predictors: Advertising spend on TV/radio/newspaper.
- Goal: Prediction.
Stock Price Prediction:
- Response: Stock price change.
- Predictors: Market indicators.
- Goal: Prediction.

(c) Clustering Applications:

Customer Segmentation:
- Group customers based on purchasing behavior.
Document Clustering:
- Group articles based on topics using text features.
Image Segmentation:
- Cluster pixels in an image for object detection.

Question 5:

A more flexible approach captures complex patterns well, reducing bias and improving predictive accuracy, especially with large data sets. However, it risks over-fitting, is computationally expensive, and harder to interpret.
A less flexible approach is simpler, interpretable, and generalizes better with small datasets, reducing over-fitting. However, it may fail to capture complex relationships, leading to higher bias.
A more flexible model is preferred for large, complex datasets where prediction is the priority (e.g., deep learning, random forests). A less flexible model is preferred when the ability to intrepret, efficiency, or small data size is important (e.g., linear/logistic regression, simple decision trees).

Question 7:

a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

The Euclidean distance between two points (x1,x2,x3) and (y1,y2,y3) in 3-dimensional space is given by:

Distance between (0,0,0) & (0,3,0) is 3.
Distance between (0,0,0) & (2,0,0) is 2.
Distance between (0,0,0) & (0,1,3) is 3.162.
Distance between (0,0,0) & (0,1,2) is 2.236.
Distance between (0,0,0) & (-1,0,1) is 1.414.
Distance between (0,0,0) & (1,1,1) is 1.732.

(b) What is our prediction with K=1? Why?

With K=1, we choose the observation that is closest to the test point. From the distances calculated. Observation 5 has the smallest distance (d5=1.414). Since Observation 5 has Y=Green, our prediction with K=1 is Green.

(c) What is our prediction with K=3? Why?

With K=3, we consider the three nearest neighbors. The three smallest distances are:

d5=1.414 (Green)
d6=1.732 (Red)
d2=2 (Red)

So, we have two Red observations and one Green observation among the three nearest neighbors. Since Red occurs more frequently, the prediction with K=3 is Red.

(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

When the Bayes decision boundary is highly nonlinear, a smaller value of KK in KK-nearest neighbors (KNN) is generally preferred. This is because smaller KK values allow the model to focus on local patterns in the data, which are crucial for capturing the intricacies of a nonlinear boundary. In contrast, larger KK values tend to smooth out these local variations, potentially oversimplifying the model and leading to biased predictions. While larger KK values may generalize better in simpler scenarios, they are less effective in cases where the true decision boundary exhibits significant nonlinearity. Let me know if you’d like further refinements!

This involves implementing KNN using Euclidean distances in R:

# Data
data <- data.frame(
    X1 = c(0, 2, 0, 0, -1, 1),
    X2 = c(3, 0, 1, 2, 0, 1),
    X3 = c(0, 0, 3, 2, 1, 1),
    Y = c("Red", "Red", "Red", "Green", "Green", "Red")
)

# Test point
test_point <- c(0, 0, 0)

# Compute Euclidean distances
distances <- apply(data[, c("X1", "X2", "X3")], MARGIN = 1,
                   FUN = function(x) sqrt(sum((x - test_point)^2)))
data$Distance <- distances

# Sort by distance
data_sorted <- data[order(data$Distance), ]

# K=1 prediction
k1_prediction <- data_sorted$Y[1]

# K=3 prediction
k3_prediction <- names(sort(table(data_sorted$Y[1:3]), decreasing = TRUE))[1]

list(K_1_Prediction = k1_prediction,
     K_3_Prediction = k3_prediction)

## $K_1_Prediction
## [1] "Green"
## 
## $K_3_Prediction
## [1] "Red"