Statistical Learning Exercise

Q2: Explain whether each scenario is a classifcation or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

For each scenario, we determine whether the problem is classification or regression, whether inference or prediction is the focus, and provide values for \(n\) (number of observations) and \(p\) (number of predictors).

(a) We collect a set of data on the top 500 frms in the US. For each firm we record proft, number of employees, industry and the CEO salary. We are interested in understanding which factors afect CEO salary.

Type: Regression (CEO salary is a continuous variable).
Goal: Inference (we want to understand factors affecting CEO salary).
\(n\): 500 firms.
\(p\): At least 3 predictors (profit, number of employees, industry).

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Type: Classification (success or failure is categorical).
Goal: Prediction (we want to predict success or failure of a new product).
\(n\): 20 previous products.
\(p\): 14 predictors (price, marketing budget, competition price, and 10 other variables).

(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Type: Regression (percentage change in exchange rate is continuous).
Goal: Prediction (we want to forecast exchange rate changes).
\(n\): 52 weeks in 2012.
\(p\): 3 predictors (% changes in US, British, and German stock markets).

Q3: We now revisit the bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less fexible statistical learning methods towards more fexible approaches. The x-axis should represent the amount of fexibility in the method, and the y-axis should represent the values for each curve. There should be fve curves. Make sure to label each one.

library(ggplot2)
flexibility <- seq(1, 10, length.out = 100)

bias_squared <- exp(-0.5 * flexibility) * 4
variance <- log(flexibility + 1)
irreducible_error <- rep(0.5, length(flexibility))
training_error <- 1 / (flexibility + 1)
test_error <- bias_squared + variance + irreducible_error

data <- data.frame(
  Flexibility = rep(flexibility, 5),
  Error = c(bias_squared, variance, irreducible_error, training_error, test_error),
  Type = rep(c("Bias^2", "Variance", "Irreducible Error", "Training Error", "Test Error"),
             each = length(flexibility))
)

ggplot(data, aes(x = Flexibility, y = Error, color = Type)) +
  geom_line(size = 1) +
  labs(title = "Bias-Variance Tradeoff", x = "Model Flexibility", y = "Error") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

(b) Explain why each of the five curves has the shape displayed in part (a).

Bias Squared: Decreases as flexibility increases.
Variance: Increases with flexibility, leading to overfitting.
Irreducible Error: Constant noise that cannot be eliminated.
Training Error: Decreases as models become more flexible.
Test Error: U-shaped curve; initially decreases, then increases due to overfitting.

Q5: What are the advantages and disadvantages of a very fexible (versus a less fexible) approach for regression or classifcation? Under what circumstances might a more fexible approach be preferred to a less fexible approach? When might a less fexible approach be preferred?

Advantages of a More Flexible Approach

Captures complex relationships.
Reduces bias.
Higher predictive accuracy in large datasets.

Disadvantages of a More Flexible Approach

High variance, leading to overfitting.
Requires more data.
Difficult to interpret.
Computationally expensive.

When to Prefer a More Flexible Approach

If prediction accuracy is the main goal.
If the true relationship is complex and nonlinear.
If large training data is available.

When to Prefer a Less Flexible Approach

When interpretability is important.
When training data is small.
When the relationship is simple.
When computational resources are limited.

Q7: The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- data.frame(
  Obs = 1:6,
  X1 = c(0, 2, 0, 0, -1, 1),
  X2 = c(3, 0, 1, 1, 0, 1),
  X3 = c(0, 0, 3, 2, 1, 1),
  Y = c("Red", "Red", "Red", "Green", "Green", "Red")
)

test_point <- c(0, 0, 0)

data <- data %>%
  mutate(Distance = sqrt((X1 - test_point[1])^2 + (X2 - test_point[2])^2 + (X3 - test_point[3])^2))

data <- data %>% arrange(Distance)
print(data)

##   Obs X1 X2 X3     Y Distance
## 1   5 -1  0  1 Green 1.414214
## 2   6  1  1  1   Red 1.732051
## 3   2  2  0  0   Red 2.000000
## 4   4  0  1  2 Green 2.236068
## 5   1  0  3  0   Red 3.000000
## 6   3  0  1  3   Red 3.162278

(b) What is our prediction with K = 1? Why?

Closest neighbor: Observation 5 (Green).
Prediction: Green.

(c) What is our prediction with K = 3? Why?

Three closest observations: 5 (Green), 6 (Red), 2 (Red).
Majority vote: Red.
Prediction: Red.

(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

If the Bayes decision boundary is highly nonlinear, a small K is preferred.
Small K captures local variations.
Large K smooths the decision boundary, which may not be ideal.

Statistical Learning Exercise - Chapter 2

Saransh Gupta

01/27/2025