(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
This is a regression problem because CEO salary is a continuous numerical variable.
We are interested in inference since the goal is to understand which factors affect CEO salary, rather than just predicting salaries.
n = 500 firms.
p = 3 (profit, number of employees, industry).
(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
This is a classification problem because the outcome variable (success or failure) is categorical.
We are more interested in prediction, as we want to determine whether a new product will be a success or failure.
n = 20 products.
p = 13 (price, marketing budget, competition price, and 10 other variables).
(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
This is a regression problem because the outcome variable (% change in USD/Euro exchange rate) is continuous.
We are focused on prediction, as we want to predict future changes based on past data.
n = 52 weeks (one year of weekly data).
p = 3 (% change in US market, British market, and German market).
(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
library(ggplot2)
flexibility <- seq(1, 10, length.out = 100)
bias <- (10 / flexibility) # Decreasing bias
variance <- (flexibility - 1)^2 / 20 # Increasing variance
training_error <- exp(-flexibility / 2) # Decreasing training error
test_error <- bias + variance + 0.5 # U-shaped test error
bayes_error <- rep(0.5, length(flexibility)) # Constant irreducible error
# Create a data frame
df <- data.frame(
Flexibility = rep(flexibility, 5),
Value = c(bias, variance, training_error, test_error, bayes_error),
Curve = rep(c("Bias^2", "Variance", "Training Error", "Test Error", "Bayes Error"), each = length(flexibility))
)
# Plot using ggplot2
ggplot(df, aes(x = Flexibility, y = Value, color = Curve)) +
geom_line(size = 1) +
labs(title = "Bias-Variance Tradeoff",
x = "Model Flexibility",
y = "Error / Value") +
theme_minimal() +
scale_color_manual(values = c("red", "blue", "green", "purple", "black")) +
theme(legend.position = "bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
(b) Explain why each of the five curves has the shape displayed in part (a)
Squared Bias (Decreasing Curve)
Bias is the error from approximating a real-world problem with a simplified model.
Less flexible models (like linear regression) have high bias because they make strong assumptions.
As flexibility increases, bias decreases since the model captures more details.
Variance (Increasing Curve)
Variance measures how much a model’s predictions change with different training data.
Less flexible models have low variance because they generalize well.
More flexible models fit training data closely, causing high variance and over-fitting.
Training Error (Decreasing Curve)
More flexible models can better fit the training data, reducing training error.
However, a low training error does not necessarily mean good generalization.
Test Error (U-Shaped Curve)
Initially, increasing flexibility improves test accuracy.
After a certain point, the model over-fits, causing test error to rise again.
Bayes (Irreducible) Error (Flat Line)
Represents the error due to inherent randomness in the data.
No model can reduce this error, so it remains constant.
A more flexible approach captures complex patterns well, reducing bias and improving predictive accuracy, especially with large data sets. However, it risks over-fitting, is computationally expensive, and harder to interpret.
A less flexible approach is simpler, interpretable, and generalizes better with small datasets, reducing over-fitting. However, it may fail to capture complex relationships, leading to higher bias.
A more flexible model is preferred for large, complex datasets where prediction is the priority (e.g., deep learning, random forests). A less flexible model is preferred when the ability to intrepret, efficiency, or small data size is important (e.g., linear/logistic regression, simple decision trees).
(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.
The Euclidean distance between two points (x1,x2,x3) and (y1,y2,y3) in 3-dimensional space is given by:\[d=\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + (x_3 - y_3)^2}\]
(b) What is our prediction with K=1? Why?
With K=1, we choose the observation that is closest to the test point. From the distances calculated. Observation 5 has the smallest distance (d5=1.414). Since Observation 5 has Y=Green, our prediction with K=1 is Green.
With K=3, we consider the three nearest neighbors. The three smallest distances are:
d5=1.414 (Green)
d6=1.732 (Red)
d2=2 (Red)
So, we have two Red observations and one Green observation among the three nearest neighbors. Since Red occurs more frequently, the prediction with K=3 is Red.
(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?
If the Bayes decision boundary is highly nonlinear, we would expect the best value for K to be small. This is because, in highly nonlinear decision boundaries, local variations in the data are more significant. A small K (such as 1) can capture the local pattern of the data more accurately, while larger values of K might smooth over these finer details and lead to poor predictions. Large K values tend to generalize too much, leading to a biased prediction in cases where the true boundary is nonlinear.
END