Chapter 2: (2), (3), (5), and (7)

Question2: Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

a. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

This is a regression problem where the response variable is the CEO salary and the 𝑝=3 predictors are profit, number of employees, and industry. Since we are interested in understanding which factors affect CEO salary, instead of trying to predict a CEO’s salary given the the predictors, we are most interested in inference, or understanding the relationship between the predictors and the response. The data is for the top 500 firms in the US, meaning 𝑛=500 .

b. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

This is a classification problem because we are putting products into one of two categories: success or failure. Since we want to know whether or not the new product will be a success or a failure, we are most interested in prediction. In this situation, 𝑛=20 for the 20 similar products that were previously launched. There are 𝑝=13 predictors: price charged for the product, marketing budget, competition price, and the ten other variables collected.

c. We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

This is a regression problem where the response variable is the % change in the USD/Euro exchange rate and the 𝑝=3 predictors are the % change in the US market, the % change in the British market, and the % change in the German market. In this situation we are most interested in prediction. Lastly, since we collected weekly data for all of 2012, 𝑛=52 .

Question 3: We now revisit the bias-variance decomposition.

a. Provide a sketch of typical (squared) bias, variance, training er- ror, test error, and Bayes (or irreducible) error curves, on a sin- gle plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

Q3a
Q3a

b. Explain why each of the five curves has the shape displayed in part (a)

  • (squared) bias: Decreases with increasing flexibility (Generally, more flexible methods result in less bias).
  • variance: Increases with increasing flexibility (In general, more flexible statistical methods have higher variance).
  • training error: Decreases with model flexibility (More complex models will better fit the training data).
  • test error: Decreases initially, then increases due to overfitting (less bias but more training error).
  • Bayes (irreducible) error: fixed (does not change with model).

Question 5: What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

A highly flexible strategy has the potential to reduce bias and provide a better fit for non-linear models.

A very flexible technique has the drawbacks of increasing variance, overfitting (following the noise too closely), and requiring the estimation of more parameters.

When we are concerned in prediction rather than the interpretability of the outcomes, a more flexible strategy would be better than a less flexible method.

When we are interested in inference and the interpretability of the results, a less flexible strategy would be chosen over a more flexible approach.

Question 7: The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

Obs. X_1 X_2 X_3 Y
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 -1 0 1 Green
6 1 1 1 Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the testpoint, X1 =X2 =X3 =0.

dat <- data.frame(
  "x1" = c(0, 2, 0, 0, -1, 1),
  "x2" = c(3, 0, 1, 1, 0, 1),
  "x3" = c(0, 0, 3, 2, 1, 1),
  "y" = c("Red", "Red", "Red", "Green", "Green", "Red")
)

# Euclidean distance between points and c(0, 0, 0)
dist <- sqrt(dat[["x1"]]^2 + dat[["x2"]]^2 + dat[["x3"]]^2)
signif(dist, 3)
## [1] 3.00 2.00 3.16 2.24 1.41 1.73

(b) What is our prediction with K = 1? Why?

knn <- function(k) {
  names(which.max(table(dat[["y"]][order(dist)[1:k]])))
}
knn(1)
## [1] "Green"

(c) What is our prediction with K = 3? Why?

 cat("Using  𝐾=3 , we predict the color of the test point using the three closest neighbors. They are  𝑋5  (distance  2rt ),  𝑋6 (distance  3rt ), and  𝑋4  (distance  5rt ). Since  𝑋4  and  𝑋5  are both green, while  𝑋6  is red, we predict that the test point will be green.")
## Using  𝐾=3 , we predict the color of the test point using the three closest neighbors. They are  𝑋5  (distance  2rt ),  𝑋6 (distance  3rt ), and  𝑋4  (distance  5rt ). Since  𝑋4  and  𝑋5  are both green, while  𝑋6  is red, we predict that the test point will be green.
knn(3)
## [1] "Red"

(d) If the Bayes decision boundary in this problem is highly non- linear, then would we expect the best value for K to be large or small? Why?

If the Bayes decision boundary in this problem is highly non-linear, then we would expect the best value for 𝐾 to be small. This is because as 𝐾 increases, the method of 𝐾 -nearest neighbors becomes less flexible and produces a decision boundary which is more linear. A small value for 𝐾 , on the other hand, results in increased flexibility and a decision boundary which is more non-linear, which in this situation would be closer to the gold standard Bayes decision boundary.