Question 1 (concept)[10p]

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

  1. The sample size \(n\) is extremely large, and the number of predictors \(p\) is small.
  2. The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small.
  3. The relationship between the predictors and response is highly non-linear.
  4. The variance of the error terms, \(Var(\epsilon)\), is extremely high.

Answer 1

  1. Better. A more flexible method will better match the data and, with large sample sizes, will obtain a better fit than an inflexible method.
  2. Worse. A more flexible approach at this case could easily lead to overfitting.
  3. Better. The nonlinear relationship requires fitting more parameters with higher degrees of freedom, so a more flexible model can fit the data better.
  4. Worse. When the error variance is too large, a more flexible model will fit the noise of the error and increase the variance.

Question 2 (concept)[10p]

We now revisit the bias-variance decomposition.

  1. Provide a sketch of typical (squared) bias, variance, training mean squared error, test mean squared error, and Bayes (or irreducible) error rate curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The \(x\)-axis should represent the amount of flexibility in the method, and the \(y\)-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

  2. Explain why each of the five curves has the shape displayed in part (a)

Answer 2

  1. First all values are greater than or equal to 0.
    Bias: As the flexibility of the model increases, it will match the data better, so the bias will be monotonically decreasing.
    Variance: As the flexibility of the model increases, overfitting will occur, so the variance will increase monotonically.
    Training error: As the flexibility of the model increases, it will match the data better, so the training error will monotonically decrease.
    Test error: As the flexibility of the model increases, it will match the data better, so the test error will decrease, but as the flexibility continues to increase, the model will have overfitting problems when the threshold of overfitting is exceeded, and the test error will increase at that time.
    Bayes error: By definition, Bayesian error is a lower bound on the test error and is non-varying with model flexibility.

Question 3 (concept)[10p]

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable

Obs. \(X_1\) \(X_2\) \(X_3\) \(Y\)
0 3 0 Red
2 0 0 Red
0 1 3 Red
0 1 2 Green
-1 0 1 Green
1 1 1 Red

Suppose we wish to use this data set to make a prediction for \(Y\) when \(X_1 = X_2 = X_3 = 0\) using K-nearest neighbors.

  1. Compute the Euclidean distance between each observation and the test point, \(X_1 = X_2 = X_3 = 0\).

(Note: the Euclidean distance of two vectors \(a = (a_1,a_2,a_3)\) and \(b = (b_1,b_2,b_3)\) is given by \(d(a,b) = \sqrt{(a_1-b_1)^2 + (a_2-b_2)^2 + (a_3-b_3)^2}\). The same idea extends to vectors with \(n\) coordinates.)

  1. What is our prediction with K = 1? Why?
  2. What is our prediction with K = 3? Why?
  3. If the Bayes decision boundary in this problem is highly non-linear, then would we expect the best value for K to be large or small? Why?

Answer 3

  1. Here it would be recommended to create a table (or choose any way want to provide your answer):
Obs. \(X_1\) \(X_2\) \(X_3\) \(d\big(obs, (0,0,0) \big)\)
0 3 0
        3.0
2 0 0
        2.0
0 1 3
        3.2
0 1 2
        2.2
-1 0 1
        1.4
1 1 1
        1.7
  1. Red or Green? Green. When K=1, Obs.5 is the closest neighbor.
  2. Red or Green? Red. When K=3, Obs.2, 5, 6 are the closest neighbor. (Red, Green, Red), Red is in the majority.
  3. Small. Models with smaller K are more flexible, and more flexible models fit better for nonlinear decision boundaries.

Question 4 (concept)[10p]

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Answer 4

The advantage of very flexible models is that they can fit the data better, match the data better, and reduce bias, especially for nonlinear problems. The disadvantages are that more parameters need to be estimated, tend to overfit the data, increase the variance, and have poor interpretability.
A more flexible approach is better when our goal is prediction rather than interpretation. The less flexible approach is better when our purpose is interpretation rather than prediction.

Question 5 (concept)[10p]

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages? Describe the differences between a parametric and a non-parametric statistical learning approach.

Answer 5

Parametric methods make certain prior assumptions about the form of f, reducing the problem of estimating f to that of estimating the parameters; whereas nonparametric methods do not make prior assumptions about the form of f.
Thus the advantage of parametric methods is that they do not require many observations.
The disadvantage is that the form of f needs to be chosen correctly, and once it is chosen incorrectly, a bad or even wrong model is obtained.