Exercise_1_H515

Questions

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

We collect a set of data on the top 500 firms in the US. For each from we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Answer :
- Since, the response/output in this case is quantitative, hence it is a Regression problem.
- We are most interested in identifying the factors that influence CEO’s salary and not find the actual CEO’s salary, hence our goal is to find Inference.
- n = 500
p = profit, number of employees, industry.

We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Answer :
- Since, the response/output in this case is binary i.e. success/failure, hence it is a Classification problem.
- We are most interested in predicting whether the launch of the product would be a success or a failure. Hence our goal is to do Prediction.
- n = 20
p = price charged, marketing budget, competition price, and 10 other variables.

We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Answer :
- Since, the response/output in this case is quantitative i.e. % change in the USD/Euro , hence it is a Regression problem .
- We are most interested in predicting what the % change in the USD/Euro. Hence our goal is to do Prediction.
- n= 52 (weeks in year 2012) p = % change in the US market, % change in the British market, % change in the German market.

We now revisit the bias-variance decomposition.

Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be fve curves. Make sure to label each one.

Answer :

knitr::include_graphics("graph.png")

Explain why each of the fve curves has the shape displayed in part (a)

Answer :
- Squared bias. This is the error in our model introduced by the difference of our approximation and the true underlying function. A more flexible model will be increasingly similar, and the squared bias therefore diminishes as the flexibility increases.
- Variance. In the limit of a model with no flexibility the variance will be zero, since the model fit will be independent of the data. As the flexibility increases the variance will increase as well since the noise in a particular training set will correspondingly captured by the model. The curve described by the variance is an monotonically increasing function of the flexibility of the model.
- Training error. The training error is given by the average (squared) difference between the predictions of the model and the observations. If a model is very unflexible this can be quite high, but as the flexibility increases this difference will decrease. If we consider polynomials for example increasing the flexibility of the model might mean increasing the degree of the polynomial to be fitted. The additional degrees of freedom will decrease the average difference and reduce the training error.
- Test error. The expected test error is given by the formula: Variance + Bias + Bayes error, all of which are non-negative. The Bayes error is constant and a lower bound for the test error. The test error has a minimum at an intermediate level of flexibility: not too flexible, so that the variance does not dominate, and not too unflexible, so that the squared bias is not too high. The plot of the test error thus resembles sort of an upward (deformed) parabola: high for unflexible models, decreasing as flexibility increases until it reaches a minimum. Then the variance starts to dominate and the test error starts increasing. The distance between this minimum and the Bayes irreducible error gives us an idea of how well the best function in the hypothesis space will fit.
- Bayes error. This term is constant since by definition it does not depend on X and therefore on the flexibility of the model.

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Answer :

The advantages for a very flexible approach for regression or classification are obtaining a better fit for non-linear models, decreasing bias.
The disadvantages for a very flexible approach for regression or classification requires estimating a greater number of parameters,which follows the noise too closely (overfit), increasing variance.
A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results.
A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

knitr::include_graphics("q7.png")

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors. (a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

dat <- data.frame(
  "x1" = c(0, 2, 0, 0, -1, 1),
  "x2" = c(3, 0, 1, 1, 0, 1),
  "x3" = c(0, 0, 3, 2, 1, 1),
  "y" = c("Red", "Red", "Red", "Green", "Green", "Red")
)

# Euclidean distance between points and c(0, 0, 0)
dist <- sqrt(dat[["x1"]]^2 + dat[["x2"]]^2 + dat[["x3"]]^2)
dat[["distance"]] <- c(signif(dist, 3))

dat

##   x1 x2 x3     y distance
## 1  0  3  0   Red     3.00
## 2  2  0  0   Red     2.00
## 3  0  1  3   Red     3.16
## 4  0  1  2 Green     2.24
## 5 -1  0  1 Green     1.41
## 6  1  1  1   Red     1.73

What is our prediction with K = 1? Why?

knn <- function(k) {
  names(which.max(table(dat[["y"]][order(dist)[1:k]])))
}
knn(1)

## [1] "Green"

Green. Observation #5 is the closest neighbor for K = 1.

What is our prediction with K = 3? Why?

knn(3)

## [1] "Red"

Red (based on data points 2, 5, 6)

If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

For highly non-linear boundaries, we would expect the best value of K to be small.
Smaller values of K result in a more flexible KNN model, and this will produce a decision boundary that is non-linear.
A larger K would mean more data points are considered by the KNN model and this means its decision boundary is closer to a linear shape.

Exercise_1_H515_AmrithaP

Amritha Prakash

2024-01-25

Questions