This is a regression problem where the response variable is the CEO salary and the 𝑝=3 predictors are profit, number of employees, and industry. Since we are interested in understanding which factors affect CEO salary, instead of trying to predict a CEO’s salary given the the predictors, we are most interested in inference, or understanding the relationship between the predictors and the response. The data is for the top 500 firms in the US, meaning 𝑛=500 .
This is a classification problem because we are putting products into one of two categories: success or failure. Since we want to know whether or not the new product will be a success or a failure, we are most interested in prediction. In this situation, 𝑛=20 for the 20 similar products that were previously launched. There are 𝑝=13 predictors: price charged for the product, marketing budget, competition price, and the ten other variables collected.
This is a regression problem where the response variable is the % change in the USD/Euro exchange rate and the 𝑝=3 predictors are the % change in the US market, the % change in the British market, and the % change in the German market. In this situation we are most interested in prediction. Lastly, since we collected weekly data for all of 2012, 𝑛=52 .
- (squared) bias: Decreases with increasing flexibility (Generally, more flexible methods result in less bias).
- variance: Increases with increasing flexibility (In general, more flexible statistical methods have higher variance).
- training error: Decreases with model flexibility (More complex models will better fit the training data).
- test error: Decreases initially, then increases due to overfitting (less bias but more training error).
- Bayes (irreducible) error: fixed (does not change with model).
A highly flexible strategy has the potential to reduce bias and provide a better fit for non-linear models.
A very flexible technique has the drawbacks of increasing variance, overfitting (following the noise too closely), and requiring the estimation of more parameters.
When we are concerned in prediction rather than the interpretability of the outcomes, a more flexible strategy would be better than a less flexible method.
When we are interested in inference and the interpretability of the results, a less flexible strategy would be chosen over a more flexible approach.
| Obs. | X_1 | X_2 | X_3 | Y |
|---|---|---|---|---|
| 1 | 0 | 3 | 0 | Red |
| 2 | 2 | 0 | 0 | Red |
| 3 | 0 | 1 | 3 | Red |
| 4 | 0 | 1 | 2 | Green |
| 5 | -1 | 0 | 1 | Green |
| 6 | 1 | 1 | 1 | Red |
Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.
dat <- data.frame(
"x1" = c(0, 2, 0, 0, -1, 1),
"x2" = c(3, 0, 1, 1, 0, 1),
"x3" = c(0, 0, 3, 2, 1, 1),
"y" = c("Red", "Red", "Red", "Green", "Green", "Red")
)
# Euclidean distance between points and c(0, 0, 0)
dist <- sqrt(dat[["x1"]]^2 + dat[["x2"]]^2 + dat[["x3"]]^2)
signif(dist, 3)
## [1] 3.00 2.00 3.16 2.24 1.41 1.73
knn <- function(k) {
names(which.max(table(dat[["y"]][order(dist)[1:k]])))
}
knn(1)
## [1] "Green"
cat("Using 𝐾=3 , we predict the color of the test point using the three closest neighbors. They are 𝑋5 (distance 2rt ), 𝑋6 (distance 3rt ), and 𝑋4 (distance 5rt ). Since 𝑋4 and 𝑋5 are both green, while 𝑋6 is red, we predict that the test point will be green.")
## Using 𝐾=3 , we predict the color of the test point using the three closest neighbors. They are 𝑋5 (distance 2rt ), 𝑋6 (distance 3rt ), and 𝑋4 (distance 5rt ). Since 𝑋4 and 𝑋5 are both green, while 𝑋6 is red, we predict that the test point will be green.
knn(3)
## [1] "Red"
If the Bayes decision boundary in this problem is highly non-linear, then we would expect the best value for 𝐾 to be small. This is because as 𝐾 increases, the method of 𝐾 -nearest neighbors becomes less flexible and produces a decision boundary which is more linear. A small value for 𝐾 , on the other hand, results in increased flexibility and a decision boundary which is more non-linear, which in this situation would be closer to the gold standard Bayes decision boundary.