EXERCISE - 1

2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a). We collect a set of data on the top 500 firms in the US. For each f irm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Problem Type: Regression - Since the goal is to understand the relationship between numerical variables this is a regression problem.

Focus: Inference - We are interested in understanding the effect of various factors on CEO salary rather than simply predicting CEO salary.

  • n and p:

    • n=500 (the number of firms)

    • p=3 (number of predictors: profit, number of employees, and industry; CEO salary is the response variable).

(b). We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Problem Type: Classification - The goal is to classify whether a new product launch will result in success or failure, a categorical outcome.

Focus: Prediction - We are interested in predicting the success or failure of a product based on other features.

  • n and p:

    • n=20 (the number of products in the dataset)

    • p=13 (number of predictors: price charged, marketing budget, competition price, and other variables)

(c). We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Problem Type: Regression - The output variable is numerical (percentage change in USD/Euro exchange rate), so this is a regression problem.

Focus: Prediction - The aim is to predict the percentage change in exchange rates based on changes in other markets.

  • n and p:

    • n=52 (weekly data for one year)

    • p=3 (number of predictors: percentage changes in the US, British, and German markets).

3. We now revisit the bias-variance decomposition.

(a). Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

# Loading libraries
library(ggplot2)

# Creating data for plotting
flexibility <- seq(0, 10, length.out = 100)
bias <- (10 - flexibility)^2 / 100
variance <- flexibility^2 / 50
bayes_error <- rep(0.5, 100)
training_error <- bias + variance - 0.5
test_error <- bias + variance + bayes_error

# Combining data into a data frame
data <- data.frame(
  Flexibility = rep(flexibility, 5),
  Error = c(bias, variance, training_error, test_error, bayes_error),
  Type = rep(c("Bias", "Variance", "Training Error", "Test Error", "Bayes Error"), each = 100)
)

# Plotting the curves
ggplot(data, aes(x = Flexibility, y = Error, color = Type)) +
  geom_line(size = 1) +
  labs(
    title = "Bias-Variance Decomposition",
    x = "Flexibility of Model",
    y = "Error",
    color = "Curve"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

(b). Explain why each of the five curves has the shape displayed in part (a).

  1. Bias Curve: Because less flexible models (like linear models) oversimplify the problem and produce large bias, the bias curve diminishes with model flexibility. The model reduces bias as it gains flexibility by capturing more of the data’s underlying structure.
  2. variance curve: The more flexible the model, the higher the variance curve. Because simple models generalise effectively across various datasets, they are stable and show little volatility. But when flexibility rises, the model begins to overfit the training set, become more sensitive to even small data changes and introducing more variance.
  3. training error curve: Because more complicated models are better at fitting the training data, the training error curve steadily gets less as flexibility rises. The training error may even be close to zero in very flexible models.
  4. test error curve: The bias-variance tradeoff causes the test error curve to take the shape of a U. First, the model decreases bias as flexibility rises, which lowers test error. Nevertheless, overfitting raises variance and test error after a certain degree.
  5. Bayes error: Because it captures the irreducible mistake caused by noise and randomness in the data, the Bayes error stays constant. No modelling technique can lessen this mistake, which is independent of the model’s flexibility.

These curve shapes reflect the fundamental relationship between model complexity and prediction accuracy, emphasizing the importance of balancing bias and variance for optimal model performance.

5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less f lexible approach? When might a less flexible approach be preferred?

The benefit of flexible techniques is that they can capture intricate and non-linear relationships in the data. Their ability to adjust to complex patterns results in reduced bias and improved training precision. Nevertheless, these models have a tendency to overfit, particularly when dealing with noisy or limited datasets. Additionally, because of their greater volatility, they are more susceptible to modifications in the training data. Furthermore, flexible models frequently lack interpretability, which makes it challenging to describe how predictors and responses relate to one another.

Conversely, less adaptable methods are reliable and effectively apply to fresh data. They are more dependable in small datasets or in noisy data because they have lower variance and are less prone to overfit. Less flexible models also offer insights into the interactions between variables and are simpler to interpret. They might, however, have trouble identifying intricate patterns, which could result in underfitting and increased bias.

When prediction is the main objective, the sample size is high, and the data is complicated, flexible models are perfect. Conversely, less adaptable models work better with noisy data, short datasets, or scenarios where interpretability is crucial. Finding the best strategy requires weighing the trade-offs between interpretability, adaptability, and overfitting risk.

7. Suppose we wish to use this data set to make a prediction for Y when X1 =X2 =X3=0 using K-nearest neighbors.

(a). Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 =0. The Euclidean distance between the test point (X_1 = 0, X_2 = 0, X_3 = 0) and each observation is calculated as:

Distance = sqrt{(X_1 - 0)^2 + (X_2 - 0)^2 + (X_3 - 0)^2}

Distances are calculated as follows:

  1. Observation 1:

sqrt{(0 - 0)^2 + (3 - 0)^2 + (0 - 0)^2} = sqrt{0 + 9 + 0} = 3.0

  1. Observation 2:

sqrt{(2 - 0)^2 + (2 - 0)^2 + (0 - 0)^2} = sqrt{4 + 4 + 0} = sqrt{8} = 2.83

  1. Observation 3:

sqrt{(0 - 0)^2 + (1 - 0)^2 + (3 - 0)^2} = sqrt{0 + 1 + 9} = sqrt{10} = 3.16

  1. Observation 4:

sqrt{(0 - 0)^2 + (1 - 0)^2 + (2 - 0)^2} = sqrt{0 + 1 + 4} = sqrt{5} = 2.24

  1. Observation 5:

sqrt{(-1 - 0)^2 + (0 - 0)^2 + (1 - 0)^2} = sqrt{1 + 0 + 1} = sqrt{2} = 1.41

  1. Observation 6:

sqrt{(1 - 0)^2 + (1 - 0)^2 + (1 - 0)^2} = sqrt{1 + 1 + 1} = sqrt{3} = 1.73

(b). What is our prediction with K =1? Why?

For K = 1, the nearest neighbor is Observation 5 (Y = Green), as it has the smallest distance (1.41).

  • Prediction is Green

  • Reason is the closest point determines the prediction when K = 1.

(c). What is our prediction with K =3? Why?

For K = 3, we can consider the 3 closest observations:

  • Observation 5 (Y = Green): Distance 1.41

  • Observation 6 (Y = Red): Distance 1.73

  • Observation 4 (Y = Green): Distance 2.24

Votes:

  • Green: 2

  • Red:1

  • Prediction is Green

  • Reason is majority vote among the 3 nearest neighbors.

(d). If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

Since smaller K values enable the model to capture local patterns and adjust to the non-linearity in the decision boundary, we would anticipate that the optimal K would be minimal if the Bayes decision boundary is extremely non-linear. These local patterns are smoothed out by larger K values, which may result in underfitting when non-linearity is significant.