Statistical Learning Exercise 3

5. We now examine the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

Answer:

Training set: QDA will likely perform better because it is more flexible and can model complex patterns, even if they aren’t necessary for a linear decision boundary.
Test set: LDA is expected to perform better since it assumes a linear boundary and avoids overfitting. QDA might overfit due to its flexibility, leading to worse generalization on new data.

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

Answer:

Training set: QDA should perform better because it can capture non-linear relationships in the data.
Test set: QDA will also perform better as long as there’s enough data to support its complexity. However, with small sample sizes, QDA can overfit, making LDA the safer choice, as it generalizes better with limited data.

(c) As the sample size (n) increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or stay the same?

Answer: As the sample size increases, the test accuracy of QDA relative to LDA should improve. QDA’s complexity requires more data to avoid overfitting. With more data, QDA can accurately model non-linear boundaries and outperform LDA, especially when the true boundary is non-linear.

(d) True or False: If the Bayes decision boundary is linear, we will likely achieve a better test error rate using QDA rather than LDA because QDA is flexible enough to model linear decision boundaries.

Answer: False. Although QDA is flexible and can model a linear decision boundary, it adds unnecessary complexity when the boundary is truly linear. This complexity can lead to overfitting, meaning LDA, with its simpler model, will generally achieve a lower test error rate for linear problems.

6 Suppose we collect data for a group of students in a statistics class with variables:

\(X_1 = \text{hours studied}\)
\(X_2 = \text{undergrad GPA}\)
\(Y = \text{receive an A}\)

We fit a logistic regression and obtain the estimated coefficients:

\[ \hat{\beta}_0 = -6, \quad \hat{\beta}_1 = 0.05, \quad \hat{\beta}_2 = 1 \]

The logistic regression model is given by:

\[ \hat{p} = \frac{e^{\left( \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 \right)}}{1 + e^{\left( \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 \right)}} \]

(a) Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 will receive an A in the class.

Substituting \(X_1 = 40\) and \(X_2 = 3.5\) into the model:

\[ \hat{p} = \frac{e^{\left( -6 + 0.05 \cdot 40 + 1 \cdot 3.5 \right)}}{1 + e^{\left( -6 + 0.05 \cdot 40 + 1 \cdot 3.5 \right)}} \]

Simplifying the expression:

\[ \hat{p} = \frac{e^{-0.5}}{1 + e^{-0.5}} \approx \frac{0.6065}{1 + 0.6065} \approx 0.378 \]

Thus, the probability that the student will receive an A is approximately 37.8%.

(b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A?

We need to find \(X_1\) such that the probability \(\hat{p} = 0.5\). Setting the model equal to 0.5:

\[ 0.5 = \frac{e^{\left( -6 + 0.05 X_1 + 1 \cdot 3.5 \right)}}{1 + e^{\left( -6 + 0.05 X_1 + 1 \cdot 3.5 \right)}} \]

This simplifies to:

\[ e^{\left( -6 + 0.05 X_1 + 3.5 \right)} = 1 \]

Taking the natural logarithm of both sides:

\[ -6 + 0.05 X_1 + 3.5 = 0 \]

Solving for \(X_1\):

\[ 0.05 X_1 = 2.5 \quad \Rightarrow \quad X_1 = \frac{2.5}{0.05} = 50 \]

Thus, the student would need to study 50 hours to have a 50% chance of getting an A in the class.

Problem 8: Comparison of Logistic Regression vs. 1-Nearest Neighbor (K=1)

We divide a dataset into equally-sized training and test sets and use two different classification methods:

Logistic Regression:
- Training error rate: 20%
- Test error rate: 30%
1-Nearest Neighbors (K=1):
- Average error rate (over both training and test sets): 18%

Which method should we prefer?

We should prefer logistic regression over 1-nearest neighbors (K=1) for classifying new observations. Here’s why:

Logistic Regression:
Logistic regression shows a reasonable balance between training and test error, with 20% on the training set and 30% on the test set. This indicates that the model generalizes well to unseen data.
1-Nearest Neighbors (K=1):
The average error rate for K=1 is 18%, which suggests a very low training error, but the test error must be much higher than 30% to balance this average. This likely means K=1 is overfitting, memorizing the training data but not performing well on new, unseen data.

Why is Logistic Regression a better choice?

Overfitting: 1-NN tends to memorize the training data, making it prone to overfitting, which results in poor performance on new observations.
Generalization: Logistic regression has a more stable error distribution across training and test sets, indicating better generalization.
Test Error is Key: The test error is more critical because it reflects the model’s performance on unseen data. Even though K=1 has a lower average error rate, its likely poor test performance makes logistic regression the more dependable choice for predicting new observations.

Problem 9

Part 1: Comparing Logistic Regression and 1-Nearest Neighbors (K=1)

We divide a data set into equally-sized training and test sets and compare two classification methods:

Logistic Regression:
- Training error rate = 20%
- Test error rate = 30%
1-Nearest Neighbors (K=1):
- Average error rate (over both training and test sets) = 18%

Which method should we prefer?

Although the 1-Nearest Neighbor method has a lower average error rate, it is likely overfitting due to the nature of \(K=1\). This method can perfectly classify training data but may generalize poorly to new observations. Logistic regression, despite having a higher error rate, might generalize better because it avoids overfitting as much as the nearest neighbor method does. Therefore, logistic regression is the preferred method for classifying new observations.

Part 2: Odds and Probabilities

Odds are related to probability through the formula:

\[ \text{Odds} = \frac{P(\text{Event Happening})}{P(\text{Event Not Happening})} = \frac{p}{1-p} \]

If the odds are given, we can compute the probability as:

\[ p = \frac{\text{Odds}}{1 + \text{Odds}} \]

(a) On average, what fraction of people with odds of 0.37 of defaulting on their credit card payment will in fact default?

Given that the odds are 0.37, the probability can be calculated as:

\[ p = \frac{0.37}{1 + 0.37} = \frac{0.37}{1.37} \approx 0.27 \]

Thus, approximately 27% of people with these odds will default on their credit card payments.

(b) Suppose an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?

Given that the probability of default is \(p = 0.16\), the odds can be calculated using the formula:

\[ \text{Odds} = \frac{0.16}{1 - 0.16} = \frac{0.16}{0.84} \approx 0.19 \]

So, the odds that the individual will default are approximately 0.19.

Statistical Learning Exercise 3

Junaid Ahmed Mohammed

2025-03-10

5. We now examine the differences between LDA and QDA.

6 Suppose we collect data for a group of students in a statistics class with variables:

Problem 8: Comparison of Logistic Regression vs. 1-Nearest Neighbor (K=1)

Problem 9