Statistical Learning Exercise-3 (Chapter 4)

Q5. We now examine the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

If the Bayes decision boundary is truly linear, then LDA (Linear Discriminant Analysis) is expected to perform better on both the training and test sets.
QDA (Quadratic Discriminant Analysis) is more flexible but can lead to over-fitting when the true decision boundary is linear, especially with small data sets.

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

If the true decision boundary is non-linear, QDA will likely outperform LDA because it can better capture quadratic relationships.
On the training set, QDA will almost always perform better because it is more flexible.
On the test set, QDA may perform better if there is enough data to reduce overfitting. Otherwise, LDA might generalize better.

(c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

As the sample size increases, QDA’s accuracy improves relative to LDA.
With small sample sizes, QDA suffers from high variance because it estimates more parameters.
With large sample sizes, QDA’s estimation becomes more stable, allowing it to leverage its flexibility.

(d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

False. If the true boundary is linear, then LDA is preferable because it makes fewer assumptions and requires estimating fewer parameters.
QDA, despite being flexible enough to model a linear decision boundary, estimates more parameters, which can introduce unnecessary variance, leading to higher test error rates when sample sizes are small.

Q6. Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient, βˆ0 = −6, βˆ1 = 0.05, βˆ2 = 1.

(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.

# Define the coefficients
beta_0 <- -6
beta_1 <- 0.05
beta_2 <- 1

# (a) Probability estimation
X1 <- 40  # hours studied
X2 <- 3.5  # undergrad GPA
logit_prob <- exp(beta_0 + beta_1 * X1 + beta_2 * X2) / (1 + exp(beta_0 + beta_1 * X1 + beta_2 * X2))
logit_prob

## [1] 0.3775407

A student who studies 40 hours and has a GPA of 3.5 has an estimated 37.75% probability of getting an A in the class.

(b) How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class?

# (b) Solving for X1 when probability = 0.5
X2_fixed <- 3.5
X1_needed <- (log(0.5 / (1 - 0.5)) - beta_0 - beta_2 * X2_fixed) / beta_1
X1_needed

## [1] 50

The student would need to study 50 hours to have a 50% probability of getting an A.

Q8. Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20 % on the training data and 30 % on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18 %. Based on these results, which method should we prefer to use for classification of new observations? Why?

Logistic Regression: Training Error = 20%, Test Error = 30%
1-Nearest Neighbor (K=1): Average Error = 18%

Which method should we prefer?

- KNN (K=1) has a lower average error, but this does not necessarily mean it generalizes better.
- Logistic regression has a lower variance and might generalize better on new data.
- KNN (K=1) is highly flexible and may over-fit the training data, leading to poor performance on unseen data.
- A better approach could be increasing K (e.g., K=3 or 5) to reduce over-fitting.

Q9. This problem has to do with odds.

(a) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

odds <- 0.37
default_prob <- odds / (1 + odds)
default_prob

## [1] 0.270073

A person with odds of 0.37 has an estimated 27% probability of defaulting.

(b) Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will default?

prob <- 0.16
default_odds <- prob / (1 - prob)
default_odds

## [1] 0.1904762

If a person has a 16% probability of defaulting, the corresponding odds are 0.19.