Exercise 3

We now examine the diferences between LDA and QDA.

If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

A: Because of its flexibility, we would anticipate that QDA will perform (slightly) better on the training set whereas LDA would perform better on the test set. Overfitting to any spurious non-linearity in the training data—which is unlikely to exist in the test set—will be the cause of QDA’s superior performance on the training data.

If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

A: The type of nonlinearity of the Bayes decision boundary has some bearing on this. Because of its greater flexibility, we would anticipate that QDA would perform better on the training data. We anticipate that QDA will perform noticeably better if the nonlinearity is quadratic, or nearly so. Although this depends on the type of nonlinearity, we would likely anticipate that QDA would perform better on the test data. Depending on how well QDA can describe the nonlinearity, some nonlinear connections will be poorly modeled by QDA and well approximated by LDA.

In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

A: Generally speaking, we would expect the test prediction accuracy of the higher flexibility model (QDA) to improve relative to the lower flexibility model (LDA), as for a large n the probability of nonlinear training relationships being spurious decreases.

True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is fexible enough to model a linear decision boundary. Justify your answer.

A: It is False. Particularly with a smaller sample size, the variance from using a more flexible method (QDA) will lead to overfitting, yielding a higher test error than LDA. I can’t see how QDA would be favourable regardless of the sample size though when we already know that the Bayes decision boundary is linear. If this logic was correct we would simply always favour the most flexible method.

Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coeffcient, βˆ0 = −6, βˆ1 = 0.05, βˆ2 = 1.

Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.

A: 0.378

How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class?

A: 50 hours

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two diferent classifcation procedures. First we use logistic regression and get an error rate of 20 % on the training data and 30 % on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18 %. Based on these results, which method should we prefer to use for classifcation of new observations? Why?

A: We would prefer Logistic Regression. The training error for KNN can be thought of as the error that occurs when the training data is input as the test set. When K = 1, this means that when KNN makes a prediction on a test observation, it will look for the single closest observation available in the training data (which will be itself). It will then assign that training observations response value as the prediction for the test observation.

This will always have zero error, irrespective of the dataset or whether classification/regression is being used. (Lets say some neccessary assumptions have been made, namely that the observations are unique, i.e. there are no cases of ≥ 2 observations with identical predictors but different response values. This is the only problematic case - if both training observations are 0 distance from the test observation, what should the predicted response be?)

This means than, if KNN (where K = 1) averages an 18% error across train & test, its training error will be 0, so its test error must be 2 * 18% = 36%, which is worse than the 30% test error of logistic regression. For this reason I would prefer to use logistic regression - the classifier generalizes better to new data, which is all we care about.

This problem has to do with odds.

On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

A: 0.27

Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will default?

A: 0.19