1. Logistic Regression - Equivalent Representations
2. Linear Discriminant Analysis - Discriminant Function Proof ($p$ = 1)
3. Quadratic Discriminant Analysis - Discriminant Function Proof ($p$ = 1)
4. Local Approaches (e.g. KNN) - High Dimensionality
5. Differences Between LDA & QDA
6. Logistic Regression (Manual Estimates)
- (a) Estimate $p$
- (b) Hours for $p$ = 0.5
7. Bayes Theorem/LDA (Manual Estimates)
8. KNN Training Error
9. Odds vs Probability
- (a) Odds $\rightarrow$ Probability
- (b) Probability $\rightarrow$ Odds
10. APPLIED: The Weekly Dataset (Logistic, LDA, QDA, KNN)
11. APPLIED: The Auto Dataset (LDA, QDA, Logistic, KNN)
12. APPLIED: Writing Functions
13. APPLIED: The Boston Dataset

NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

Loading packages:

library(tidyverse)
library(ISLR) # 'Weekly' data
library(caret) # train(), confusionMatrix()
library(MASS) # lda(), qda(), `Boston` data
select <- dplyr::select # MASS 'select' clashing with dplyr
library(class) # knn()
library(gridExtra)

theme_set(theme_light())

1. Logistic Regression - Equivalent Representations

Q: Using a little bit of algebra, show that (4.2) is equivalent to (4.3). In other words, the logistic function representation and the logit representation for the logistic regression model are equivalent.

I also include the (4.4) representation, as it’s the form I think of when I think of logistic regression.

\[ \begin{align*} & p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}} && (4.2)\\ \implies & p(X) + p(X) \cdot e^{\beta_0 + \beta_1X} = e^{\beta_0 + \beta_1X} \\ \implies & p(X) = (1 - p(X)) \cdot e^{\beta_0 + \beta_1X} \\ \implies & \frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1X} && (4.3)\\ \implies & log \left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1X && (4.4) \end{align*} \]

2. Linear Discriminant Analysis - Discriminant Function Proof ($p$ = 1)

Q: It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case. In other words, under the assumption that the observations in the kth class are drawn from a $\mathcal{N}(\mu_k, \sigma^2)$ distribution, the Bayes’ classifier assigns an observation to the class for which the discriminant function is maximized.

First, as I think it will be a good piece of reference material, I want to show how we get from Bayes Theorem (with some assumptions) to an LDA classifier.

We know that with LDA we want to find the class $k$ for which $p_k(x)$ is maximized, as this will be our class prediction.

\[\begin{align*} p_k(x) & = P_r(Y = k | X = x) && \text{(want the } k \text{ that maximizes this)} \\ & = \frac{P_r(X = x | Y = k) \cdot P_r(Y = k)}{P_r(X = x)} && \text{(Bayes Theorem)} \\ & = \frac{P_r(Y = k) \cdot P_r(X = x | Y = k)}{\sum_{l=1}^{K} P_r(X = x | Y = l) \cdot P_r(Y = l)} && \text{(more Bayes Theorem)} \end{align*}\]

From here we simply change notation a little:

$\pi_k = P_r(Y = k)$: The prior probability that an observation is from class $k$
$f_k(x) = P_r(X = x | Y = k)$: The density function of $X$ for an observation from the $k$th class

\[\implies p_k(x) = \frac{\pi_k \cdot f_k(x)}{\sum_{l=1}^{K} \pi_l \cdot f_l(x)}\]

We also make two assumptions:

Assume $f_k(x)$ is Gaussian, so in the 1-dimensional ($p$ = 1) case we have the Gaussian density function: $f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k} exp(-\frac{1}{2\sigma_k^2} (x-\mu_k)^2)$
Furthermore assume the variance of the single predictor $X$ is equal across all $k$ classes: $\sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2$ (for $p$ > 1 we would instead be assuming equal covariance matrices here)

Plugging in this simplified density function, we get:

\[p_k(x) = \frac{\pi_k \cdot f_k(x)}{\sum_{l=1}^{K} \pi_l \cdot f_l(x)} = \frac {\pi_k \cdot \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{1}{2\sigma^2} (x-\mu_k)^2)} {\sum_{l=1}^{K} \pi_l \cdot \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{1}{2\sigma^2} (x-\mu_l)^2)} \quad \text{(4.12)}\]

Now, more specifically to the question, we have:

\[\begin{align*} p_k(x) & = \frac{\pi_k \cdot \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{1}{2\sigma^2} (x-\mu_k)^2)} {\sum_{l=1}^{K} \pi_l \cdot \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{1}{2\sigma^2} (x-\mu_l)^2)} \\ & = \frac{\pi_k \cdot e^{\gamma_k}}{\sum_{l=1}^{K} \pi_l \cdot e^{\gamma_l}} && \text{(where: } \gamma_j = -\frac{1}{2\sigma^2} (x-\mu_j)^2 \text{)} \end{align*}\]

We classify an observation to the class $k$ for which $p_k(x)$ is maximized.

Finding the $k$ that maximizes $p_k(x)$ is equivalent to finding the $k$ that maximizes $log(p_k(x))$, since $log(\alpha) > log(\beta) \quad (\forall \alpha > \beta)$.

Hence:

\[\begin{align*} log(p_k(x)) & = log \left( \frac{\pi_k \cdot e^{\gamma_k}}{\sum_{l=1}^{K} \pi_l \cdot e^{\gamma_l}} \right) \\ & = log(\pi_k \cdot e^{\gamma_k}) - log \left( \sum_{l=1}^{K} \pi_l \cdot e^{\gamma_l} \right) && \text{(Log laws)}\\ & = log(\pi_k) + \gamma_k - log \left( \sum_{l=1}^{K} \pi_l \cdot e^{\gamma_l} \right) && \text{(more Log laws)} \end{align*}\]

Now, notice that $log \left( \sum_{l=1}^{K} \pi_l \cdot e^{\gamma_l} \right)$ is a summation over $K$ (notice it’s a big $K$) classes, and so it will take the same value across every value of $k$ (little $k$).

Choosing a $k$ to maximize $log(\pi_k) + \gamma_k - log \left( \sum_{l=1}^{K} \pi_l \cdot e^{\gamma_l} \right)$ is therefore equivalent to choosing a $k$ to maximize $log(\pi_k) + \gamma_k$.

\[\begin{align*} & log(\pi_k) + \gamma_k \\ = \; & log(\pi_k) - \frac{1}{2\sigma^2} (x-\mu_k)^2 && \text{(re-subbing: } \gamma_j = -\frac{1}{2\sigma^2} (x-\mu_j)^2 \text{)} \\ = \; & log(\pi_k) - \frac{(x^2 + \mu_k^2 - 2\mu_kx)}{2\sigma^2} \\ = \; & log(\pi_k) - \frac{\mu_k^2}{2\sigma^2} + \frac{\mu_kx}{\sigma^2} - \frac{x^2}{2\sigma^2} \end{align*}\]

Similar to the previous step, we can note that $\frac{x^2}{2\sigma^2}$ is constant for all values of $k$, and so choosing the $k$ to maximize $log(\pi_k) - \frac{\mu_k^2}{2\sigma^2} + \frac{\mu_kx}{\sigma^2} - \frac{x^2}{2\sigma^2}$ is the same as choosing a $k$ to maximize $\delta_k(x) = \left( \frac{\mu_k}{\sigma^2} \right) \cdot x + \left( log(\pi_k) - \frac{\mu_k^2}{2\sigma} \right) \quad \text{(4.13)}$.

Notice that $\delta_k(x)$ is linear in $x$ (hence, linear discriminant analysis!).

I have therefore shown that choosing the $k$ that maximizes $p_k(x)$ is equivalent to choosing the $k$ that maximizes $\delta_k(x)$, as required.

3. Quadratic Discriminant Analysis - Discriminant Function Proof ($p$ = 1)

Q: This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class-specific mean vector and a class specific covariance matrix. We consider the simple case where $p$ = 1; i.e. there is only one feature. Suppose that we have $K$ classes, and that if an observation belongs to the $k$th class then $X$ comes from a one-dimensional normal distribution, $X \sim \mathcal{N}(\mu_k, \sigma_k^2)$. Recall that the density function for the one-dimensional normal distribution is given in (4.11). Prove that in this case, the Bayes’ classifier is not linear. Argue that it is in fact quadratic. Hint: For this problem, you should follow the arguments laid out in Section 4.4.2, but without making the assumption that $\sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2$.

The difference with QDA is that we don’t make the assumption of equal covariance matrices (equal variances here, as $p$ = 1) across the classes. Therefore $\sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2$ is not assumed, so we have:

\[f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k} exp(-\frac{1}{2\sigma_k^2} (x-\mu_k)^2)\]

Hence:

\[\begin{align*} p_k(x) & = \frac{\pi_k \cdot f_k(x)} {\sum_{l=1}^{K} \pi_l \cdot f_l(x)} \\ & = \frac{\pi_k \cdot \frac{1}{\sqrt{2\pi}\sigma_k} exp \left( -\frac{1}{2\sigma_k^2} (x-\mu_k)^2 \right)} {\sum_{l=1}^{K} \pi_l \cdot \frac{1}{\sqrt{2\pi}\sigma_l} exp \left( -\frac{1}{2\sigma_l^2} (x-\mu_l)^2 \right)} \\ & = \frac{\frac{\pi_k}{\sigma_k} \cdot e^{\gamma_k}} {\sum_{l=1}^{K} \frac{\pi_l}{\sigma_l} \cdot e^{\gamma_l}} && \text{(cancel out } \frac{1}{\sqrt{2\pi}} \text{, sub in: } \gamma_j = -\frac{1}{2\sigma_j^2} (x-\mu_j)^2 \text{)} \end{align*}\]

Following the same steps as the last question, we want to find the class $k$ for which the following is maximized:

\[\begin{align*} log(p_k(x)) & = log \left( \frac{\frac{\pi_k}{\sigma_k} \cdot e^{\gamma_k}} {\sum_{l=1}^{K} \frac{\pi_l}{\sigma_l} \cdot e^{\gamma_l}} \right) \\ & = log \left( \frac{\pi_k}{\sigma_k} \cdot e^{\gamma_k} \right) - log \left( \sum_{l=1}^{K} \pi_l \cdot e^{\gamma_l} \right) \\ & = log(\pi_k) + \gamma_k - log(\sigma_k) - log \left( \sum_{l=1}^{K} \pi_l \cdot e^{\gamma_l} \right) \end{align*}\]

As in the previous question, we eliminate $log \left( \sum_{l=1}^{K} \pi_l \cdot e^{\gamma_l} \right)$, since it is constant $\forall k$.

This is therefore equivalent to finding the $k$ that maximizes:

\[\begin{align*} & log(\pi_k) + \gamma_k - log(\sigma_k) \\ = \; & log(\pi_k) -\frac{1}{2\sigma_k^2} (x-\mu_k)^2 - log(\sigma_k) && \text{(re-subbing: } \gamma_j = -\frac{1}{2\sigma_j^2} (x-\mu_j)^2 \text{)} \\ = \; & log(\pi_k) - \frac{(x^2 + \mu_k^2 - 2\mu_kx)}{2\sigma_k^2} - log(\sigma_k)\\ = \; & log(\pi_k) - \frac{\mu_k^2}{2\sigma_k^2} + \frac{\mu_kx}{\sigma_k^2} - \frac{x^2}{2\sigma_k^2} - log(\sigma_k) \\ = \; & \left( - \frac{1}{2\sigma_k^2} \right) \cdot x^2 + \left( \frac{\mu_k}{\sigma_k^2} \right) \cdot x + \left( log(\pi_k) - log(\sigma_k) - \frac{\mu_k^2}{2\sigma_k} \right) \end{align*}\]

Based on not including the equal variances assumption, we have obtained a different discriminant function which is clearly not linear in $x$ (it’s quadratic, hence quadratic discriminant analysis!).

4. Local Approaches (e.g. KNN) - High Dimensionality

When the number of features $p$ is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when $p$ is large. We will now investigate this curse.

(a) $p$ = 1

Q: Suppose that we have a set of observations, each with measurements on $p$ = 1 feature, $X$. We assume that $X$ is uniformly (evenly) distributed on $[0,1]$. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of $X$ closest to that test observation. For instance, in order to predict the response for a test observation with $X$ = 0.6, we will use observations in the range $[0.55,0.65]$. On average, what fraction of the available observations will we use to make the prediction?

For a test observation $X \in [0.05, 0.95]$, it is trivial (from the example given) that training observations in the range $[X - 0.05, X + 0.05]$ will be used for its prediction, and since $X \sim \mathcal{U}(0, 1)$, in these cases this will equate to ~10% of the observations. For the remaining ‘edge cases’ where $X \in [0, 0.05) \cup (0.95, 1]$, I think the question is a bit unclear. I interpret the wording to mean that:

if $X \in [0, 0.05)$, training observations in the range $[0, 0.1]$ will be used
if $X \in (0.95, 1]$, training observations in the range $[0.9, 1]$ will be used

The only difference here is that some of that 0.1 fraction of points are farther from the test observation, because there aren’t any training observations outside of $[0,1]$ that can be used.

Again, since $X \sim \mathcal{U}(0, 1)$, these cases will also use ~10% of the observations. Across all cases we can therefore see that the fraction of observations used to make each prediction will be 10%.

(b) $p$ = 2

Q: Now suppose that we have a set of observations, each with measurements on $p$ = 2 features, $X_1$ and $X_2$. We assume that $(X_1, X_2)$ are uniformly distributed on $[0,1] \times [0,1]$. We wish to predict a test observation’s response using only observations that are within 10% of the range of $X_1$ and within 10% of the range of $X_2$ closest to that test observation. For instance, in order to predict the response for a test observation with $X_1$ = 0.6 and $X_2$ = 0.35, we will use observations in the range $[0.55,0.65]$ for $X_1$ and in the range $[0.3,0.4]$ for $X_2$. On average, what fraction of the available observations will we use to make the prediction?

This is the same as the previous question, but now the problem is in two dimensions. The question is basically asking: for prediction, if we use observations within 10% of the range of $X_1$ (as in part (a)), and also within 10% of the range of $X_2$, what fraction of the training observations are used to make a prediction?

We can apply the same logic as before here. 10% of observations will satisfy the first criteria, and 10% will satisfy the second, but we want the fraction of observations that satisfy both. Assuming that $X_1$ and $X_2$ are independent, we can multiply the probabilities of these two events, so the fraction used to make each prediction will be $0.1^2$ = 1%.

(c) $p$ = 100

Q: Now suppose that we have a set of observations on $p$ = 100 features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

Using identical reasoning to part (b), the fraction of observations within 10% of all $p$ = 100 features approaches zero, and is given by $0.1^{100}$.

(d) KNN Drawbacks for Large $p$

Q: Using your answers to parts (a)-(c), argue that a drawback of KNN when $p$ is large is that there are very few training observations “near” any given test observation.

This is shown from parts (a)-(c) where we are just using ‘near’ as to mean ‘within 10% of the range’. We see very quickly that, as dimensionality increases, the probability of there being training observations ‘like’ or ‘near’ the test observation $X$ across all $p$ dimensions approaches zero: $\lim_{p\rightarrow\infty}(0.1)^p = 0$

This will mean that in wide datasets (where $p$ is large), the $K$ nearest neighbours chosen will actually not be very close, because there simply won’t exist training observations that are ‘close’ across all $p$ dimensions.

(e) $p$-Dimensional Hypercube for Prediction (small & large $p$)

Q: Now suppose that we wish to make a prediction for a test observation by creating a $p$-dimensional hypercube centered around the test observation that contains, on average, 10% of the training observations. For $p$ = 1, 2, and 100, what is the length of each side of the hypercube? Comment on your answer. Note: A hypercube is a generalization of a cube to an arbitrary number of dimensions. When $p$ = 1, a hypercube is simply a line segment, when $p$ = 2 it is a square, and when $p$ = 100 it is a 100-dimensional cube.

For $p$ = 1, this is the same as part (a): 0.1

For $p$ = 2, the length must be $0.1^\frac{1}{2}$ = 0.316. The ‘hyper-cube’ (here, a square) must extend further into each dimension in the training space and use observations further away in order that 10% of the training observations are used for prediction. We saw in part (b) that, if the sides stay of length 0.1 (only going within 10% of the range for both dimensions), only 1% of training observations will be used for prediction.

For $p$ = 100, the length of the sides must be $0.1^\frac{1}{p} = 0.1^\frac{1}{100} = 0.977$. In order for 10% of the training observations to fall within the cube, each side of the cube must extend further and further into the range of each of the $p$ dimensions. These 10% of training observations selected aren’t really ‘close’ at all!

The image below shows a simulated example where $p$ = 2.

set.seed(1)
test_obs <- data.frame(x1 = 0.25, x2 = 0.58)
unif_sim <- data.frame(x1 = runif(100000, 0, 1), x2 = runif(100000, 0, 1))

unif_sim %>%
  sample_n(10000) %>%
  ggplot(aes(x = x1, y = x2)) + 
  geom_point(alpha = 0.5) + 
  geom_rect(xmin = test_obs$x1 - (0.1^0.5)/2, 
            xmax = test_obs$x1 + (0.1^0.5)/2, 
            ymin = test_obs$x2 - (0.1^0.5)/2, 
            ymax = test_obs$x2 + (0.1^0.5)/2, 
            alpha = 0.01, fill = "deepskyblue3") + 
  geom_point(data = test_obs, col = "red", size = 3) +
  theme_light() + 
  labs(title = "p = 2 hyper-cube, side length = 0.316", 
       subtitle = "10,000 simulated obs: (x1, x2) ~ unif[0,1] x unif[0,1]")

Here, the test observation $X$ for which we must predict is shown in red.

test_obs

##     x1   x2
## 1 0.25 0.58

We can see that, if a hyper-cube with sides of length $0.1^\frac{1}{2}$ = 0.316 is created around $X$, this contains approx 10% of the training observations as required:

 unif_sim %>%
  filter(x1 > test_obs$x1 - (0.1^0.5)/2, 
         x1 < test_obs$x1 + (0.1^0.5)/2, 
         x2 > test_obs$x2 - (0.1^0.5)/2, 
         x2 < test_obs$x2 + (0.1^0.5)/2) %>%
  summarize(perc_in_square = n() / 100000)

##   perc_in_square
## 1        0.10019

5. Differences Between LDA & QDA

We now examine the differences between LDA and QDA.

(a) Linear Bayes Decision Boundary

Q: If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

Here we would expect QDA to perform (slightly) better on the training set due to its flexibility, but for LDA to perform better on the test set. QDA outperforming on the training data will be due to overfitting to any spurious non-linearity in the training data which would not likely be present in the test set.

(b) Non-Linear Bayes Decision Boundary

Q: If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

This depends somewhat on the nature of the nonlinearity of the Bayes decision boundary.

On the training data, we would expect QDA to out-perform due to its higher flexibility. If the nonlinearity is quadratic (or close to that) then we would expect QDA to perform significantly better.

On the test data, we would probably expect QDA to perform better, although this does depend on the nature of the nonlinearity. Some nonlinear relationships will be poorly approximated by QDA and well approximated by LDA, so this again depends on how well QDA can model the nonlinearity.

(c) Large Sample Sizes

Q: In general, as the sample size $n$ increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

Generally speaking, we would expect the test prediction accuracy of the higher flexibility model (QDA) to improve relative to the lower flexibility model (LDA), as for a large $n$ the probability of nonlinear training relationships being spurious decreases.

(d) Linear Bayes Decision Boundary

Q: True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

False. Particularly with a smaller sample size, the variance from using a more flexible method (QDA) will lead to overfitting, yielding a higher test error than LDA. I can’t see how QDA would be favourable regardless of the sample size though when we already know that the Bayes decision boundary is linear. If this logic was correct we would simply always favour the most flexible method.

6. Logistic Regression (Manual Estimates)

Suppose we collect data for a group of students in a statistics class with variables:

$X_1$ = hours studied
$X_2$ = undergrad GPA
$Y$ = receive an A.

We fit a logistic regression and produce estimated coefficients:

$\hat{\beta_0}$ = -6
$\hat{\beta_1}$ = 0.05
$\hat{\beta_2}$ = 1

(a) Estimate $p$

Q: Estimate the probability that a student who studies for 40h and has an undergrad GPA of 3.5 gets an A in the class.

From the estimate of equation (4.2): \[ \begin{align*} p(X) &= \frac{e^{\hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2}} \\ &= \frac{e^{-6 + (0.05)(40) + (1)(3.5)}}{1 + e^{-6 + (0.05)(40) + (1)(3.5)}} \\ &= \frac{e^{-0.5}}{1 + e^{-0.5}} \end{align*} \]

round(exp(-0.5) / (1 + exp(-0.5)), 3)

## [1] 0.378

(b) Hours for $p$ = 0.5

Q: How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?

\[\begin{align*} & p(X) = \frac{e^{-6 + (0.05)(X_1) + (1)(3.5)}}{1 + e^{-6 + (0.05)(X_1) + (1)(3.5)}} = \frac{e^{0.05 \cdot X_1 - 2.5}}{1 + e^{0.05 \cdot X_1 - 2.5}} = 0.5\\ \implies & e^{0.05 \cdot X_1 - 2.5} = 0.5 + 0.5 \cdot e^{0.05 \cdot X_1 - 2.5} \\ \implies & e^{0.05 \cdot X_1 - 2.5} = 1 \\ \implies & X_1 = \frac{log(1) + 2.5}{0.05} = 50 \end{align*}\]

7. Bayes Theorem/LDA (Manual Estimates)

Q: Suppose that we wish to predict whether a given stock will issue a dividend this year (“yes” or “no”) based on $X$, last year’s percent profit. We examine a large number of companies and discover that the mean value of $X$ for companies that issued a dividend was $\bar{X}$ = 10, while the mean for those that didn’t was $\bar{X}$ = 0. In addition, the variance of $X$ for these two sets of companies was $\hat{\sigma^2}$ = 36. Finally, 80% of companies issued dividends. Assuming that $X$ follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was $X$ = 4 last year.

Hint: Recall that the density function for a normal random variable is $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{\frac{-(x - \mu)^2}{2\sigma^2}}$. You will need to use Bayes’ theorem.

I’m a little confused by this question, because it seems to be asking us to make calculations using the Bayes classifier, but with all the assumptions we would make for LDA (normality, equal variances), so I’m not sure why it doesn’t name what we are doing here as LDA.

Recall the formula derived above from Bayes theorem:

\[p_k(x) = P_r(Y = k | X = x) = \frac{\pi_k \cdot f_k(x)}{\sum_{l=1}^{K} \pi_l \cdot f_l(x)}\]

We therefore have:

$\pi_{yes}$ = 0.8
$\bar{X}_{yes}$ = 10
$\pi_{no}$ = 0.2
$\bar{X}_{no}$ = 0
$\hat{\sigma_{yes}^2} = \hat{\sigma_{no}^2} = \hat{\sigma^2}$ = 36

\[f_{yes}(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{\frac{-(x - \mu)^2}{2\sigma^2}} \\ = \frac{1}{\sqrt{(2\pi)(36)}} \cdot e^{\frac{-(x - 10)^2}{(2)(36)}} \\ \implies f_{yes}(4) = \frac{e^{-\frac{1}{2}}}{\sqrt{72\pi}}\]

\[f_{no}(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{\frac{-(x - \mu)^2}{2\sigma^2}} \\ = \frac{1}{\sqrt{(2\pi)(36)}} \cdot e^{\frac{-(x - 0)^2}{(2)(36)}} \\ \implies f_{no}(4) = \frac{e^{-\frac{2}{9}}}{\sqrt{72\pi}}\]

Hence:

\[P_r(Y = yes | X = 4) = \frac{\pi_{yes} \cdot f_{yes}(4)}{\pi_{no} \cdot f_{no}(4) + \pi_{yes} \cdot f_{yes}(4)} \\ = \frac{(0.8)(\frac{e^{-\frac{1}{2}}}{\sqrt{72\pi}})}{(0.2)(\frac{e^{-\frac{2}{9}}}{\sqrt{72\pi}}) + (0.8)(\frac{e^{-\frac{1}{2}}}{\sqrt{72\pi}})}\]

round((0.8 * exp(-0.5) / sqrt(72*pi)) / ((0.2 * exp(-2/9) / sqrt(72*pi)) + (0.8 * exp(-0.5) / sqrt(72*pi))), 3)

## [1] 0.752

8. KNN Training Error

Q: Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. $K$ = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

We would prefer Logistic Regression.

The training error for KNN can be thought of as the error that occurs when the training data is input as the test set. When $K$ = 1, this means that when KNN makes a prediction on a test observation, it will look for the single closest observation available in the training data (which will be itself). It will then assign that training observations response value as the prediction for the test observation.

This will always have zero error, irrespective of the dataset or whether classification/regression is being used. (Note: some neccessary assumptions have been made, namely that the observations are unique, i.e. there are no cases of $\ge$ 2 observations with identical predictors but different response values. This is the only problematic case - if both training observations are 0 distance from the test observation, what should the predicted response be?)

This means than, if KNN (where $K$ = 1) averages an 18% error across train & test, its training error will be 0, so its test error must be 2 * 18% = 36%, which is worse than the 30% test error of logistic regression. For this reason I would prefer to use logistic regression - the classifier generalizes better to new data, which is all we care about.

9. Odds vs Probability

This problem has to do with odds.

Here’s a quick bonus graph showing the relationship between odds ($\frac{p(X)}{1 - p(X)}$) and probability ($p(X)$), where we can see:

$\frac{p(X)}{1 - p(X)} \ge 0 \quad ( \forall p(X) \in [0, 1))$
Odds = 1 when $p(X)$ = 0.5:

The graph has been truncated at y = 20, since $\lim_{p\to 1^{-}} \frac{p}{1-p} = \infty$

data.frame(prob = seq(0, 0.99, 0.01)) %>%
  mutate(odds = prob / (1 - prob)) %>%
  ggplot(aes(x = prob, y = odds)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept = 0.5, col = "grey30") +
  geom_hline(yintercept = 1, col = "grey30") + 
  coord_cartesian(ylim = c(0, 20)) + 
  labs(x = "p", 
       y = "Odds: p / (1 - p)", 
       title = "Odds vs Probability Relationship")

(a) Odds $\rightarrow$ Probability

Q: On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

\[\begin{align*} & \frac{p(X)}{1 - p(X)} = 0.37 \\ \implies & p(X) = 0.37 - 0.37 \cdot p(X) \\ \implies & p(X) = \frac{0.37}{1.37} \end{align*}\]

round(0.37/1.37, 3)

## [1] 0.27

(b) Probability $\rightarrow$ Odds

Q: Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?

\[p(X) = 0.16 \implies \frac{p(X)}{1 - p(X)} = \frac{0.16}{1 - 0.16}\]

round(0.16/0.84, 3)

## [1] 0.19

10. APPLIED: The `Weekly` Dataset (Logistic, LDA, QDA, KNN)

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

glimpse(Weekly)

## Rows: 1,089
## Columns: 9
## $ Year      <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 199...
## $ Lag1      <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807...
## $ Lag2      <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372...
## $ Lag3      <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178...
## $ Lag4      <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.71...
## $ Lag5      <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.5...
## $ Volume    <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0....
## $ Today     <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041...
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up,...

The variables are:

Year: The year that the observation was recorded
Lag1: Percentage return for previous week
Lag2: Percentage return for 2 weeks previous
Lag3: Percentage return for 3 weeks previous
Lag4: Percentage return for 4 weeks previous
Lag5: Percentage return for 5 weeks previous
Volume: Volume of shares traded (average number of daily shares traded in billions)
Today: Percentage return for this week
Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week

(a) Data Summary

Q: Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

An initial glance at the numeric variables of the dataset:

pairs(Weekly[ ,-9])

abs(cor(Weekly[ ,-9]))

##              Year        Lag1       Lag2       Lag3        Lag4        Lag5
## Year   1.00000000 0.032289274 0.03339001 0.03000649 0.031127923 0.030519101
## Lag1   0.03228927 1.000000000 0.07485305 0.05863568 0.071273876 0.008183096
## Lag2   0.03339001 0.074853051 1.00000000 0.07572091 0.058381535 0.072499482
## Lag3   0.03000649 0.058635682 0.07572091 1.00000000 0.075395865 0.060657175
## Lag4   0.03112792 0.071273876 0.05838153 0.07539587 1.000000000 0.075675027
## Lag5   0.03051910 0.008183096 0.07249948 0.06065717 0.075675027 1.000000000
## Volume 0.84194162 0.064951313 0.08551314 0.06928771 0.061074617 0.058517414
## Today  0.03245989 0.075031842 0.05916672 0.07124364 0.007825873 0.011012698
##            Volume       Today
## Year   0.84194162 0.032459894
## Lag1   0.06495131 0.075031842
## Lag2   0.08551314 0.059166717
## Lag3   0.06928771 0.071243639
## Lag4   0.06107462 0.007825873
## Lag5   0.05851741 0.011012698
## Volume 1.00000000 0.033077783
## Today  0.03307778 1.000000000

As we would expect with stock market data, there are no obvious strong relationships between the Lag variables. However, there do appear to be some interesting trends over time. I create the Week variable below, allowing for easier plotting of trends, since there is a chronology to the rows that is not shown fully through the Year variable.

I first do a quick sense-check that the rows are in the correct order, based on the definition of Today and Lag1:

Weekly %>%
  filter(lead(Lag1) != Today) %>%
  nrow()

## [1] 0

Since there are no rows out of order, the dataset appears to be correctly ordered in ascending weeks, so I create Week (basically a row counter):

Weekly$Week <- 1:nrow(Weekly)

Looking at Volume over time, there has been a significant increase in the volume of shares traded since the 90’s. This appears to have peaked around 2009, starting to decrease in 2010. it would be interesting to see the S&P 500 stats since then.

year_breaks <- Weekly %>%
  group_by(Year) %>%
  summarize(Week = min(Week))

ggplot(Weekly, aes(x = Week, y = Volume)) + 
  geom_line() + 
  geom_smooth() + 
  scale_x_continuous(breaks = year_breaks$Week, minor_breaks = NULL, labels = year_breaks$Year) + 
  labs(title = "Average Daily Shares Traded vs Time", 
       x = "Time") + 
  theme_light()

Here is Direction over time, which is less interesting. There appear to only be 4 years in which >= 50% of the weeks didn’t see a positive return (2000, 2001, 2002, 2008).

ggplot(Weekly, aes(x = Year, fill = Direction)) + 
  geom_bar(position = "fill") +
  geom_hline(yintercept = 0.5, col = "grey45") +
  scale_x_continuous(breaks = seq(1990, 2010), minor_breaks = NULL) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_light() + 
  theme(axis.title.y = element_blank(), 
        legend.position = "bottom") + 
  ggtitle("% of Up/Down Weeks vs Time")

The split of the weeks into Down & Up can be seen in the table below. We could get a classifier with 55.56% accuracy simply by predicting the S&P 500 return will be positive every week.

prop.table(table(Weekly$Direction))

## 
##      Down        Up 
## 0.4444444 0.5555556

We can also see that the market seems to go through periods of higher variance/instability. Crashes (e.g. Sept. 2008) stand out here.

ggplot(Weekly, aes(x = Week, y = Today / 100)) + 
  geom_line() + 
  scale_x_continuous(breaks = year_breaks$Week, minor_breaks = NULL, labels = year_breaks$Year) + 
  scale_y_continuous(labels = scales::percent_format(), breaks = seq(-0.2, 0.2, 0.05)) + 
  geom_hline(yintercept = 0, col = "grey55") +
  theme_light() + 
  labs(title = "Weekly Percentage Return vs Time", 
       x = "Time", 
       y = "Percentage Return")

(b) Logistic Regression (predict market `Direction`)

Q: Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

Lag2 appears to be the only statistically significant predictor:

glm_dir <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
               data = Weekly, 
               family = "binomial")

summary(glm_dir)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Note that the output from the Z-statistic (z value) is calculated the same as with the T-test in linear regression: $Z = \frac{\hat{\beta_j}}{SE(\hat{\beta_j})}$, and a large absolute value indicates evidence against the null hypothesis $H_{0}: \beta_{j} = 0$ (again, the same).

(c) Confusion Matrix

Q: Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

The confusion matrix is shown below. I use caret::confusionMatrix(), as it has various confusion matrix statistics built in to its output.

predicted <- factor(ifelse(predict(glm_dir, type = "response") < 0.5, "Down", "Up"))

confusionMatrix(predicted, Weekly$Direction, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   54  48
##       Up    430 557
##                                          
##                Accuracy : 0.5611         
##                  95% CI : (0.531, 0.5908)
##     No Information Rate : 0.5556         
##     P-Value [Acc > NIR] : 0.369          
##                                          
##                   Kappa : 0.035          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9207         
##             Specificity : 0.1116         
##          Pos Pred Value : 0.5643         
##          Neg Pred Value : 0.5294         
##              Prevalence : 0.5556         
##          Detection Rate : 0.5115         
##    Detection Prevalence : 0.9063         
##       Balanced Accuracy : 0.5161         
##                                          
##        'Positive' Class : Up             
##

This is just over the baseline accuracy (55.56%) achieved by a naive classifier (that predicts Up every time). In fact, this is almost the strategy of the logistic regression model:

prop.table(table(predicted))

## predicted
##       Down         Up 
## 0.09366391 0.90633609

This is reflected in the very poor specificity (it does not predict the negative class well).

Note also that we are dealing with training accuracy here, so this marginal accuracy improvement over the baseline is not interesting.

(d) `train` and `test` - Logistic Regression

Q: Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

I create train and test, corresponding to the two time periods given in the question.

The confusion matrix is for the test predictions this time.

train <- Weekly[Weekly$Year <= 2008, ]
test <- Weekly[Weekly$Year > 2008, ]

glm_dir <- glm(Direction ~ Lag2, 
               data = train, 
               family = "binomial")

predicted <- factor(ifelse(predict(glm_dir, newdata = test, type = "response") < 0.5, "Down", "Up"))

confusionMatrix(predicted, test$Direction, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
##

Here we get an Accuracy of 0.625.

The confusionMatrix() function provides a lot of other useful statistics. For example, No Information Rate : 0.5865 tells us that the largest class (Up) is 58.65% of the test observations, and hence this is our baseline for a classifier.

Clearly Accuracy : 0.625 > 0.5865, which is positive. However, our test dataset is relatively small so this might not be a meaningful improvement.

We are provided with a p-value for a one-sided test to see whether the accuracy is better than the “no information rate”. P-Value [Acc > NIR] : 0.2439 > 0.05 $\implies$ no significant evidence that our classifier is better than the baseline strategy. Predicting stock movements is hard - who would’ve thought?

(e) `train` and `test` - LDA

Q: Repeat (d) using LDA.

lda_dir <- lda(Direction ~ Lag2, data = train)


predicted_lda <- predict(lda_dir, newdata = test)

confusionMatrix(data = predicted_lda$class,
                reference = test$Direction, 
                positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
##

Here we get an Accuracy of 0.625. Note that, as before, we have P-Value [Acc > NIR] : 0.2439 > 0.05, so whilst the accuracy of the classifier is 0.625 > 0.5865, the test sample size is not large enough for this increase over the baseline to be meaningful.

predict.lda()$posterior gives a data frame of probability predictions, with one column per response class. predict.lda()$classgives the class prediction for each observation (the class with the greatest probability). I check this below:

identical(as.character(predicted_lda$class), 
as.character(ifelse(predicted_lda$posterior[ ,2] < 0.5, "Down", "Up")))

## [1] TRUE

(f) `train` and `test` - QDA

Q: Repeat (d) using QDA.

qda_dir <- qda(Direction ~ Lag2, data = train)


predicted_qda <- predict(qda_dir, newdata = test)

confusionMatrix(data = predicted_qda$class, 
                reference = test$Direction, 
                positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
##

Here we get an Accuracy of 0.5865.

Note that the QDA classifier just predicts Up for every test observation - it behaves identically to the naive classifier on this dataset, with a sensitivity of 1 and a specificity of 0.

(g) `train` and `test` - KNN (K = 1)

Q: Repeat (d) using KNN with K = 1.

Usually as a pre-processing step for KNN (with multiple predictors over different scales), we would want to standardize the predictors ($x_{new} = \frac{x - \mu}{\sigma}$) so that each $x_{new}$ will have a mean of 0 and standard deviation of 1. In this case, however, there is only 1 predictor (Lag2), so the nearest neighbour would not be effected by this.

Note also that there is some element of randomness with the KNN classifier. Take the following test observation that requires prediction:

test[100, "Lag2"]

## [1] 0.043

# test[75, "Lag2"] # another one here

Notice that there are two train observations of identical distance (w.r.t Lag2), but both have different Direction values:

train[c(10, 808), c("Lag2", "Direction")]

##      Lag2 Direction
## 10  0.041      Down
## 808 0.041        Up

In this case, the KNN probability will be 0.5:

set.seed(1)
predicted_knn <- knn(train = data.frame(Lag2 = train$Lag2), 
                  test = data.frame(Lag2 = test$Lag2), 
                  cl = train$Direction, 
                  k = 1, 
                  prob = T)

attr(predicted_knn, "prob")[100]

## [1] 0.5

However, the classifier must make a prediction, which will just be chosen at random:

predicted_knn[100]

## [1] Down
## Levels: Down Up

See the confusion matrix below:

confusionMatrix(data = predicted_knn, 
                reference = test$Direction, 
                positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 30
##       Up     22 31
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4003, 0.5997)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.9700          
##                                           
##                   Kappa : -0.0033         
##                                           
##  Mcnemar's Test P-Value : 0.3317          
##                                           
##             Sensitivity : 0.5082          
##             Specificity : 0.4884          
##          Pos Pred Value : 0.5849          
##          Neg Pred Value : 0.4118          
##              Prevalence : 0.5865          
##          Detection Rate : 0.2981          
##    Detection Prevalence : 0.5096          
##       Balanced Accuracy : 0.4983          
##                                           
##        'Positive' Class : Up              
##

Here we get an accuracy of 0.5, which is again worse than the baseline.

(h) Best Performing Classifier?

Q: Which of these methods appears to provide the best results on this data?

Taking the target metric as the accuracy of the classifier: LDA & Logistic Regression get the same test accuracy of 0.625, so these two are tied.

(i) Experimenting (combined predictors, interactions, transformations, etc.)

Q: Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

KNN - selecting best K (using cross-validation):

train$Today <- NULL

ctrl <- trainControl(method = "repeatedcv",
                     number = 5,
                     repeats = 5)

set.seed(111)

knn_train <- train(y = train$Direction,
                   x = train[ ,-8],
                   method = "knn",
                   metric = "Accuracy",
                   preProcess = c("center", "scale"),
                   tuneGrid = expand.grid(k = seq(1, 50, 2)),
                   trControl = ctrl)

caret::varImp(knn_train)

## ROC curve variable importance
## 
##        Importance
## Lag1      100.000
## Lag2       77.256
## Lag5       64.309
## Year       45.659
## Volume     43.735
## Week       42.513
## Lag4        4.578
## Lag3        0.000

knn_train

## k-Nearest Neighbors 
## 
## 985 samples
##   8 predictor
##   2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 788, 788, 788, 788, 788, 787, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa       
##    1  0.4947996  -0.020756148
##    3  0.5232477   0.031373990
##    5  0.5230292   0.028769372
##    7  0.5372435   0.053353378
##    9  0.5372415   0.049612758
##   11  0.5343834   0.040819116
##   13  0.5346081   0.037598994
##   15  0.5368437   0.040363712
##   17  0.5317675   0.026161027
##   19  0.5301555   0.021608137
##   21  0.5370622   0.033488872
##   23  0.5330064   0.024861594
##   25  0.5401182   0.036317004
##   27  0.5376693   0.029998090
##   29  0.5321890   0.015868622
##   31  0.5338310   0.018687256
##   33  0.5360563   0.021431639
##   35  0.5322138   0.012669022
##   37  0.5283538   0.002535516
##   39  0.5330239   0.011432961
##   41  0.5289589   0.002400751
##   43  0.5295618   0.003081904
##   45  0.5299700   0.003587628
##   47  0.5303854   0.003745293
##   49  0.5320056   0.005552184
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 25.

ggplot(knn_train) +
  geom_smooth() +
  theme_light() +
  scale_y_continuous(labels = scales::percent_format()) +
  ggtitle("KNN - 'K' Selection (5-repeated 5-fold cross-validation)")

caret automatically chooses the best value for k. Evaluating the performance of this new model on test:

knn_pred <- predict(knn_train, newdata = test)

confusionMatrix(data = knn_pred, 
                reference = test$Direction, 
                positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   19 20
##       Up     24 41
##                                           
##                Accuracy : 0.5769          
##                  95% CI : (0.4761, 0.6732)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.6193          
##                                           
##                   Kappa : 0.1156          
##                                           
##  Mcnemar's Test P-Value : 0.6511          
##                                           
##             Sensitivity : 0.6721          
##             Specificity : 0.4419          
##          Pos Pred Value : 0.6308          
##          Neg Pred Value : 0.4872          
##              Prevalence : 0.5865          
##          Detection Rate : 0.3942          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.5570          
##                                           
##        'Positive' Class : Up              
##

It appears that performance increased compared to the K = 1 version. However, we are still below the performance of the baseline approach. Similar (worse) results can be seen by adding other predictors, so I leave KNN for now in favour of other methods.

New Features:

Since logistic regression produced one of the best classifiers so far, it could be appropriate to call the best_predictor() function I wrote in previous chapters (which applies logistic regression for a categorical response variable), in the hopes that a potentially useful transformation might be identified.

best_predictor <- function(dataframe, response) {
  
  if (sum(sapply(dataframe, function(x) {is.numeric(x) | is.factor(x)})) < ncol(dataframe)) {
    stop("Make sure that all variables are of class numeric/factor!")
  }
  
  # pre-allocate vectors
  varname <- c()
  vartype <- c()
  R2 <- c()
  R2_log <- c()
  R2_quad <- c()
  AIC <- c()
  AIC_log <- c()
  AIC_quad <- c()
  y <- dataframe[ ,response]
  
  # # # # # NUMERIC RESPONSE # # # # #
  if (is.numeric(y)) {
    
    for (i in 1:ncol(dataframe)) {
      
      x <- dataframe[ ,i]
      varname[i] <- names(dataframe)[i]
      
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      if (!identical(y, x)) {
        
        # linear: y ~ x
        R2[i] <- summary(lm(y ~ x))$r.squared 
        
        # log-transform: y ~ log(x)
        if (is.numeric(x)) { 
          if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
            R2_log[i] <- summary(lm(y ~ log(x + abs(min(x)) + 1)))$r.squared
          } else {
            R2_log[i] <- summary(lm(y ~ log(x)))$r.squared
          }
        } else {
          R2_log[i] <- NA
        }
        
        # quadratic: y ~ x + x^2
        if (is.numeric(x)) { 
          R2_quad[i] <- summary(lm(y ~ x + I(x^2)))$r.squared
        } else {
          R2_quad[i] <- NA
        }
        
      } else {
        R2[i] <- NA
        R2_log[i] <- NA
        R2_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               R2 = round(R2, 3), 
               R2_log = round(R2_log, 3), 
               R2_quad = round(R2_quad, 3)) %>%
      mutate(max_R2 = pmax(R2, R2_log, R2_quad, na.rm = T)) %>%
      arrange(desc(max_R2))
    
    
    # # # # # CATEGORICAL RESPONSE # # # # #
  } else {
    
    for (i in 1:ncol(dataframe)) {
      
      x <- dataframe[ ,i]
      varname[i] <- names(dataframe)[i]
      
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      if (!identical(y, x)) {
        
        # linear: y ~ x
        AIC[i] <- summary(glm(y ~ x, family = "binomial"))$aic 
        
        # log-transform: y ~ log(x)
        if (is.numeric(x)) { 
          if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
            AIC_log[i] <- summary(glm(y ~ log(x + abs(min(x)) + 1), family = "binomial"))$aic
          } else {
            AIC_log[i] <- summary(glm(y ~ log(x), family = "binomial"))$aic
          }
        } else {
          AIC_log[i] <- NA
        }
        
        # quadratic: y ~ x + x^2
        if (is.numeric(x)) { 
          AIC_quad[i] <- summary(glm(y ~ x + I(x^2), family = "binomial"))$aic
        } else {
          AIC_quad[i] <- NA
        }
        
      } else {
        AIC[i] <- NA
        AIC_log[i] <- NA
        AIC_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               AIC = round(AIC, 3), 
               AIC_log = round(AIC_log, 3), 
               AIC_quad = round(AIC_quad, 3)) %>%
      mutate(min_AIC = pmin(AIC, AIC_log, AIC_quad, na.rm = T)) %>%
      arrange(min_AIC)
  } 
}

However, due to the nature of what we are trying to predict, along with the relatively small sample size, it seems highly likely that the the appeared usefulness of any transformations will be due to overfitting to the training data.

To illustrate this, I call best_predictor(train, "Direction"), but I also add 10 noise variables (junk_1, …, junk_10), and we can see that it’s certainly not clear that any of the variables are useful at all, as the two best-performing variables are random $\mathcal{N}(0, 1)$ variables:

train$junk_1 <- rnorm(nrow(train))
train$junk_2 <- runif(nrow(train))
train$junk_3 <- factor(as.numeric(rnorm(nrow(train)) > 0))
train$junk_4 <- rnorm(nrow(train))
train$junk_5 <- runif(nrow(train))
train$junk_6 <- factor(as.numeric(rnorm(nrow(train)) > 0))
train$junk_7 <- rnorm(nrow(train))
train$junk_8 <- runif(nrow(train))
train$junk_9 <- factor(as.numeric(rnorm(nrow(train)) > 0))
train$junk_10 <- rnorm(nrow(train))

best_predictor(train, "Direction")

## [1] "Response variable: Direction"

##      varname     vartype      AIC  AIC_log AIC_quad  min_AIC
## 1       Lag1     numeric 1354.446 1354.199 1356.442 1354.199
## 2       Lag2     numeric 1354.543 1355.148 1355.435 1354.543
## 3     junk_6 categorical 1355.795       NA       NA 1355.795
## 4     junk_9 categorical 1356.462       NA       NA 1356.462
## 5     Volume     numeric 1356.838 1356.751 1358.833 1356.751
## 6     junk_4     numeric 1357.707 1358.426 1356.864 1356.864
## 7       Year     numeric 1357.111 1357.112 1358.772 1357.111
## 8       Week     numeric 1357.260 1358.273 1359.076 1357.260
## 9    junk_10     numeric 1357.319 1357.314 1359.128 1357.314
## 10    junk_2     numeric 1357.358 1357.721 1359.071 1357.358
## 11      Lag5     numeric 1357.365 1358.527 1358.188 1357.365
## 12    junk_7     numeric 1357.460 1357.617 1359.344 1357.460
## 13    junk_5     numeric 1358.566 1357.927 1359.136 1357.927
## 14    junk_1     numeric 1358.008 1357.938 1359.327 1357.938
## 15    junk_8     numeric 1357.945 1358.487 1359.603 1357.945
## 16      Lag3     numeric 1358.354 1358.038 1360.286 1358.038
## 17      Lag4     numeric 1358.497 1358.685 1359.007 1358.497
## 18    junk_3 categorical 1358.700       NA       NA 1358.700
## 19 Direction categorical       NA       NA       NA       NA

While I don’t think standard transformations of the current features will be fruitful, I have a few ideas for features/modifications that could be useful:

Interaction between Lag1 & Lag2 (the two predictors most likely to be useful)
Lag_avg_abs - The average absolute value of Lag1 to Lag5 (a measure of the absolute size of recent market fluctuations - high values mean high-variance periods)
Lag_pos_cnt - A count of how many of Lag1 - Lag5 were positive days (has the market trended upwards recently?)
Quarter - A factor variable, created by binning the Week variable into the four quarters of the year
Removing all current predictors aside from Lag1 & Lag2 (these are likely more useful than the older Lag variables)

I make these changes below to both train and test. Here is the resulting train dataset:

train <- train %>%
  mutate(Lag_avg_abs = abs(Lag1) + abs(Lag2) + abs(Lag3) + abs(Lag4) + abs(Lag5), 
         Lag_pos_cnt = (Lag1 > 0) + (Lag2 > 0) + (Lag3 > 0) + (Lag4 > 0) + (Lag5 > 0)) %>%
  group_by(Year) %>%
  mutate(Week_of_year = row_number()) %>%
  ungroup() %>%
  mutate(Week_of_year = case_when(Year == 1990 ~ as.numeric(Week_of_year + 5), 
                                  TRUE ~ as.numeric(Week_of_year))) %>% # data appears to start 5wks into 1990
  mutate(Quarter = factor(case_when(Week_of_year <= 13 ~ "Q1", 
                                    Week_of_year <= 26 ~ "Q2", 
                                    Week_of_year <= 39 ~ "Q3", 
                                    TRUE ~ "Q4"))) %>%
  select(Direction, Lag1, Lag2, Lag_avg_abs, Lag_pos_cnt, Quarter)


test <- test %>%
  mutate(Lag_avg_abs = abs(Lag1) + abs(Lag2) + abs(Lag3) + abs(Lag4) + abs(Lag5), 
         Lag_pos_cnt = (Lag1 > 0) + (Lag2 > 0) + (Lag3 > 0) + (Lag4 > 0) + (Lag5 > 0)) %>%
  group_by(Year) %>%
  mutate(Week_of_year = row_number()) %>%
  ungroup() %>%
  mutate(Quarter = factor(case_when(Week_of_year <= 13 ~ "Q1", 
                                    Week_of_year <= 26 ~ "Q2", 
                                    Week_of_year <= 39 ~ "Q3", 
                                    TRUE ~ "Q4"))) %>%
  select(Direction, Lag1, Lag2, Lag_avg_abs, Lag_pos_cnt, Quarter)


glimpse(train)

## Rows: 985
## Columns: 6
## $ Direction   <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, U...
## $ Lag1        <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.8...
## $ Lag2        <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.3...
## $ Lag_avg_abs <dbl> 10.037, 6.823, 9.170, 8.748, 7.888, 8.250, 9.352, 7.583...
## $ Lag_pos_cnt <int> 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3...
## $ Quarter     <fct> Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q2, Q2, Q2, Q2, Q2, Q2,...

Here I test the model containing all of these predictors, along with an interaction term between Lag1 and Lag2:

glm_dir_2 <- glm(Direction ~ . + Lag1:Lag2, 
               data = train, 
               family = "binomial")


predicted <- factor(ifelse(predict(glm_dir_2, newdata = test, type = "response") < 0.5, "Down", "Up"))

confusionMatrix(predicted, test$Direction, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   18 16
##       Up     25 45
##                                           
##                Accuracy : 0.6058          
##                  95% CI : (0.5051, 0.7002)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.3847          
##                                           
##                   Kappa : 0.1613          
##                                           
##  Mcnemar's Test P-Value : 0.2115          
##                                           
##             Sensitivity : 0.7377          
##             Specificity : 0.4186          
##          Pos Pred Value : 0.6429          
##          Neg Pred Value : 0.5294          
##              Prevalence : 0.5865          
##          Detection Rate : 0.4327          
##    Detection Prevalence : 0.6731          
##       Balanced Accuracy : 0.5782          
##                                           
##        'Positive' Class : Up              
##

The classifier performed better on the test data than the baseline approach with an accuracy of 60.58%, but this is slightly worse than the simple LDA & Logistic classifiers which scored 62.5%. Due to the small sample size, I am skeptical as to whether any of these classifiers would continue to perform above the baseline for data beyond 2010.

11. APPLIED: The `Auto` Dataset (LDA, QDA, Logistic, KNN)

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

glimpse(Auto)

## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...

(a) New Variable: `mpg01`

Q: Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

I’ll just add the mpg01 variable to the existing Auto dataset.

Auto$mpg01 <- factor(as.numeric(Auto$mpg > median(Auto$mpg)))

table(Auto$mpg01)

## 
##   0   1 
## 196 196

(b) Visualizing Relationships with `mpg01`

Q: Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

Excluding name (categorical with 301 unique values) and mpg (used to create mpg01), and converting origin to a factor:

Auto$name <- NULL
Auto$mpg <- NULL

Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))

glimpse(Auto)

## Rows: 392
## Columns: 8
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <fct> American, American, American, American, American, Amer...
## $ mpg01        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...

I chose a jitter plot (scatter plot with some random noise added) for the relationship between mpg01 & cylinders, since cylinders is discrete with only 3 unique values, so a box plot wasn’t very informative.

g1 <- ggplot(Auto, aes(x = mpg01, y = cylinders, col = mpg01)) + 
  geom_jitter() + 
  theme(legend.position = "none") + 
  ggtitle("Cylinders vs mpg01 - Jitter Plot") 

g2 <- ggplot(Auto, aes(x = mpg01, y = displacement, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Displacement vs mpg01 - Boxplot")

g3 <- ggplot(Auto, aes(x = mpg01, y = horsepower, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Horsepower vs mpg01 - Boxplot")

g4 <- ggplot(Auto, aes(x = mpg01, y = weight, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Weight vs mpg01 - Boxplot")

g5 <- ggplot(Auto, aes(x = mpg01, y = acceleration, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Acceleration vs mpg01 - Boxplot")

g6 <- ggplot(Auto, aes(x = mpg01, y = year, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Year vs mpg01 - Boxplot")

g7 <- ggplot(Auto, aes(x = origin, fill = mpg01)) + 
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent_format()) + 
  theme(axis.title.y = element_blank()) + 
  ggtitle("Origin vs mpg01 - Bar Plot") 


grid.arrange(g1, g2, g3, g4, g5, g6, g7, 
             ncol = 2, 
             nrow = 4)

It’s tricky to compare predictive power between different graphs (e.g. scatter plots & box plots), but if I had to guess I would say that cylinders, displacement & weight are the three strongest predictors of mpg01.

horsepower looks like it might be a little weaker, with year & acceleration weaker still (more significant overlap of the distributions).

It is harder to judge how strong a predictor origin is visually when compared with the other variables, since we are looking at a relationship between two nominal categorical variables, but it certainly appears to be on the weaker side.

(c) `train` & `test`

Q: Split the data into a training set and a test set.

I create train & test from the Auto dataset. I leave name & mpg excluded for now. train makes up 50% of the data:

set.seed(444)
index <- createDataPartition(y = Auto$mpg01, p = 0.5, list = F)

train <- Auto[index, ]
test <- Auto[-index, ]

nrow(train) / nrow(Auto)

## [1] 0.5

(d) Predicting `mpg01` - LDA

Q: Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

For (d), (e), (f) & (g), I’ll use the top 4 strongest predictors I identified visually (cylinders, displacement, weight & horsepower).

Furthermore, as these questions are very similar to question 10 (b) - (g), I thought I would investigate how to train all of these models using train() from the caret package instead. This means we can easily report two error metrics too (the train cross-validation error, and the test error).

Unlike in the last question, with this dataset we are performing far better than the baseline classifier accuracy of 50%.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

set.seed(1)

lda_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower, 
                 data = train, 
                 method = "lda", 
                 trControl = ctrl)

train (cross-validation) error:

1 - lda_mpg$results$Accuracy # cv error

## [1] 0.09125731

test error:

predicted_lda <- predict(lda_mpg, newdata = test, type = "raw") # as opposed to type = "prob"

mean(predicted_lda != test$mpg01)

## [1] 0.122449

(e) Predicting `mpg01` - QDA

Q: Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

set.seed(2)

qda_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower, 
                 data = train, 
                 method = "qda", 
                 trControl = ctrl)

train (cross-validation) error:

1 - qda_mpg$results$Accuracy

## [1] 0.08991228

test error:

predicted_qda <- predict(qda_mpg, newdata = test, type = "raw")

mean(predicted_qda != test$mpg01)

## [1] 0.1173469

In this case, QDA appears to perform better than LDA with respect to test error, and slightly better in terms of CV error.

(f) Predicting `mpg01` - Logistic Regression

Q: Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

set.seed(3)

log_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower, 
                 data = train, 
                 method = "glm", 
                 family = "binomial", 
                 trControl = ctrl)

train (cross-validation) error:

1 - log_mpg$results$Accuracy

## [1] 0.1084405

test error:

predicted_log <- predict(log_mpg, newdata = test, type = "raw")

mean(predicted_log != test$mpg01)

## [1] 0.1173469

Logistic Regression performs better than LDA & QDA with respect to CV error & test error.

(g) Predicting `mpg01` - KNN

Q: Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

I pass a grid of hyperparameters (values for K) to caret::train(). caret will test each value for K using the cross-validation scheme defined by ctrl (3-repeated 10-fold cross-validation), automatically select the best value for K with respect to a target metric (out-of-sample accuracy), then retrain the KNN model on the full train dataset with this optimal value for K.

Specifying preProcess = c("center", "scale") centers and scales (standardizes) the 4 predictors, which is necessary for distance-based methods such as KNN where the predictors are over different scales.

set.seed(4)

knn_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower, 
                 data = train, 
                 method = "knn", 
                 preProcess = c("center", "scale"), 
                 trControl = ctrl, 
                 tuneGrid = expand.grid(k = seq(1, 85, 3)))

knn_mpg

## k-Nearest Neighbors 
## 
## 196 samples
##   4 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 176, 177, 176, 176, 176, 176, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.8779532  0.7554422
##    4  0.8999805  0.7995964
##    7  0.9033138  0.8062631
##   10  0.9033138  0.8062631
##   13  0.9033138  0.8062631
##   16  0.9033138  0.8062631
##   19  0.9033138  0.8062631
##   22  0.9033138  0.8062631
##   25  0.9033138  0.8062631
##   28  0.9033138  0.8062631
##   31  0.9033138  0.8062631
##   34  0.9033138  0.8062631
##   37  0.9033138  0.8062631
##   40  0.9033138  0.8062631
##   43  0.9033138  0.8062631
##   46  0.9033138  0.8062631
##   49  0.9033138  0.8062631
##   52  0.9033138  0.8062631
##   55  0.9033138  0.8062631
##   58  0.9033138  0.8062631
##   61  0.9033138  0.8062631
##   64  0.9049805  0.8095964
##   67  0.9049805  0.8095964
##   70  0.9066472  0.8129297
##   73  0.9100682  0.8197621
##   76  0.9083138  0.8162631
##   79  0.9084016  0.8164288
##   82  0.9067349  0.8130955
##   85  0.9034016  0.8065072
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 73.

train (cross-validation) error:

1 - max(knn_mpg$results$Accuracy)

## [1] 0.08993177

test error:

predicted_knn <- predict(knn_mpg, newdata = test, type = "raw")

mean(predicted_knn != test$mpg01)

## [1] 0.1122449

The optimal value chosen was K = 7, although this model performed close to Logistic Regression in CV error, and slightly worse in test error.

There appears to be some quirk with the data where all classifiers with mid-large values for K (from 10 to 73) seem to be getting the same cross-validation accuracy scores, presumably because they are making identical predictions.

I think identifying why exactly this is happening would require some investigation, but I think it is likely due to the fact that I am testing large values of K relative to the sample size, causing all these classifiers to effectively behave identically (as opposed to a mistake in the code/package).

12. APPLIED: Writing Functions

This problem involves writing functions.

(a) `Power()` Function

Q: Write a function, Power(), that prints out the result of raising 2 to the 3rd power. In other words, your function should compute $2^3$ and print out the results.

Hint: Recall that x^a raises x to the power a. Use the print() function to output the result.

I’ve omitted print() as it is not needed:

Power <- function() {
  2^3
}

Power()

## [1] 8

(b) `Power2()` Function

Q: Create a new function, Power2(), that allows you to pass any two numbers, x and a, and prints out the value of x^a. You can do this by beginning your function with the line Power2=function (x,a){. You should be able to call your function by entering, for instance, Power2(3,8) on the command line. This should output the value of $3^8$, namely, 6,561.

Power2 <- function(x, a) {
  x^a
}

Power2(3, 8)

## [1] 6561

(c) `Power2()` Applied

Q: Using the Power2() function that you just wrote, compute $10^3$, $8^{17}$, and $131^3$.

$10^3$:

Power2(10, 3)

## [1] 1000

$8^{17}$:

Power2(8, 17)

## [1] 2.2518e+15

$131^3$:

Power2(131, 3)

## [1] 2248091

(d) `Power3()` Function

Q: Now create a new function, Power3(), that actually returns the result x^a as an R object, rather than simply printing it to the screen. That is, if you store the value x^a in an object called result within your function, then you can simply return() this result, using the following line: return(result). This should be the last line in your function, before the } symbol.

We don’t need the return() function here. I test on Power2(3, 8):

Power3 <- function(x, a) {
  result <- x^a
  result
}

Power3(3, 8)

## [1] 6561

(e) `Power3()` Applied

Q: Now using the Power3() function, create a plot of $f(x) = x^2$. The x-axis should display a range of integers from 1 to 10, and the y-axis should display $x^2$. Label the axes appropriately, and use an appropriate title for the figure. Consider displaying either the x-axis, the y-axis, or both on the log-scale.

data.frame(x = 1:10, y = Power3(1:10, 2)) %>%
  ggplot(aes(x = x, y = log(y))) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous(breaks = 1:10, minor_breaks = NULL)

(f) `PlotPower()` Function

Q: Create a function, PlotPower(), that allows you to create a plot of x against x^a for a fixed a and for a range of values of x. For instance, if you call PlotPower(1:10, 3) then a plot should be created with an x-axis taking on values $1$, $2$, …, $10$, and a y-axis taking on values $1^3$, $2^3$,…, $10^3$.

PlotPower <- function(x, a, col = "black") {
  ggplot(mapping = aes(x = x, y = x^a)) + 
    geom_point(col = col) + 
    geom_line(col = col) + 
    scale_x_continuous(labels = scales::comma_format()) +
    scale_y_continuous(labels = scales::comma_format()) +
    theme_light() + 
    labs(title = paste0("Plot of f(x) = x^", as.character(a), " (for x between ", min(x), " and ", max(x), ")"), 
         subtitle = "Created using the 'PlotPower()' function")
}

I test the PlotPower() function for two examples below. I also added acol argument to change the colour, and made the title auto-generate to display the function and range.

PlotPower(1:20, 3)

PlotPower(x = seq(1, 500, 10), 
          a = 2, 
          col = "deepskyblue3")

13. APPLIED: The `Boston` Dataset

Q: Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

First, I’ll overwrite the crime rate variable (crim) with a binary factor variable, which takes the value 1 when crim is above the median, else 0.

The resulting dataset is shown below:

Boston$crim <- factor(ifelse(Boston$crim > median(Boston$crim), 1, 0))

glimpse(Boston)

## Rows: 506
## Columns: 14
## $ crim    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...

ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 5)

I start by fitting all 4 model types, using all predictors each time. I won’t use a train/test split, but will instead evaluate performance using cross-validation accuracy (10-fold, repeated 5 times).

Logistic Regression:

set.seed(1)
log_crim <- train(crim ~ ., 
                 data = Boston, 
                 method = "glm", 
                 family = "binomial", 
                 trControl = ctrl)

log_crim

## Generalized Linear Model 
## 
## 506 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 456, 455, 456, 454, 456, 456, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9028549  0.8057423

LDA:

set.seed(2)
lda_crim <- train(crim ~ ., 
                 data = Boston, 
                 method = "lda", 
                 trControl = ctrl)

lda_crim

## Linear Discriminant Analysis 
## 
## 506 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 456, 455, 456, 456, 455, 454, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8498998  0.6996644

QDA:

set.seed(3)
qda_crim <- train(crim ~ ., 
                 data = Boston, 
                 method = "qda", 
                 trControl = ctrl)

qda_crim

## Quadratic Discriminant Analysis 
## 
## 506 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 456, 456, 455, 455, 455, 456, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8894561  0.7788975

KNN:

set.seed(4)
knn_crim <- train(crim ~ ., 
                 data = Boston, 
                 method = "knn", 
                 preProcess = c("center", "scale"), 
                 trControl = ctrl, 
                 tuneGrid = expand.grid(k = seq(1, 20, 2)))

knn_crim

## k-Nearest Neighbors 
## 
## 506 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 455, 456, 456, 455, 455, 456, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9145970  0.8292052
##    3  0.9181104  0.8362439
##    5  0.9153345  0.8306748
##    7  0.9047128  0.8094000
##    9  0.8779164  0.7557756
##   11  0.8664404  0.7327928
##   13  0.8664323  0.7327720
##   15  0.8703940  0.7407136
##   17  0.8684169  0.7367859
##   19  0.8648404  0.7296460
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.

KNN actually performed the best, which is surprising since I’ve performed no variable selection, and KNN can be quite negatively impacted by weak predictors.

To combat this and try to improve the model further, I apply the following approach:

Use caret::varImp() to rank order all 13 the predictors from best $\rightarrow$ worst
Train a KNN model on different predictor subsets (1 best, 2 best, 3 best, …)
Evaluate what combination of predictors (and corresponding value for k) performs the best with the target metric (5-repeated 10-fold cross-validation accuracy)

Below are the indexed variable importance measures provided by caret:

varImp(knn_crim)

## ROC curve variable importance
## 
##         Importance
## nox         100.00
## dis          86.05
## age          82.64
## indus        80.85
## tax          78.00
## rad          73.48
## lstat        59.43
## medv         50.63
## zn           47.35
## ptratio      45.59
## black        42.68
## rm           20.66
## chas          0.00

By ordering the variables in the dataset based on their importance, I can fit, tune & evaluate all models in a single small loop. Below is the output:

knn_best_to_worst <- as.data.frame(varImp(knn_crim)$importance) %>%
  rownames_to_column("var") %>%
  arrange(desc(X0)) %>%
  pull(var)

Boston <- Boston %>%
  dplyr::select(c(knn_best_to_worst, "crim"))



cv_accuracy <- c()
chosen_k <- c()

set.seed(5)

for (i in 1:13) {
  knn_crim_temp <- as.data.frame(Boston[ ,c(1:i, 14)])
  knn_crim_temp <- train(crim ~ ., 
                         data = knn_crim_temp,
                         method = "knn", 
                         preProcess = c("center", "scale"), 
                         trControl = ctrl, 
                         tuneGrid = expand.grid(k = seq(1, 20, 2)))
  
  cv_accuracy[i] <- max(knn_crim_temp$results$Accuracy)
  chosen_k[i] <- as.numeric(knn_crim_temp$bestTune)
}

data.frame(pred_num = 1:13, cv_accuracy = cv_accuracy, chosen_k = chosen_k) %>%
  arrange(desc(cv_accuracy))

##    pred_num cv_accuracy chosen_k
## 1         1   0.9549303        1
## 2        10   0.9407704        5
## 3        11   0.9407192        3
## 4         6   0.9378703        1
## 5         5   0.9356350        5
## 6         4   0.9340434        3
## 7         7   0.9340048        1
## 8         9   0.9324929        1
## 9        12   0.9320217        5
## 10        8   0.9316078        3
## 11       13   0.9170664        5
## 12        2   0.8960546        1
## 13        3   0.8877502        7

Interestingly, the most effective model (by some margin) is also probably the simplest model we could possibly create - a KNN model with $p$ = 1 and K = 1. This model is so simple that implementing it in a train, test framework can be described in 2 lines of text:

“For each new test suburb in which we want to predict the binarized crime rate (crim), look in train and find the suburb closest to it in its nitrogen oxide concentration (nox) and use that suburbs binarized crime rate as the prediction”.

ISLR - Classification (Ch. 4) - Exercise Solutions

Liam Morgan

February 2020

1. Logistic Regression - Equivalent Representations

2. Linear Discriminant Analysis - Discriminant Function Proof (\(p\) = 1)

3. Quadratic Discriminant Analysis - Discriminant Function Proof (\(p\) = 1)

4. Local Approaches (e.g. KNN) - High Dimensionality

(a) \(p\) = 1

(b) \(p\) = 2

(c) \(p\) = 100

(d) KNN Drawbacks for Large \(p\)

(e) \(p\)-Dimensional Hypercube for Prediction (small & large \(p\))

5. Differences Between LDA & QDA

(a) Linear Bayes Decision Boundary

(b) Non-Linear Bayes Decision Boundary

(c) Large Sample Sizes

(d) Linear Bayes Decision Boundary

6. Logistic Regression (Manual Estimates)

(a) Estimate \(p\)

(b) Hours for \(p\) = 0.5

7. Bayes Theorem/LDA (Manual Estimates)

8. KNN Training Error

9. Odds vs Probability

(a) Odds \(\rightarrow\) Probability

(b) Probability \(\rightarrow\) Odds

10. APPLIED: The Weekly Dataset (Logistic, LDA, QDA, KNN)

(a) Data Summary

(b) Logistic Regression (predict market Direction)

(c) Confusion Matrix

(d) train and test - Logistic Regression

(e) train and test - LDA

(f) train and test - QDA

(g) train and test - KNN (K = 1)

(h) Best Performing Classifier?

(i) Experimenting (combined predictors, interactions, transformations, etc.)

11. APPLIED: The Auto Dataset (LDA, QDA, Logistic, KNN)

(a) New Variable: mpg01

(b) Visualizing Relationships with mpg01

(c) train & test

(d) Predicting mpg01 - LDA

(e) Predicting mpg01 - QDA

(f) Predicting mpg01 - Logistic Regression

(g) Predicting mpg01 - KNN

12. APPLIED: Writing Functions

(a) Power() Function

(b) Power2() Function

(c) Power2() Applied

(d) Power3() Function

(e) Power3() Applied

(f) PlotPower() Function

13. APPLIED: The Boston Dataset

10. APPLIED: The `Weekly` Dataset (Logistic, LDA, QDA, KNN)

(b) Logistic Regression (predict market `Direction`)

(d) `train` and `test` - Logistic Regression

(e) `train` and `test` - LDA

(f) `train` and `test` - QDA

(g) `train` and `test` - KNN (K = 1)

11. APPLIED: The `Auto` Dataset (LDA, QDA, Logistic, KNN)

(a) New Variable: `mpg01`

(b) Visualizing Relationships with `mpg01`

(c) `train` & `test`

(d) Predicting `mpg01` - LDA

(e) Predicting `mpg01` - QDA

(f) Predicting `mpg01` - Logistic Regression

(g) Predicting `mpg01` - KNN

(a) `Power()` Function

(b) `Power2()` Function

(c) `Power2()` Applied

(d) `Power3()` Function

(e) `Power3()` Applied

(f) `PlotPower()` Function

13. APPLIED: The `Boston` Dataset