1. T-Tests
2. KNN: Classification vs Regression
3. Regression Puzzle
4. Train & Test RSS - Linear/Nonlinear Models
5. Linear Regression (without an intercept)
6. Regression - ($\bar{x}$, $\bar{y}$)
7. Pearson Correlation & Regression $R^2$
8. APPLIED: The Auto Dataset (Simple Linear Regression)
9. APPLIED: The Auto Dataset (Multiple Linear Regression)
10. APPLIED: The Carseats Dataset (Multiple Linear Regression)
11. APPLIED: Generated Data
12. APPLIED: Generated Data (Regression without an intercept)
13. APPLIED: Generated Data (Regression with an intercept)
14. APPLIED: Generated Data (Collinearity)
15. APPLIED: Generated Data (Collinearity)

NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

1. T-Tests

Q: Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

Table 3.4, the regression statistics from predicting $y$ = sales using TV, radio & newspaper.

data <- data.frame(coeff = c('$\\hat{\\beta_{0}}$ = 2.939', '$\\hat{\\beta_{1}}$ = 0.046', '$\\hat{\\beta_{2}}$ = 0.189', '$\\hat{\\beta_{3}}$ = -0.001'), 
                   std_err = c(0.3119, 0.0014, 0.0086, 0.0059), 
                   t_stat = c(9.42, 32.81, 21.89, -0.18), 
                   p_value = c('< 0.0001', '< 0.0001', '< 0.0001', '0.8599'), 
                   stringsAsFactors = F)

colnames(data) <- c('coefficient', 'std. error', 't-statistic', 'p-value')
rownames(data) <- c('Intercept', '`TV`', '`radio`', '`newspaper`')

library(knitr)
library(kableExtra)

kable(data, row.names = T) %>%
   kable_styling(bootstrap_options = "bordered", full_width = F)

	coefficient	std. error	t-statistic	p-value
Intercept	$\hat{\beta_{0}}$ = 2.939	0.3119	9.42	< 0.0001
`TV`	$\hat{\beta_{1}}$ = 0.046	0.0014	32.81	< 0.0001
`radio`	$\hat{\beta_{2}}$ = 0.189	0.0086	21.89	< 0.0001
`newspaper`	$\hat{\beta_{3}}$ = -0.001	0.0059	-0.18	0.8599

(a) Intercept

The corresponding test (with null and alternative hypotheses):

\[H_{0}: \beta_{0} = 0 \\H_{a}: \beta_{0} \neq 0 \]

$\beta_{0}$ is the value that $y$ takes when $x_{1} = x_{2} = x_{3} = 0$. That is, it is the sales we expect when there is no TV, radio or newspaper advertising.

Since p < 0.0001, we have significant evidence to reject $H_{0}$ and accept the alternative hypothesis. This basically just means we have strong evidence that sales would not be zero in the absence of TV, radio & newspaper advertising.

(b) `TV` & `radio`

The corresponding test (with null and alternative hypotheses) for TV:

\[H_{0}: \beta_{1} = 0 \\H_{a}: \beta_{1} \neq 0 \]

The same for radio:

\[H_{0}: \beta_{2} = 0 \\H_{a}: \beta_{2} \neq 0 \]

In both cases, the null hypothesis means that the corresponding variable has no effect on $y$, when holding all other predictors fixed.

In both cases, p < 0.0001, so we reject $H_{0}$ and conclude that both a change in the TV or radio budgets does have some impact on sales.

(c) `newspaper`

The corresponding test (with null and alternative hypotheses) for newspaper:

\[H_{0}: \beta_{3} = 0 \\H_{a}: \beta_{3} \neq 0 \]

The null hypothesis means that, when holding the other predictors (TV & radio) fixed, newspaper has no effect on $y$ = sales.

Since p = 0.8599, there is not sufficient evidence to reject this hypothesis. In other words, it appears that changing the newspaper budget has no impact on sales.

2. KNN: Classification vs Regression

Q: Carefully explain the differences between the KNN classifier and KNN regression methods.

Obviously the main difference is the goal. The KNN classifier is used in situations where the response variable is categorical, whereas the KNN regressor would only be appropriate when it is numeric.

For both the classifier & regressor:

Want to predict for some new test observation: $x_{0}$
We choose some $K$ nearest neighbours, where $K\in\mathbb{N}$
We refer to the $K$ training observations closest to $x_{0}$ by $\mathcal{N}_{0}$

KNN Classifier:

First, the conditional probability for class $j$ is estimated, which is the fraction of the points in $\mathcal{N}_{0}$ where $y_{i}$ = $j$.

\[ P_{r}(Y = j | x = x_{0}) = \frac{1}{K} \sum_{i \in \mathcal{N}_{0}} I(y_{i} = j)\]

After working out the conditional probability for each $j$ class, the class prediction $\hat{f}(x_{0})$ is then simply the class $j$ with the highest conditional probability:

\[\hat{f}(x_{0}) = \underset{j}{\operatorname{argmax}}(P_{r}(Y = j | x = x_{0}))\]

KNN Regressor:

The KNN regressor identified the $K$ training observations that are closest to the observation it is predicting ($x_{0}$), then it estimates $f(x_{0})$ by taking the mean $y$ value of these training observations.

\[\hat{f}(x_{0}) = \frac{1}{K} \sum_{i \in \mathcal{N}_{0}} y_{i}\]

3. Regression Puzzle

Suppose we have a data set with five predictors:

$X_1$ = GPA
$X_2$ = IQ
$X_3$ = Gender (1 for Female and 0 for Male)
$X_4$ = Interaction between GPA and IQ
$X_5$ = Interaction between GPA and Gender

The response $Y$ is starting salary after graduation (in thousands of dollars).

Suppose we use least squares to fit the model, and get:

$\hat{\beta_{0}}$ = 50
$\hat{\beta_{1}}$ = 20
$\hat{\beta_{2}}$ = 0.07
$\hat{\beta_{3}}$ = 35
$\hat{\beta_{4}}$ = 0.01
$\hat{\beta_{5}}$ = -10

(a) Salary for Fixed IQ & GPA

Q: Which answer is correct, and why?

i. For a fixed value of IQ and GPA, males earn more on average than females.

ii. For a fixed value of IQ and GPA, females earn more on average than males.

iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.

iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

The regression equation is given by:

\[ \begin{align*} \hat{salary} & = \hat{\beta_{0}} + \hat{\beta_{1}}\cdot gpa + \hat{\beta_{2}}\cdot iq + \hat{\beta_{3}}\cdot gender + \hat{\beta_{4}}\cdot gpa\cdot iq + \hat{\beta_{5}}\cdot gpa\cdot gender \\ & = 50 + 20\cdot gpa + 0.07\cdot iq + 35\cdot gender + 0.01\cdot gpa\cdot iq - 10\cdot gpa\cdot gender \end{align*} \]

Lets look at the $\hat{y}$ for $gender = 1$ (female) and $gender = 0$ (male):

\[ \begin{align*} \mathbb{E}[y \mid gender = 1] & = 50 + 20\cdot gpa + 0.07\cdot iq + 35\cdot 1 + 0.01\cdot gpa\cdot iq - 10\cdot gpa\cdot 1 \\ & = 85 + 10\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq \end{align*} \]

\[ \begin{align*} \mathbb{E}[y \mid gender = 0] & = 50 + 20\cdot gpa + 0.07\cdot iq + 35\cdot 0 + 0.01\cdot gpa\cdot iq - 10\cdot gpa\cdot 0 \\ & = 50 + 20\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq \end{align*} \]

Starting salary is higher for females than males when:

\[ \begin{align*} & \mathbb{E}[y \mid gender = 1] > \mathbb{E}[y \mid gender = 0] \\ \Longleftrightarrow \enspace & 85 + 10\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq > 50 + 20\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq \\ \Longleftrightarrow \enspace & 85 + 10\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq > 50 + 20\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq \\ \Longleftrightarrow \enspace & 85 + 10\cdot gpa > 50 + 20\cdot gpa \\ \Longleftrightarrow \enspace & gpa < 3.5\\ \end{align*} \] So for fixed IQ & GPA, females earn more than males except for when $gpa > 3.5$. This is equivalent to answer iii.

(b) Using the Equation

Q: Predict the salary of a female with IQ of 110 and a GPA of 4.0.

Plugging the values into the regression equation in (a):

\[ \begin{align*} \hat{salary} & = 50 + 20\cdot gpa + 0.07\cdot iq + 35\cdot gender + 0.01\cdot gpa\cdot iq - 10\cdot gpa\cdot gender \\ & = 50 + 20\cdot 4 + 0.07\cdot 110 + 35\cdot 1 + 0.01\cdot 4\cdot 110 - 10\cdot 4\cdot 1 \\ & = 137.1 \end{align*} \]

So her predicted salary would be: $137,100.

(c) Interaction Coefficient

Q: True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

False. The coefficient value for an interaction term between $x_{1}$ and $x_{2}$ will depend on the scales of $x_{1}$, $x_{2}$ & $y$.

We could change this coefficient simply by changing the scale of particular variables (e.g. units $\rightarrow$ thousands $\rightarrow$ millions) and this won’t be in any way reflective of whether there is an interaction effect present: the value of $\hat{\beta_{4}}$ is not informative for this purpose.

We should instead perform a t-test for the parameter of the GPA/IQ interaction term if we are interested in whether this interaction effect exists:

\[H_{0}: \beta_{4} = 0 \\H_{a}: \beta_{4} \neq 0 \]

4. Train & Test RSS - Linear/Nonlinear Models

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response.

I then fit a linear regression model to the data: $Y = \hat{\beta_{0}} + \hat{\beta_{1}}X + \epsilon$

as well as a separate cubic regression: $Y = \hat{\beta_{0}} + \hat{\beta_{1}}X + \hat{\beta_{2}}X^2 + \hat{\beta_{3}}X^3 + \epsilon$

(a) True Relationship = Linear, Training RSS?

Q: Suppose that the true relationship between X and Y is linear. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

I would expect similar results, but with the cubic regression with a lower training RSS. We are basically increasing the model flexibility by going from linear $\rightarrow$ cubic, and we are working with a small sample of data, so I’d expect the cubic model to overfit to any nonlinearities and hence have a lower training RSS.

(b) True Relationship = Linear, Test RSS?

Q: Answer (a) using test rather than training RSS.

Here I would expect the linear regression to perform slightly better than the cubic regression. I say slightly because the true relationship is linear and we are working from a sample of this data, so the cubic regression would likely have fit something close to a linear relationship (it’s not going to fit a wild cubic relationship if the data mostly follow a straight line).

(c) True Relationship = Non-Linear, Train RSS?

Q: Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

As in part (a), I expect the cubic regression to have a lower training RSS.

If the relationship is very non-linear and can be well approximated by a cubic fit, I would expect the training RSS to be much lower.

If, however, the relationship is very close to linear or cannot be well approximated by a cubic fit, we might expect more similar results.

Either way, the cubic fit would outperform the linear fit, as increasing model flexibility will lead to a reduction in training error.

(d) True Relationship = Non-Linear, Test RSS?

Q: Answer (c) using test rather than training RSS.

It depends on what the nature of the non-linear relationship is. For most non-linear relationships I would expect the cubic regression to have a lower test RSS, but at the same time, a linear model may give a very good approximation for a mildly non-linear relationship.

I think this completely depends. If the relationship is closer to linear, we would expect the linear regression to have a lower test RSS. In the case of a stronger non-linear relationship that is closer to cubic, we would obviously expect the cubic regression to have a lower test RSS.

5. Linear Regression (without an intercept)

Q: Consider the fitted values that result from performing linear regression without an intercept. In this setting, the ith fitted value takes the form:

\[\hat{y_i}=x_i\hat{\beta}\]

where:

\[\hat{\beta} = (\sum_{j=1}^{n}x_jy_j)/(\sum_{k=1}^{n}x_k^2)\]

Show that we can write

\[\hat{y_i} = \sum_{j=1}^{n}a_jy_j\]

What is $a_j$?

\[ \begin{align*} \hat{y_i} & = x_i\hat{\beta} \\\\ & = x_i \frac{\sum_{j=1}^{n}x_jy_j}{\sum_{k=1}^{n}x_k^2} \\\\ & = \frac{\sum_{j=1}^{n}x_ix_jy_j}{\sum_{k=1}^{n}x_k^2} && \text{(since } x_i \text{ doesn't depend on } j \text{)}\\\\ & = \sum_{j=1}^{n}\frac{x_ix_jy_j}{\sum_{k=1}^{n}x_k^2} && \text{(similarly, the summation over } k \text{ is basically a constant w.r.t } j \text{)}\\\\ & = \sum_{j=1}^{n}\frac{x_ix_j}{\sum_{k=1}^{n}x_k^2} y_j\\\\ & = \sum_{j=1}^{n}a_jy_j \\\\ \end{align*} \] Where $a_j = \frac{x_ix_j}{\sum_{k=1}^{n}x_k^2}$

We can see that the fitted values from linear regression (with the intercept fixed at 0) are linear combinations of the response values.

6. Regression - ($\bar{x}$, $\bar{y}$)

Q: Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point ($\bar{x}$, $\bar{y}$)

(3.4) give the parameter estimates for a simple linear regression line when minimizing the RSS:

\[\hat{\beta_{1}} = \frac{\sum_{i = 1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n(x_i - \bar{x})^2} \\ \hat{\beta_{0}} = \bar{y} - \hat{\beta_{1}}\bar{x}\]

The least squares line:

\[y_i = \hat{\beta_{0}} + \hat{\beta_{1}}x_i\]

By subbing in $x_i = \bar{x}$, and using the relationship for $\hat{\beta_{0}}$:

\[ \begin{align*} y_i & = \hat{\beta_{0}} + \hat{\beta_{1}}\bar{x} \\ & = (\bar{y} - \hat{\beta_{1}}\bar{x}) + \hat{\beta_{1}}\bar{x} \\ & = \bar{y} \end{align*} \]

7. Pearson Correlation & Regression $R^2$

Q: It is claimed in the text that in the case of simple linear regression of Y onto X, the R2 statistic (3.17) is equal to the square of the correlation between X and Y (3.18). Prove that this is the case. For simplicity, you may assume that $\bar{x} = \bar{y} = 0$

I’m going to drop some notation to keep things clearer, but all sums will be from 1 to n, e.g:

\[\sum_{i = 1}^n(x_i - \bar{x}) = \sum(x - \bar{x})\]

Using the parameter estimates for $\hat{\beta_{0}}$ & $\hat{\beta_{1}}$ in (6), we can simplify (since $\bar{x} = \bar{y} = 0$):

\[\hat{\beta_{0}} = 0 \\ \hat{\beta_{1}} = \frac{\sum{xy}}{\sum{x^2}} \] We can also simplify the correlation coefficient:

\[Cor(X, Y) = \frac{\sum{(x - \bar{x})(y - \bar{y})}}{\sqrt{\sum{(x - \bar{x})^2}}\sqrt{\sum{(y - \bar{y})^2}}} = \frac{\sum{xy}}{\sqrt{\sum{y^2}\sum{x^2}}}\]

\[R^2 = \frac{TSS - RSS}{TSS}\]

\[TSS = \sum{(y - \bar{y})^2} = \sum{y^2}\]

Lots of algebra to simplify $RSS$:

\[ \begin{align*} RSS & = \sum{(y - \hat{y})^2} \\ & = \sum{y^2 - 2y\hat{y} + \hat{y}^2}\\ & = \sum{y^2 - 2y(\hat{\beta_{0}} + \hat{\beta_{1}}x) + (\hat{\beta_{0}} + \hat{\beta_{1}}x)^2} && \text{(subbing } \hat{y} \text{)}\\ & = \sum{y^2 - 2\hat{\beta_{1}}xy + \hat{\beta_{1}}^2x^2} && \text{(since } \hat{\beta_{0}} = 0 \text{)}\\ & = \sum{y^2} - 2\hat{\beta_{1}}\sum{xy} + \hat{\beta_{1}}^2\sum{x^2} && \text{(properties of summations)} \\ & = \sum{y^2} - 2\left(\frac{\sum{xy}}{\sum{x^2}}\right)\sum{xy} + \left(\frac{\sum{xy}}{\sum{x^2}}\right)^2\sum{x^2} && \text{(subbing } \hat{\beta_{1}} \text{)}\\ & = \sum{y^2} - 2\frac{(\sum{xy})^2}{\sum{x^2}} + \frac{(\sum{xy})^2}{\sum{x^2}} \\ & = \frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{\sum{x^2}} \end{align*} \]

Plugging this into the $R^2$ formula:

\[ \begin{align*} R^2 & = \frac{TSS - RSS}{TSS} \\ & = \frac{\sum{y^2} - \left[\frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{\sum{x^2}}\right]}{\sum{y^2}} \\ & = \frac{\sum{y^2}\sum{x^2} - \sum{y^2}\sum{x^2} + (\sum{xy})^2}{\sum{y^2}\sum{x^2}} \\ & = \frac{(\sum{xy})^2}{\sum{y^2}\sum{x^2}} \\ & = \left(\frac{\sum{xy}}{\sqrt{\sum{y^2}\sum{x^2}}}\right)^2 \\ & = Cor(X, Y)^2 \end{align*} \]

So we get $R^2 = Cor(X, Y)^2$, as required.

8. APPLIED: The `Auto` Dataset (Simple Linear Regression)

This question involves the use of simple linear regression on the Auto data set.

library(ISLR)
library(tidyverse)

glimpse(Auto)

## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...

(a) Predicting `mpg` with `horsepower`

Q: Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:

i. Is there a relationship between the predictor and the response?
ii. How strong is the relationship between the predictor and the response?
iii. Is the relationship between the predictor and the response positive or negative?
iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

I follow the methodology on p.102 - p.104 quite closely here.

mpg_horsepower <- lm(mpg ~ horsepower, data = Auto)
summary(mpg_horsepower)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

i: The p-value for the horsepower variable is very small (<<0.05), so there is strong evidence to believe that horsepower is associated with mpg. Therefore, there is a relationship between the predictor and response.

ii:

We have the $RSE$, an estimate of the standard deviation of $\epsilon$, given by:

\[RSE = \sqrt{\frac{1}{n-p-1} RSS}\] Think of it as “the average amount that the response will deviate from the true regression line… another way to think about this is that, even if the model were correct and the true values of the unknown coeffcients were known exactly, any prediction of $y$ on the basis of the predictors would still be off by about $RSE$ units, on average.”

The $RSE$ for this model:

summary(mpg_horsepower)$sigma

## [1] 4.905757

The $RSE$ is different (good or bad) in the sense that it takes on the units of $y$, but we can divide this by $\bar{y}$ to get the percentage error:

summary(mpg_horsepower)$sigma / mean(Auto$mpg)

## [1] 0.2092371

So the percentage error = 20.92%.

The $R^2$ of the linear model, which can be thought of as “the percentage of variability in the response that is explained by the predictor”, is given by:

summary(mpg_horsepower)$r.squared

## [1] 0.6059483

So the horsepower explains 60.59% of the variance in mpg.

iii:

The relationship is negative, as we can see from the parameter estimate for horsepower ($\hat{\beta_1}$):

coefficients(mpg_horsepower)[2]

## horsepower 
## -0.1578447

This means that, if a vehicle has higher horsepower, it will generally have a lower value for mpg.

iv:

If horsepower = 98, we can get the prediction for mpg (fitted value), and the 95% confidence & prediction intervals for mpg (when horsepower = 98).

The confidence interval:

predict(mpg_horsepower, data.frame(horsepower = 98), interval = "confidence", level = 0.95)

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

The prediction interval:

predict(mpg_horsepower, data.frame(horsepower = 98), interval = "prediction", level = 0.95)

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

The prediction interval is wider than the confidence interval as we would expect.

(b) Plot Response (`mpg`) vs Predictor (`horsepower`)

theme_set(theme_light())

ggplot(Auto, aes(x = horsepower, y = mpg)) + 
  geom_point() + 
  geom_abline(intercept = coef(mpg_horsepower)[1], slope = coef(mpg_horsepower)[2], 
              col = "deepskyblue3", 
              size = 1) + 
  geom_smooth(se = F)

(c) Diagnostic Plots

Q: Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

This is one of the cases where I think using plot() instead of ggplot() is fine, although these plots can be obtained as normal ggplot graphs using lindia::gg_diagnose(lm) which will offer more flexibility & customization.

par(mfrow=c(2,2))
plot(mpg_horsepower)

Looking at the smoothing line of the residuals ($e_i = y_i - \hat{y_i}$) vs the fitted values ($\hat{y_i}$), there is a strong pattern in the residuals, indicating non-linearity. You can see evidence of this in the scatter plot in part (b) too.

There also appears to be non-constant variance in the error terms (heteroscedasticity), but this could be corrected to an extent when trying a quadratic fit. If not, transformations such as $log(y)$ or $\sqrt{y}$ can shrink larger responses by a greater amount and reduce this issue.

There are some observations with large standardized residuals & high leverage (hence, high Cook’s Distance) that may want to be reviewed.

p.92 - p.98 were helpful here.

9. APPLIED: The `Auto` Dataset (Multiple Linear Regression)

This question involves the use of multiple linear regression on the Auto data set.

(a) Scatterplot Matrix

Q: Produce a scatterplot matrix which includes all of the variables in the data set.

Note that name & origin are both nominal factor variables, but the pairs() function converts to a numeric automatically (as a result, the plots of name in particular aren’t really meaningful).

pairs(Auto)

(b) Correlation Matrix

Q: Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

origin is still included, although it’s just a categorical variable with numeric encoding:

cor(Auto[ ,-9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Predicting `mpg` with all variables (except `name`)

Q: Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?
ii. Which predictors appear to have a statistically significant relationship to the response?
iii. What does the coefficient for the year variable suggest?

I first encode origin as a factor, adding labels for clarity.

Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))

mpg_lm <- lm(mpg ~ . - name, data = Auto)
summary(mpg_lm)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders      -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement    2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower     -1.818e-02  1.371e-02  -1.326 0.185488    
## weight         -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration    7.910e-02  9.822e-02   0.805 0.421101    
## year            7.770e-01  5.178e-02  15.005  < 2e-16 ***
## originEuropean  2.630e+00  5.664e-01   4.643 4.72e-06 ***
## originJapanese  2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

i: We answer this question by performing an F-test, where we test the null hypothesis that all of the regression coefficients are zero:

\[H_{0}: \beta_{1} = \beta_{2} = \dots = \beta_{p} = 0 \\H_{a}: \text{at least one } \beta_{j} \ne 0 \]

The p-value is given at the bottom of the model summary (p-value: < 2.2e-16), so it’s clear that the probability of the null hypothesis being true (given our data) is practically zero.

We reject the null hypothesis (and hence conclude that there is a relationship between the predictors and mpg).

ii: Using the coefficient p-values in the model output, and p = 0.05 as my threshold for significance, all variables except cylinders, horsepower & acceleration have a statistically significant relationship with the response.

iii:

Extracting the coefficient for year ($\hat{\beta_{7}}$):

coef(mpg_lm)[7]

##      year 
## 0.7770269

A coefficient of 0.777 means that the effect of an increase of 1 year (with all other effects held constant) is associated with an increase of 0.777 in the response (mpg). This could perhaps be thought of as the rate of the cars increase in efficiency year-on-year.

(d) Diagnostic Plots

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(mpg_lm)

As in question 8, some pattern in the residuals might suggest some non-linearity in the relationships between the predictors and response that the model is not currently capturing.

We can also see some evidence of observations (e.g. 14) with both high leverage and high residual statistics, that may be disproportionately influencing the regression predictions.

mpg_lm_stats <- broom::augment(mpg_lm) %>%
  select(.rownames, .fitted, .hat, .resid, .std.resid, .cooksd) %>%
  arrange(desc(.cooksd))

mpg_lm_stats

## # A tibble: 392 x 6
##    .rownames .fitted   .hat .resid .std.resid .cooksd
##    <chr>       <dbl>  <dbl>  <dbl>      <dbl>   <dbl>
##  1 14           19.4 0.194   -5.42      -1.83  0.0895
##  2 394          35.5 0.0646   8.53       2.67  0.0547
##  3 327          32.4 0.0406  11.0        3.41  0.0546
##  4 326          33.9 0.0332  10.4        3.20  0.0391
##  5 245          33.0 0.0323  10.1        3.11  0.0358
##  6 387          28.7 0.0342   9.33       2.87  0.0324
##  7 323          33.2 0.0151  13.4        4.07  0.0282
##  8 112          27.0 0.0294  -9.01      -2.77  0.0257
##  9 328          27.9 0.0297   8.55       2.62  0.0234
## 10 330          34.7 0.0217   9.86       3.01  0.0224
## # ... with 382 more rows

mpg_lm_stats$cooksd_cutoff <- factor(ifelse(mpg_lm_stats$.cooksd >= 4 / nrow(Auto), "True", "False"))

I extract these stats (leverage, residual stats, cooks distance, etc.) for each observation below using the broom package, and highlight those observations with $CooksD \ge \frac{4}{n}$, a common threshold for identifying influential observations.

ggplot(mpg_lm_stats, aes(x = .hat, y = .std.resid, col = cooksd_cutoff)) + 
  geom_point(alpha = 0.8) + 
  scale_color_manual(values = c("mediumseagreen", "#BB8FCE")) +
  theme_light() + 
  theme(legend.position = "bottom") +
  labs(title = "Residuals vs Leverage", 
       x = "Leverage", 
       y = "Standardized Residuals", 
       col = "Cooks Distance >= 4/n")

ggplot(mpg_lm_stats, aes(x = .hat, y = .cooksd, col = cooksd_cutoff)) + 
  geom_point(alpha = 0.8) + 
  geom_hline(yintercept = 4 / nrow(Auto), col = "grey30") +
  scale_color_manual(values = c("mediumseagreen", "#BB8FCE")) +
  theme_light() + 
  theme(legend.position = "bottom") +
  labs(title = "Cooks Distance vs Leverage", 
       x = "Leverage", 
       y = "Cooks Distance", 
       col = "Cooks Distance >= 4/n")

(e) Interaction Terms

Q: Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Since we have relatively few predictors, we can test all interactions with mpg ~ . * . in the call to lm():

summary(lm(formula = mpg ~ . * ., data = Auto[, -9]))

## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6008 -1.2863  0.0813  1.2082 12.0382 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.401e+01  5.147e+01   0.855 0.393048    
## cylinders                    3.302e+00  8.187e+00   0.403 0.686976    
## displacement                -3.529e-01  1.974e-01  -1.788 0.074638 .  
## horsepower                   5.312e-01  3.390e-01   1.567 0.117970    
## weight                      -3.259e-03  1.820e-02  -0.179 0.857980    
## acceleration                -6.048e+00  2.147e+00  -2.818 0.005109 ** 
## year                         4.833e-01  5.923e-01   0.816 0.415119    
## originEuropean              -3.517e+01  1.260e+01  -2.790 0.005547 ** 
## originJapanese              -3.765e+01  1.426e+01  -2.640 0.008661 ** 
## cylinders:displacement      -6.316e-03  7.106e-03  -0.889 0.374707    
## cylinders:horsepower         1.452e-02  2.457e-02   0.591 0.555109    
## cylinders:weight             5.703e-04  9.044e-04   0.631 0.528709    
## cylinders:acceleration       3.658e-01  1.671e-01   2.189 0.029261 *  
## cylinders:year              -1.447e-01  9.652e-02  -1.499 0.134846    
## cylinders:originEuropean    -7.210e-01  1.088e+00  -0.662 0.508100    
## cylinders:originJapanese     1.226e+00  1.007e+00   1.217 0.224379    
## displacement:horsepower     -5.407e-05  2.861e-04  -0.189 0.850212    
## displacement:weight          2.659e-05  1.455e-05   1.828 0.068435 .  
## displacement:acceleration   -2.547e-03  3.356e-03  -0.759 0.448415    
## displacement:year            4.547e-03  2.446e-03   1.859 0.063842 .  
## displacement:originEuropean -3.364e-02  4.220e-02  -0.797 0.425902    
## displacement:originJapanese  5.375e-02  4.145e-02   1.297 0.195527    
## horsepower:weight           -3.407e-05  2.955e-05  -1.153 0.249743    
## horsepower:acceleration     -3.445e-03  3.937e-03  -0.875 0.382122    
## horsepower:year             -6.427e-03  3.891e-03  -1.652 0.099487 .  
## horsepower:originEuropean   -4.869e-03  5.061e-02  -0.096 0.923408    
## horsepower:originJapanese    2.289e-02  6.252e-02   0.366 0.714533    
## weight:acceleration         -6.851e-05  2.385e-04  -0.287 0.774061    
## weight:year                 -8.065e-05  2.184e-04  -0.369 0.712223    
## weight:originEuropean        2.277e-03  2.685e-03   0.848 0.397037    
## weight:originJapanese       -4.498e-03  3.481e-03  -1.292 0.197101    
## acceleration:year            6.141e-02  2.547e-02   2.412 0.016390 *  
## acceleration:originEuropean  9.234e-01  2.641e-01   3.496 0.000531 ***
## acceleration:originJapanese  7.159e-01  3.258e-01   2.198 0.028614 *  
## year:originEuropean          2.932e-01  1.444e-01   2.031 0.043005 *  
## year:originJapanese          3.139e-01  1.483e-01   2.116 0.035034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared:  0.8967, Adjusted R-squared:  0.8866 
## F-statistic: 88.34 on 35 and 356 DF,  p-value: < 2.2e-16

We can see the significant terms (at the 0.05 level) are those with at least one asterisk (*). It is probably unreasonable to use a significance level of 0.05 here, as we are testing such a large number of hypothesis, perhaps a lower threshold for significance (or a p-value correction using the p.adjust() function) would be appropriate.

Using the standard threshold of 0.05, the significant interaction terms are given by:

cylinders:acceleration
acceleration:year
acceleration:originEuropean
acceleration:originJapanese
year:originEuropean
year:originJapanese

(f) Variable Transformations

Q: Try a few different transformations of the variables, such as $log(X)$, $\sqrt{X}$, $X^2$. Comment on your findings.

I use my best_predictor() function here, as defined in my chapter 2 solutions.

There is no testing for $\sqrt{X}$ transformations here, but these tend to perform similarly to transformations of the form $log(X)$.

best_predictor <- function(dataframe, response) {
  
  if (sum(sapply(dataframe, function(x) {is.numeric(x) | is.factor(x)})) < ncol(dataframe)) {
    stop("Make sure that all variables are of class numeric/factor!")
  }
  
  # pre-allocate vectors
  varname <- c()
  vartype <- c()
  R2 <- c()
  R2_log <- c()
  R2_quad <- c()
  AIC <- c()
  AIC_log <- c()
  AIC_quad <- c()
  y <- dataframe[ ,response]
  
  # # # # # NUMERIC RESPONSE # # # # #
  if (is.numeric(y)) {
    
    for (i in 1:ncol(dataframe)) {
      
      x <- dataframe[ ,i]
      varname[i] <- names(dataframe)[i]
      
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      if (!identical(y, x)) {
        
        # linear: y ~ x
        R2[i] <- summary(lm(y ~ x))$r.squared 
        
        # log-transform: y ~ log(x)
        if (is.numeric(x)) { 
          if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
            R2_log[i] <- summary(lm(y ~ log(x + abs(min(x)) + 1)))$r.squared
          } else {
            R2_log[i] <- summary(lm(y ~ log(x)))$r.squared
          }
        } else {
          R2_log[i] <- NA
        }
        
        # quadratic: y ~ x + x^2
        if (is.numeric(x)) { 
          R2_quad[i] <- summary(lm(y ~ x + I(x^2)))$r.squared
        } else {
          R2_quad[i] <- NA
        }
        
      } else {
        R2[i] <- NA
        R2_log[i] <- NA
        R2_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               R2 = round(R2, 3), 
               R2_log = round(R2_log, 3), 
               R2_quad = round(R2_quad, 3)) %>%
      mutate(max_R2 = pmax(R2, R2_log, R2_quad, na.rm = T)) %>%
      arrange(desc(max_R2))
    
    
    # # # # # CATEGORICAL RESPONSE # # # # #
  } else {
    
    for (i in 1:ncol(dataframe)) {
      
      x <- dataframe[ ,i]
      varname[i] <- names(dataframe)[i]
      
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      if (!identical(y, x)) {
        
        # linear: y ~ x
        AIC[i] <- summary(glm(y ~ x, family = "binomial"))$aic 
        
        # log-transform: y ~ log(x)
        if (is.numeric(x)) { 
          if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
            AIC_log[i] <- summary(glm(y ~ log(x + abs(min(x)) + 1), family = "binomial"))$aic
          } else {
            AIC_log[i] <- summary(glm(y ~ log(x), family = "binomial"))$aic
          }
        } else {
          AIC_log[i] <- NA
        }
        
        # quadratic: y ~ x + x^2
        if (is.numeric(x)) { 
          AIC_quad[i] <- summary(glm(y ~ x + I(x^2), family = "binomial"))$aic
        } else {
          AIC_quad[i] <- NA
        }
        
      } else {
        AIC[i] <- NA
        AIC_log[i] <- NA
        AIC_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               AIC = round(AIC, 3), 
               AIC_log = round(AIC_log, 3), 
               AIC_quad = round(AIC_quad, 3)) %>%
      mutate(min_AIC = pmin(AIC, AIC_log, AIC_quad, na.rm = T)) %>%
      arrange(min_AIC)
  } 
}

best_predictor(Auto[ ,-9], "mpg")

## [1] "Response variable: mpg"

##        varname     vartype    R2 R2_log R2_quad max_R2
## 1       weight     numeric 0.693  0.713   0.715  0.715
## 2 displacement     numeric 0.648  0.686   0.689  0.689
## 3   horsepower     numeric 0.606  0.668   0.688  0.688
## 4    cylinders     numeric 0.605  0.603   0.607  0.607
## 5         year     numeric 0.337  0.332   0.368  0.368
## 6       origin categorical 0.332     NA      NA  0.332
## 7 acceleration     numeric 0.179  0.190   0.194  0.194
## 8          mpg     numeric    NA     NA      NA     NA

Lets visualize:

mpg_predictors <- best_predictor(Auto[ ,-9], "mpg") %>%
  select(-c(vartype, max_R2)) %>%
  gather(key = "key", value = "R2", -varname) %>%
  filter(!is.na(R2))

mpg_predictors_order <- best_predictor(Auto[ ,-9], "mpg") %>%
  select(varname, max_R2) %>%
  filter(!is.na(max_R2)) %>%
  arrange(desc(max_R2)) %>%
  pull(varname)

mpg_predictors$varname <- factor(mpg_predictors$varname, 
                                 ordered = T, 
                                 levels = mpg_predictors_order)

ggplot(mpg_predictors, aes(x = R2, y = varname, col = key, group = varname)) + 
  geom_line(col = "grey15") +
  geom_point(size = 2) + 
  theme_light() + 
  theme(legend.position = "bottom") +
  labs(title = "Best predictors (& transformations) for mpg", 
       col = "Predictor Transformation", 
       y = "Predictor") + 
  scale_color_discrete(labels = c("None", "Log-Transformed", "Order 2 Polynomial"))

We might say from this that we should test quadratic terms for weight, displacement, horsepower & year

Recall, for the standard linear model, we get an adjusted $R^2$ of 0.8205.

mpg_lm_2 <- lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = Auto)

summary(mpg_lm_2)

## 
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) + 
##     I(horsepower^2) + I(year^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4816 -1.5384  0.0735  1.3671 12.0213 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.185e+02  6.966e+01   6.008 4.40e-09 ***
## cylinders          5.073e-01  3.191e-01   1.590 0.112692    
## displacement      -3.328e-02  2.045e-02  -1.627 0.104480    
## horsepower        -1.781e-01  3.953e-02  -4.506 8.81e-06 ***
## weight            -1.114e-02  2.587e-03  -4.306 2.12e-05 ***
## acceleration      -1.700e-01  9.652e-02  -1.762 0.078960 .  
## year              -1.019e+01  1.837e+00  -5.546 5.49e-08 ***
## originEuropean     1.323e+00  5.304e-01   2.494 0.013068 *  
## originJapanese     1.258e+00  5.129e-01   2.452 0.014637 *  
## I(weight^2)        1.182e-06  3.438e-07   3.439 0.000649 ***
## I(displacement^2)  5.839e-05  3.435e-05   1.700 0.089967 .  
## I(horsepower^2)    4.388e-04  1.336e-04   3.284 0.001118 ** 
## I(year^2)          7.210e-02  1.207e-02   5.974 5.35e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared:  0.8773, Adjusted R-squared:  0.8735 
## F-statistic: 225.9 on 12 and 379 DF,  p-value: < 2.2e-16

The adjusted $R^2$ has improved quite significantly: 0.8735

Notice how it seems like there is less non-linearity in the residuals.

par(mfrow=c(2,2))
plot(mpg_lm_2)

However, there is still some fan-shape to the residual plot, and the Q-Q plot is showing that the residuals are not as normal as we would hope. Perhaps a log-transformation of the response (mpg) will help this:

mpg_lm_3 <- lm(log(mpg) ~ . - name, data = Auto)

summary(mpg_lm_3)

## 
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40380 -0.06679  0.00493  0.06913  0.33036 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.712e+00  1.673e-01  10.230  < 2e-16 ***
## cylinders      -2.781e-02  1.149e-02  -2.420  0.01598 *  
## displacement    7.874e-04  2.738e-04   2.876  0.00425 ** 
## horsepower     -1.520e-03  4.904e-04  -3.100  0.00208 ** 
## weight         -2.639e-04  2.344e-05 -11.260  < 2e-16 ***
## acceleration   -1.403e-03  3.513e-03  -0.399  0.68996    
## year            3.055e-02  1.852e-03  16.491  < 2e-16 ***
## originEuropean  8.531e-02  2.026e-02   4.210 3.18e-05 ***
## originJapanese  8.145e-02  1.977e-02   4.119 4.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1183 on 383 degrees of freedom
## Multiple R-squared:  0.8815, Adjusted R-squared:  0.879 
## F-statistic: 356.1 on 8 and 383 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(mpg_lm_3)

Pushing for even higher performance from the linear model, I make the following changes.

Note that I have tried to avoid adding too many complex interactions & transformations to the model, but have instead focused on extracting the relationships that I think are most likely to exist and add performance on new data:

log-transform the response (mpg)
introduce some quadratic terms (for those that are non-linearly related to log(mpg))
introduce the most significant interaction terms
introduce a brand variable that i created in chapter 2, through text-mining the name variable

Auto_2 <- Auto %>%
  mutate(mpg = log(mpg))

# new variable
Auto_2$brand <- sapply(strsplit(as.character(Auto_2$name), split = " "),
                     function(x) x[1]) # extract the first item from each list element

Auto_2$brand <- factor(ifelse(Auto_2$brand %in% c("vokswagen", "vw"), "volkswagen", 
                            ifelse(Auto_2$brand == "toyouta", "toyota", 
                                   ifelse(Auto_2$brand %in% c("chevroelt", "chevy"), "chevrolet", 
                                          ifelse(Auto_2$brand == "maxda", "mazda", 
                                                 Auto_2$brand))))) # fixing typo's
Auto_2$brand <- fct_lump(Auto_2$brand, 
                       n = 9, 
                       other_level = "uncommon") # collapse into 10 categories

Auto_2$name <- NULL

# gives ideas for transformations:
# best_predictor(Auto_2, "mpg") 

# gives ideas for interactions:
# summary(lm(mpg ~ . * ., data = Auto_2[ ,-9])) 


mpg_lm_4 <- lm(mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year + acceleration:origin, data = Auto_2)

summary(mpg_lm_4)

## 
## Call:
## lm(formula = mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year + 
##     acceleration:origin, data = Auto_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36578 -0.05924  0.00281  0.05853  0.32154 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.407e+01  2.668e+00   5.274 2.28e-07 ***
## cylinders                   -6.017e-03  1.101e-02  -0.547 0.584993    
## displacement                -1.286e-04  2.835e-04  -0.454 0.650361    
## horsepower                  -8.124e-03  1.344e-03  -6.046 3.63e-09 ***
## weight                      -1.541e-04  2.578e-05  -5.978 5.34e-09 ***
## acceleration                -1.506e-01  4.536e-02  -3.320 0.000991 ***
## year                        -2.541e-01  7.136e-02  -3.560 0.000419 ***
## originEuropean              -2.801e-01  9.866e-02  -2.839 0.004772 ** 
## originJapanese              -2.721e-01  1.229e-01  -2.214 0.027430 *  
## brandbuick                   6.683e-02  3.351e-02   1.994 0.046840 *  
## brandchevrolet               3.528e-02  2.565e-02   1.375 0.169816    
## branddatsun                  1.896e-01  3.943e-02   4.809 2.21e-06 ***
## branddodge                   5.767e-02  2.936e-02   1.965 0.050209 .  
## brandford                   -3.803e-03  2.573e-02  -0.148 0.882547    
## brandplymouth                6.476e-02  2.834e-02   2.285 0.022871 *  
## brandtoyota                  1.106e-01  3.842e-02   2.880 0.004207 ** 
## brandvolkswagen              2.639e-02  4.022e-02   0.656 0.512122    
## branduncommon                6.968e-02  2.657e-02   2.622 0.009100 ** 
## I(horsepower^2)              1.789e-05  4.230e-06   4.229 2.96e-05 ***
## I(year^2)                    1.685e-03  4.829e-04   3.490 0.000540 ***
## acceleration:year            1.690e-03  5.954e-04   2.839 0.004782 ** 
## acceleration:originEuropean  1.912e-02  5.590e-03   3.420 0.000697 ***
## acceleration:originJapanese  1.544e-02  7.407e-03   2.084 0.037829 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1043 on 369 degrees of freedom
## Multiple R-squared:  0.9112, Adjusted R-squared:  0.9059 
## F-statistic: 172.2 on 22 and 369 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(mpg_lm_4)

The heteroscedasticity of the residuals has definitely decreased significantly since the first linear model. The adjusted $R^2$ has improved from 0.8205 to 0.9059, which is pretty impressive given that no additional data has been collected.

10. APPLIED: The `Carseats` Dataset (Multiple Linear Regression)

This question should be answered using the Carseats data set.

(a) Predicting `Sales`

Q: Fit a multiple regression model to predict Sales using Price, Urban, and US.

Carseats is in the ISLR package, so the data has already been loaded.

glimpse(Carseats)

## Rows: 400
## Columns: 11
## $ Sales       <dbl> 9.50, 11.22, 10.06, 7.40, 4.15, 10.81, 6.63, 11.85, 6.5...
## $ CompPrice   <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132, 121, ...
## $ Income      <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78, 94, 35...
## $ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, 11, 5, ...
## $ Population  <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131, 150, 5...
## $ Price       <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 100, 94, ...
## $ ShelveLoc   <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Good, Medi...
## $ Age         <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, 62, 53,...
## $ Education   <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, 18, 18,...
## $ Urban       <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes,...
## $ US          <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, Yes...

sales_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)

(b) Interpreting Coefficients

Q: Provide an interpretation of each coefficient in the model. Be careful-some of the variables in the model are qualitative!

summary(sales_lm)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Price = -0.054

Interpretation = the effect of a 1-unit increase in Price (for fixed values of Urban & US) is a change in Sales of -0.054 units (54 sales).

Urban = -0.022

Interpretation = the effect of a store being in an urban area (for fixed values of Price & US) is a change in Sales of 0.022 units (22 sales). However, in this case, since the p-value for this variables T-test is so high, we can say that there is no evidence for a relationship between the car seat Sales at a store and whether the store was Urban (or rural).

US = 1.200

Interpretation = the effect of a store being in the US (for fixed values of Price & Urban) is a change in Sales of 1.2 units (1200 sales).

(c) Equation-Form

Q: Write out the model in equation form, being careful to handle the qualitative variables properly.

\[\hat{Sales} = 13.043469 - 0.054459\cdot Price - 0.021916\cdot Urban + 1.200573\cdot US\]

Where:

$Urban$ = 1 for a store in an urban location, else 0
$US$ = 1 for a store in the US, else 0

(d) T-tests

Q: For which of the predictors can you reject the null hypothesis $H_{0}: \beta_{j} = 0$

This question is asking about the results from the parameter T-tests. Based on the output & comments from part (b), we can reject the null hypothesis for the Price and US predictors, but there is insufficient evidence to reject the null hypothesis that the coefficient for Urban is zero.

(e) Smaller Model

Q: On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

Removing the Urban variable:

sales_lm_2 <- lm(Sales ~ Price + US, data = Carseats)

summary(sales_lm_2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) Model Fit

Q: How well do the models in (a) and (e) fit the data?

For model (a), we have:

$R^2$ = 0.23928
adjusted $R^2$ = 0.23351

For model (e), we have:

$R^2$ = 0.23926
adjusted $R^2$ = 0.23543

In this case, both models explain ~23.9% of the variance in Sales.

This is an interesting case to talk about using training $R^2$ vs $\bar{R}^2$ (adjusted $R^2$) in model selection, since model (a) has a higher $R^2$ but lower $\bar{R}^2$.

Recall:

\[R^2 = 1 - \frac{RSS}{TSS}\]

\[\bar{R}^2 = 1 - \frac{RSS/(n - p - 1)}{TSS / (n - 1)}\]

\[RSS = \sum_{i = 1}^n{(y_i - \hat{y})^2}\]

\[TSS = \sum_{i = 1}^n{(y_i - \bar{y})^2}\]

$TSS$ will be constant regardless of the model being fit, and so maximizing $\bar{R}^2$ is equivalent to minimizing $RSS/(n - p - 1)$.

The fact that $RSS$ will always decrease as variables are added to the model explains why $R^2$ always increases with additional variables.

However, with $\bar{R}^2$, adding new variables will mean that the increase in $RSS$ is now ‘competing’ with the increase in $n - p - 1$ (specifically, the increase in $p$), so it is less clear whether $\bar{R}^2$ will increase or not.

In this particular case, we can see that model (e) adds a useless variable (Urban) and sees an increase in $R^2$ but a decrease in $\bar{R}^2$, because the boost in $RSS$ that the model sees is insufficient in outweighing the penalty by the increase in $p$.

We should really choose (e) as the better model in this case, as it provides a more parsimonious model, and this is a good example of why the training $R^2$ or $RSS$ can be very limited.

(g) Coefficient Confidence Interval

Q: Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

The confidence intervals are calculated using:

\[\hat{\beta_i} \pm t_{n-2}(0.975) \cdot SE(\hat{\beta_i})\]

Where:

$\hat{\beta_i}$ is the parameter estimate
$t_{n-2}(0.975)$ is the 97.5% quantile of a t-distribution, with $n-2$ degrees of freedom (qt(0.975, nrow(data) - 2))
$SE(\hat{\beta_i})$ is the standard error of $\hat{\beta_i}$

confint(sales_lm_2, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

We can say, for instance, that there is a 95% probability that the true parameter for Price ($\beta_1$) falls within the interval: (-0.0648, -0.0442), and a 5% probability that it doesn’t.

(h) Outliers and High Leverage Observations

Q: Is there evidence of outliers or high leverage observations in the model from (e)

Since the average leverage statistic is always equal to $(p + 1)/n$, we can identify observations with a leverage far higher than $(p + 1)/n$ as ‘high-leverage’.

# NOTE: broom::augment(lm)$.hat is the same as hatvalues(lm)
# (p + 1) / n = average leverage?
round(((2 + 1) / nrow(Carseats)), 10) == round(mean(hatvalues(sales_lm_2)), 10)

## [1] TRUE

Similarly, we can identify potential outliers as those observations with a standardized residual outside of $[-2, 2]$.

broom::augment(sales_lm_2) %>%
  select(.hat, .std.resid) %>%
  ggplot(aes(x = .hat, y = .std.resid)) + 
  geom_point() + 
  geom_hline(yintercept = -2, col = "deepskyblue3", size = 1) + 
  geom_hline(yintercept = 2, col = "deepskyblue3", size = 1) + 
  geom_vline(xintercept = 3 / nrow(Carseats), col = "mediumseagreen", size = 1) + 
  theme_light() + 
  labs(x = "Leverage", 
       y = "Standardized Residual", 
       title = "Residuals vs Leverage")

Looking at these thresholds on a graph, it is hard to identify when an observation is large enough in residual & leverage to be influential and worth investigating.

This is why I prefer the approach of using a metric that combines these (e.g. cooks distance):

broom::augment(sales_lm_2) %>%
  rownames_to_column("rowid") %>%
  arrange(desc(.cooksd)) %>% 
  select(Sales, Price, US, .std.resid, .hat, .cooksd)

## # A tibble: 400 x 6
##    Sales Price US    .std.resid    .hat .cooksd
##    <dbl> <dbl> <fct>      <dbl>   <dbl>   <dbl>
##  1 14.9     82 No          2.58 0.0116   0.0261
##  2 14.4     53 No          1.73 0.0237   0.0243
##  3 10.6    149 No          2.32 0.0126   0.0228
##  4 15.6     72 Yes         2.17 0.0129   0.0205
##  5  0.37   191 Yes        -1.42 0.0286   0.0198
##  6 16.3     92 Yes         2.87 0.00664  0.0183
##  7  3.47    81 No         -2.10 0.0119   0.0177
##  8  0.53   159 Yes        -2.05 0.0119   0.0169
##  9 13.6     89 No          2.18 0.00983  0.0158
## 10  3.02    90 Yes        -2.56 0.00710  0.0157
## # ... with 390 more rows

In this case it’s unlikely that you could justify removing any other observations or class them as problematic. The model is very simple with $p$ = 2, and many of these observations would not be an issue if more strong predictors were added.

11. APPLIED: Generated Data

In this problem we will investigate the t-statistic for the null hypothesis $H_{0}: \beta = 0$ in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows.

set.seed(1)

x = rnorm (100)

y = 2*x + rnorm(100)

set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100)

(a) Predict `y` using `x` (no intercept)

Q: Perform a simple linear regression of y onto x, without an intercept. Report the coefficient estimate $\hat{\beta_a}$, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis $H_{0}: \beta_a = 0$. Comment on these results

lm_1 <- lm(y ~ x + 0)
summary(lm_1)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

We have:

$\hat{\beta_a}$ = 1.9938761
$SE(\hat{\beta_a})$ = 0.1064767
$T$ = $\frac{\hat{\beta_a} - 0}{SE(\hat{\beta_a})}$ = 18.7259319
$p$ = 2.642196910^{-34}

The model p-value strongly suggests that there is a relationship between y & x.

lm_1 describes a relationship of the form:

\[\hat{y} = \hat{\beta_a}\cdot x = 1.9939\cdot x\]

(b) Predict `x` using `y` (no intercept)

Q: Perform a simple linear regression of y onto x, without an intercept. Report the coefficient estimate $\hat{\beta_b}$, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis $H_{0}: \beta_b = 0$. Comment on these results

lm_2 <- lm(x ~ y + 0)
summary(lm_2)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

We have:

$\hat{\beta_b}$ = 0.3911145
$SE(\hat{\beta_b})$ = 0.0208863
$T$ = $\frac{\hat{\beta_b} - 0}{SE(\hat{\beta_b})}$ = 18.7259319
$p$ = 2.642196910^{-34}

The model p-value strongly suggests that there is a relationship between x & y.

lm_2 describes a relationship of the form:

\[\hat{x} = \hat{\beta_b}\cdot y = 0.3911\cdot x\]

(c) Reverse Regression (reversing $x$ and $y$) - Comparing Results

Q: What is the relationship between the results obtained in (a) and (b)?

In both models (lm_1 & lm_2) we have the same T-statistic (and hence, p-value), but the parameter estimates ($\hat{\beta_a}$ & $\hat{\beta_b}$) and their standard error estimates are different, which we would expect.

Since the underlying relationship is given by $y = 2x + \epsilon$, and reversing this for $x$ gives $x = \frac{y - \epsilon}{2}$, I would have expected to get:

$\hat{\beta_a} \approx$ 2
$\hat{\beta_b} \approx$ 0.5

However, we get $\hat{\beta_a}$ = 1.9939 (which is close), and $\hat{\beta_b}$ = 0.3911 (which is not close)!

It’s interesting that $\hat{\beta_b} \ne \frac{1}{\hat{\beta_a}}$, and this is not just some peculiarity with performing linear regression without an intercept! The same results can be observed when doing regression with an intercept:

lm_3 <- lm(y ~ x)
summary(lm_3)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8768 -0.6138 -0.1395  0.5394  2.3462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03769    0.09699  -0.389    0.698    
## x            1.99894    0.10773  18.556   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

\[\hat{y} = -0.0377 + 1.9989\cdot x\]

lm_4 <- lm(x ~ y)
summary(lm_4)

## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90848 -0.28101  0.06274  0.24570  0.85736 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03880    0.04266    0.91    0.365    
## y            0.38942    0.02099   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

\[\hat{x} = 0.0388 + 0.3894\cdot y\]

In the case of lm_1, we are minimizing: $RSS = \sum_{i = 1}^n{(y_i - \hat{y})^2}$. Minimizing this vertical distance yields a slightly flatter line (with a smaller slope).

data <- data.frame(x, y)

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_segment(aes(x = x, y = y,
                   xend = x, yend = lm_1$fitted.values), col = "grey35") + 
  geom_abline(intercept = 0, slope = coef(lm_1), size = 1, col = "deepskyblue3") + 
  labs(title = "lm_1: predicting y using x")

Whereas, in the case of lm_2, we are minimizing $RSS = \sum_{i = 1}^n{(x_i - \hat{x})^2}$. Minimizing this horizontal distance yields a slightly steeper line (with a larger slope).

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_segment(aes(x = x, y = y,
                   xend = lm_2$fitted.values, yend = y), col = "grey35") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_2), size = 1, col = "mediumseagreen") + 
  labs(title = "lm_2: predicting x using y")

Here, I visualize the two regression lines (with no intercept). Note that, if a human were to draw what they expect the ‘line of best fit’ to be, it would likely be between these two lines!

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_1), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_2), size = 1, col = "mediumseagreen")

Fundamentally I this comes down to the fact that we are minimizing different things when fitting the two regressions. The term ‘line of best fit’ is deceptive, really, because the thing we choose to minimize with change depending on what we choose as our response.

Following this graphical logic, it would make sense if these two lines (y ~ x, and x ~ y) approached one another as the noise in the relationship decreased, and the relationship approached a straight line.

Increased Noise:

I increase the $\epsilon$ term in the underlying relationship, so now $\epsilon$ has sd = 1.75 instead of sd = 1. This produces a noisier relationship between x and y.

set.seed(1)

x = rnorm (100)

y = 2*x + rnorm(100, sd = 1.75)

set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100, sd = 1.75) # previously: y = 2*x + rnorm(100, sd = 1)
data <- data.frame(x,y)

lm_5 <- lm(y ~ x + 0)
lm_6 <- lm(x ~ y + 0)

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_5), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_6), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_5 & lm_6)", 
       subtitle = "sd(epsilon) increased to 1.75")

For lm_5 we get:

$\hat{\beta_a}$ = 1.9892831
Therefore: $\hat{y} = 1.9893\cdot x$

For lm_6 we get:

$\hat{\beta_b}$ = 0.2690193
Therefore: $\hat{x} = 0.2690\cdot y$

Decreased Noise:

I now reduce the $\epsilon$ term in the underlying relationship to simulate this, so now $\epsilon$ has sd = 0.25 instead of sd = 1:

set.seed(1)

x = rnorm (100)

y = 2*x + rnorm(100, sd = 0.25)

set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100, sd = 0.25) # previously: y = 2*x + rnorm(100, sd = 1)
data <- data.frame(x,y)

lm_7 <- lm(y ~ x + 0)
lm_8 <- lm(x ~ y + 0)

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_7), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_8), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_7 & lm_8)", 
       subtitle = "sd(epsilon) reduced to 0.25")

For lm_7 we get:

$\hat{\beta_a}$ = 1.998469
Therefore: $\hat{y} = 1.9985\cdot x$

For lm_8 we get:

$\hat{\beta_b}$ = 0.4917459
Therefore: $\hat{x} = 0.4917\cdot y$

Notice how, as $R^2 \rightarrow$ 1, $\hat{\beta_b} \rightarrow \frac{1}{\hat{\beta_a}}$, and it is possible to show that these parameter estimates will have a perfect reciprocal relationship only when $R^2 = 1$.

I’m now reaching the end of my knowledge so I will leave this question here, but hopefully that has made clearer the mechanics for why the regression coefficients are different, even though (for me) it’s a confusing concept to wrap my head around.

(d) Algebra Involving the T-Statistic (No Intercept)

Q: For the regression of $Y$ onto $X$ without an intercept… show algebraically, and confirm numerically in R, that the t-statistic can be written as: $T = \frac{\hat{\beta}}{SE(\hat{\beta})} = \frac {(\sqrt{n-1}) \sum{x y}} {\sqrt{(\sum{x^2})(\sum{y^2}) - (\sum{x y})^2}}$

For regression without an intercept, we have the following results:

\[\hat{\beta} = \frac{\sum{x y}}{\sum{x^2}} \\ SE(\hat{\beta}) = \sqrt{\frac{\sum{(y - x \hat{\beta})^2}}{(n-1) \sum{x^2}}}\]

Hence:

\[T = \frac{\hat{\beta}}{SE(\hat{\beta})} = \frac{\frac{\sum{x y}}{\sum{x^2}}} {\sqrt{\frac{\sum{(y - x \hat{\beta})^2}}{(n-1) \sum{x^2}}}}\]

Using the same logic as in the $RSS$ algebra in question 7 (since $\hat{\beta_{0}}$ = 0 and $\hat{y} = x \hat{\beta}$):

\[\begin{align*} RSS & = \sum{(y - x \hat{\beta})^2} \\ & = \sum{(y - \hat{y})^2} \\ & = \sum{y^2} - 2\hat{\beta}\sum{xy} + \hat{\beta}^2\sum{x^2} &&\text{(steps shown in q7)} \\ & = \frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{\sum{x^2}} &&\text{(steps shown in q7)} \end{align*} \]

We can use this result to simplify the denominator:

\[\begin{align*} SE(\hat{\beta}) & = \sqrt{\frac{\sum{(y - x \hat{\beta})^2}}{(n-1) \sum{x^2}}} \\ & = \sqrt{\frac{\left(\frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{\sum{x^2}} \right)}{(n-1) \sum{x^2}}} && \text{(substituting } y - x \hat{\beta} \text)\\ & = \sqrt{\frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{(n-1)(\sum{x^2})^2}} \\ & = \frac{\sqrt{\sum{y^2}\sum{x^2} - (\sum{xy})^2}} {\sqrt{n-1}\sum{x^2}} \end{align*} \]

Subbing this in, we get to the required equality:

\[\begin{align*} T & = \frac{\hat{\beta}}{SE(\hat{\beta})} \\ & = \frac{ \left(\frac{\sum{x y}}{\sum{x^2}}\right) } {\left(\frac{\sqrt{\sum{y^2}\sum{x^2} - (\sum{xy})^2}} {\sqrt{n-1}\sum{x^2}}\right)} \\ & = \frac{\sum{x y}}{\sum{x^2}} \cdot \frac{\sqrt{n-1}\sum{x^2}}{\sqrt{\sum{y^2}\sum{x^2} - (\sum{xy})^2}} \\ & = \frac {\sqrt{n-1} \sum{x y}} {\sqrt{(\sum{x^2})(\sum{y^2}) - (\sum{x y})^2}} \end{align*} \]

I now verify this is the case in R:

Recall lm_1 from earlier:

set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100)

lm_1 <- lm(y ~ x + 0)
summary(lm_1)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

From the model output we obtain the following T-statistic: 18.7259319

Now to verify this, by using the formula from the proof:

\[T = \frac {\sqrt{n-1} \sum{x y}} {\sqrt{(\sum{x^2})(\sum{y^2}) - (\sum{x y})^2}}\]

T_statistic <- (sqrt(length(x) - 1) * sum(x*y)) / sqrt(sum(x^2) * sum(y^2) - sum(x*y)^2)
T_statistic

## [1] 18.72593

Checking whether the T-statistic from the model summary is the same as my manually calculated T-statistic, to 10dp.:

round(as.numeric(T_statistic), 10) == round(as.numeric(summary(lm_1)$coefficients[3]), 10)

## [1] TRUE

(e) Reverse Regression - T-Statistic Equality

Q: Using the results from (d), argue that the t-statistic for the regression of $y$ onto $x$ is the same as the t-statistic for the regression of $x$ onto $y$.

This is immediate from swapping $x$ and $y$ in the formula above:

\[T_{y \sim x} = \frac{\sqrt{n-1} \sum{x y}} {\sqrt{(\sum{x^2})(\sum{y^2}) - (\sum{x y})^2}} = \frac{\sqrt{n-1} \sum{y x}} {\sqrt{(\sum{y^2})(\sum{x^2}) - (\sum{y x})^2}} = T_{x \sim y}\]

(f) Algebra Involving the T-Statistic (with an Intercept)

Q: In R, show that when regression is performed with an intercept, the t-statistic for $H_{0}: \beta_{1} = 0$ is the same for the regression of y onto x as it is for the regression of x onto y.

As before, since we are not proving here but showing in R, I can only think that the authors expect an example (not a proof). So using the example provided earlier:

set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100)

lm_3 <- lm(y ~ x)
summary(lm_3)$coefficients

##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -0.03769261 0.09698729 -0.3886346 6.983896e-01
## x            1.99893961 0.10772703 18.5555993 7.723851e-34

lm_4 <- lm(x ~ y)
summary(lm_4)$coefficients

##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 0.03880394 0.04266144  0.9095787 3.652764e-01
## y           0.38942451 0.02098690 18.5555993 7.723851e-34

They look the same, but just to check, I extract the T-values and check their equality (to 10dp.):

round(summary(lm_3)$coefficients[2, 3], 10) == round(summary(lm_4)$coefficients[2, 3], 10)

## [1] TRUE

12. APPLIED: Generated Data (Regression without an intercept)

This problem involves simple linear regression without an intercept.

(a) Reverse Regression: Requirement for Equal Coefficients

Q: Recall that the coefficient estimate $\hat{\beta_a}$ for the linear regression of $y$ onto $x$ without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of $x$ onto $y$ the same as the coefficient estimate for the regression of $y$ onto $x$?

Using the same notation as previously, where:

\[\hat{y} = \hat{\beta_a}x \\ \hat{x} = \hat{\beta_b}y\]

We have:

\[\hat{\beta_a} = \frac{\sum{xy}}{\sum{x^2}} \\ \hat{\beta_b} = \frac{\sum{yx}}{\sum{y^2}} \\ \hat{\beta_a} = \hat{\beta_b} \iff \frac{\sum{xy}}{\sum{x^2}} = \frac{\sum{yx}}{\sum{y^2}} \iff \sum_{i = 1}^n{x_i^2} = \sum_{i = 1}^n{y_i^2}\]

(b) Non-equal Coefficients

Q: Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of $x$ onto $y$ is different from the coefficient estimate for the regression of $y$ onto $x$

All that is required here is that the above equality doesn’t hold (which will happen in all x, y combinations, with the exception of a very few).

I generate 100 observations with the following properties:

\[y = 5\cdot x + \epsilon \\ \epsilon \sim \mathcal{N}(0, 2^2)\]

set.seed(2)
x <- rnorm(100)
y <- 5*x + rnorm(100, sd = 2)
data <- data.frame(x, y)

lm_9 <- lm(y ~ x + 0)
lm_10 <- lm(x ~ y + 0)

We have:

$\sum_{i = 1}^n{x_i^2}$ = 133.3521492
$\sum_{i = 1}^n{y_i^2}$ = 3591.2935934

It’s quite clear that $\hat{\beta_a} \ne \hat{\beta_b}$:

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_9), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_10), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_9 & lm_10)")

The actual coefficients:

$\hat{\beta_a}$ = 4.904515
$\hat{\beta_b}$ = 0.182115

(c) Equal Coefficients

Q: Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of $x$ onto $y$ is the same from the coefficient estimate for the regression of $y$ onto $x$

One such example where $\sum_{i = 1}^n{x_i^2} = \sum_{i = 1}^n{y_i^2}$ would hold is obviously where the two vectors are equal (a relationship where $R^2$ = 1, and $\hat{\beta_b} = \frac{1}{\hat{\beta_a}} = \frac{1}{1} = 1$):

set.seed(3)
x <- rnorm(100)
y <- x
data <- data.frame(x, y)

lm_11 <- lm(y ~ x + 0)
lm_12 <- lm(x ~ y + 0)

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_11), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_12), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_11 & lm_12)")

We have:

$\sum_{i = 1}^n{x_i^2}$ = 72.566061
$\sum_{i = 1}^n{y_i^2}$ = 72.566061
$\hat{\beta_a}$ = 1
$\hat{\beta_b}$ = 1

Another example would be where vectors $x$ and $y$ are permutations of one another. Clearly $\sum_{i = 1}^n{x_i^2} = \sum_{i = 1}^n{y_i^2}$ would hold here, so the parameters should also be equal:

set.seed(4)
x <- 1:100
y <- x[order(rnorm(100))]
data <- data.frame(x, y)

lm_13 <- lm(y ~ x + 0)
lm_14 <- lm(x ~ y + 0)

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_13), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_14), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_13 & lm_14)")

We have:

$\sum_{i = 1}^n{x_i^2}$ = 3.383510^{5}
$\sum_{i = 1}^n{y_i^2}$ = 3.383510^{5}
$\hat{\beta_a}$ = 0.692493
$\hat{\beta_b}$ = 0.692493

Note that the previous example showed perfect relationship where $\hat{\beta_a} = {\hat{\beta_b}}$ held true, and this example shows a random relationship where it also holds. You could create a relationship as strong or as weak as desired, simply by permuting less (or more) or the elements of $x$, and still we would see that $\hat{\beta_a} = {\hat{\beta_b}}$ holds.

Another obvious example would be where $y = |x|$:

set.seed(5)
x <- rnorm(100)
y <- abs(x)
data <- data.frame(x, y)

lm_15 <- lm(y ~ x + 0)
lm_16 <- lm(x ~ y + 0)

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_15), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_16), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_15 & lm_16)")

We have:

$\sum_{i = 1}^n{x_i^2}$ = 88.5627779
$\sum_{i = 1}^n{y_i^2}$ = 88.5627779
$\hat{\beta_a}$ = 0.115944
$\hat{\beta_b}$ = 0.115944

13. APPLIED: Generated Data (Regression with an intercept)

In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

set.seed(1)

(a) Creating $x$

Q: Using the rnorm() function, create a vector, x, containing 100 observations drawn from a $\mathcal{N}(0, 1)$ distribution. This represents a feature, $x$

x <- rnorm(100)

(b) Creating $\epsilon$

Q: Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a $\mathcal{N}(0, 0.25)$ distribution i.e. a normal distribution with mean zero and variance 0.25.

eps <- rnorm(100, sd = 0.5) # var = sd^2

(c) Creating $y$

Q: Using x and eps, generate a vector y according to the model: $y = -1 + 0.5x + \epsilon$ What is the length of the vector $y$? What are the values of $\beta_0$ and $\beta_1$ in this linear model?

y <- -1 + 0.5*x + eps

The length of y is obviously 100, because it must be the same length of x and eps (one element of y is generated from each 100 corresponding values of x and eps):

length(y)

## [1] 100

We have the true parameters here too, which are:

$\beta_0$ = -1
$\beta_1$ = 0.5

(d) Scatter Plot

Q: Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

data <- data.frame(x, y)

g1 <- ggplot(data, aes(x = x, y = y)) + 
  geom_point()

g1

Comments:

Positive linear relationship
Weak relationship overall (large amount of noise)
Intercept $\approx$ -1, slope $\approx$ 0.5, as expected (from the generating function)

(e) Linear Model

Q: Fit a least squares linear model to predict y using x. Comment on the model obtained. How do $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ compare to $\beta_{0}$ and $\beta_{1}$?

lm_17 <- lm(y ~ x)

coef(lm_17)

## (Intercept)           x 
##  -1.0188463   0.4994698

Hence:

$\hat{\beta_{0}}$ = -1.0188463
$\hat{\beta_{1}}$ = 0.4994698

(f) Line & Legend Overlay

Q: Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

Note: It appears that putting the intercept, slope & col arguments for geom_abline() within the aes() call automatically generates a legend.

g1 + 
  geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) + 
  geom_abline(aes(intercept = coef(lm_17)[1], slope = coef(lm_17)[2], col = "Least Squares")) + 
  scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
  theme(legend.position = "bottom") + 
  labs(title = "sd(eps) = 0.5")

(g) Quadratic Term

Q: Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.

First, we can view the summary of the previous linear model (lm_17 <- lm(y ~ x)):

summary(lm_17)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93842 -0.30688 -0.06975  0.26970  1.17309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
## x            0.49947    0.05386   9.273 4.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4619 
## F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15

Comparing this to the fit of a new model, lm_18, which includes a quadratic term (x^2):

lm_18 <- lm(y ~ x + I(x^2))

summary(lm_18)

## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98252 -0.31270 -0.06441  0.29014  1.13500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
## x            0.50858    0.05399   9.420  2.4e-15 ***
## I(x^2)      -0.05946    0.04238  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared:  0.4779, Adjusted R-squared:  0.4672 
## F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14

The coefficient for the x^2 variable is not significant, so there is no statistical evidence to reject the null hypothesis that the coefficient for this quadratic term is zero.

It is interesting that the adjusted $R^2$ increased (slightly) with the inclusion of a quadratic term, and raises questions about the suitability of using such a metric blindly for variable selection. The improvement is small enough that simply varying the random seed used will often make it disappear.

(h) Repeat - Lower $Var(\epsilon)$

Q: Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term $\epsilon$ in (b). Describe your results.

Here I repeat the process. This time I reduce $Var(\epsilon)$ from $0.5^2$ to $0.1^2$

Note that the parameter estimates are ever-so-slightly closer to the true population parameters, but in both cases they are incredibly close to approximating the underlying function:

set.seed(1)
x <- rnorm(100)
eps <- rnorm(100, sd = 0.1) 
y <- -1 + 0.5*x + eps
data <- data.frame(x, y)

lm_19 <- lm(y ~ x)

ggplot(data, aes(x = x, y = y)) + 
  geom_point() + 
  geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) + 
  geom_abline(aes(intercept = coef(lm_19)[1], slope = coef(lm_19)[2], col = "Least Squares")) + 
  scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
  theme(legend.position = "bottom") + 
  labs(title = "sd(eps) = 0.1")

The standard linear model fit:

summary(lm_19)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18768 -0.06138 -0.01395  0.05394  0.23462 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.003769   0.009699  -103.5   <2e-16 ***
## x            0.499894   0.010773    46.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09628 on 98 degrees of freedom
## Multiple R-squared:  0.9565, Adjusted R-squared:  0.956 
## F-statistic:  2153 on 1 and 98 DF,  p-value: < 2.2e-16

Comparing to a quadratic fit:

lm_20 <- lm(y ~ x + I(x^2))
summary(lm_20)

## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19650 -0.06254 -0.01288  0.05803  0.22700 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.994328   0.011766 -84.512   <2e-16 ***
## x            0.501716   0.010798  46.463   <2e-16 ***
## I(x^2)      -0.011892   0.008477  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0958 on 97 degrees of freedom
## Multiple R-squared:  0.9573, Adjusted R-squared:  0.9565 
## F-statistic:  1088 on 2 and 97 DF,  p-value: < 2.2e-16

As we’d expect, $RSE$ decreases considerably & $R^2$/adjusted $R^2$ increases considerably for both models.

It is interesting that, when using the same random seed, the p-value of the quadratic term does not change, despite the fact that the noise term (eps = $\epsilon$) has a much smaller variance than before.

The findings are identical to before: The coefficient for the x^2 variable is not significant, so there is no statical evidence to reject the null hypothesis that the coefficient for this quadratic term is zero.

(i) Repeat - Higher $Var(\epsilon)$

Q: Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term $\epsilon$ in (b). Describe your results.

Here I repeat the process. This time I increase $Var(\epsilon)$ from $0.5^2$ to $1^2 = 1$

set.seed(1)
x <- rnorm(100)
eps <- rnorm(100, sd = 1) 
y <- -1 + 0.5*x + eps
data <- data.frame(x, y)

lm_21 <- lm(y ~ x)

ggplot(data, aes(x = x, y = y)) + 
  geom_point() + 
  geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) + 
  geom_abline(aes(intercept = coef(lm_21)[1], slope = coef(lm_21)[2], col = "Least Squares")) + 
  scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
  theme(legend.position = "bottom") + 
  labs(title = "sd(eps) = 1")

The standard linear model fit:

summary(lm_21)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8768 -0.6138 -0.1395  0.5394  2.3462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03769    0.09699 -10.699  < 2e-16 ***
## x            0.49894    0.10773   4.632 1.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared:  0.1796, Adjusted R-squared:  0.1712 
## F-statistic: 21.45 on 1 and 98 DF,  p-value: 1.117e-05

Comparing to a quadratic fit:

lm_22 <- lm(y ~ x + I(x^2))
summary(lm_22)

## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9650 -0.6254 -0.1288  0.5803  2.2700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.94328    0.11766  -8.017 2.47e-12 ***
## x            0.51716    0.10798   4.789 6.01e-06 ***
## I(x^2)      -0.11892    0.08477  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.958 on 97 degrees of freedom
## Multiple R-squared:  0.1959, Adjusted R-squared:  0.1793 
## F-statistic: 11.82 on 2 and 97 DF,  p-value: 2.557e-05

As we’d expect, $RSE$ increases considerably & $R^2$/adjusted $R^2$ decreases considerably for both models.

In this iteration I was surprised at how well the least squares fit still matched the population regression line, as the relationship appears far noisier than before.

Again, the results of the t-test on the quadratic term conclude that there is no evidence for supporting a quadratic relationship.

(j) Confidence Intervals for $\beta_0$, $\beta_1$

Q: What are the confidence intervals for $\beta_0$ and $\beta_1$ based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.

Recall the 95% confidence interval for $\beta_i$ is given by:

\[\hat{\beta_i} \pm t_{n-2}(0.975) \cdot SE(\hat{\beta_i})\]

Original Data: $Var(\epsilon) = 0.5^2$

confint(lm_17)

##                  2.5 %     97.5 %
## (Intercept) -1.1150804 -0.9226122
## x            0.3925794  0.6063602

Noisier Data: $Var(\epsilon) = 1$

confint(lm_21)

##                  2.5 %     97.5 %
## (Intercept) -1.2301607 -0.8452245
## x            0.2851588  0.7127204

Less-Noisy Data: $Var(\epsilon) = 0.1^2$

confint(lm_19)

##                  2.5 %     97.5 %
## (Intercept) -1.0230161 -0.9845224
## x            0.4785159  0.5212720

None of the confidence intervals for $\beta_1$ contain zero.

As the relationship becomes noisier (i.e. as $Var(\epsilon)$ increases), the 95% confidence interval for each parameter becomes wider. Since we saw the actual parameter estimates $\hat{\beta_i}$ barely change, this is being driven by an increase in $SE(\hat{\beta_i})$.

All of the confidence intervals are centered around -1 for $\beta_0$ and 0.5 for $\beta_0$, which is to be expected (in all 3 scenarios, the parameter estimates $\hat{\beta_i}$ were very close to the population parameters).

A stronger relationship with less noise means that the true population parameter $\beta_i$ can be said to fall within a smaller region (with 95% certainty).

14. APPLIED: Generated Data (Collinearity)

This problem focuses on the collinearity problem

(a) Generating the Data

Q: Perform the following commands in R:

set.seed(1)

x1=runif (100)

x2=0.5*x1+rnorm (100)/10

y=2+2*x1+0.3*x2+rnorm (100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

set.seed(1)
x1 = runif(100)
x2 = 0.5*x1 + rnorm(100)/10
y = 2 + 2*x1 + 0.3*x2 + rnorm(100)

The form of the regression model is given by:

\[y = \beta_0 + \beta_1\cdot x_1 + \beta_2\cdot x_2 + \epsilon\] The regression coefficients are given by:

$\beta_0$ = 2
$\beta_1$ = 2
$\beta_2$ = 0.3

(b) Correlated Predictors

Q: What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables

The Pearson correlation coefficient between x1 and x2 is given below:

cor(x1, x2)

## [1] 0.8351212

ggplot(mapping = aes(x = x1, y = x2)) + 
  geom_point()

(c) Correlated Predictors - Model Statistics

Q: Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are $\hat{\beta_0}$, $\hat{\beta_1}$, and $\hat{\beta_2}$? How do these relate to the true $\beta_0$, $\beta_1$, and $\beta_2$? Can you reject the null hypothesis $H_{0}: \beta_1 = 0$? How about the null hypothesis $H_{0}: \beta_2 = 0$?

lm_23 <- lm(y ~ x1 + x2)

summary(lm_23)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

$\hat{\beta_0}$ = 2.1304996 ($\beta_0$ = 2)
$\hat{\beta_1}$ = 1.4395554 ($\beta_1$ = 2)
$\hat{\beta_2}$ = 1.0096742 ($\beta_2$ = 0.3)

Using the standard alpha threshold of 0.05, we can reject the null hypothesis for $\beta_1$, but cannot reject the null hypothesis for $\beta_2$.

(d) Correlated Predictors - Using `x1` only

Q: Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis $H_{0}: \beta_1 = 0$?

The form of this regression model is given by:

\[y = \beta_0 + \beta_1\cdot x_1 + \epsilon\]

lm_24 <- lm(y ~ x1)

summary(lm_24)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

In this case, the null hypothesis for $\beta_1$ can be rejected.

(e) Correlated Predictors - Using `x2` only

Q: Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis $H_{0}: \beta_1 = 0$?

The form of this regression model is given by:

\[y = \beta_0 + \beta_1\cdot x_2 + \epsilon\]

lm_25 <- lm(y ~ x2)

summary(lm_25)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

In this case, the null hypothesis for $\beta_1$ can be rejected.

(f) Contradictory Results?

Q: Do the results obtained in (c)-(e) contradict each other? Explain your answer

The supposedly contradictory results are from the fact that, in a model with x1 and x2 as predictors, the x2 variable was not significant. However, when testing a model with just x2 as a predictor, we find that it is significant.

These are not contradictory results, and arise because x2 does not offer enough ‘new information’ when fitting a model that already contains x1. The fact that x2 can be significant on its own and not significant in the presence of x1 arises from the fact that x1 and x2 are highly correlated, so using both the variables means a lot of the information provided by one can is effectively redundant.

(g) Additional (Mismeasured) Observation

Q: Now suppose we obtain one additional observation, which was unfortunately mismeasured.

x1=c(x1, 0.1)

x2=c(x2, 0.8)

y=c(y,6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

x1 = c(x1, 0.1)
x2 = c(x2, 0.8)
y = c(y, 6)

data <- data.frame(y, x1, x2, new = 0)
data$new[length(data$new)] <- 1

ggplot(data, aes(x = x1, y = x2, col = factor(new))) + 
  geom_point() + 
  scale_color_manual(values = c("black", "red")) + 
  theme(legend.position = "none")

ggplot(data, aes(x = x1, y = y, col = factor(new))) + 
  geom_point() + 
  scale_color_manual(values = c("black", "red")) + 
  theme(legend.position = "none")

ggplot(data, aes(x = x2, y = y, col = factor(new))) + 
  geom_point() + 
  scale_color_manual(values = c("black", "red")) + 
  theme(legend.position = "none")

Model: y ~ x1 + x2

Previously, see part (c).

With the new observation included:

summary(lm(y ~ x1 + x2))

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

Previously, we had that in a model containing both x1 and x2, x1 was significant and x2 wasn’t.

With the addition of this new observation, this has reversed (x2 is now significant, x1 isn’t).

broom::augment(lm(y ~ x1 + x2)) %>%
  cbind(new = data$new) %>%
  ggplot(aes(x = .hat, y = .std.resid, col = factor(new))) + 
  geom_point() + 
  geom_hline(yintercept = -2, col = "deepskyblue3", size = 1, alpha = 0.5) + 
  geom_hline(yintercept = 2, col = "deepskyblue3", size = 1, alpha = 0.5) + 
  geom_vline(xintercept = 3 / nrow(data), col = "mediumseagreen", size = 1, alpha = 0.5) +
  scale_color_manual(values = c("black", "red")) + 
  theme_light() +
  theme(legend.position = "none") +
  labs(x = "Leverage", 
       y = "Standardized Residual", 
       title = "Residuals vs Leverage: y ~ x1 + x2")

Recall that the leverage of a point is a measurement of how unusual/‘far away’ its $x_i$ values are from other the other observations. It takes an average of $(p + 1)/n$.

Judging by the graph above, in the case of lm(y ~ x1 + x2), the new red observation is (by far) the highest leverage, and this can be visually confirmed by looking at its position on the scatter plot of x1 vs x2. It also has a large residual and hence could be considered an outlier.

Model: y ~ x1

Previously, see part (d).

With the new observation included:

summary(lm(y ~ x1))

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

Previously, in the model containing just x1 as a predictor, x1 was significant. This is still the case.

broom::augment(lm(y ~ x1)) %>%
  cbind(new = data$new) %>%
  ggplot(aes(x = .hat, y = .std.resid, col = factor(new))) + 
  geom_point() + 
  geom_hline(yintercept = -2, col = "deepskyblue3", size = 1, alpha = 0.5) + 
  geom_hline(yintercept = 2, col = "deepskyblue3", size = 1, alpha = 0.5) + 
  geom_vline(xintercept = 2 / nrow(data), col = "mediumseagreen", size = 1, alpha = 0.5) +
  scale_color_manual(values = c("black", "red")) + 
  theme_light() +
  theme(legend.position = "none") +
  labs(x = "Leverage", 
       y = "Standardized Residual", 
       title = "Residuals vs Leverage: y ~ x1")

For lm(y ~ x1), the new observation is still fairly high-leverage, but is also an outlier with a very large standardized residual (>3). Looking at the graph of y vs x1, we can visually confirm this (the point is far from the mean of x1 and would be a regression lines biggest outlier).

Model: y ~ x2

Previously, see part (e).

With the new observation included:

summary(lm(y ~ x2))

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

Previously, in the model containing just x2 as a predictor, x2 was significant. This is still the case.

broom::augment(lm(y ~ x2)) %>%
  cbind(new = data$new) %>%
  ggplot(aes(x = .hat, y = .std.resid, col = factor(new))) + 
  geom_point() + 
  geom_hline(yintercept = -2, col = "deepskyblue3", size = 1, alpha = 0.5) + 
  geom_hline(yintercept = 2, col = "deepskyblue3", size = 1, alpha = 0.5) + 
  geom_vline(xintercept = 2 / nrow(data), col = "mediumseagreen", size = 1, alpha = 0.5) +
  scale_color_manual(values = c("black", "red")) + 
  theme_light() +
  theme(legend.position = "none") +
  labs(x = "Leverage", 
       y = "Standardized Residual", 
       title = "Residuals vs Leverage: y ~ x2")

For lm(y ~ x2), the observation is (once again) the highest leverage, which agreed with the graph of y vs x2 (where it is furthest from the mean of x2).

In this case, this observation is mostly in-line with the fit of a regression line, so its residual is comparatively low. For this model, the new observation might not be considered influential.

15. APPLIED: Generated Data (Collinearity)

This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.

(a) Repeated Simple Linear Regressions

Q: For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

Boston <- MASS::Boston
Boston$chas <- factor(Boston$chas)

I wrote the following loop based on extractions from my best_predictor() function, which will fit a linear model predicting crim for each predictor in Boston, extract their $R^2$ and t-test p-value, then rank order them in descending $R^2$.

I also extract the model slope ($\hat{\beta_1}$), for use in (c).

R2 <- c()
P_value <- c()
Slope <- c()
y <- Boston$crim

for (i in 1:ncol(Boston)) {
  x <- Boston[ ,i]
  
  if (!identical(y, x)) {
    # linear: y ~ x
    R2[i] <- summary(lm(y ~ x))$r.squared 
    P_value[i] <- summary(lm(y ~ x))$coefficients[2, 4]
    Slope[i] <- summary(lm(y ~ x))$coefficients[2, 1]
  } else {
    R2[i] <- NA
    P_value[i] <- NA
    Slope[i] <- NA
  }
}

crime_preds <- data.frame(varname = names(Boston), 
           R2 = round(R2, 5), 
           P_value = round(P_value, 10), 
           Slope = round(Slope, 5)) %>%
  arrange(desc(R2))

crime_preds

##    varname      R2      P_value    Slope
## 1      rad 0.39126 0.0000000000  0.61791
## 2      tax 0.33961 0.0000000000  0.02974
## 3    lstat 0.20759 0.0000000000  0.54880
## 4      nox 0.17722 0.0000000000 31.24853
## 5    indus 0.16531 0.0000000000  0.50978
## 6     medv 0.15078 0.0000000000 -0.36316
## 7    black 0.14827 0.0000000000 -0.03628
## 8      dis 0.14415 0.0000000000 -1.55090
## 9      age 0.12442 0.0000000000  0.10779
## 10 ptratio 0.08407 0.0000000000  1.15198
## 11      rm 0.04807 0.0000006347 -2.68405
## 12      zn 0.04019 0.0000055065 -0.07393
## 13    chas 0.00312 0.2094345015 -1.89278
## 14    crim      NA           NA       NA

It appears that, when tested individually, all predictors are significant (with the exception of chas, which has p = 0.2094 > 0.05).

I skip producing plots here, as this was already done and discussed in some detail in Ch.2 - Q10 - (b), (c), etc.

(b) All Predictors

Q: Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis $H_{0}: \beta_{j} = 0$

Fitting the linear model crim_lm, predicting crim using all predictors:

crim_lm <- lm(crim ~ ., data = Boston)

summary(crim_lm)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas1        -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

We can reject the null hypothesis $H_{0}: \beta_{j} = 0$ (at the 5% level) for the predictors:

zn
dis
rad
black
medv

For all other predictors, there was insufficient evidence to reject the null hypothesis that their coefficient was zero.

(c) Simple vs Multiple Regression Coefficients

Q: How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis

See the table below, where Slope appears as the x-axis variable, and is $\hat{\beta_1}$ in the simple linear regression case. Estimate is the y-axis variable, $\hat{\beta_j}$, the coefficient estimate for the variable in the multiple regression equation.

multiple_reg <- summary(crim_lm)$coefficients %>%
  as.data.frame() %>%
  rownames_to_column("varname") %>%
  dplyr::select(varname, Estimate)

simple_reg <- crime_preds %>%
  select(varname, Slope) %>%
  mutate(varname = as.character(varname))

simple_reg$varname[simple_reg$varname == "chas"] <- "chas1"

simple_multiple_reg <- inner_join(simple_reg, multiple_reg, by = "varname") 
simple_multiple_reg

##    varname    Slope      Estimate
## 1      rad  0.61791   0.588208591
## 2      tax  0.02974  -0.003780016
## 3    lstat  0.54880   0.126211376
## 4      nox 31.24853 -10.313534912
## 5    indus  0.50978  -0.063854824
## 6     medv -0.36316  -0.198886821
## 7    black -0.03628  -0.007537505
## 8      dis -1.55090  -0.987175726
## 9      age  0.10779   0.001451643
## 10 ptratio  1.15198  -0.271080558
## 11      rm -2.68405   0.430130506
## 12      zn -0.07393   0.044855215
## 13   chas1 -1.89278  -0.749133611

ggplot(simple_multiple_reg, aes(x = Slope, y = Estimate, col = factor(varname))) + 
  geom_jitter() + 
  labs(title = "Simple vs Multiple Regression Coefficients", 
       subtitle = "For linear models predicting 'crim'", 
       x = "Simple Coefficient", 
       y = "Multiple Coefficient") + 
  theme(legend.position = "none")

(d) Non-Linear Relationships?

Q: Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form: $Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon$

I create another loop, similar to part (a), but this time I fit a cubic model of the form $Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon$, where $Y$ = crim, for all $X$ features (except crim & chas).

I extract the $R^2$ for each model as before, but now I also extract the p-values for the $X$, $X^2$ & $X^3$ variable t-tests, which can be used to identify where non-linear relationships exist. The relationship column then determines which is the highest order (significant) relationship.

P_value_x <- c()
P_value_x2 <- c()
P_value_x3 <- c()
R2 <- c()
y <- Boston$crim

for (i in 1:ncol(Boston)) {
  x <- Boston[ ,i]
  if (is.numeric(x)) { 
    model <- lm(y ~ x + I(x^2) + I(x^3))
    if (!identical(y, x)) {
      P_value_x[i] <- summary(model)$coefficients[2, 4]
      P_value_x2[i] <- summary(model)$coefficients[3, 4]
      P_value_x3[i] <- summary(model)$coefficients[4, 4]
      R2[i] <- summary(model)$r.squared 
    }
  } else {
    P_value_x[i] <- NA
    P_value_x2[i] <- NA
    P_value_x3[i] <- NA
    R2[i] <- NA
  }
}

data.frame(varname = names(Boston),
           R2 = round(R2, 5),
           P_value_x = round(P_value_x, 10),
           P_value_x2 = round(P_value_x2, 10), 
           P_value_x3 = round(P_value_x3, 10)) %>%
  filter(!varname %in% c("crim", "chas")) %>%
  arrange(desc(R2)) %>%
  mutate(relationship = case_when(P_value_x3 < 0.05 ~ "Cubic", 
                                  P_value_x2 < 0.05 ~ "Quadratic", 
                                  P_value_x < 0.05 ~ "Linear", 
                                  TRUE ~ "No Relationship"))

##    varname      R2    P_value_x   P_value_x2   P_value_x3    relationship
## 1     medv 0.42020 0.0000000000 0.0000000000 0.0000000000           Cubic
## 2      rad 0.40004 0.6234175212 0.6130098773 0.4823137740 No Relationship
## 3      tax 0.36888 0.1097075249 0.1374681578 0.2438506811 No Relationship
## 4      nox 0.29698 0.0000000000 0.0000000000 0.0000000000           Cubic
## 5      dis 0.27782 0.0000000000 0.0000000000 0.0000000109           Cubic
## 6    indus 0.25966 0.0000529706 0.0000000003 0.0000000000           Cubic
## 7    lstat 0.21793 0.3345299858 0.0645873561 0.1298905873 No Relationship
## 8      age 0.17423 0.1426608270 0.0473773275 0.0066799154           Cubic
## 9    black 0.14984 0.1385871340 0.4741750826 0.5436171817 No Relationship
## 10 ptratio 0.11378 0.0030286627 0.0041195521 0.0063005136           Cubic
## 11      rm 0.06779 0.2117564139 0.3641093853 0.5085751094 No Relationship
## 12      zn 0.05824 0.0026122963 0.0937504996 0.2295386205          Linear

Below are a couple of examples of predictors where the cubic term is significant. Using the formula argument in geom_smooth(method = "lm", ), we can overlay the cubic line directly onto the scatter plot:

ggplot(Boston, aes(x = medv, y = crim)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)") + 
  labs(title = "Cubic Relationship between 'medv' & 'crim'")

ggplot(Boston, aes(x = nox, y = crim)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)") + 
  labs(title = "Cubic Relationship between 'nox' & 'crim'")

	coefficient	std. error	t-statistic	p-value
Intercept	\(\hat{\beta_{0}}\) = 2.939	0.3119	9.42	< 0.0001
`TV`	\(\hat{\beta_{1}}\) = 0.046	0.0014	32.81	< 0.0001
`radio`	\(\hat{\beta_{2}}\) = 0.189	0.0086	21.89	< 0.0001
`newspaper`	\(\hat{\beta_{3}}\) = -0.001	0.0059	-0.18	0.8599

ISLR - Linear Regression (Ch. 3) - Exercise Solutions

Liam Morgan

November 2019

1. T-Tests

(a) Intercept

(b) TV & radio

(c) newspaper

2. KNN: Classification vs Regression

3. Regression Puzzle

(a) Salary for Fixed IQ & GPA

(b) Using the Equation

(c) Interaction Coefficient

4. Train & Test RSS - Linear/Nonlinear Models

(a) True Relationship = Linear, Training RSS?

(b) True Relationship = Linear, Test RSS?

(c) True Relationship = Non-Linear, Train RSS?

(d) True Relationship = Non-Linear, Test RSS?

5. Linear Regression (without an intercept)

6. Regression - (\(\bar{x}\), \(\bar{y}\))

7. Pearson Correlation & Regression \(R^2\)

8. APPLIED: The Auto Dataset (Simple Linear Regression)

(a) Predicting mpg with horsepower

(b) Plot Response (mpg) vs Predictor (horsepower)

(c) Diagnostic Plots

9. APPLIED: The Auto Dataset (Multiple Linear Regression)

(a) Scatterplot Matrix

(b) Correlation Matrix

(c) Predicting mpg with all variables (except name)

(d) Diagnostic Plots

(e) Interaction Terms

(f) Variable Transformations

10. APPLIED: The Carseats Dataset (Multiple Linear Regression)

(a) Predicting Sales

(b) Interpreting Coefficients

(c) Equation-Form

(d) T-tests

(e) Smaller Model

(f) Model Fit

(g) Coefficient Confidence Interval

(h) Outliers and High Leverage Observations

11. APPLIED: Generated Data

(a) Predict y using x (no intercept)

(b) Predict x using y (no intercept)

(c) Reverse Regression (reversing \(x\) and \(y\)) - Comparing Results

(d) Algebra Involving the T-Statistic (No Intercept)

(e) Reverse Regression - T-Statistic Equality

(f) Algebra Involving the T-Statistic (with an Intercept)

12. APPLIED: Generated Data (Regression without an intercept)

(a) Reverse Regression: Requirement for Equal Coefficients

(b) Non-equal Coefficients

(c) Equal Coefficients

13. APPLIED: Generated Data (Regression with an intercept)

(a) Creating \(x\)

(b) Creating \(\epsilon\)

(c) Creating \(y\)

(d) Scatter Plot

(e) Linear Model

(f) Line & Legend Overlay

(g) Quadratic Term

(h) Repeat - Lower \(Var(\epsilon)\)

(i) Repeat - Higher \(Var(\epsilon)\)

(j) Confidence Intervals for \(\beta_0\), \(\beta_1\)

14. APPLIED: Generated Data (Collinearity)

(a) Generating the Data

(b) Correlated Predictors

(c) Correlated Predictors - Model Statistics

(d) Correlated Predictors - Using x1 only

(e) Correlated Predictors - Using x2 only

(f) Contradictory Results?

(g) Additional (Mismeasured) Observation

15. APPLIED: Generated Data (Collinearity)

(a) Repeated Simple Linear Regressions

(b) All Predictors

(c) Simple vs Multiple Regression Coefficients

(d) Non-Linear Relationships?

(b) `TV` & `radio`

(c) `newspaper`

8. APPLIED: The `Auto` Dataset (Simple Linear Regression)

(a) Predicting `mpg` with `horsepower`

(b) Plot Response (`mpg`) vs Predictor (`horsepower`)

9. APPLIED: The `Auto` Dataset (Multiple Linear Regression)

(c) Predicting `mpg` with all variables (except `name`)

10. APPLIED: The `Carseats` Dataset (Multiple Linear Regression)

(a) Predicting `Sales`

(a) Predict `y` using `x` (no intercept)

(b) Predict `x` using `y` (no intercept)

(d) Correlated Predictors - Using `x1` only

(e) Correlated Predictors - Using `x2` only