Auto
Dataset (Simple Linear Regression)Auto
Dataset (Multiple Linear Regression)Carseats
Dataset (Multiple Linear Regression)y
using x
(no intercept)x
using y
(no intercept)NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Q: Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.
A:
Table 3.4, the regression statistics from predicting \(y\) = sales
using TV
, radio
& newspaper
.
data <- data.frame(coeff = c('$\\hat{\\beta_{0}}$ = 2.939', '$\\hat{\\beta_{1}}$ = 0.046', '$\\hat{\\beta_{2}}$ = 0.189', '$\\hat{\\beta_{3}}$ = -0.001'),
std_err = c(0.3119, 0.0014, 0.0086, 0.0059),
t_stat = c(9.42, 32.81, 21.89, -0.18),
p_value = c('< 0.0001', '< 0.0001', '< 0.0001', '0.8599'),
stringsAsFactors = F)
colnames(data) <- c('coefficient', 'std. error', 't-statistic', 'p-value')
rownames(data) <- c('Intercept', '`TV`', '`radio`', '`newspaper`')
library(knitr)
library(kableExtra)
kable(data, row.names = T) %>%
kable_styling(bootstrap_options = "bordered", full_width = F)
coefficient | std. error | t-statistic | p-value | |
---|---|---|---|---|
Intercept | \(\hat{\beta_{0}}\) = 2.939 | 0.3119 | 9.42 | < 0.0001 |
TV
|
\(\hat{\beta_{1}}\) = 0.046 | 0.0014 | 32.81 | < 0.0001 |
radio
|
\(\hat{\beta_{2}}\) = 0.189 | 0.0086 | 21.89 | < 0.0001 |
newspaper
|
\(\hat{\beta_{3}}\) = -0.001 | 0.0059 | -0.18 | 0.8599 |
The corresponding test (with null and alternative hypotheses):
\[H_{0}: \beta_{0} = 0 \\H_{a}: \beta_{0} \neq 0 \]
\(\beta_{0}\) is the value that \(y\) takes when \(x_{1} = x_{2} = x_{3} = 0\). That is, it is the sales
we expect when there is no TV, radio or newspaper advertising.
Since p < 0.0001, we have significant evidence to reject \(H_{0}\) and accept the alternative hypothesis. This basically just means we have strong evidence that sales
would not be zero in the absence of TV
, radio
& newspaper
advertising.
TV
& radio
The corresponding test (with null and alternative hypotheses) for TV
:
\[H_{0}: \beta_{1} = 0 \\H_{a}: \beta_{1} \neq 0 \]
The same for radio
:
\[H_{0}: \beta_{2} = 0 \\H_{a}: \beta_{2} \neq 0 \]
In both cases, the null hypothesis means that the corresponding variable has no effect on \(y\), when holding all other predictors fixed.
In both cases, p < 0.0001, so we reject \(H_{0}\) and conclude that both a change in the TV
or radio
budgets does have some impact on sales
.
newspaper
The corresponding test (with null and alternative hypotheses) for newspaper
:
\[H_{0}: \beta_{3} = 0 \\H_{a}: \beta_{3} \neq 0 \]
The null hypothesis means that, when holding the other predictors (TV
& radio
) fixed, newspaper
has no effect on \(y\) = sales
.
Since p = 0.8599, there is not sufficient evidence to reject this hypothesis. In other words, it appears that changing the newspaper
budget has no impact on sales
.
Q: Carefully explain the differences between the KNN classifier and KNN regression methods.
A:
Obviously the main difference is the goal. The KNN classifier is used in situations where the response variable is categorical, whereas the KNN regressor would only be appropriate when it is numeric.
For both the classifier & regressor:
KNN Classifier:
First, the conditional probability for class \(j\) is estimated, which is the fraction of the points in \(\mathcal{N}_{0}\) where \(y_{i}\) = \(j\).
\[ P_{r}(Y = j | x = x_{0}) = \frac{1}{K} \sum_{i \in \mathcal{N}_{0}} I(y_{i} = j)\]
After working out the conditional probability for each \(j\) class, the class prediction \(\hat{f}(x_{0})\) is then simply the class \(j\) with the highest conditional probability:
\[\hat{f}(x_{0}) = \underset{j}{\operatorname{argmax}}(P_{r}(Y = j | x = x_{0}))\]
KNN Regressor:
The KNN regressor identified the \(K\) training observations that are closest to the observation it is predicting (\(x_{0}\)), then it estimates \(f(x_{0})\) by taking the mean \(y\) value of these training observations.
\[\hat{f}(x_{0}) = \frac{1}{K} \sum_{i \in \mathcal{N}_{0}} y_{i}\]
Suppose we have a data set with five predictors:
The response \(Y\) is starting salary after graduation (in thousands of dollars).
Suppose we use least squares to fit the model, and get:
Q: Which answer is correct, and why?
i. For a fixed value of IQ and GPA, males earn more on average than females.
ii. For a fixed value of IQ and GPA, females earn more on average than males.
iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.
iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.
A:
The regression equation is given by:
\[ \begin{align*} \hat{salary} & = \hat{\beta_{0}} + \hat{\beta_{1}}\cdot gpa + \hat{\beta_{2}}\cdot iq + \hat{\beta_{3}}\cdot gender + \hat{\beta_{4}}\cdot gpa\cdot iq + \hat{\beta_{5}}\cdot gpa\cdot gender \\ & = 50 + 20\cdot gpa + 0.07\cdot iq + 35\cdot gender + 0.01\cdot gpa\cdot iq - 10\cdot gpa\cdot gender \end{align*} \]
Lets look at the \(\hat{y}\) for \(gender = 1\) (female) and \(gender = 0\) (male):
\[ \begin{align*} \mathbb{E}[y \mid gender = 1] & = 50 + 20\cdot gpa + 0.07\cdot iq + 35\cdot 1 + 0.01\cdot gpa\cdot iq - 10\cdot gpa\cdot 1 \\ & = 85 + 10\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq \end{align*} \]
\[ \begin{align*} \mathbb{E}[y \mid gender = 0] & = 50 + 20\cdot gpa + 0.07\cdot iq + 35\cdot 0 + 0.01\cdot gpa\cdot iq - 10\cdot gpa\cdot 0 \\ & = 50 + 20\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq \end{align*} \]
Starting salary is higher for females than males when:
\[ \begin{align*} & \mathbb{E}[y \mid gender = 1] > \mathbb{E}[y \mid gender = 0] \\ \Longleftrightarrow \enspace & 85 + 10\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq > 50 + 20\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq \\ \Longleftrightarrow \enspace & 85 + 10\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq > 50 + 20\cdot gpa + 0.07\cdot iq + 0.01\cdot gpa\cdot iq \\ \Longleftrightarrow \enspace & 85 + 10\cdot gpa > 50 + 20\cdot gpa \\ \Longleftrightarrow \enspace & gpa < 3.5\\ \end{align*} \] So for fixed IQ & GPA, females earn more than males except for when \(gpa > 3.5\). This is equivalent to answer iii.
Q: Predict the salary of a female with IQ of 110 and a GPA of 4.0.
A:
Plugging the values into the regression equation in (a):
\[ \begin{align*} \hat{salary} & = 50 + 20\cdot gpa + 0.07\cdot iq + 35\cdot gender + 0.01\cdot gpa\cdot iq - 10\cdot gpa\cdot gender \\ & = 50 + 20\cdot 4 + 0.07\cdot 110 + 35\cdot 1 + 0.01\cdot 4\cdot 110 - 10\cdot 4\cdot 1 \\ & = 137.1 \end{align*} \]
So her predicted salary would be: $137,100.
Q: True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
A:
False. The coefficient value for an interaction term between \(x_{1}\) and \(x_{2}\) will depend on the scales of \(x_{1}\), \(x_{2}\) & \(y\).
We could change this coefficient simply by changing the scale of particular variables (e.g. units \(\rightarrow\) thousands \(\rightarrow\) millions) and this won’t be in any way reflective of whether there is an interaction effect present: the value of \(\hat{\beta_{4}}\) is not informative for this purpose.
We should instead perform a t-test for the parameter of the GPA/IQ interaction term if we are interested in whether this interaction effect exists:
\[H_{0}: \beta_{4} = 0 \\H_{a}: \beta_{4} \neq 0 \]
I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response.
I then fit a linear regression model to the data: \(Y = \hat{\beta_{0}} + \hat{\beta_{1}}X + \epsilon\)
as well as a separate cubic regression: \(Y = \hat{\beta_{0}} + \hat{\beta_{1}}X + \hat{\beta_{2}}X^2 + \hat{\beta_{3}}X^3 + \epsilon\)
Q: Suppose that the true relationship between X and Y is linear. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
A:
I would expect similar results, but with the cubic regression with a lower training RSS. We are basically increasing the model flexibility by going from linear \(\rightarrow\) cubic, and we are working with a small sample of data, so I’d expect the cubic model to overfit to any nonlinearities and hence have a lower training RSS.
Q: Answer (a) using test rather than training RSS.
A:
Here I would expect the linear regression to perform slightly better than the cubic regression. I say slightly because the true relationship is linear and we are working from a sample of this data, so the cubic regression would likely have fit something close to a linear relationship (it’s not going to fit a wild cubic relationship if the data mostly follow a straight line).
Q: Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
A:
As in part (a), I expect the cubic regression to have a lower training RSS.
If the relationship is very non-linear and can be well approximated by a cubic fit, I would expect the training RSS to be much lower.
If, however, the relationship is very close to linear or cannot be well approximated by a cubic fit, we might expect more similar results.
Either way, the cubic fit would outperform the linear fit, as increasing model flexibility will lead to a reduction in training error.
Q: Answer (c) using test rather than training RSS.
A:
It depends on what the nature of the non-linear relationship is. For most non-linear relationships I would expect the cubic regression to have a lower test RSS, but at the same time, a linear model may give a very good approximation for a mildly non-linear relationship.
I think this completely depends. If the relationship is closer to linear, we would expect the linear regression to have a lower test RSS. In the case of a stronger non-linear relationship that is closer to cubic, we would obviously expect the cubic regression to have a lower test RSS.
Q: Consider the fitted values that result from performing linear regression without an intercept. In this setting, the ith fitted value takes the form:
\[\hat{y_i}=x_i\hat{\beta}\]
where:
\[\hat{\beta} = (\sum_{j=1}^{n}x_jy_j)/(\sum_{k=1}^{n}x_k^2)\]
Show that we can write
\[\hat{y_i} = \sum_{j=1}^{n}a_jy_j\]
What is \(a_j\)?
A:
\[ \begin{align*} \hat{y_i} & = x_i\hat{\beta} \\\\ & = x_i \frac{\sum_{j=1}^{n}x_jy_j}{\sum_{k=1}^{n}x_k^2} \\\\ & = \frac{\sum_{j=1}^{n}x_ix_jy_j}{\sum_{k=1}^{n}x_k^2} && \text{(since } x_i \text{ doesn't depend on } j \text{)}\\\\ & = \sum_{j=1}^{n}\frac{x_ix_jy_j}{\sum_{k=1}^{n}x_k^2} && \text{(similarly, the summation over } k \text{ is basically a constant w.r.t } j \text{)}\\\\ & = \sum_{j=1}^{n}\frac{x_ix_j}{\sum_{k=1}^{n}x_k^2} y_j\\\\ & = \sum_{j=1}^{n}a_jy_j \\\\ \end{align*} \] Where \(a_j = \frac{x_ix_j}{\sum_{k=1}^{n}x_k^2}\)
We can see that the fitted values from linear regression (with the intercept fixed at 0) are linear combinations of the response values.
Q: Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point (\(\bar{x}\), \(\bar{y}\))
A:
(3.4) give the parameter estimates for a simple linear regression line when minimizing the RSS:
\[\hat{\beta_{1}} = \frac{\sum_{i = 1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n(x_i - \bar{x})^2} \\ \hat{\beta_{0}} = \bar{y} - \hat{\beta_{1}}\bar{x}\]
The least squares line:
\[y_i = \hat{\beta_{0}} + \hat{\beta_{1}}x_i\]
By subbing in \(x_i = \bar{x}\), and using the relationship for \(\hat{\beta_{0}}\):
\[ \begin{align*} y_i & = \hat{\beta_{0}} + \hat{\beta_{1}}\bar{x} \\ & = (\bar{y} - \hat{\beta_{1}}\bar{x}) + \hat{\beta_{1}}\bar{x} \\ & = \bar{y} \end{align*} \]
Q: It is claimed in the text that in the case of simple linear regression of Y onto X, the R2 statistic (3.17) is equal to the square of the correlation between X and Y (3.18). Prove that this is the case. For simplicity, you may assume that \(\bar{x} = \bar{y} = 0\)
A:
I’m going to drop some notation to keep things clearer, but all sums will be from 1 to n, e.g:
\[\sum_{i = 1}^n(x_i - \bar{x}) = \sum(x - \bar{x})\]
Using the parameter estimates for \(\hat{\beta_{0}}\) & \(\hat{\beta_{1}}\) in (6), we can simplify (since \(\bar{x} = \bar{y} = 0\)):
\[\hat{\beta_{0}} = 0 \\ \hat{\beta_{1}} = \frac{\sum{xy}}{\sum{x^2}} \] We can also simplify the correlation coefficient:
\[Cor(X, Y) = \frac{\sum{(x - \bar{x})(y - \bar{y})}}{\sqrt{\sum{(x - \bar{x})^2}}\sqrt{\sum{(y - \bar{y})^2}}} = \frac{\sum{xy}}{\sqrt{\sum{y^2}\sum{x^2}}}\]
\[R^2 = \frac{TSS - RSS}{TSS}\]
\[TSS = \sum{(y - \bar{y})^2} = \sum{y^2}\]
Lots of algebra to simplify \(RSS\):
\[ \begin{align*} RSS & = \sum{(y - \hat{y})^2} \\ & = \sum{y^2 - 2y\hat{y} + \hat{y}^2}\\ & = \sum{y^2 - 2y(\hat{\beta_{0}} + \hat{\beta_{1}}x) + (\hat{\beta_{0}} + \hat{\beta_{1}}x)^2} && \text{(subbing } \hat{y} \text{)}\\ & = \sum{y^2 - 2\hat{\beta_{1}}xy + \hat{\beta_{1}}^2x^2} && \text{(since } \hat{\beta_{0}} = 0 \text{)}\\ & = \sum{y^2} - 2\hat{\beta_{1}}\sum{xy} + \hat{\beta_{1}}^2\sum{x^2} && \text{(properties of summations)} \\ & = \sum{y^2} - 2\left(\frac{\sum{xy}}{\sum{x^2}}\right)\sum{xy} + \left(\frac{\sum{xy}}{\sum{x^2}}\right)^2\sum{x^2} && \text{(subbing } \hat{\beta_{1}} \text{)}\\ & = \sum{y^2} - 2\frac{(\sum{xy})^2}{\sum{x^2}} + \frac{(\sum{xy})^2}{\sum{x^2}} \\ & = \frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{\sum{x^2}} \end{align*} \]
Plugging this into the \(R^2\) formula:
\[ \begin{align*} R^2 & = \frac{TSS - RSS}{TSS} \\ & = \frac{\sum{y^2} - \left[\frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{\sum{x^2}}\right]}{\sum{y^2}} \\ & = \frac{\sum{y^2}\sum{x^2} - \sum{y^2}\sum{x^2} + (\sum{xy})^2}{\sum{y^2}\sum{x^2}} \\ & = \frac{(\sum{xy})^2}{\sum{y^2}\sum{x^2}} \\ & = \left(\frac{\sum{xy}}{\sqrt{\sum{y^2}\sum{x^2}}}\right)^2 \\ & = Cor(X, Y)^2 \end{align*} \]
So we get \(R^2 = Cor(X, Y)^2\), as required.
Auto
Dataset (Simple Linear Regression)This question involves the use of simple linear regression on the Auto
data set.
library(ISLR)
library(tidyverse)
glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...
mpg
with horsepower
Q: Use the lm()
function to perform a simple linear regression with mpg
as the response and horsepower
as the predictor. Use the summary()
function to print the results. Comment on the output. For example:
i. Is there a relationship between the predictor and the response?
ii. How strong is the relationship between the predictor and the response?
iii. Is the relationship between the predictor and the response positive or negative?
iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?
A:
I follow the methodology on p.102 - p.104 quite closely here.
mpg_horsepower <- lm(mpg ~ horsepower, data = Auto)
summary(mpg_horsepower)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
i: The p-value for the horsepower
variable is very small (<<0.05), so there is strong evidence to believe that horsepower
is associated with mpg
. Therefore, there is a relationship between the predictor and response.
ii:
We have the \(RSE\), an estimate of the standard deviation of \(\epsilon\), given by:
\[RSE = \sqrt{\frac{1}{n-p-1} RSS}\] Think of it as “the average amount that the response will deviate from the true regression line… another way to think about this is that, even if the model were correct and the true values of the unknown coeffcients were known exactly, any prediction of \(y\) on the basis of the predictors would still be off by about \(RSE\) units, on average.”
The \(RSE\) for this model:
summary(mpg_horsepower)$sigma
## [1] 4.905757
The \(RSE\) is different (good or bad) in the sense that it takes on the units of \(y\), but we can divide this by \(\bar{y}\) to get the percentage error:
summary(mpg_horsepower)$sigma / mean(Auto$mpg)
## [1] 0.2092371
So the percentage error = 20.92%.
The \(R^2\) of the linear model, which can be thought of as “the percentage of variability in the response that is explained by the predictor”, is given by:
summary(mpg_horsepower)$r.squared
## [1] 0.6059483
So the horsepower
explains 60.59% of the variance in mpg
.
iii:
The relationship is negative, as we can see from the parameter estimate for horsepower
(\(\hat{\beta_1}\)):
coefficients(mpg_horsepower)[2]
## horsepower
## -0.1578447
This means that, if a vehicle has higher horsepower
, it will generally have a lower value for mpg
.
iv:
If horsepower
= 98, we can get the prediction for mpg
(fitted value), and the 95% confidence & prediction intervals for mpg
(when horsepower
= 98).
The confidence interval:
predict(mpg_horsepower, data.frame(horsepower = 98), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 24.46708 23.97308 24.96108
The prediction interval:
predict(mpg_horsepower, data.frame(horsepower = 98), interval = "prediction", level = 0.95)
## fit lwr upr
## 1 24.46708 14.8094 34.12476
The prediction interval is wider than the confidence interval as we would expect.
mpg
) vs Predictor (horsepower
)theme_set(theme_light())
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point() +
geom_abline(intercept = coef(mpg_horsepower)[1], slope = coef(mpg_horsepower)[2],
col = "deepskyblue3",
size = 1) +
geom_smooth(se = F)
Q: Use the plot()
function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
A:
This is one of the cases where I think using plot()
instead of ggplot()
is fine, although these plots can be obtained as normal ggplot graphs using lindia::gg_diagnose(lm)
which will offer more flexibility & customization.
par(mfrow=c(2,2))
plot(mpg_horsepower)
Looking at the smoothing line of the residuals (\(e_i = y_i - \hat{y_i}\)) vs the fitted values (\(\hat{y_i}\)), there is a strong pattern in the residuals, indicating non-linearity. You can see evidence of this in the scatter plot in part (b) too.
There also appears to be non-constant variance in the error terms (heteroscedasticity), but this could be corrected to an extent when trying a quadratic fit. If not, transformations such as \(log(y)\) or \(\sqrt{y}\) can shrink larger responses by a greater amount and reduce this issue.
There are some observations with large standardized residuals & high leverage (hence, high Cook’s Distance) that may want to be reviewed.
p.92 - p.98 were helpful here.
Auto
Dataset (Multiple Linear Regression)This question involves the use of multiple linear regression on the Auto
data set.
Q: Produce a scatterplot matrix which includes all of the variables in the data set.
A:
Note that name
& origin
are both nominal factor variables, but the pairs()
function converts to a numeric automatically (as a result, the plots of name
in particular aren’t really meaningful).
pairs(Auto)
Q: Compute the matrix of correlations between the variables using the function cor()
. You will need to exclude the name
variable, which is qualitative.
A:
origin
is still included, although it’s just a categorical variable with numeric encoding:
cor(Auto[ ,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
mpg
with all variables (except name
)Q: Use the lm()
function to perform a multiple linear regression with mpg
as the response and all other variables except name
as the predictors. Use the summary()
function to print the results. Comment on the output. For instance:
i. Is there a relationship between the predictors and the response?
ii. Which predictors appear to have a statistically significant relationship to the response?
iii. What does the coefficient for the year variable suggest?
A:
I first encode origin
as a factor, adding labels for clarity.
Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))
mpg_lm <- lm(mpg ~ . - name, data = Auto)
summary(mpg_lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## originEuropean 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## originJapanese 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
i: We answer this question by performing an F-test, where we test the null hypothesis that all of the regression coefficients are zero:
\[H_{0}: \beta_{1} = \beta_{2} = \dots = \beta_{p} = 0 \\H_{a}: \text{at least one } \beta_{j} \ne 0 \]
The p-value is given at the bottom of the model summary (p-value: < 2.2e-16
), so it’s clear that the probability of the null hypothesis being true (given our data) is practically zero.
We reject the null hypothesis (and hence conclude that there is a relationship between the predictors and mpg
).
ii: Using the coefficient p-values in the model output, and p = 0.05 as my threshold for significance, all variables except cylinders
, horsepower
& acceleration
have a statistically significant relationship with the response.
iii:
Extracting the coefficient for year
(\(\hat{\beta_{7}}\)):
coef(mpg_lm)[7]
## year
## 0.7770269
A coefficient of 0.777 means that the effect of an increase of 1 year (with all other effects held constant) is associated with an increase of 0.777 in the response (mpg
). This could perhaps be thought of as the rate of the cars increase in efficiency year-on-year.
Use the plot()
function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow=c(2,2))
plot(mpg_lm)
As in question 8, some pattern in the residuals might suggest some non-linearity in the relationships between the predictors and response that the model is not currently capturing.
We can also see some evidence of observations (e.g. 14) with both high leverage and high residual statistics, that may be disproportionately influencing the regression predictions.
mpg_lm_stats <- broom::augment(mpg_lm) %>%
select(.rownames, .fitted, .hat, .resid, .std.resid, .cooksd) %>%
arrange(desc(.cooksd))
mpg_lm_stats
## # A tibble: 392 x 6
## .rownames .fitted .hat .resid .std.resid .cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14 19.4 0.194 -5.42 -1.83 0.0895
## 2 394 35.5 0.0646 8.53 2.67 0.0547
## 3 327 32.4 0.0406 11.0 3.41 0.0546
## 4 326 33.9 0.0332 10.4 3.20 0.0391
## 5 245 33.0 0.0323 10.1 3.11 0.0358
## 6 387 28.7 0.0342 9.33 2.87 0.0324
## 7 323 33.2 0.0151 13.4 4.07 0.0282
## 8 112 27.0 0.0294 -9.01 -2.77 0.0257
## 9 328 27.9 0.0297 8.55 2.62 0.0234
## 10 330 34.7 0.0217 9.86 3.01 0.0224
## # ... with 382 more rows
mpg_lm_stats$cooksd_cutoff <- factor(ifelse(mpg_lm_stats$.cooksd >= 4 / nrow(Auto), "True", "False"))
I extract these stats (leverage, residual stats, cooks distance, etc.) for each observation below using the broom
package, and highlight those observations with \(CooksD \ge \frac{4}{n}\), a common threshold for identifying influential observations.
ggplot(mpg_lm_stats, aes(x = .hat, y = .std.resid, col = cooksd_cutoff)) +
geom_point(alpha = 0.8) +
scale_color_manual(values = c("mediumseagreen", "#BB8FCE")) +
theme_light() +
theme(legend.position = "bottom") +
labs(title = "Residuals vs Leverage",
x = "Leverage",
y = "Standardized Residuals",
col = "Cooks Distance >= 4/n")
ggplot(mpg_lm_stats, aes(x = .hat, y = .cooksd, col = cooksd_cutoff)) +
geom_point(alpha = 0.8) +
geom_hline(yintercept = 4 / nrow(Auto), col = "grey30") +
scale_color_manual(values = c("mediumseagreen", "#BB8FCE")) +
theme_light() +
theme(legend.position = "bottom") +
labs(title = "Cooks Distance vs Leverage",
x = "Leverage",
y = "Cooks Distance",
col = "Cooks Distance >= 4/n")
Q: Use the *
and :
symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
A:
Since we have relatively few predictors, we can test all interactions with mpg ~ . * .
in the call to lm()
:
summary(lm(formula = mpg ~ . * ., data = Auto[, -9]))
##
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6008 -1.2863 0.0813 1.2082 12.0382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.401e+01 5.147e+01 0.855 0.393048
## cylinders 3.302e+00 8.187e+00 0.403 0.686976
## displacement -3.529e-01 1.974e-01 -1.788 0.074638 .
## horsepower 5.312e-01 3.390e-01 1.567 0.117970
## weight -3.259e-03 1.820e-02 -0.179 0.857980
## acceleration -6.048e+00 2.147e+00 -2.818 0.005109 **
## year 4.833e-01 5.923e-01 0.816 0.415119
## originEuropean -3.517e+01 1.260e+01 -2.790 0.005547 **
## originJapanese -3.765e+01 1.426e+01 -2.640 0.008661 **
## cylinders:displacement -6.316e-03 7.106e-03 -0.889 0.374707
## cylinders:horsepower 1.452e-02 2.457e-02 0.591 0.555109
## cylinders:weight 5.703e-04 9.044e-04 0.631 0.528709
## cylinders:acceleration 3.658e-01 1.671e-01 2.189 0.029261 *
## cylinders:year -1.447e-01 9.652e-02 -1.499 0.134846
## cylinders:originEuropean -7.210e-01 1.088e+00 -0.662 0.508100
## cylinders:originJapanese 1.226e+00 1.007e+00 1.217 0.224379
## displacement:horsepower -5.407e-05 2.861e-04 -0.189 0.850212
## displacement:weight 2.659e-05 1.455e-05 1.828 0.068435 .
## displacement:acceleration -2.547e-03 3.356e-03 -0.759 0.448415
## displacement:year 4.547e-03 2.446e-03 1.859 0.063842 .
## displacement:originEuropean -3.364e-02 4.220e-02 -0.797 0.425902
## displacement:originJapanese 5.375e-02 4.145e-02 1.297 0.195527
## horsepower:weight -3.407e-05 2.955e-05 -1.153 0.249743
## horsepower:acceleration -3.445e-03 3.937e-03 -0.875 0.382122
## horsepower:year -6.427e-03 3.891e-03 -1.652 0.099487 .
## horsepower:originEuropean -4.869e-03 5.061e-02 -0.096 0.923408
## horsepower:originJapanese 2.289e-02 6.252e-02 0.366 0.714533
## weight:acceleration -6.851e-05 2.385e-04 -0.287 0.774061
## weight:year -8.065e-05 2.184e-04 -0.369 0.712223
## weight:originEuropean 2.277e-03 2.685e-03 0.848 0.397037
## weight:originJapanese -4.498e-03 3.481e-03 -1.292 0.197101
## acceleration:year 6.141e-02 2.547e-02 2.412 0.016390 *
## acceleration:originEuropean 9.234e-01 2.641e-01 3.496 0.000531 ***
## acceleration:originJapanese 7.159e-01 3.258e-01 2.198 0.028614 *
## year:originEuropean 2.932e-01 1.444e-01 2.031 0.043005 *
## year:originJapanese 3.139e-01 1.483e-01 2.116 0.035034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared: 0.8967, Adjusted R-squared: 0.8866
## F-statistic: 88.34 on 35 and 356 DF, p-value: < 2.2e-16
We can see the significant terms (at the 0.05 level) are those with at least one asterisk (*
). It is probably unreasonable to use a significance level of 0.05 here, as we are testing such a large number of hypothesis, perhaps a lower threshold for significance (or a p-value correction using the p.adjust()
function) would be appropriate.
Using the standard threshold of 0.05, the significant interaction terms are given by:
cylinders:acceleration
acceleration:year
acceleration:originEuropean
acceleration:originJapanese
year:originEuropean
year:originJapanese
Q: Try a few different transformations of the variables, such as \(log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.
A:
I use my best_predictor()
function here, as defined in my chapter 2 solutions.
There is no testing for \(\sqrt{X}\) transformations here, but these tend to perform similarly to transformations of the form \(log(X)\).
best_predictor <- function(dataframe, response) {
if (sum(sapply(dataframe, function(x) {is.numeric(x) | is.factor(x)})) < ncol(dataframe)) {
stop("Make sure that all variables are of class numeric/factor!")
}
# pre-allocate vectors
varname <- c()
vartype <- c()
R2 <- c()
R2_log <- c()
R2_quad <- c()
AIC <- c()
AIC_log <- c()
AIC_quad <- c()
y <- dataframe[ ,response]
# # # # # NUMERIC RESPONSE # # # # #
if (is.numeric(y)) {
for (i in 1:ncol(dataframe)) {
x <- dataframe[ ,i]
varname[i] <- names(dataframe)[i]
if (class(x) %in% c("numeric", "integer")) {
vartype[i] <- "numeric"
} else {
vartype[i] <- "categorical"
}
if (!identical(y, x)) {
# linear: y ~ x
R2[i] <- summary(lm(y ~ x))$r.squared
# log-transform: y ~ log(x)
if (is.numeric(x)) {
if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
R2_log[i] <- summary(lm(y ~ log(x + abs(min(x)) + 1)))$r.squared
} else {
R2_log[i] <- summary(lm(y ~ log(x)))$r.squared
}
} else {
R2_log[i] <- NA
}
# quadratic: y ~ x + x^2
if (is.numeric(x)) {
R2_quad[i] <- summary(lm(y ~ x + I(x^2)))$r.squared
} else {
R2_quad[i] <- NA
}
} else {
R2[i] <- NA
R2_log[i] <- NA
R2_quad[i] <- NA
}
}
print(paste("Response variable:", response))
data.frame(varname,
vartype,
R2 = round(R2, 3),
R2_log = round(R2_log, 3),
R2_quad = round(R2_quad, 3)) %>%
mutate(max_R2 = pmax(R2, R2_log, R2_quad, na.rm = T)) %>%
arrange(desc(max_R2))
# # # # # CATEGORICAL RESPONSE # # # # #
} else {
for (i in 1:ncol(dataframe)) {
x <- dataframe[ ,i]
varname[i] <- names(dataframe)[i]
if (class(x) %in% c("numeric", "integer")) {
vartype[i] <- "numeric"
} else {
vartype[i] <- "categorical"
}
if (!identical(y, x)) {
# linear: y ~ x
AIC[i] <- summary(glm(y ~ x, family = "binomial"))$aic
# log-transform: y ~ log(x)
if (is.numeric(x)) {
if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
AIC_log[i] <- summary(glm(y ~ log(x + abs(min(x)) + 1), family = "binomial"))$aic
} else {
AIC_log[i] <- summary(glm(y ~ log(x), family = "binomial"))$aic
}
} else {
AIC_log[i] <- NA
}
# quadratic: y ~ x + x^2
if (is.numeric(x)) {
AIC_quad[i] <- summary(glm(y ~ x + I(x^2), family = "binomial"))$aic
} else {
AIC_quad[i] <- NA
}
} else {
AIC[i] <- NA
AIC_log[i] <- NA
AIC_quad[i] <- NA
}
}
print(paste("Response variable:", response))
data.frame(varname,
vartype,
AIC = round(AIC, 3),
AIC_log = round(AIC_log, 3),
AIC_quad = round(AIC_quad, 3)) %>%
mutate(min_AIC = pmin(AIC, AIC_log, AIC_quad, na.rm = T)) %>%
arrange(min_AIC)
}
}
best_predictor(Auto[ ,-9], "mpg")
## [1] "Response variable: mpg"
## varname vartype R2 R2_log R2_quad max_R2
## 1 weight numeric 0.693 0.713 0.715 0.715
## 2 displacement numeric 0.648 0.686 0.689 0.689
## 3 horsepower numeric 0.606 0.668 0.688 0.688
## 4 cylinders numeric 0.605 0.603 0.607 0.607
## 5 year numeric 0.337 0.332 0.368 0.368
## 6 origin categorical 0.332 NA NA 0.332
## 7 acceleration numeric 0.179 0.190 0.194 0.194
## 8 mpg numeric NA NA NA NA
Lets visualize:
mpg_predictors <- best_predictor(Auto[ ,-9], "mpg") %>%
select(-c(vartype, max_R2)) %>%
gather(key = "key", value = "R2", -varname) %>%
filter(!is.na(R2))
mpg_predictors_order <- best_predictor(Auto[ ,-9], "mpg") %>%
select(varname, max_R2) %>%
filter(!is.na(max_R2)) %>%
arrange(desc(max_R2)) %>%
pull(varname)
mpg_predictors$varname <- factor(mpg_predictors$varname,
ordered = T,
levels = mpg_predictors_order)
ggplot(mpg_predictors, aes(x = R2, y = varname, col = key, group = varname)) +
geom_line(col = "grey15") +
geom_point(size = 2) +
theme_light() +
theme(legend.position = "bottom") +
labs(title = "Best predictors (& transformations) for mpg",
col = "Predictor Transformation",
y = "Predictor") +
scale_color_discrete(labels = c("None", "Log-Transformed", "Order 2 Polynomial"))
We might say from this that we should test quadratic terms for weight
, displacement
, horsepower
& year
Recall, for the standard linear model, we get an adjusted \(R^2\) of 0.8205.
mpg_lm_2 <- lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = Auto)
summary(mpg_lm_2)
##
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) +
## I(horsepower^2) + I(year^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4816 -1.5384 0.0735 1.3671 12.0213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.185e+02 6.966e+01 6.008 4.40e-09 ***
## cylinders 5.073e-01 3.191e-01 1.590 0.112692
## displacement -3.328e-02 2.045e-02 -1.627 0.104480
## horsepower -1.781e-01 3.953e-02 -4.506 8.81e-06 ***
## weight -1.114e-02 2.587e-03 -4.306 2.12e-05 ***
## acceleration -1.700e-01 9.652e-02 -1.762 0.078960 .
## year -1.019e+01 1.837e+00 -5.546 5.49e-08 ***
## originEuropean 1.323e+00 5.304e-01 2.494 0.013068 *
## originJapanese 1.258e+00 5.129e-01 2.452 0.014637 *
## I(weight^2) 1.182e-06 3.438e-07 3.439 0.000649 ***
## I(displacement^2) 5.839e-05 3.435e-05 1.700 0.089967 .
## I(horsepower^2) 4.388e-04 1.336e-04 3.284 0.001118 **
## I(year^2) 7.210e-02 1.207e-02 5.974 5.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared: 0.8773, Adjusted R-squared: 0.8735
## F-statistic: 225.9 on 12 and 379 DF, p-value: < 2.2e-16
The adjusted \(R^2\) has improved quite significantly: 0.8735
Notice how it seems like there is less non-linearity in the residuals.
par(mfrow=c(2,2))
plot(mpg_lm_2)
However, there is still some fan-shape to the residual plot, and the Q-Q plot is showing that the residuals are not as normal as we would hope. Perhaps a log-transformation of the response (mpg
) will help this:
mpg_lm_3 <- lm(log(mpg) ~ . - name, data = Auto)
summary(mpg_lm_3)
##
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40380 -0.06679 0.00493 0.06913 0.33036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.712e+00 1.673e-01 10.230 < 2e-16 ***
## cylinders -2.781e-02 1.149e-02 -2.420 0.01598 *
## displacement 7.874e-04 2.738e-04 2.876 0.00425 **
## horsepower -1.520e-03 4.904e-04 -3.100 0.00208 **
## weight -2.639e-04 2.344e-05 -11.260 < 2e-16 ***
## acceleration -1.403e-03 3.513e-03 -0.399 0.68996
## year 3.055e-02 1.852e-03 16.491 < 2e-16 ***
## originEuropean 8.531e-02 2.026e-02 4.210 3.18e-05 ***
## originJapanese 8.145e-02 1.977e-02 4.119 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1183 on 383 degrees of freedom
## Multiple R-squared: 0.8815, Adjusted R-squared: 0.879
## F-statistic: 356.1 on 8 and 383 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mpg_lm_3)
Pushing for even higher performance from the linear model, I make the following changes.
Note that I have tried to avoid adding too many complex interactions & transformations to the model, but have instead focused on extracting the relationships that I think are most likely to exist and add performance on new data:
mpg
)log(mpg)
)brand
variable that i created in chapter 2, through text-mining the name
variableAuto_2 <- Auto %>%
mutate(mpg = log(mpg))
# new variable
Auto_2$brand <- sapply(strsplit(as.character(Auto_2$name), split = " "),
function(x) x[1]) # extract the first item from each list element
Auto_2$brand <- factor(ifelse(Auto_2$brand %in% c("vokswagen", "vw"), "volkswagen",
ifelse(Auto_2$brand == "toyouta", "toyota",
ifelse(Auto_2$brand %in% c("chevroelt", "chevy"), "chevrolet",
ifelse(Auto_2$brand == "maxda", "mazda",
Auto_2$brand))))) # fixing typo's
Auto_2$brand <- fct_lump(Auto_2$brand,
n = 9,
other_level = "uncommon") # collapse into 10 categories
Auto_2$name <- NULL
# gives ideas for transformations:
# best_predictor(Auto_2, "mpg")
# gives ideas for interactions:
# summary(lm(mpg ~ . * ., data = Auto_2[ ,-9]))
mpg_lm_4 <- lm(mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year + acceleration:origin, data = Auto_2)
summary(mpg_lm_4)
##
## Call:
## lm(formula = mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year +
## acceleration:origin, data = Auto_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36578 -0.05924 0.00281 0.05853 0.32154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.407e+01 2.668e+00 5.274 2.28e-07 ***
## cylinders -6.017e-03 1.101e-02 -0.547 0.584993
## displacement -1.286e-04 2.835e-04 -0.454 0.650361
## horsepower -8.124e-03 1.344e-03 -6.046 3.63e-09 ***
## weight -1.541e-04 2.578e-05 -5.978 5.34e-09 ***
## acceleration -1.506e-01 4.536e-02 -3.320 0.000991 ***
## year -2.541e-01 7.136e-02 -3.560 0.000419 ***
## originEuropean -2.801e-01 9.866e-02 -2.839 0.004772 **
## originJapanese -2.721e-01 1.229e-01 -2.214 0.027430 *
## brandbuick 6.683e-02 3.351e-02 1.994 0.046840 *
## brandchevrolet 3.528e-02 2.565e-02 1.375 0.169816
## branddatsun 1.896e-01 3.943e-02 4.809 2.21e-06 ***
## branddodge 5.767e-02 2.936e-02 1.965 0.050209 .
## brandford -3.803e-03 2.573e-02 -0.148 0.882547
## brandplymouth 6.476e-02 2.834e-02 2.285 0.022871 *
## brandtoyota 1.106e-01 3.842e-02 2.880 0.004207 **
## brandvolkswagen 2.639e-02 4.022e-02 0.656 0.512122
## branduncommon 6.968e-02 2.657e-02 2.622 0.009100 **
## I(horsepower^2) 1.789e-05 4.230e-06 4.229 2.96e-05 ***
## I(year^2) 1.685e-03 4.829e-04 3.490 0.000540 ***
## acceleration:year 1.690e-03 5.954e-04 2.839 0.004782 **
## acceleration:originEuropean 1.912e-02 5.590e-03 3.420 0.000697 ***
## acceleration:originJapanese 1.544e-02 7.407e-03 2.084 0.037829 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1043 on 369 degrees of freedom
## Multiple R-squared: 0.9112, Adjusted R-squared: 0.9059
## F-statistic: 172.2 on 22 and 369 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mpg_lm_4)
The heteroscedasticity of the residuals has definitely decreased significantly since the first linear model. The adjusted \(R^2\) has improved from 0.8205 to 0.9059, which is pretty impressive given that no additional data has been collected.
Carseats
Dataset (Multiple Linear Regression)This question should be answered using the Carseats
data set.
Sales
Q: Fit a multiple regression model to predict Sales
using Price
, Urban
, and US
.
A:
Carseats
is in the ISLR
package, so the data has already been loaded.
glimpse(Carseats)
## Rows: 400
## Columns: 11
## $ Sales <dbl> 9.50, 11.22, 10.06, 7.40, 4.15, 10.81, 6.63, 11.85, 6.5...
## $ CompPrice <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132, 121, ...
## $ Income <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78, 94, 35...
## $ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, 11, 5, ...
## $ Population <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131, 150, 5...
## $ Price <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 100, 94, ...
## $ ShelveLoc <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Good, Medi...
## $ Age <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, 62, 53,...
## $ Education <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, 18, 18,...
## $ Urban <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes,...
## $ US <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, Yes...
sales_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
Q: Provide an interpretation of each coefficient in the model. Be careful-some of the variables in the model are qualitative!
A:
summary(sales_lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price
= -0.054
Interpretation = the effect of a 1-unit increase in Price
(for fixed values of Urban
& US
) is a change in Sales
of -0.054 units (54 sales).
Urban
= -0.022
Interpretation = the effect of a store being in an urban area (for fixed values of Price
& US
) is a change in Sales
of 0.022 units (22 sales). However, in this case, since the p-value for this variables T-test is so high, we can say that there is no evidence for a relationship between the car seat Sales
at a store and whether the store was Urban
(or rural).
US
= 1.200
Interpretation = the effect of a store being in the US (for fixed values of Price
& Urban
) is a change in Sales
of 1.2 units (1200 sales).
Q: Write out the model in equation form, being careful to handle the qualitative variables properly.
A:
\[\hat{Sales} = 13.043469 - 0.054459\cdot Price - 0.021916\cdot Urban + 1.200573\cdot US\]
Where:
Q: For which of the predictors can you reject the null hypothesis \(H_{0}: \beta_{j} = 0\)
A:
This question is asking about the results from the parameter T-tests. Based on the output & comments from part (b), we can reject the null hypothesis for the Price
and US
predictors, but there is insufficient evidence to reject the null hypothesis that the coefficient for Urban
is zero.
Q: On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
A:
Removing the Urban
variable:
sales_lm_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(sales_lm_2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Q: How well do the models in (a) and (e) fit the data?
A:
For model (a), we have:
For model (e), we have:
In this case, both models explain ~23.9% of the variance in Sales
.
This is an interesting case to talk about using training \(R^2\) vs \(\bar{R}^2\) (adjusted \(R^2\)) in model selection, since model (a) has a higher \(R^2\) but lower \(\bar{R}^2\).
Recall:
\[R^2 = 1 - \frac{RSS}{TSS}\]
\[\bar{R}^2 = 1 - \frac{RSS/(n - p - 1)}{TSS / (n - 1)}\]
\[RSS = \sum_{i = 1}^n{(y_i - \hat{y})^2}\]
\[TSS = \sum_{i = 1}^n{(y_i - \bar{y})^2}\]
\(TSS\) will be constant regardless of the model being fit, and so maximizing \(\bar{R}^2\) is equivalent to minimizing \(RSS/(n - p - 1)\).
The fact that \(RSS\) will always decrease as variables are added to the model explains why \(R^2\) always increases with additional variables.
However, with \(\bar{R}^2\), adding new variables will mean that the increase in \(RSS\) is now ‘competing’ with the increase in \(n - p - 1\) (specifically, the increase in \(p\)), so it is less clear whether \(\bar{R}^2\) will increase or not.
In this particular case, we can see that model (e) adds a useless variable (Urban
) and sees an increase in \(R^2\) but a decrease in \(\bar{R}^2\), because the boost in \(RSS\) that the model sees is insufficient in outweighing the penalty by the increase in \(p\).
We should really choose (e) as the better model in this case, as it provides a more parsimonious model, and this is a good example of why the training \(R^2\) or \(RSS\) can be very limited.
Q: Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
A:
The confidence intervals are calculated using:
\[\hat{\beta_i} \pm t_{n-2}(0.975) \cdot SE(\hat{\beta_i})\]
Where:
qt(0.975, nrow(data) - 2)
)confint(sales_lm_2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
We can say, for instance, that there is a 95% probability that the true parameter for Price
(\(\beta_1\)) falls within the interval: (-0.0648, -0.0442), and a 5% probability that it doesn’t.
Q: Is there evidence of outliers or high leverage observations in the model from (e)
A:
Since the average leverage statistic is always equal to \((p + 1)/n\), we can identify observations with a leverage far higher than \((p + 1)/n\) as ‘high-leverage’.
# NOTE: broom::augment(lm)$.hat is the same as hatvalues(lm)
# (p + 1) / n = average leverage?
round(((2 + 1) / nrow(Carseats)), 10) == round(mean(hatvalues(sales_lm_2)), 10)
## [1] TRUE
Similarly, we can identify potential outliers as those observations with a standardized residual outside of \([-2, 2]\).
broom::augment(sales_lm_2) %>%
select(.hat, .std.resid) %>%
ggplot(aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_hline(yintercept = -2, col = "deepskyblue3", size = 1) +
geom_hline(yintercept = 2, col = "deepskyblue3", size = 1) +
geom_vline(xintercept = 3 / nrow(Carseats), col = "mediumseagreen", size = 1) +
theme_light() +
labs(x = "Leverage",
y = "Standardized Residual",
title = "Residuals vs Leverage")
Looking at these thresholds on a graph, it is hard to identify when an observation is large enough in residual & leverage to be influential and worth investigating.
This is why I prefer the approach of using a metric that combines these (e.g. cooks distance):
broom::augment(sales_lm_2) %>%
rownames_to_column("rowid") %>%
arrange(desc(.cooksd)) %>%
select(Sales, Price, US, .std.resid, .hat, .cooksd)
## # A tibble: 400 x 6
## Sales Price US .std.resid .hat .cooksd
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 14.9 82 No 2.58 0.0116 0.0261
## 2 14.4 53 No 1.73 0.0237 0.0243
## 3 10.6 149 No 2.32 0.0126 0.0228
## 4 15.6 72 Yes 2.17 0.0129 0.0205
## 5 0.37 191 Yes -1.42 0.0286 0.0198
## 6 16.3 92 Yes 2.87 0.00664 0.0183
## 7 3.47 81 No -2.10 0.0119 0.0177
## 8 0.53 159 Yes -2.05 0.0119 0.0169
## 9 13.6 89 No 2.18 0.00983 0.0158
## 10 3.02 90 Yes -2.56 0.00710 0.0157
## # ... with 390 more rows
In this case it’s unlikely that you could justify removing any other observations or class them as problematic. The model is very simple with \(p\) = 2, and many of these observations would not be an issue if more strong predictors were added.
In this problem we will investigate the t-statistic for the null hypothesis \(H_{0}: \beta = 0\) in simple linear regression without an intercept. To begin, we generate a predictor x
and a response y
as follows.
set.seed(1)
x = rnorm (100)
y = 2*x + rnorm(100)
set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100)
y
using x
(no intercept)Q: Perform a simple linear regression of y
onto x
, without an intercept. Report the coefficient estimate \(\hat{\beta_a}\), the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis \(H_{0}: \beta_a = 0\). Comment on these results
A:
lm_1 <- lm(y ~ x + 0)
summary(lm_1)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
We have:
The model p-value strongly suggests that there is a relationship between y
& x
.
lm_1
describes a relationship of the form:
\[\hat{y} = \hat{\beta_a}\cdot x = 1.9939\cdot x\]
x
using y
(no intercept)Q: Perform a simple linear regression of y
onto x
, without an intercept. Report the coefficient estimate \(\hat{\beta_b}\), the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis \(H_{0}: \beta_b = 0\). Comment on these results
A:
lm_2 <- lm(x ~ y + 0)
summary(lm_2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
We have:
The model p-value strongly suggests that there is a relationship between x
& y
.
lm_2
describes a relationship of the form:
\[\hat{x} = \hat{\beta_b}\cdot y = 0.3911\cdot x\]
Q: What is the relationship between the results obtained in (a) and (b)?
A:
In both models (lm_1
& lm_2
) we have the same T-statistic (and hence, p-value), but the parameter estimates (\(\hat{\beta_a}\) & \(\hat{\beta_b}\)) and their standard error estimates are different, which we would expect.
Since the underlying relationship is given by \(y = 2x + \epsilon\), and reversing this for \(x\) gives \(x = \frac{y - \epsilon}{2}\), I would have expected to get:
However, we get \(\hat{\beta_a}\) = 1.9939 (which is close), and \(\hat{\beta_b}\) = 0.3911 (which is not close)!
It’s interesting that \(\hat{\beta_b} \ne \frac{1}{\hat{\beta_a}}\), and this is not just some peculiarity with performing linear regression without an intercept! The same results can be observed when doing regression with an intercept:
lm_3 <- lm(y ~ x)
summary(lm_3)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769 0.09699 -0.389 0.698
## x 1.99894 0.10773 18.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
\[\hat{y} = -0.0377 + 1.9989\cdot x\]
lm_4 <- lm(x ~ y)
summary(lm_4)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90848 -0.28101 0.06274 0.24570 0.85736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880 0.04266 0.91 0.365
## y 0.38942 0.02099 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
\[\hat{x} = 0.0388 + 0.3894\cdot y\]
In the case of lm_1
, we are minimizing: \(RSS = \sum_{i = 1}^n{(y_i - \hat{y})^2}\). Minimizing this vertical distance yields a slightly flatter line (with a smaller slope).
data <- data.frame(x, y)
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_segment(aes(x = x, y = y,
xend = x, yend = lm_1$fitted.values), col = "grey35") +
geom_abline(intercept = 0, slope = coef(lm_1), size = 1, col = "deepskyblue3") +
labs(title = "lm_1: predicting y using x")
Whereas, in the case of lm_2
, we are minimizing \(RSS = \sum_{i = 1}^n{(x_i - \hat{x})^2}\). Minimizing this horizontal distance yields a slightly steeper line (with a larger slope).
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_segment(aes(x = x, y = y,
xend = lm_2$fitted.values, yend = y), col = "grey35") +
geom_abline(intercept = 0, slope = 1 / coef(lm_2), size = 1, col = "mediumseagreen") +
labs(title = "lm_2: predicting x using y")
Here, I visualize the two regression lines (with no intercept). Note that, if a human were to draw what they expect the ‘line of best fit’ to be, it would likely be between these two lines!
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_1), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_2), size = 1, col = "mediumseagreen")
Fundamentally I this comes down to the fact that we are minimizing different things when fitting the two regressions. The term ‘line of best fit’ is deceptive, really, because the thing we choose to minimize with change depending on what we choose as our response.
Following this graphical logic, it would make sense if these two lines (y ~ x
, and x ~ y
) approached one another as the noise in the relationship decreased, and the relationship approached a straight line.
Increased Noise:
I increase the \(\epsilon\) term in the underlying relationship, so now \(\epsilon\) has sd = 1.75
instead of sd = 1
. This produces a noisier relationship between x
and y
.
set.seed(1)
x = rnorm (100)
y = 2*x + rnorm(100, sd = 1.75)
set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100, sd = 1.75) # previously: y = 2*x + rnorm(100, sd = 1)
data <- data.frame(x,y)
lm_5 <- lm(y ~ x + 0)
lm_6 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_5), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_6), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_5 & lm_6)",
subtitle = "sd(epsilon) increased to 1.75")
For lm_5
we get:
For lm_6
we get:
Decreased Noise:
I now reduce the \(\epsilon\) term in the underlying relationship to simulate this, so now \(\epsilon\) has sd = 0.25
instead of sd = 1
:
set.seed(1)
x = rnorm (100)
y = 2*x + rnorm(100, sd = 0.25)
set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100, sd = 0.25) # previously: y = 2*x + rnorm(100, sd = 1)
data <- data.frame(x,y)
lm_7 <- lm(y ~ x + 0)
lm_8 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_7), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_8), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_7 & lm_8)",
subtitle = "sd(epsilon) reduced to 0.25")
For lm_7
we get:
For lm_8
we get:
Notice how, as \(R^2 \rightarrow\) 1, \(\hat{\beta_b} \rightarrow \frac{1}{\hat{\beta_a}}\), and it is possible to show that these parameter estimates will have a perfect reciprocal relationship only when \(R^2 = 1\).
I’m now reaching the end of my knowledge so I will leave this question here, but hopefully that has made clearer the mechanics for why the regression coefficients are different, even though (for me) it’s a confusing concept to wrap my head around.
Q: For the regression of \(Y\) onto \(X\) without an intercept… show algebraically, and confirm numerically in R, that the t-statistic can be written as: \(T = \frac{\hat{\beta}}{SE(\hat{\beta})} = \frac {(\sqrt{n-1}) \sum{x y}} {\sqrt{(\sum{x^2})(\sum{y^2}) - (\sum{x y})^2}}\)
A:
For regression without an intercept, we have the following results:
\[\hat{\beta} = \frac{\sum{x y}}{\sum{x^2}} \\ SE(\hat{\beta}) = \sqrt{\frac{\sum{(y - x \hat{\beta})^2}}{(n-1) \sum{x^2}}}\]
Hence:
\[T = \frac{\hat{\beta}}{SE(\hat{\beta})} = \frac{\frac{\sum{x y}}{\sum{x^2}}} {\sqrt{\frac{\sum{(y - x \hat{\beta})^2}}{(n-1) \sum{x^2}}}}\]
Using the same logic as in the \(RSS\) algebra in question 7 (since \(\hat{\beta_{0}}\) = 0 and \(\hat{y} = x \hat{\beta}\)):
\[\begin{align*} RSS & = \sum{(y - x \hat{\beta})^2} \\ & = \sum{(y - \hat{y})^2} \\ & = \sum{y^2} - 2\hat{\beta}\sum{xy} + \hat{\beta}^2\sum{x^2} &&\text{(steps shown in q7)} \\ & = \frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{\sum{x^2}} &&\text{(steps shown in q7)} \end{align*} \]
We can use this result to simplify the denominator:
\[\begin{align*} SE(\hat{\beta}) & = \sqrt{\frac{\sum{(y - x \hat{\beta})^2}}{(n-1) \sum{x^2}}} \\ & = \sqrt{\frac{\left(\frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{\sum{x^2}} \right)}{(n-1) \sum{x^2}}} && \text{(substituting } y - x \hat{\beta} \text)\\ & = \sqrt{\frac{\sum{y^2}\sum{x^2} - (\sum{xy})^2}{(n-1)(\sum{x^2})^2}} \\ & = \frac{\sqrt{\sum{y^2}\sum{x^2} - (\sum{xy})^2}} {\sqrt{n-1}\sum{x^2}} \end{align*} \]
Subbing this in, we get to the required equality:
\[\begin{align*} T & = \frac{\hat{\beta}}{SE(\hat{\beta})} \\ & = \frac{ \left(\frac{\sum{x y}}{\sum{x^2}}\right) } {\left(\frac{\sqrt{\sum{y^2}\sum{x^2} - (\sum{xy})^2}} {\sqrt{n-1}\sum{x^2}}\right)} \\ & = \frac{\sum{x y}}{\sum{x^2}} \cdot \frac{\sqrt{n-1}\sum{x^2}}{\sqrt{\sum{y^2}\sum{x^2} - (\sum{xy})^2}} \\ & = \frac {\sqrt{n-1} \sum{x y}} {\sqrt{(\sum{x^2})(\sum{y^2}) - (\sum{x y})^2}} \end{align*} \]
I now verify this is the case in R:
Recall lm_1
from earlier:
set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100)
lm_1 <- lm(y ~ x + 0)
summary(lm_1)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
From the model output we obtain the following T-statistic: 18.7259319
Now to verify this, by using the formula from the proof:
\[T = \frac {\sqrt{n-1} \sum{x y}} {\sqrt{(\sum{x^2})(\sum{y^2}) - (\sum{x y})^2}}\]
T_statistic <- (sqrt(length(x) - 1) * sum(x*y)) / sqrt(sum(x^2) * sum(y^2) - sum(x*y)^2)
T_statistic
## [1] 18.72593
Checking whether the T-statistic from the model summary is the same as my manually calculated T-statistic, to 10dp.:
round(as.numeric(T_statistic), 10) == round(as.numeric(summary(lm_1)$coefficients[3]), 10)
## [1] TRUE
Q: Using the results from (d), argue that the t-statistic for the regression of \(y\) onto \(x\) is the same as the t-statistic for the regression of \(x\) onto \(y\).
A:
This is immediate from swapping \(x\) and \(y\) in the formula above:
\[T_{y \sim x} = \frac{\sqrt{n-1} \sum{x y}} {\sqrt{(\sum{x^2})(\sum{y^2}) - (\sum{x y})^2}} = \frac{\sqrt{n-1} \sum{y x}} {\sqrt{(\sum{y^2})(\sum{x^2}) - (\sum{y x})^2}} = T_{x \sim y}\]
Q: In R, show that when regression is performed with an intercept, the t-statistic for \(H_{0}: \beta_{1} = 0\) is the same for the regression of y
onto x
as it is for the regression of x
onto y
.
A:
As before, since we are not proving here but showing in R, I can only think that the authors expect an example (not a proof). So using the example provided earlier:
set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100)
lm_3 <- lm(y ~ x)
summary(lm_3)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769261 0.09698729 -0.3886346 6.983896e-01
## x 1.99893961 0.10772703 18.5555993 7.723851e-34
lm_4 <- lm(x ~ y)
summary(lm_4)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880394 0.04266144 0.9095787 3.652764e-01
## y 0.38942451 0.02098690 18.5555993 7.723851e-34
They look the same, but just to check, I extract the T-values and check their equality (to 10dp.):
round(summary(lm_3)$coefficients[2, 3], 10) == round(summary(lm_4)$coefficients[2, 3], 10)
## [1] TRUE
This problem involves simple linear regression without an intercept.
Q: Recall that the coefficient estimate \(\hat{\beta_a}\) for the linear regression of \(y\) onto \(x\) without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of \(x\) onto \(y\) the same as the coefficient estimate for the regression of \(y\) onto \(x\)?
A:
Using the same notation as previously, where:
\[\hat{y} = \hat{\beta_a}x \\ \hat{x} = \hat{\beta_b}y\]
We have:
\[\hat{\beta_a} = \frac{\sum{xy}}{\sum{x^2}} \\ \hat{\beta_b} = \frac{\sum{yx}}{\sum{y^2}} \\ \hat{\beta_a} = \hat{\beta_b} \iff \frac{\sum{xy}}{\sum{x^2}} = \frac{\sum{yx}}{\sum{y^2}} \iff \sum_{i = 1}^n{x_i^2} = \sum_{i = 1}^n{y_i^2}\]
Q: Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of \(x\) onto \(y\) is different from the coefficient estimate for the regression of \(y\) onto \(x\)
A:
All that is required here is that the above equality doesn’t hold (which will happen in all x, y combinations, with the exception of a very few).
I generate 100 observations with the following properties:
\[y = 5\cdot x + \epsilon \\ \epsilon \sim \mathcal{N}(0, 2^2)\]
set.seed(2)
x <- rnorm(100)
y <- 5*x + rnorm(100, sd = 2)
data <- data.frame(x, y)
lm_9 <- lm(y ~ x + 0)
lm_10 <- lm(x ~ y + 0)
We have:
It’s quite clear that \(\hat{\beta_a} \ne \hat{\beta_b}\):
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_9), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_10), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_9 & lm_10)")
The actual coefficients:
Q: Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of \(x\) onto \(y\) is the same from the coefficient estimate for the regression of \(y\) onto \(x\)
A:
One such example where \(\sum_{i = 1}^n{x_i^2} = \sum_{i = 1}^n{y_i^2}\) would hold is obviously where the two vectors are equal (a relationship where \(R^2\) = 1, and \(\hat{\beta_b} = \frac{1}{\hat{\beta_a}} = \frac{1}{1} = 1\)):
set.seed(3)
x <- rnorm(100)
y <- x
data <- data.frame(x, y)
lm_11 <- lm(y ~ x + 0)
lm_12 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_11), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_12), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_11 & lm_12)")
We have:
Another example would be where vectors \(x\) and \(y\) are permutations of one another. Clearly \(\sum_{i = 1}^n{x_i^2} = \sum_{i = 1}^n{y_i^2}\) would hold here, so the parameters should also be equal:
set.seed(4)
x <- 1:100
y <- x[order(rnorm(100))]
data <- data.frame(x, y)
lm_13 <- lm(y ~ x + 0)
lm_14 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_13), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_14), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_13 & lm_14)")
We have:
Note that the previous example showed perfect relationship where \(\hat{\beta_a} = {\hat{\beta_b}}\) held true, and this example shows a random relationship where it also holds. You could create a relationship as strong or as weak as desired, simply by permuting less (or more) or the elements of \(x\), and still we would see that \(\hat{\beta_a} = {\hat{\beta_b}}\) holds.
Another obvious example would be where \(y = |x|\):
set.seed(5)
x <- rnorm(100)
y <- abs(x)
data <- data.frame(x, y)
lm_15 <- lm(y ~ x + 0)
lm_16 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_15), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_16), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_15 & lm_16)")
We have:
In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1)
prior to starting part (a) to ensure consistent results.
set.seed(1)
Q: Using the rnorm()
function, create a vector, x
, containing 100 observations drawn from a \(\mathcal{N}(0, 1)\) distribution. This represents a feature, \(x\)
A:
x <- rnorm(100)
Q: Using the rnorm()
function, create a vector, eps
, containing 100 observations drawn from a \(\mathcal{N}(0, 0.25)\) distribution i.e. a normal distribution with mean zero and variance 0.25.
A:
eps <- rnorm(100, sd = 0.5) # var = sd^2
Q: Using x
and eps, generate a vector y according to the model: \(y = -1 + 0.5x + \epsilon\) What is the length of the vector \(y\)? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model?
A:
y <- -1 + 0.5*x + eps
The length of y
is obviously 100, because it must be the same length of x
and eps
(one element of y
is generated from each 100 corresponding values of x
and eps
):
length(y)
## [1] 100
We have the true parameters here too, which are:
Q: Create a scatterplot displaying the relationship between x
and y
. Comment on what you observe.
A:
data <- data.frame(x, y)
g1 <- ggplot(data, aes(x = x, y = y)) +
geom_point()
g1
Comments:
Q: Fit a least squares linear model to predict y
using x
. Comment on the model obtained. How do \(\hat{\beta_{0}}\) and \(\hat{\beta_{1}}\) compare to \(\beta_{0}\) and \(\beta_{1}\)?
A:
lm_17 <- lm(y ~ x)
coef(lm_17)
## (Intercept) x
## -1.0188463 0.4994698
Hence:
Q: Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend()
command to create an appropriate legend.
A:
Note: It appears that putting the intercept, slope & col arguments for geom_abline()
within the aes()
call automatically generates a legend.
g1 +
geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) +
geom_abline(aes(intercept = coef(lm_17)[1], slope = coef(lm_17)[2], col = "Least Squares")) +
scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
theme(legend.position = "bottom") +
labs(title = "sd(eps) = 0.5")
Q: Now fit a polynomial regression model that predicts y
using x
and x^2
. Is there evidence that the quadratic term improves the model fit? Explain your answer.
A:
First, we can view the summary of the previous linear model (lm_17 <- lm(y ~ x)
):
summary(lm_17)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93842 -0.30688 -0.06975 0.26970 1.17309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.01885 0.04849 -21.010 < 2e-16 ***
## x 0.49947 0.05386 9.273 4.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared: 0.4674, Adjusted R-squared: 0.4619
## F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15
Comparing this to the fit of a new model, lm_18
, which includes a quadratic term (x^2
):
lm_18 <- lm(y ~ x + I(x^2))
summary(lm_18)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98252 -0.31270 -0.06441 0.29014 1.13500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.97164 0.05883 -16.517 < 2e-16 ***
## x 0.50858 0.05399 9.420 2.4e-15 ***
## I(x^2) -0.05946 0.04238 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared: 0.4779, Adjusted R-squared: 0.4672
## F-statistic: 44.4 on 2 and 97 DF, p-value: 2.038e-14
The coefficient for the x^2
variable is not significant, so there is no statistical evidence to reject the null hypothesis that the coefficient for this quadratic term is zero.
It is interesting that the adjusted \(R^2\) increased (slightly) with the inclusion of a quadratic term, and raises questions about the suitability of using such a metric blindly for variable selection. The improvement is small enough that simply varying the random seed used will often make it disappear.
Q: Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.
A:
Here I repeat the process. This time I reduce \(Var(\epsilon)\) from \(0.5^2\) to \(0.1^2\)
Note that the parameter estimates are ever-so-slightly closer to the true population parameters, but in both cases they are incredibly close to approximating the underlying function:
set.seed(1)
x <- rnorm(100)
eps <- rnorm(100, sd = 0.1)
y <- -1 + 0.5*x + eps
data <- data.frame(x, y)
lm_19 <- lm(y ~ x)
ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) +
geom_abline(aes(intercept = coef(lm_19)[1], slope = coef(lm_19)[2], col = "Least Squares")) +
scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
theme(legend.position = "bottom") +
labs(title = "sd(eps) = 0.1")
The standard linear model fit:
summary(lm_19)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18768 -0.06138 -0.01395 0.05394 0.23462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.003769 0.009699 -103.5 <2e-16 ***
## x 0.499894 0.010773 46.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09628 on 98 degrees of freedom
## Multiple R-squared: 0.9565, Adjusted R-squared: 0.956
## F-statistic: 2153 on 1 and 98 DF, p-value: < 2.2e-16
Comparing to a quadratic fit:
lm_20 <- lm(y ~ x + I(x^2))
summary(lm_20)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19650 -0.06254 -0.01288 0.05803 0.22700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.994328 0.011766 -84.512 <2e-16 ***
## x 0.501716 0.010798 46.463 <2e-16 ***
## I(x^2) -0.011892 0.008477 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0958 on 97 degrees of freedom
## Multiple R-squared: 0.9573, Adjusted R-squared: 0.9565
## F-statistic: 1088 on 2 and 97 DF, p-value: < 2.2e-16
As we’d expect, \(RSE\) decreases considerably & \(R^2\)/adjusted \(R^2\) increases considerably for both models.
It is interesting that, when using the same random seed, the p-value of the quadratic term does not change, despite the fact that the noise term (eps
= \(\epsilon\)) has a much smaller variance than before.
The findings are identical to before: The coefficient for the x^2
variable is not significant, so there is no statical evidence to reject the null hypothesis that the coefficient for this quadratic term is zero.
Q: Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.
A:
Here I repeat the process. This time I increase \(Var(\epsilon)\) from \(0.5^2\) to \(1^2 = 1\)
set.seed(1)
x <- rnorm(100)
eps <- rnorm(100, sd = 1)
y <- -1 + 0.5*x + eps
data <- data.frame(x, y)
lm_21 <- lm(y ~ x)
ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) +
geom_abline(aes(intercept = coef(lm_21)[1], slope = coef(lm_21)[2], col = "Least Squares")) +
scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
theme(legend.position = "bottom") +
labs(title = "sd(eps) = 1")
The standard linear model fit:
summary(lm_21)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.03769 0.09699 -10.699 < 2e-16 ***
## x 0.49894 0.10773 4.632 1.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.1796, Adjusted R-squared: 0.1712
## F-statistic: 21.45 on 1 and 98 DF, p-value: 1.117e-05
Comparing to a quadratic fit:
lm_22 <- lm(y ~ x + I(x^2))
summary(lm_22)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9650 -0.6254 -0.1288 0.5803 2.2700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.94328 0.11766 -8.017 2.47e-12 ***
## x 0.51716 0.10798 4.789 6.01e-06 ***
## I(x^2) -0.11892 0.08477 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.958 on 97 degrees of freedom
## Multiple R-squared: 0.1959, Adjusted R-squared: 0.1793
## F-statistic: 11.82 on 2 and 97 DF, p-value: 2.557e-05
As we’d expect, \(RSE\) increases considerably & \(R^2\)/adjusted \(R^2\) decreases considerably for both models.
In this iteration I was surprised at how well the least squares fit still matched the population regression line, as the relationship appears far noisier than before.
Again, the results of the t-test on the quadratic term conclude that there is no evidence for supporting a quadratic relationship.
Q: What are the confidence intervals for \(\beta_0\) and \(\beta_1\) based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.
A:
Recall the 95% confidence interval for \(\beta_i\) is given by:
\[\hat{\beta_i} \pm t_{n-2}(0.975) \cdot SE(\hat{\beta_i})\]
Original Data: \(Var(\epsilon) = 0.5^2\)
confint(lm_17)
## 2.5 % 97.5 %
## (Intercept) -1.1150804 -0.9226122
## x 0.3925794 0.6063602
Noisier Data: \(Var(\epsilon) = 1\)
confint(lm_21)
## 2.5 % 97.5 %
## (Intercept) -1.2301607 -0.8452245
## x 0.2851588 0.7127204
Less-Noisy Data: \(Var(\epsilon) = 0.1^2\)
confint(lm_19)
## 2.5 % 97.5 %
## (Intercept) -1.0230161 -0.9845224
## x 0.4785159 0.5212720
None of the confidence intervals for \(\beta_1\) contain zero.
As the relationship becomes noisier (i.e. as \(Var(\epsilon)\) increases), the 95% confidence interval for each parameter becomes wider. Since we saw the actual parameter estimates \(\hat{\beta_i}\) barely change, this is being driven by an increase in \(SE(\hat{\beta_i})\).
All of the confidence intervals are centered around -1 for \(\beta_0\) and 0.5 for \(\beta_0\), which is to be expected (in all 3 scenarios, the parameter estimates \(\hat{\beta_i}\) were very close to the population parameters).
A stronger relationship with less noise means that the true population parameter \(\beta_i\) can be said to fall within a smaller region (with 95% certainty).
This problem focuses on the collinearity problem
Q: Perform the following commands in R:
set.seed(1)
x1=runif (100)
x2=0.5*x1+rnorm (100)/10
y=2+2*x1+0.3*x2+rnorm (100)
The last line corresponds to creating a linear model in which y
is a function of x1
and x2
. Write out the form of the linear model. What are the regression coefficients?
A:
set.seed(1)
x1 = runif(100)
x2 = 0.5*x1 + rnorm(100)/10
y = 2 + 2*x1 + 0.3*x2 + rnorm(100)
The form of the regression model is given by:
\[y = \beta_0 + \beta_1\cdot x_1 + \beta_2\cdot x_2 + \epsilon\] The regression coefficients are given by:
Q: Do the results obtained in (c)-(e) contradict each other? Explain your answer
A:
The supposedly contradictory results are from the fact that, in a model with x1
and x2
as predictors, the x2
variable was not significant. However, when testing a model with just x2
as a predictor, we find that it is significant.
These are not contradictory results, and arise because x2
does not offer enough ‘new information’ when fitting a model that already contains x1
. The fact that x2
can be significant on its own and not significant in the presence of x1
arises from the fact that x1
and x2
are highly correlated, so using both the variables means a lot of the information provided by one can is effectively redundant.
Q: Now suppose we obtain one additional observation, which was unfortunately mismeasured.
x1=c(x1, 0.1)
x2=c(x2, 0.8)
y=c(y,6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.
A:
x1 = c(x1, 0.1)
x2 = c(x2, 0.8)
y = c(y, 6)
data <- data.frame(y, x1, x2, new = 0)
data$new[length(data$new)] <- 1
ggplot(data, aes(x = x1, y = x2, col = factor(new))) +
geom_point() +
scale_color_manual(values = c("black", "red")) +
theme(legend.position = "none")
ggplot(data, aes(x = x1, y = y, col = factor(new))) +
geom_point() +
scale_color_manual(values = c("black", "red")) +
theme(legend.position = "none")
ggplot(data, aes(x = x2, y = y, col = factor(new))) +
geom_point() +
scale_color_manual(values = c("black", "red")) +
theme(legend.position = "none")
Model: y ~ x1 + x2
Previously, see part (c).
With the new observation included:
summary(lm(y ~ x1 + x2))
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
Previously, we had that in a model containing both x1
and x2
, x1
was significant and x2
wasn’t.
With the addition of this new observation, this has reversed (x2
is now significant, x1
isn’t).
broom::augment(lm(y ~ x1 + x2)) %>%
cbind(new = data$new) %>%
ggplot(aes(x = .hat, y = .std.resid, col = factor(new))) +
geom_point() +
geom_hline(yintercept = -2, col = "deepskyblue3", size = 1, alpha = 0.5) +
geom_hline(yintercept = 2, col = "deepskyblue3", size = 1, alpha = 0.5) +
geom_vline(xintercept = 3 / nrow(data), col = "mediumseagreen", size = 1, alpha = 0.5) +
scale_color_manual(values = c("black", "red")) +
theme_light() +
theme(legend.position = "none") +
labs(x = "Leverage",
y = "Standardized Residual",
title = "Residuals vs Leverage: y ~ x1 + x2")
Recall that the leverage of a point is a measurement of how unusual/‘far away’ its \(x_i\) values are from other the other observations. It takes an average of \((p + 1)/n\).
Judging by the graph above, in the case of lm(y ~ x1 + x2)
, the new red observation is (by far) the highest leverage, and this can be visually confirmed by looking at its position on the scatter plot of x1
vs x2
. It also has a large residual and hence could be considered an outlier.
Model: y ~ x1
Previously, see part (d).
With the new observation included:
summary(lm(y ~ x1))
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
Previously, in the model containing just x1
as a predictor, x1
was significant. This is still the case.
broom::augment(lm(y ~ x1)) %>%
cbind(new = data$new) %>%
ggplot(aes(x = .hat, y = .std.resid, col = factor(new))) +
geom_point() +
geom_hline(yintercept = -2, col = "deepskyblue3", size = 1, alpha = 0.5) +
geom_hline(yintercept = 2, col = "deepskyblue3", size = 1, alpha = 0.5) +
geom_vline(xintercept = 2 / nrow(data), col = "mediumseagreen", size = 1, alpha = 0.5) +
scale_color_manual(values = c("black", "red")) +
theme_light() +
theme(legend.position = "none") +
labs(x = "Leverage",
y = "Standardized Residual",
title = "Residuals vs Leverage: y ~ x1")
For lm(y ~ x1)
, the new observation is still fairly high-leverage, but is also an outlier with a very large standardized residual (>3). Looking at the graph of y
vs x1
, we can visually confirm this (the point is far from the mean of x1
and would be a regression lines biggest outlier).
Model: y ~ x2
Previously, see part (e).
With the new observation included:
summary(lm(y ~ x2))
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
Previously, in the model containing just x2
as a predictor, x2
was significant. This is still the case.
broom::augment(lm(y ~ x2)) %>%
cbind(new = data$new) %>%
ggplot(aes(x = .hat, y = .std.resid, col = factor(new))) +
geom_point() +
geom_hline(yintercept = -2, col = "deepskyblue3", size = 1, alpha = 0.5) +
geom_hline(yintercept = 2, col = "deepskyblue3", size = 1, alpha = 0.5) +
geom_vline(xintercept = 2 / nrow(data), col = "mediumseagreen", size = 1, alpha = 0.5) +
scale_color_manual(values = c("black", "red")) +
theme_light() +
theme(legend.position = "none") +
labs(x = "Leverage",
y = "Standardized Residual",
title = "Residuals vs Leverage: y ~ x2")
For lm(y ~ x2)
, the observation is (once again) the highest leverage, which agreed with the graph of y
vs x2
(where it is furthest from the mean of x2
).
In this case, this observation is mostly in-line with the fit of a regression line, so its residual is comparatively low. For this model, the new observation might not be considered influential.
This problem involves the Boston
data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.
Q: For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
A:
Boston <- MASS::Boston
Boston$chas <- factor(Boston$chas)
I wrote the following loop based on extractions from my best_predictor()
function, which will fit a linear model predicting crim
for each predictor in Boston
, extract their \(R^2\) and t-test p-value, then rank order them in descending \(R^2\).
I also extract the model slope (\(\hat{\beta_1}\)), for use in (c).
R2 <- c()
P_value <- c()
Slope <- c()
y <- Boston$crim
for (i in 1:ncol(Boston)) {
x <- Boston[ ,i]
if (!identical(y, x)) {
# linear: y ~ x
R2[i] <- summary(lm(y ~ x))$r.squared
P_value[i] <- summary(lm(y ~ x))$coefficients[2, 4]
Slope[i] <- summary(lm(y ~ x))$coefficients[2, 1]
} else {
R2[i] <- NA
P_value[i] <- NA
Slope[i] <- NA
}
}
crime_preds <- data.frame(varname = names(Boston),
R2 = round(R2, 5),
P_value = round(P_value, 10),
Slope = round(Slope, 5)) %>%
arrange(desc(R2))
crime_preds
## varname R2 P_value Slope
## 1 rad 0.39126 0.0000000000 0.61791
## 2 tax 0.33961 0.0000000000 0.02974
## 3 lstat 0.20759 0.0000000000 0.54880
## 4 nox 0.17722 0.0000000000 31.24853
## 5 indus 0.16531 0.0000000000 0.50978
## 6 medv 0.15078 0.0000000000 -0.36316
## 7 black 0.14827 0.0000000000 -0.03628
## 8 dis 0.14415 0.0000000000 -1.55090
## 9 age 0.12442 0.0000000000 0.10779
## 10 ptratio 0.08407 0.0000000000 1.15198
## 11 rm 0.04807 0.0000006347 -2.68405
## 12 zn 0.04019 0.0000055065 -0.07393
## 13 chas 0.00312 0.2094345015 -1.89278
## 14 crim NA NA NA
It appears that, when tested individually, all predictors are significant (with the exception of chas
, which has p = 0.2094 > 0.05).
I skip producing plots here, as this was already done and discussed in some detail in Ch.2 - Q10 - (b), (c), etc.
Q: Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis \(H_{0}: \beta_{j} = 0\)
A:
Fitting the linear model crim_lm
, predicting crim
using all predictors:
crim_lm <- lm(crim ~ ., data = Boston)
summary(crim_lm)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas1 -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
We can reject the null hypothesis \(H_{0}: \beta_{j} = 0\) (at the 5% level) for the predictors:
zn
dis
rad
black
medv
For all other predictors, there was insufficient evidence to reject the null hypothesis that their coefficient was zero.
Q: How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis
A:
See the table below, where Slope
appears as the x-axis variable, and is \(\hat{\beta_1}\) in the simple linear regression case. Estimate
is the y-axis variable, \(\hat{\beta_j}\), the coefficient estimate for the variable in the multiple regression equation.
multiple_reg <- summary(crim_lm)$coefficients %>%
as.data.frame() %>%
rownames_to_column("varname") %>%
dplyr::select(varname, Estimate)
simple_reg <- crime_preds %>%
select(varname, Slope) %>%
mutate(varname = as.character(varname))
simple_reg$varname[simple_reg$varname == "chas"] <- "chas1"
simple_multiple_reg <- inner_join(simple_reg, multiple_reg, by = "varname")
simple_multiple_reg
## varname Slope Estimate
## 1 rad 0.61791 0.588208591
## 2 tax 0.02974 -0.003780016
## 3 lstat 0.54880 0.126211376
## 4 nox 31.24853 -10.313534912
## 5 indus 0.50978 -0.063854824
## 6 medv -0.36316 -0.198886821
## 7 black -0.03628 -0.007537505
## 8 dis -1.55090 -0.987175726
## 9 age 0.10779 0.001451643
## 10 ptratio 1.15198 -0.271080558
## 11 rm -2.68405 0.430130506
## 12 zn -0.07393 0.044855215
## 13 chas1 -1.89278 -0.749133611
ggplot(simple_multiple_reg, aes(x = Slope, y = Estimate, col = factor(varname))) +
geom_jitter() +
labs(title = "Simple vs Multiple Regression Coefficients",
subtitle = "For linear models predicting 'crim'",
x = "Simple Coefficient",
y = "Multiple Coefficient") +
theme(legend.position = "none")
Q: Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form: \(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\)
A:
I create another loop, similar to part (a), but this time I fit a cubic model of the form \(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\), where \(Y\) = crim
, for all \(X\) features (except crim
& chas
).
I extract the \(R^2\) for each model as before, but now I also extract the p-values for the \(X\), \(X^2\) & \(X^3\) variable t-tests, which can be used to identify where non-linear relationships exist. The relationship
column then determines which is the highest order (significant) relationship.
P_value_x <- c()
P_value_x2 <- c()
P_value_x3 <- c()
R2 <- c()
y <- Boston$crim
for (i in 1:ncol(Boston)) {
x <- Boston[ ,i]
if (is.numeric(x)) {
model <- lm(y ~ x + I(x^2) + I(x^3))
if (!identical(y, x)) {
P_value_x[i] <- summary(model)$coefficients[2, 4]
P_value_x2[i] <- summary(model)$coefficients[3, 4]
P_value_x3[i] <- summary(model)$coefficients[4, 4]
R2[i] <- summary(model)$r.squared
}
} else {
P_value_x[i] <- NA
P_value_x2[i] <- NA
P_value_x3[i] <- NA
R2[i] <- NA
}
}
data.frame(varname = names(Boston),
R2 = round(R2, 5),
P_value_x = round(P_value_x, 10),
P_value_x2 = round(P_value_x2, 10),
P_value_x3 = round(P_value_x3, 10)) %>%
filter(!varname %in% c("crim", "chas")) %>%
arrange(desc(R2)) %>%
mutate(relationship = case_when(P_value_x3 < 0.05 ~ "Cubic",
P_value_x2 < 0.05 ~ "Quadratic",
P_value_x < 0.05 ~ "Linear",
TRUE ~ "No Relationship"))
## varname R2 P_value_x P_value_x2 P_value_x3 relationship
## 1 medv 0.42020 0.0000000000 0.0000000000 0.0000000000 Cubic
## 2 rad 0.40004 0.6234175212 0.6130098773 0.4823137740 No Relationship
## 3 tax 0.36888 0.1097075249 0.1374681578 0.2438506811 No Relationship
## 4 nox 0.29698 0.0000000000 0.0000000000 0.0000000000 Cubic
## 5 dis 0.27782 0.0000000000 0.0000000000 0.0000000109 Cubic
## 6 indus 0.25966 0.0000529706 0.0000000003 0.0000000000 Cubic
## 7 lstat 0.21793 0.3345299858 0.0645873561 0.1298905873 No Relationship
## 8 age 0.17423 0.1426608270 0.0473773275 0.0066799154 Cubic
## 9 black 0.14984 0.1385871340 0.4741750826 0.5436171817 No Relationship
## 10 ptratio 0.11378 0.0030286627 0.0041195521 0.0063005136 Cubic
## 11 rm 0.06779 0.2117564139 0.3641093853 0.5085751094 No Relationship
## 12 zn 0.05824 0.0026122963 0.0937504996 0.2295386205 Linear
Below are a couple of examples of predictors where the cubic term is significant. Using the formula
argument in geom_smooth(method = "lm", )
, we can overlay the cubic line directly onto the scatter plot:
ggplot(Boston, aes(x = medv, y = crim)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)") +
labs(title = "Cubic Relationship between 'medv' & 'crim'")
ggplot(Boston, aes(x = nox, y = crim)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)") +
labs(title = "Cubic Relationship between 'nox' & 'crim'")