Libraries

library(MASS)
library(ISLR2)
## 
## Attaching package: 'ISLR2'
## The following object is masked from 'package:MASS':
## 
##     Boston

Exercise 8. This question should be answered using the Auto data set.

(a) Perform a simple linear regression with mpg as the response and horsepower as the predictor. Comment on the output

data("Auto")
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
lm.fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. The p-value of the coefficient of the variable horsepower being approximately 0 (<2e-16) indicates that the varible is statistically significant in predicting mpg, which implies there exists a relationship between mpg and horsepower
  2. The R-squared value of 0.6059 indicates that 60.59% of the variability in the variable mpg can be explained by the variable horsepower
  3. The relationship between the variable mpg and the variable horsepower is negative, indicated by the negative value of the coefficient of horsepower (-0.157845)
predict(lm.fit,data.frame(horsepower=c(98)),interval="prediction", level = 0.95)
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476
predict(lm.fit,data.frame(horsepower=c(98)),interval="confidence",level=0.95)
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
  1. We can infer that a horsepower value of 98 gives a predicted mpg of 24.46. The 95% confidence interval for this prediction is (23.97, 24.96) and the 95% prediction interval is (14.80, 34.12).We see that prediction interval is wider than the confidence interval.

(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(Auto$horsepower, Auto$mpg, main =" Scatterplot of mpg vs horsepower",
xlab = "Horsepower", ylab = "Miles per gallon",col="purple")
abline(lm.fit, lwd = 3, col = "black")

(c) Use the plot() function to produce diagnostic plots of the least squares regression ft. Comment on any problems you see with the fit

par(mfrow = c(2,2))
plot(lm.fit)

> Looking at the Residuals vs. Fitted plot below, there is a clear U-shape to the residuals, which is a strong indicator of non-linearity in the data. This, when combined with plot we got in 8(b), we can say that the simple linear regression model is not a good fit. The second plot shows that the residuals are normally distributed. The third plot shows that the variance of the errors is constant. Finally, the plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and a few high leverage points.

Exercise 10. This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US

data("Carseats")
lm.fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefcient in the model.

  • On average, when the price of the child car seats increases by $1 then sales will drop by about $54, ceteris paribus.
  • There is not a statistically significant difference in sales between stores in urban location and stores in rural location, ceteris paribus.
  • On average, if a store is based in the US, sales will increase by $12,000, ceteris paribus.

(c) Write out the model in equation form

\(Sales = 13.04 - (0.05*Price) - (0.02*Urban) + (1.2*US)\)
+ Urban = 1 if the store is in an urban location and 0 otherwise
+ US = 1 if the store is based in the US and 0 otherwise

(d) For which of the predictors can you reject the null hypothesis \(H_0\) : \(β_j\) = 0?

Based on the p-value, we can reject the null hypothesis for the following predictors: Price and US

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

We can compare fits by comparing some diagnostics that are available at the bottom of the summary() output.

lm.fit lm.fit2 Better Fit
Residual standard error 2.472 2.469 lm.fit2
Multiple R-squared 0.2393 0.2393 draw
Adjusted R-squared 0.2335 0.2354 lm.fit2
F-statistic, p-value 41.52, < 2e-16 62.43, < 2e_16 lm.fit2

So the model from (e), lm.fit2, seems to be a better fit.

g) Using the model from (e), obtain 95 % confidence intervals for the coeficient(s).

confint(lm.fit2, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h) Is there evidence of outliers or high leverage observations in the model from (e)?

plot(lm.fit2, which = 5)

> From the plots, there is no evidence that there exists such points.

Exercise 14. This problem focuses on the collinearity problem.

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

(a) The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefcients?

\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\)

where \(\beta_0 = 2\), \(\beta_1 = 2\), \(\beta_2 = 0.3\), and \(\epsilon \sim N(0,1)\)

(b) What is the correlation between \(x_1\) and \(x_2\)? Create a scatterplot displaying the relationship between the variables.

cor(x1, x2)
## [1] 0.8351212
plot(x1, x2)

(c) Using this data, fit a least squares regression to predict y using x1 and x2 Describe the results obtained. What are \(\hat{\beta_0}\), \(\hat{\beta_1}\), and \(\hat{\beta_2}\)? How do these relate to the true \(\beta_0\), \(\beta_1\), and \(\beta_2\)? Can you reject the null hypothesis \(H_0\) : \(\beta_1\) = 0? How about the null hypothesis \(H_0\) : \(\beta_2\) = 0?

lm.fit <- lm(y ~ x1 + x2)
summary(lm.fit)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05
  • The coefficients \(\hat{\beta_0}\), \(\hat{\beta_1}\) and \(\hat{\beta_2}\) are respectively 2.1305, 1.4396 and 1.0097. Only \(\hat{\beta_0}\) is close to \(β0\), \(\hat{\beta_1}\) and \(\hat{\beta_2}\) are not very close to the known value of \(\beta_1\) and \(\beta_2\) . As the p-value is less than 0.05 we may reject \(H_0\) for \(\beta_1\) , however we may not reject \(H_0\) for \(\beta_2\) as the p-value is higher than 0.05.

(d) Now ft a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis \(H_0\) : \(\beta_1\) = 0?

lm.fit2 <- lm(y ~ x1)
summary(lm.fit2)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

The evidence for \(\beta_1\) is stronger now than it was in the first model (as evidenced by the p-values) and \(\hat{\beta_1}\) is much closer to the value we assigned it to be.
Based on the p-value, we can reject the null hypothesis.

(e) Now ft a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis \(H_0\) : \(\beta_1\) = 0?

lm.fit3 <- lm(y ~ x2)
summary(lm.fit3)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

In this model, the estimate of \(\beta_2\) is very different from its known value (2.9 compares to 0.3). We can reject the null hypothesis since the p-value is very small.

(f) Do the results obtained in (c)–(e) contradict each other? Explain your answer.

No, the results do not contradict each other. As the predictors \(x_1\) and \(x_2\) are highly correlated we are in the presence of collinearity, in this case it can be difficult to determine how each predictor separately is associated with the response. Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for \(\beta_1\) to grow (we have a standard error of 0.7211795 and 1.1337225 for \(x_1\) and \(x_2\) respectively in the model with two predictors and only of 0.3962774 and 0.6330467 for \(x_1\) and \(x_2\) respectively in the models with only one predictor). Consequently, we may fail to reject \(H_0\) in the presence of collinearity. The importance of the \(x_2\) variable has been masked due to the presence of collinearity.

(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured. Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
lm.fit_re <- lm(y ~ x1 + x2)
summary(lm.fit_re)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
plot(lm.fit_re)

> This has changed the estimates pretty drastically. If we look at the first 3 plots, it doesn’t look like the new observation is necessarily an outlier. However, the 4th plot does reveal that is definitely exerting a lot of influence on the fit of the line.

lm.fit2_re <- lm(y ~ x1)
summary(lm.fit2_re)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05
plot(lm.fit2_re)

> Again, the new data point has changed the estimates. The new point is revealed to be an outlier in plots 1-3. However, it’s influence over the fit in this model is reduced compared to in the lm.fit_new model.

lm.fit3_re <- lm(y ~ x2)
summary(lm.fit3_re)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06
plot(lm.fit3_re)

> In this final model, the new observation does not seem to be an outlier but does exert quite a bit of influence on the fit.