library(MASS)
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following object is masked from 'package:MASS':
##
## Boston
mpg
as the
response and horsepower
as the predictor. Comment on the
outputdata("Auto")
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
lm.fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
- The p-value of the coefficient of the variable horsepower being approximately 0 (<2e-16) indicates that the varible is statistically significant in predicting mpg, which implies there exists a relationship between mpg and horsepower
- The R-squared value of 0.6059 indicates that 60.59% of the variability in the variable mpg can be explained by the variable horsepower
- The relationship between the variable mpg and the variable horsepower is negative, indicated by the negative value of the coefficient of horsepower (-0.157845)
predict(lm.fit,data.frame(horsepower=c(98)),interval="prediction", level = 0.95)
## fit lwr upr
## 1 24.46708 14.8094 34.12476
predict(lm.fit,data.frame(horsepower=c(98)),interval="confidence",level=0.95)
## fit lwr upr
## 1 24.46708 23.97308 24.96108
- We can infer that a horsepower value of 98 gives a predicted mpg of 24.46. The 95% confidence interval for this prediction is (23.97, 24.96) and the 95% prediction interval is (14.80, 34.12).We see that prediction interval is wider than the confidence interval.
plot(Auto$horsepower, Auto$mpg, main =" Scatterplot of mpg vs horsepower",
xlab = "Horsepower", ylab = "Miles per gallon",col="purple")
abline(lm.fit, lwd = 3, col = "black")
par(mfrow = c(2,2))
plot(lm.fit)
> Looking at the Residuals vs. Fitted plot below, there is a clear
U-shape to the residuals, which is a strong indicator of non-linearity
in the data. This, when combined with plot we got in 8(b), we can say
that the simple linear regression model is not a good fit. The second
plot shows that the residuals are normally distributed. The third plot
shows that the variance of the errors is constant. Finally, the plot of
standardized residuals versus leverage indicates the presence of a few
outliers (higher than 2 or lower than -2) and a few high leverage
points.
Sales
using Price
, Urban
, and US
data("Carseats")
lm.fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
\(Sales = 13.04 - (0.05*Price) -
(0.02*Urban) + (1.2*US)\)
+ Urban = 1 if the store is in an urban location and 0 otherwise
+ US = 1 if the store is based in the US and 0 otherwise
Based on the p-value, we can reject the null hypothesis for the following predictors: Price and US
lm.fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
We can compare fits by comparing some diagnostics that are available
at the bottom of the summary()
output.
lm.fit | lm.fit2 | Better Fit | |
---|---|---|---|
Residual standard error | 2.472 | 2.469 | lm.fit2 |
Multiple R-squared | 0.2393 | 0.2393 | draw |
Adjusted R-squared | 0.2335 | 0.2354 | lm.fit2 |
F-statistic, p-value | 41.52, < 2e-16 | 62.43, < 2e_16 | lm.fit2 |
So the model from (e),
lm.fit2
, seems to be a better fit.
confint(lm.fit2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(lm.fit2, which = 5)
> From the plots, there is no evidence that there exists such
points.
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
y
is a function of x1
and x2
.
Write out the form of the linear model. What are the regression
coefcients?\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\)
where \(\beta_0 = 2\), \(\beta_1 = 2\), \(\beta_2 = 0.3\), and \(\epsilon \sim N(0,1)\)
cor(x1, x2)
## [1] 0.8351212
plot(x1, x2)
y
using x1
and x2
Describe the
results obtained. What are \(\hat{\beta_0}\), \(\hat{\beta_1}\), and \(\hat{\beta_2}\)? How do these relate to the
true \(\beta_0\), \(\beta_1\), and \(\beta_2\)? Can you reject the null
hypothesis \(H_0\) : \(\beta_1\) = 0? How about the null
hypothesis \(H_0\) : \(\beta_2\) = 0?lm.fit <- lm(y ~ x1 + x2)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
y
using only x1
. Comment on your results. Can you reject the
null hypothesis \(H_0\) : \(\beta_1\) = 0?lm.fit2 <- lm(y ~ x1)
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
The evidence for \(\beta_1\) is stronger now than it was in the first model (as evidenced by the p-values) and \(\hat{\beta_1}\) is much closer to the value we assigned it to be.
Based on the p-value, we can reject the null hypothesis.
y
using only x2
. Comment on your results. Can you reject the
null hypothesis \(H_0\) : \(\beta_1\) = 0?lm.fit3 <- lm(y ~ x2)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
In this model, the estimate of \(\beta_2\) is very different from its known value (2.9 compares to 0.3). We can reject the null hypothesis since the p-value is very small.
No, the results do not contradict each other. As the predictors \(x_1\) and \(x_2\) are highly correlated we are in the presence of collinearity, in this case it can be difficult to determine how each predictor separately is associated with the response. Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for \(\beta_1\) to grow (we have a standard error of 0.7211795 and 1.1337225 for \(x_1\) and \(x_2\) respectively in the model with two predictors and only of 0.3962774 and 0.6330467 for \(x_1\) and \(x_2\) respectively in the models with only one predictor). Consequently, we may fail to reject \(H_0\) in the presence of collinearity. The importance of the \(x_2\) variable has been masked due to the presence of collinearity.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
lm.fit_re <- lm(y ~ x1 + x2)
summary(lm.fit_re)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
plot(lm.fit_re)
> This has changed the estimates pretty drastically. If we look at
the first 3 plots, it doesn’t look like the new observation is
necessarily an outlier. However, the 4th plot does reveal that is
definitely exerting a lot of influence on the fit of the line.
lm.fit2_re <- lm(y ~ x1)
summary(lm.fit2_re)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
plot(lm.fit2_re)
> Again, the new data point has changed the estimates. The new point
is revealed to be an outlier in plots 1-3. However, it’s influence over
the fit in this model is reduced compared to in the
lm.fit_new
model.
lm.fit3_re <- lm(y ~ x2)
summary(lm.fit3_re)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
plot(lm.fit3_re)
> In this final model, the new observation does not seem to be an
outlier but does exert quite a bit of influence on the fit.