Problem 2 a)
library(ISLR)
data(Auto)
slrauto <- lm(data = Auto, mpg~horsepower)
summary(slrauto)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
According to the model, the F-statistic is 599.7. This large number suggests that the model is effective. The p-value associated with the F-stat is <2e-16. Since this number is less than 0.05, this suggests that we reject the null hypothesis and conclude that 𝛽=/=0. This suggests that the results are statistically significant, and there is a relationship betwen the predictor and response variable.
The R^2 value suggests that 60.59% of the variation in mpg (the response variable) is explained by horespower (the predictor variable).
Since the coefficient of horsepower or the predictor variable is negative, the relationship between the predictor and the response variable is negative. The linear regression suggests that the more horespower an automobile has, the less mpg the automobile will have.
#Method 1
slrauto <- lm(data = Auto, mpg~horsepower)
newdata = data.frame(horsepower = 98)
predict(slrauto, newdata)
## 1
## 24.46708
#Method 2
yhat = -0.157845*98 +39.935861
yhat
## [1] 24.46705
The predicted mpg associated with a horespower of 98 is 24.46705
Confidence & Prediction Intervals
confBand <- predict(slrauto, interval="confidence")
head(confBand)
## fit lwr upr
## 1 19.416046 18.831250 20.000841
## 2 13.891480 12.982802 14.800158
## 3 16.259151 15.504025 17.014277
## 4 16.259151 15.504025 17.014277
## 5 17.837598 17.174242 18.500955
## 6 8.682604 7.401151 9.964056
predBand <- predict(slrauto, interval = "predict")
## Warning in predict.lm(slrauto, interval = "predict"): predictions on current data refer to _future_ responses
head(predBand)
## fit lwr upr
## 1 19.416046 9.753295 29.07880
## 2 13.891480 4.203732 23.57923
## 3 16.259151 6.584598 25.93370
## 4 16.259151 6.584598 25.93370
## 5 17.837598 8.169775 27.50542
## 6 8.682604 -1.047190 18.41240
attach(Auto)
plot(mpg~horsepower, col = "blue")
abline(slrauto)
#Method 1: Calculating Residuals
res <- Auto$mpg-slrauto$fitted.values
head(res)
## 1 2 3 4 5 6
## -1.4160457 1.1085200 1.7408490 -0.2591510 -0.8375984 6.3173962
#Method 2: Calculating Residuals
head(slrauto$residuals)
## 1 2 3 4 5 6
## -1.4160457 1.1085200 1.7408490 -0.2591510 -0.8375984 6.3173962
#Creating a data frame to store the residuals and fitted values
auto_df <- data.frame(residuals = slrauto$residuals,
fitted = slrauto$fitted.values)
Residual Plot
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot(auto_df, aes(fitted, residuals))+
geom_point()+
geom_abline(slope=0, intercept = 0, lty = 2, color = "blue")+
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
According to the residual plot, there is a pattern that shows that when the fitted values increase, the variation of the residuals increases, creating a fan shape in the graph.
QQ PLot
qqnorm(slrauto$residuals)
qqline(slrauto$residuals)
According to the QQ plot, the values of the residuals all lay above the qqline, suggesting that the data is from a distribution with a right skew.
Residual vs Leverage Plot
plot(slrauto)
The residual vs leverage plot indicates the presence of outlier points and leverage points, which can impact the line of fit when performing a least square regression.