Auto data set from ISLR2 package.Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Based on your output, Comment on the following questions.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.1.2
data("Auto")
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
y <- Auto$mpg
x <- Auto$horsepower
lm.fit <- lm(y~x)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## x -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Yes, because the p-value is 2.2e-16
The R-squared value shows that almost 61% of the variation in mpg (response) is because of the horsepower (predictor).
Negative.
predict(lm.fit,data.frame(x=c(98)),interval="prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
predict(lm.fit,data.frame(x=c(98)),interval="confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
The predicted mpg associated with a horsepower of 98 is about 24.47.
We are 95% confident that the mpg of a car with horsepower of 98 is between 14.81 to 34.12.
We are 95% confident that the average mpg of a car with horsepower of 98 is between 23.97 to 24.96.
The prediction interval is wider than the confidence interval.
plot(x, y, main = "Scatterplot of mpg vs. horsepower", xlab = "horsepower", ylab = "mpg", col = "red")+ abline(lm.fit,lwd=5,col="blue")
## integer(0)
par(mfrow = c(2,2))
plot(lm.fit)
The plot of residuals versus fitted values indicates the presence of non linearity in the data. The plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and a few high leverage points.
lm(y ∼ x + 0).)y <- Auto$mpg
x <- Auto$horsepower
lm.fit1 <- lm(y~x +0)
summary(lm.fit1)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.451 -5.033 5.609 15.185 35.716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.178840 0.006648 26.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.65 on 391 degrees of freedom
## Multiple R-squared: 0.6492, Adjusted R-squared: 0.6483
## F-statistic: 723.7 on 1 and 391 DF, p-value: < 2.2e-16
The coefficient estimate is 0.178840, the standard error is 0.006648, the t-value 26.9 and the p-value is less than 0.05. That is why we reject the null hypothesis that coefficient of x is zero.
lm.fit.2 <-lm(x~y+0)
summary(lm.fit.2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.82 -28.91 13.21 63.40 181.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 3.6302 0.1349 26.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.01 on 391 degrees of freedom
## Multiple R-squared: 0.6492, Adjusted R-squared: 0.6483
## F-statistic: 723.7 on 1 and 391 DF, p-value: < 2.2e-16
The Coefficient estimate is 3.6302, standard error is 0.1349 t statistic is 26.9 and p-value associated with is less than 0.05 and hence we reject the null hypothesis that coefficient of x is zero.
We obtain the same value for the t-statistic and consequently the same value for the corresponding p-value. Both results in (a) and (b) reflect the same line created in (a).The intercept is changed and it is not the inverse, so we cannot say that y=mx+c is written as x=(1/m)(y-c)
\(SE(\hat{β}) =\) \(\sqrt{\frac{\sum^{n}_{i=1}(y_i - x_i\hat{β})^2}{(n-1)\sum^n_{i'=1}(x^2_{i'})}}\)
(These formulas are slightly different from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression without an intercept.) confirm numerically in R.
yi.hat = lm.fit1$fitted.values
xi.hat = lm.fit.2$fitted.values
n = length(x)
t <- sqrt(n - 1)*(x %*% y)/sqrt(sum(x^2) * sum(y^2) - (x %*% y)^2)
as.numeric(t)
## [1] 26.90077
T above is exactly the t-statistic given before.
R, show that when regression is performed with an intercept, the t-statistic for \(H_0\): \(β_1\)= 0 is the same for the regression of \(y\) onto \(x\) as it is for the regression of \(x\) onto \(y\).model.final.1 = lm(y~x)
summary(model.final.1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## x -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
model.final.2 = lm(x~y)
summary(model.final.2)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.892 -15.716 -2.094 13.108 96.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.4756 3.8732 50.21 <2e-16 ***
## y -3.8389 0.1568 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.19 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Again, the t-statistic in both model.final.1 and model.final.2 is equal to -24.49