HW-Q 1: This question involves the use of simple linear regression on the Auto data set from ISLR2 package.

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Based on your output, Comment on the following questions.

library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.1.2
data("Auto")
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
y <- Auto$mpg
x <- Auto$horsepower

lm.fit <- lm(y~x)
summary(lm.fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## x           -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictor and the response?

Yes, because the p-value is 2.2e-16

  1. How strong is the relationship between the predictor and the response?

The R-squared value shows that almost 61% of the variation in mpg (response) is because of the horsepower (predictor).

  1. Is the relationship between the predictor and the response positive or negative?

Negative.

  1. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?
predict(lm.fit,data.frame(x=c(98)),interval="prediction")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476
predict(lm.fit,data.frame(x=c(98)),interval="confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

The predicted mpg associated with a horsepower of 98 is about 24.47.

We are 95% confident that the mpg of a car with horsepower of 98 is between 14.81 to 34.12.

We are 95% confident that the average mpg of a car with horsepower of 98 is between 23.97 to 24.96.

The prediction interval is wider than the confidence interval.

  1. Plot the response and the predictor. Use the abline() function to display the least squares regression line.
plot(x, y, main = "Scatterplot of mpg vs. horsepower", xlab = "horsepower", ylab = "mpg", col = "red")+ abline(lm.fit,lwd=5,col="blue")

## integer(0)
  1. Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
par(mfrow = c(2,2))
plot(lm.fit)

The plot of residuals versus fitted values indicates the presence of non linearity in the data. The plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and a few high leverage points.

HW-Q 2: In this problem we will investigate the t-statistic for the null hypothesis H0 : β = 0 in simple linear regression without an intercept.

  1. Perform a simple linear regression on the last question dataset (mpg as the response y and horsepower as the predictor x) without an intercept. Report the coefficient estimate \(\hat{β}\), the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis H0 : β = 0. Comment on these results. (You can perform regression without an intercept using the command lm(y ∼ x + 0).)
y <- Auto$mpg
x <- Auto$horsepower

lm.fit1 <- lm(y~x +0)
summary(lm.fit1)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.451  -5.033   5.609  15.185  35.716 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x 0.178840   0.006648    26.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.65 on 391 degrees of freedom
## Multiple R-squared:  0.6492, Adjusted R-squared:  0.6483 
## F-statistic: 723.7 on 1 and 391 DF,  p-value: < 2.2e-16

The coefficient estimate is 0.178840, the standard error is 0.006648, the t-value 26.9 and the p-value is less than 0.05. That is why we reject the null hypothesis that coefficient of x is zero.

  1. Now perform a simple linear regression (mpg as the predictor x and horsepower as the response y) without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis \(H_{0}\) : β = 0. Comment on these results.
lm.fit.2 <-lm(x~y+0)
summary(lm.fit.2)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -112.82  -28.91   13.21   63.40  181.44 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   3.6302     0.1349    26.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.01 on 391 degrees of freedom
## Multiple R-squared:  0.6492, Adjusted R-squared:  0.6483 
## F-statistic: 723.7 on 1 and 391 DF,  p-value: < 2.2e-16

The Coefficient estimate is 3.6302, standard error is 0.1349 t statistic is 26.9 and p-value associated with is less than 0.05 and hence we reject the null hypothesis that coefficient of x is zero.

  1. What is the relationship between the results obtained in (a) and (b)? (for this question, only interpretation)

We obtain the same value for the t-statistic and consequently the same value for the corresponding p-value. Both results in (a) and (b) reflect the same line created in (a).The intercept is changed and it is not the inverse, so we cannot say that y=mx+c is written as x=(1/m)(y-c)

  1. For the regression of \(Y\) onto \(X\) without an intercept, the t-statistic for \(H_{0}\) : β = 0 takes the form \(\frac{\hat{β}}{SE(\hat{β})}\), where \(\hat{β}\) is given by (3.38), and where

\(SE(\hat{β}) =\) \(\sqrt{\frac{\sum^{n}_{i=1}(y_i - x_i\hat{β})^2}{(n-1)\sum^n_{i'=1}(x^2_{i'})}}\)

(These formulas are slightly different from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression without an intercept.) confirm numerically in R.

yi.hat = lm.fit1$fitted.values
xi.hat = lm.fit.2$fitted.values
n = length(x)
t <- sqrt(n - 1)*(x %*% y)/sqrt(sum(x^2) * sum(y^2) - (x %*% y)^2)
as.numeric(t)
## [1] 26.90077

T above is exactly the t-statistic given before.

  1. In R, show that when regression is performed with an intercept, the t-statistic for \(H_0\): \(β_1\)= 0 is the same for the regression of \(y\) onto \(x\) as it is for the regression of \(x\) onto \(y\).
model.final.1 = lm(y~x)
summary(model.final.1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## x           -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
model.final.2 = lm(x~y)
summary(model.final.2)
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.892 -15.716  -2.094  13.108  96.947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 194.4756     3.8732   50.21   <2e-16 ***
## y            -3.8389     0.1568  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.19 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Again, the t-statistic in both model.final.1 and model.final.2 is equal to -24.49