Libraries

library(broom)
library(MASS)
library(ISLR2)
## 
## Attaching package: 'ISLR2'
## The following object is masked from 'package:MASS':
## 
##     Boston
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ dials        1.2.0     ✔ rsample      1.2.0
## ✔ dplyr        1.1.3     ✔ tibble       3.2.1
## ✔ ggplot2      3.5.0     ✔ tidyr        1.3.1
## ✔ infer        1.0.5     ✔ tune         1.1.2
## ✔ modeldata    1.2.0     ✔ workflows    1.1.3
## ✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
## ✔ purrr        1.0.2     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ dplyr::select()  masks MASS::select()
## ✖ recipes::step()  masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(ggplot2)

(a) Perform a simple linear regression with mpg as the response and horsepower as the predictor. Comment on the output

data("Auto")
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
lm_model <- linear_reg() %>%
  set_engine('lm') %>%
  set_mode('regression')
lm_fit <- lm_model %>%
  fit(mpg ~ horsepower, data = Auto)
tidy(lm_fit)
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   39.9     0.717        55.7 1.22e-187
## 2 horsepower    -0.158   0.00645     -24.5 7.03e- 81
glance(lm_fit)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.606         0.605  4.91      600. 7.03e-81     1 -1179. 2363. 2375.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
  1. The p-value of the coefficient of the variable horsepower being approximately 0 (<2e-16) indicates that the varible is statistically significant in predicting mpg, which implies there exists a relationship between mpg and horsepower
  2. The relationship between the variable mpg and the variable horsepower is negative, indicated by the negative value of the coefficient of horsepower (-0.157845)
  3. The R-squared value of 0.6059 indicates that 60.59% of the variability in the variable mpg can be explained by the variable horsepower

(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(color = "purple") +  # Add scatterplot points
  geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 1) +  # Add regression line
  labs(title = "Scatterplot of mpg vs horsepower with Regression Line",
       x = "Horsepower",
       y = "Miles per Gallon") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

(c) Produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit

library(ggfortify)
## Registered S3 method overwritten by 'ggfortify':
##   method          from   
##   autoplot.glmnet parsnip
autoplot(lm_fit, which = 1:4) +
  labs(title = "Diagnostic Plots of Least Squares Regression Fit")

> > Looking at the Residuals vs. Fitted plot, there is a clear U-shape to the residuals, which is a strong indicator of non-linearity in the data. This, when combined with plot we got in 8(b), we can say that the simple linear regression model is not a good fit. The second plot shows that the residuals are normally distributed. The third plot shows that the variance of the errors is constant. Finally, the plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and a few high leverage points.