This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set

library(ISLR2)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
data(Auto)

Create a scatterplot matrix excluding the “name” column

ggpairs(Auto[, -9])

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative. cor()

Computing the correlation matrix

cor_matrix <- cor(Auto[, -9])
print(cor_matrix)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

testing the multiple linear regression model:

lm_auto <- lm(mpg ~ . - name, data = Auto)
summary(lm_auto)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

(i). Is there a relationship between the predictors and the response? Yes there is: F-stastic value (252.4) and P-value < 2.2e-16 is very small Multiple R-squared (0.8215) is high, which indicates that 82.15% of the variance in mpg is explained by the predictirs

(ii). Which predictors appear to have a statistically significant relationship to the response?

the significance of relationships are shown by the P-values of the predictors. p-values < 0.05 are considered statistically significant. Intercept: p-value = 0.00024 (significant) Cylinders: p-value = 0.12780 (not significant) Displacement: p-value = 0.00844 (significant) Horsepower: p-value = 0.21963 (not significant) Weight: p-value = < 2e-16 (highly significant) Acceleration: p-value = 0.41548 (not significant) Year: p-value = < 2e-16 (highly significant) Origin: p-value = 4.67e-07 (highly significant)

(iii). What does the coefficient for the year variable suggest?

For every one-year increase, the mpg increases by approximately 0.75 miles per gallon. the p-value < 2e-16 is statistically significant. meaning: newer cars tend to be more fuel-efficient than older cars.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Plotting

par(mfrow = c(2, 2))
plot(lm_auto)

# Residuals vs Fitted: large outliers, non-linear # Q-Q: Normal # Scale-Location: Checks for homoscedasticity (constant variance). # Residuals vs Leverage: high-leverage points.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

# Fit model with interactions __## (*) only__

lm_interaction <- lm(mpg ~ cylinders * displacement + horsepower * weight + year * origin, data = Auto)
summary(lm_interaction)
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + horsepower * weight + 
##     year * origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7440 -1.6277  0.0152  1.4696 11.8628 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.079e+01  8.110e+00   2.564  0.01073 *  
## cylinders              -9.591e-01  4.833e-01  -1.984  0.04795 *  
## displacement           -2.954e-02  1.605e-02  -1.841  0.06642 .  
## horsepower             -1.851e-01  2.406e-02  -7.691 1.25e-13 ***
## weight                 -9.688e-03  9.443e-04 -10.259  < 2e-16 ***
## year                    5.334e-01  1.015e-01   5.254 2.48e-07 ***
## origin                 -1.033e+01  4.243e+00  -2.434  0.01538 *  
## cylinders:displacement  5.059e-03  2.160e-03   2.341  0.01973 *  
## horsepower:weight       4.205e-05  6.864e-06   6.126 2.24e-09 ***
## year:origin             1.419e-01  5.441e-02   2.608  0.00947 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.899 on 382 degrees of freedom
## Multiple R-squared:  0.8652, Adjusted R-squared:  0.862 
## F-statistic: 272.4 on 9 and 382 DF,  p-value: < 2.2e-16

# Yes, all the oredictors are significant except for displacement

## (:) only

lm_interaction_only <- lm(mpg ~ cylinders:displacement + horsepower:weight + year:origin, data = Auto)
summary(lm_interaction_only)
## 
## Call:
## lm(formula = mpg ~ cylinders:displacement + horsepower:weight + 
##     year:origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4474  -3.0871  -0.4037   2.2405  14.6409 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.735e+01  8.195e-01  33.368  < 2e-16 ***
## cylinders:displacement -1.125e-03  6.952e-04  -1.618    0.106    
## horsepower:weight      -1.856e-05  2.980e-06  -6.226 1.24e-09 ***
## year:origin             3.144e-02  4.398e-03   7.149 4.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.441 on 388 degrees of freedom
## Multiple R-squared:  0.6787, Adjusted R-squared:  0.6762 
## F-statistic: 273.2 on 3 and 388 DF,  p-value: < 2.2e-16

# cylinders:displacement (not significant)

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

# Log transformation

lm_log <- lm(mpg ~ log(horsepower) + log(weight) + . - name, data = Auto)
summary(lm_log)
## 
## Call:
## lm(formula = mpg ~ log(horsepower) + log(weight) + . - name, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9332 -1.5111 -0.1791  1.4776 12.1820 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.010e+02  3.285e+01   6.120 2.32e-09 ***
## log(horsepower) -1.826e+01  3.498e+00  -5.222 2.92e-07 ***
## log(weight)     -2.127e+01  5.767e+00  -3.688 0.000259 ***
## cylinders       -1.979e-01  2.887e-01  -0.686 0.493344    
## displacement    -4.646e-05  7.102e-03  -0.007 0.994784    
## horsepower       1.119e-01  2.837e-02   3.945 9.51e-05 ***
## weight           2.913e-03  1.821e-03   1.600 0.110467    
## acceleration    -2.150e-01  9.994e-02  -2.151 0.032081 *  
## year             7.691e-01  4.521e-02  17.012  < 2e-16 ***
## origin           7.030e-01  2.543e-01   2.765 0.005970 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.912 on 382 degrees of freedom
## Multiple R-squared:  0.864,  Adjusted R-squared:  0.8608 
## F-statistic: 269.7 on 9 and 382 DF,  p-value: < 2.2e-16

# Square root transformation

lm_sqrt <- lm(mpg ~ sqrt(horsepower) + sqrt(weight) + . - name, data = Auto)
summary(lm_sqrt)
## 
## Call:
## lm(formula = mpg ~ sqrt(horsepower) + sqrt(weight) + . - name, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9012 -1.4862 -0.1416  1.4574 12.1265 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.4689933  9.3167486   7.134 4.92e-12 ***
## sqrt(horsepower) -7.0167976  1.3555888  -5.176 3.67e-07 ***
## sqrt(weight)     -1.6353191  0.4185399  -3.907  0.00011 ***
## cylinders        -0.0980047  0.2903418  -0.338  0.73589    
## displacement     -0.0006989  0.0071499  -0.098  0.92218    
## horsepower        0.2744676  0.0586846   4.677 4.05e-06 ***
## weight            0.0106482  0.0036246   2.938  0.00351 ** 
## acceleration     -0.2125734  0.1002474  -2.120  0.03461 *  
## year              0.7689674  0.0451634  17.026  < 2e-16 ***
## origin            0.7084719  0.2536032   2.794  0.00547 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.907 on 382 degrees of freedom
## Multiple R-squared:  0.8645, Adjusted R-squared:  0.8613 
## F-statistic: 270.7 on 9 and 382 DF,  p-value: < 2.2e-16

# Squaring some predictors

lm_sq <- lm(mpg ~ I(horsepower^2) + I(weight^2) + . - name, data = Auto)
summary(lm_sq)
## 
## Call:
## lm(formula = mpg ~ I(horsepower^2) + I(weight^2) + . - name, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8713 -1.6140 -0.1788  1.4667 12.0738 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.110e+00  4.586e+00   1.332  0.18359    
## I(horsepower^2)  6.217e-04  1.286e-04   4.833 1.96e-06 ***
## I(weight^2)      1.420e-06  2.835e-07   5.010 8.35e-07 ***
## cylinders        1.600e-01  2.981e-01   0.537  0.59164    
## displacement    -9.982e-04  7.271e-03  -0.137  0.89087    
## horsepower      -2.086e-01  3.999e-02  -5.216 3.01e-07 ***
## weight          -1.339e-02  2.125e-03  -6.303 8.07e-10 ***
## acceleration    -1.830e-01  1.006e-01  -1.818  0.06979 .  
## year             7.724e-01  4.522e-02  17.081  < 2e-16 ***
## origin           7.372e-01  2.530e-01   2.914  0.00378 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.91 on 382 degrees of freedom
## Multiple R-squared:  0.8642, Adjusted R-squared:  0.861 
## F-statistic:   270 on 9 and 382 DF,  p-value: < 2.2e-16

# The adjusted r squared is highm which indicates better fit.