1. Load the mpg excel file into R Markdown, and convert the dataset into a data frame.
library(readxl)
mpg <- read_excel("C:/Users/Lynx/Documents/MSDA/621/mpg.xlsx")
mpg <- as.data.frame(mpg)
  1. Using all rows of the dataset as your training set, create a linear regression model to predict mpg based on acceleration. What is the adjusted R-squared for your model?
model <- lm(mpg ~ ., data = mpg)
summary(model)
## 
## Call:
## lm(formula = mpg ~ ., data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.007  -5.636  -1.242   4.758  23.192 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.9698     2.0432   2.432   0.0154 *  
## acceleration   1.1912     0.1292   9.217   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.101 on 396 degrees of freedom
## Multiple R-squared:  0.1766, Adjusted R-squared:  0.1746 
## F-statistic: 84.96 on 1 and 396 DF,  p-value: < 2.2e-16

The adjusted R-Squared is 0.1746

  1. Determine if a Box-Cox transformation would be beneficial. If so, perform the transformation. What is your adjusted R-squared value for your model now? Did it improve after applying a BoxCox transformation?
plot(model$fitted.values, model$residuals)
abline(h = 0)

The pattern amongst the risiduals is not evenly distributed, and as such, a Box-Cox transformation would be beneficial.

library(MASS)
boxcox(model)

Because the maximum point of in the curve is closest to 0, a log transformation will be applied.

model2 <- lm(I(log(mpg)) ~ ., data = mpg)
summary(model2)
## 
## Call:
## lm(formula = I(log(mpg)) ~ ., data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06515 -0.23641 -0.00943  0.23576  0.79343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.24656    0.08759  25.648   <2e-16 ***
## acceleration  0.05491    0.00554   9.911   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3044 on 396 degrees of freedom
## Multiple R-squared:  0.1987, Adjusted R-squared:  0.1967 
## F-statistic: 98.23 on 1 and 396 DF,  p-value: < 2.2e-16

The new Adjusted R-Squared is now 0.1967 which is > than 0.1746. This signifies an improvement to the model after applying the transformation.

  1. Create a scatter plot comparing mpg and acceleration. Does the relationship between these two variables appear to be perfectly linear? Or is there perhaps some slight nonlinearity in the relationship?
plot(mpg ~ ., data = mpg)

There appears to be some slight nonlinearity in the relationship between the mpg and acceleration variables.

  1. Create a new regression model to predict the Box-Cox transformation of mpg using both acceleration and a transformation of acceleration. Look at the “Common Transformation on Covariates” document in the “Week 11” folder of our Google Drive to determine some appropriate transformations. Which of these 4 “common transformations” yields the highest adjusted R-squared?
model3 <- lm(I(log(mpg)) ~ acceleration + I(acceleration^2), data = mpg)
summary(model3)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(acceleration^2), 
##     data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07126 -0.22527 -0.00066  0.21838  0.77803 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.023320   0.331575   3.086 0.002170 ** 
## acceleration       0.213095   0.041764   5.102 5.22e-07 ***
## I(acceleration^2) -0.004959   0.001298  -3.820 0.000155 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2993 on 395 degrees of freedom
## Multiple R-squared:  0.2273, Adjusted R-squared:  0.2234 
## F-statistic:  58.1 on 2 and 395 DF,  p-value: < 2.2e-16
model4 <- lm(I(log(mpg)) ~ acceleration + I(1/acceleration), data = mpg)
summary(model4)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(1/acceleration), 
##     data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05749 -0.22920  0.00108  0.22127  0.76895 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.26294    0.58605   7.274 1.89e-12 ***
## acceleration       -0.01068    0.01963  -0.544  0.58682    
## I(1/acceleration) -14.99800    4.31148  -3.479  0.00056 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3002 on 395 degrees of freedom
## Multiple R-squared:  0.2226, Adjusted R-squared:  0.2186 
## F-statistic: 56.54 on 2 and 395 DF,  p-value: < 2.2e-16
model5 <- lm(I(log(mpg)) ~ acceleration + I(log(acceleration)), data = mpg)
summary(model5)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(log(acceleration)), 
##     data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06111 -0.22515  0.00151  0.21794  0.77069 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.59298    1.04365  -1.526 0.127724    
## acceleration         -0.09011    0.03966  -2.272 0.023624 *  
## I(log(acceleration))  2.23400    0.60516   3.692 0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2997 on 395 degrees of freedom
## Multiple R-squared:  0.2255, Adjusted R-squared:  0.2215 
## F-statistic: 57.49 on 2 and 395 DF,  p-value: < 2.2e-16
model6 <- lm(I(log(mpg)) ~ acceleration + I(sqrt(acceleration)), data = mpg)
summary(model6)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(sqrt(acceleration)), 
##     data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06339 -0.22731  0.00077  0.21655  0.77214 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.38791    1.23506  -1.933 0.053897 .  
## acceleration          -0.24561    0.08008  -3.067 0.002310 ** 
## I(sqrt(acceleration))  2.36968    0.62997   3.762 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 395 degrees of freedom
## Multiple R-squared:  0.2265, Adjusted R-squared:  0.2225 
## F-statistic: 57.82 on 2 and 395 DF,  p-value: < 2.2e-16

The y = x^2 model yields the highest Adjusted R-Squared value at 0.2234.

  1. Using the “data.frame” function, create a new data frame with the following three variables:

    • The Box-Cox transformation of mpg

    • acceleration

    • The transformation of acceleration that yielded the best adjusted R-squared in the preceding question

mpg2 <- data.frame(boxcox = I(log(mpg$mpg)), acceleration = mpg$acceleration, accelsquared = (mpg$acceleration)^2)
  1. Apply unit normal scaling to the new data frame that you created in Question 6. Is acceleration or its transformation more influential in predicting the Box-Cox transformation of mpg?
mpg2_normal <- lm(boxcox ~ ., data = mpg2)
summary(mpg2_normal)
## 
## Call:
## lm(formula = boxcox ~ ., data = mpg2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07126 -0.22527 -0.00066  0.21838  0.77803 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.023320   0.331575   3.086 0.002170 ** 
## acceleration  0.213095   0.041764   5.102 5.22e-07 ***
## accelsquared -0.004959   0.001298  -3.820 0.000155 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2993 on 395 degrees of freedom
## Multiple R-squared:  0.2273, Adjusted R-squared:  0.2234 
## F-statistic:  58.1 on 2 and 395 DF,  p-value: < 2.2e-16

Acceleration is more influential in predicting the Box-Cox transformation of mpg than its transformation. This is because the absolute value of the estimate for acceleration (0.213095) is bigger than the absolute value of the estimate for its transformation (0.004959).