Homework 7

Create an R Markdown file to answer the following questions, and then “knit” your file to create an HTML document. Your HTML document should contain both textual explanations of your answers, as well as all R code needed to support your work. Be sure to write your R code in the format for R Markdown code chunks learned in the first class: {r} Place R code here

Submit both your HTML document and original R Markdown file. If you have trouble uploading your files on Brightspace, be sure to e-mail them to the instructor.

Use the mpg excel file to answer all of the below questions.

  1. Load the mpg excel file into R Markdown, and convert the dataset into a data frame.
#import the read excel library
library(readxl)

mpg_data <- read_excel("/Users/kamriefoster/Downloads/mpg.xlsx")

mpg_data <- as.data.frame(mpg_data)
  1. Using all rows of the dataset as your training set, create a linear regression model to predict mpg based on acceleration. What is the adjusted R-squared for your model?
model1 <- lm(mpg~., data = mpg_data)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ ., data = mpg_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.007  -5.636  -1.242   4.758  23.192 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.9698     2.0432   2.432   0.0154 *  
## acceleration   1.1912     0.1292   9.217   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.101 on 396 degrees of freedom
## Multiple R-squared:  0.1766, Adjusted R-squared:  0.1746 
## F-statistic: 84.96 on 1 and 396 DF,  p-value: < 2.2e-16

The adjusted R-squared value for this model is 0.1746.

  1. Determine if a Box-Cox transformation would be beneficial. If so, perform the transformation. What is your adjusted R-squared value for your model now? Did it improve after applying a BoxCox transformation?
plot(model1$fitted.values, model1$residuals)
abline(h = 0)

Since the residuals are not centered around/close to the 0 (horizontal) line a box-cox transformation could be beneficial for the model that was created. Next, the best value to do the box-cox transformation will be found.

library(MASS)
boxcox(model1)

With a lambda value of around 0, the best box-cox transformation involves a log transformation for the mpg variable.

model2 <- lm(I(log(mpg))~., data = mpg_data)
summary(model2)
## 
## Call:
## lm(formula = I(log(mpg)) ~ ., data = mpg_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06515 -0.23641 -0.00943  0.23576  0.79343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.24656    0.08759  25.648   <2e-16 ***
## acceleration  0.05491    0.00554   9.911   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3044 on 396 degrees of freedom
## Multiple R-squared:  0.1987, Adjusted R-squared:  0.1967 
## F-statistic: 98.23 on 1 and 396 DF,  p-value: < 2.2e-16

The new adjusted R-squared value is 0.1967 which is an improvement from the first model. However, this is still not a great model and other transformation could be done to improve the R-squared value.

  1. Create a scatter plot comparing mpg and acceleration. Does the relationship between these two variables appear to be perfectly linear? Or is there perhaps some slight nonlinearity in the relationship?
plot(mpg_data$acceleration, mpg_data$mpg, xlab="Acceleration", ylab="mpg")

The relationship between acceleration and mpg does not appear to be perfectly linear.

  1. Create a new regression model to predict the Box-Cox transformation of mpg using both acceleration and a transformation of acceleration. Look at the “Common Transformation on Covariates” document in the “Week 11” folder of our Google Drive to determine some appropriate transformations. Which of these 4 “common transformations” yields the highest adjusted R-squared?
#acceleration squared

model3 <- lm(I(log(mpg))~ acceleration + I(acceleration^2), data = mpg_data)
summary(model3)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(acceleration^2), 
##     data = mpg_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07126 -0.22527 -0.00066  0.21838  0.77803 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.023320   0.331575   3.086 0.002170 ** 
## acceleration       0.213095   0.041764   5.102 5.22e-07 ***
## I(acceleration^2) -0.004959   0.001298  -3.820 0.000155 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2993 on 395 degrees of freedom
## Multiple R-squared:  0.2273, Adjusted R-squared:  0.2234 
## F-statistic:  58.1 on 2 and 395 DF,  p-value: < 2.2e-16
#inverse of acceleration 

model4 <- lm(I(log(mpg))~ acceleration + I(1/acceleration), data = mpg_data)
summary(model4)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(1/acceleration), 
##     data = mpg_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05749 -0.22920  0.00108  0.22127  0.76895 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.26294    0.58605   7.274 1.89e-12 ***
## acceleration       -0.01068    0.01963  -0.544  0.58682    
## I(1/acceleration) -14.99800    4.31148  -3.479  0.00056 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3002 on 395 degrees of freedom
## Multiple R-squared:  0.2226, Adjusted R-squared:  0.2186 
## F-statistic: 56.54 on 2 and 395 DF,  p-value: < 2.2e-16
#log of acceleration

model5 <- lm(I(log(mpg))~ acceleration + I(log(acceleration)), data = mpg_data)
summary(model5)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(log(acceleration)), 
##     data = mpg_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06111 -0.22515  0.00151  0.21794  0.77069 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.59298    1.04365  -1.526 0.127724    
## acceleration         -0.09011    0.03966  -2.272 0.023624 *  
## I(log(acceleration))  2.23400    0.60516   3.692 0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2997 on 395 degrees of freedom
## Multiple R-squared:  0.2255, Adjusted R-squared:  0.2215 
## F-statistic: 57.49 on 2 and 395 DF,  p-value: < 2.2e-16
#square root of acceleration

model6 <- lm(I(log(mpg))~ acceleration + I(sqrt(acceleration)), data = mpg_data)
summary(model6)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(sqrt(acceleration)), 
##     data = mpg_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06339 -0.22731  0.00077  0.21655  0.77214 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.38791    1.23506  -1.933 0.053897 .  
## acceleration          -0.24561    0.08008  -3.067 0.002310 ** 
## I(sqrt(acceleration))  2.36968    0.62997   3.762 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 395 degrees of freedom
## Multiple R-squared:  0.2265, Adjusted R-squared:  0.2225 
## F-statistic: 57.82 on 2 and 395 DF,  p-value: < 2.2e-16

From the Adjusted R-squared values displayed above, the best model out of the four covariates being transformed as well as the box-cox transformation is acceleration squared. Demonstrated by model 3 for reference. The Adjusted R-squared value is 0.2234.

  1. Using the “data.frame” function, create a new data frame with the following three variables:
    • The Box-Cox transformation of mpg
    • acceleration
    • The transformation of acceleration that yielded the best adjusted R-squared in the preceding question
bc.mpg <- mpg_data$mpg
bc.mpg <- log(bc.mpg)
#bc.mpg

transformation <- mpg_data$acceleration
transformation <- transformation^2
#transformation

acceleration <- mpg_data$acceleration
#acceleration

mpg2 <- data.frame(bc.mpg, acceleration, transformation)
#mpg2
  1. Apply unit normal scaling to the new data frame that you created in Question 6. Is acceleration or its transformation more influential in predicting the Box-Cox transformation of mpg?
mpg_unit_normal = as.data.frame(apply(mpg2, 2, function(x){(x - mean(x))/sd(x)}))
mpg_reg_unit_normal <- lm(bc.mpg ~., data = mpg_unit_normal)
summary(mpg_reg_unit_normal)
## 
## Call:
## lm(formula = bc.mpg ~ ., data = mpg_unit_normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.15395 -0.66322 -0.00194  0.64295  2.29063 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.683e-16  4.417e-02   0.000 1.000000    
## acceleration    1.730e+00  3.391e-01   5.102 5.22e-07 ***
## transformation -1.295e+00  3.391e-01  -3.820 0.000155 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8813 on 395 degrees of freedom
## Multiple R-squared:  0.2273, Adjusted R-squared:  0.2234 
## F-statistic:  58.1 on 2 and 395 DF,  p-value: < 2.2e-16

The acceleration appears to be more influential in predicting mpg than the transformation, but both are still significant.