1. Load the mpg excel file into R Markdown, and convert the dataset into a data frame.
library('xlsx')
mpg <- read.xlsx("/Users/jusimioni/Desktop/MSDA 2021-23/Fall 2022/data/mpg.xlsx", sheetIndex = 1)
mpg <- as.data.frame(mpg)
head(mpg)
##   mpg acceleration
## 1  18         12.0
## 2  15         11.5
## 3  18         11.0
## 4  16         12.0
## 5  17         10.5
## 6  15         10.0
  1. Using all rows of the dataset as your training set, create a linear regression model to predict mpg based on acceleration. What is the adjusted R-squared for your model?
mpg_model <- lm(mpg~., data=mpg)
summary(mpg_model)
## 
## Call:
## lm(formula = mpg ~ ., data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.007  -5.636  -1.242   4.758  23.192 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.9698     2.0432   2.432   0.0154 *  
## acceleration   1.1912     0.1292   9.217   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.101 on 396 degrees of freedom
## Multiple R-squared:  0.1766, Adjusted R-squared:  0.1746 
## F-statistic: 84.96 on 1 and 396 DF,  p-value: < 2.2e-16

The adjusted R-squred of the model is 17.46%
3. Determine if a Box-Cox transformation would be beneficial. If so, perform the transformation. What is your adjusted R-squared value for your model now? Did it improve after applying a Box-Cox transformation?
The following plot should show if a transformation is needed.

plot(mpg_model$fitted.values, mpg_model$residuals)
abline(h = 0)

Looking at the plot it does not look linear. Performing a Box-Cox transformation should be beneficial. Looking at the plot to decide the type of transformation.

library(MASS)
boxcox(mpg_model)

Using Log transformation.

model_log <- lm(I(log(mpg)) ~., data=mpg)
summary(model_log)
## 
## Call:
## lm(formula = I(log(mpg)) ~ ., data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06515 -0.23641 -0.00943  0.23576  0.79343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.24656    0.08759  25.648   <2e-16 ***
## acceleration  0.05491    0.00554   9.911   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3044 on 396 degrees of freedom
## Multiple R-squared:  0.1987, Adjusted R-squared:  0.1967 
## F-statistic: 98.23 on 1 and 396 DF,  p-value: < 2.2e-16

The new adjusted r-squared is 19.67% what is better than the previous linear regression.

  1. Create a scatter plot comparing mpg and acceleration. Does the relationship between these two variables appear to be perfectly linear? Or is there perhaps some slight nonlinearity in the relationship?
plot(mpg$mpg, mpg$acceleration, ylab='Acceleration', xlab ='MPG')

The model is non-linear, since in the plot there isn’t define line, but many dots scatter around.

  1. Create a new regression model to predict the Box-Cox transformation of mpg using both acceleration and a transformation of acceleration. Look at the “Common Transformation on Covariates” document in the “Week 11” folder of our Google Drive to determine some appropriate transformations. Which of these 4 “common transformations” yields the highest adjusted R-squared?

Looking at the transformations two functions seem to have a similar line to the chart above. The functions are y = Log(x) and y = sqrt x.

#Transformation of acceleration - Using Log

model_log <- lm(I(log(mpg)) ~ acceleration+I(log(acceleration)), data=mpg)
summary(model_log)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(log(acceleration)), 
##     data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06111 -0.22515  0.00151  0.21794  0.77069 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.59298    1.04365  -1.526 0.127724    
## acceleration         -0.09011    0.03966  -2.272 0.023624 *  
## I(log(acceleration))  2.23400    0.60516   3.692 0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2997 on 395 degrees of freedom
## Multiple R-squared:  0.2255, Adjusted R-squared:  0.2215 
## F-statistic: 57.49 on 2 and 395 DF,  p-value: < 2.2e-16
#Transformation of acceleration - Using sqrt

model_log <- lm(I(log(mpg)) ~ acceleration+I(sqrt(acceleration)), data=mpg)
summary(model_log)
## 
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(sqrt(acceleration)), 
##     data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06339 -0.22731  0.00077  0.21655  0.77214 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.38791    1.23506  -1.933 0.053897 .  
## acceleration          -0.24561    0.08008  -3.067 0.002310 ** 
## I(sqrt(acceleration))  2.36968    0.62997   3.762 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 395 degrees of freedom
## Multiple R-squared:  0.2265, Adjusted R-squared:  0.2225 
## F-statistic: 57.82 on 2 and 395 DF,  p-value: < 2.2e-16

The Square root transformation has a better adjusted R-square. Boht of the models are still very similar, but Square root of x has a 0.1% better adjusted R-square. Both of the adjustes R-squares are better than the previous model.

  1. Using the “data.frame” function, create a new data frame with the following three variables:
    • The Box-Cox transformation of mpg
mpg_transformation <- data.frame(log_mpg=log(mpg$mpg))

• acceleration

mpg_transformation <- data.frame(acceleration=mpg$acceleration, mpg_transformation)

• The transformation of acceleration that yielded the best adjusted R-squared in the preceding question

mpg_transformation <- data.frame(sqrt_acceleration=sqrt(mpg$acceleration), mpg_transformation)
  1. Apply unit normal scaling to the new data frame that you created in Question 6. Is acceleration or its transformation more influential in predicting the Box-Cox transformation of mpg?
mpg_unit_normal = as.data.frame(apply(mpg_transformation, 2, function(x){(x - mean(x))/sd(x)}))
model2_unit_normal <- lm(log_mpg ~., data=mpg_unit_normal)
summary(model2_unit_normal)
## 
## Call:
## lm(formula = log_mpg ~ ., data = mpg_unit_normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.13079 -0.66924  0.00226  0.63756  2.27330 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.991e-15  4.420e-02   0.000 1.000000    
## sqrt_acceleration  2.446e+00  6.502e-01   3.762 0.000194 ***
## acceleration      -1.994e+00  6.502e-01  -3.067 0.002310 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8817 on 395 degrees of freedom
## Multiple R-squared:  0.2265, Adjusted R-squared:  0.2225 
## F-statistic: 57.82 on 2 and 395 DF,  p-value: < 2.2e-16

The transformation of the acceleration (sqrt_acceleration has a p-value of 0.000194) has a much greater influence on the variable when trying to predict the model.