Create an R Markdown file to answer the following questions, and then
“knit” your file to create an HTML document. Your HTML document should
contain both textual explanations of your answers, as well as all R code
needed to support your work. Be sure to write your R code in the format
for R Markdown code chunks learned in the first class:
{r} Place R code here
Submit both your HTML document and original R Markdown file. If you have trouble uploading your files on Brightspace, be sure to e-mail them to the instructor.
Use the mpg excel file to answer all of the below questions.
#import the read excel library
library(readxl)
mpg_data <- read_excel("/Users/kamriefoster/Downloads/mpg.xlsx")
mpg_data <- as.data.frame(mpg_data)
model1 <- lm(mpg~., data = mpg_data)
summary(model1)
##
## Call:
## lm(formula = mpg ~ ., data = mpg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.007 -5.636 -1.242 4.758 23.192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9698 2.0432 2.432 0.0154 *
## acceleration 1.1912 0.1292 9.217 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.101 on 396 degrees of freedom
## Multiple R-squared: 0.1766, Adjusted R-squared: 0.1746
## F-statistic: 84.96 on 1 and 396 DF, p-value: < 2.2e-16
The adjusted R-squared value for this model is 0.1746.
plot(model1$fitted.values, model1$residuals)
abline(h = 0)
Since the residuals are not centered around/close to the 0 (horizontal) line a box-cox transformation could be beneficial for the model that was created. Next, the best value to do the box-cox transformation will be found.
library(MASS)
boxcox(model1)
With a lambda value of around 0, the best box-cox transformation involves a log transformation for the mpg variable.
model2 <- lm(I(log(mpg))~., data = mpg_data)
summary(model2)
##
## Call:
## lm(formula = I(log(mpg)) ~ ., data = mpg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06515 -0.23641 -0.00943 0.23576 0.79343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.24656 0.08759 25.648 <2e-16 ***
## acceleration 0.05491 0.00554 9.911 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3044 on 396 degrees of freedom
## Multiple R-squared: 0.1987, Adjusted R-squared: 0.1967
## F-statistic: 98.23 on 1 and 396 DF, p-value: < 2.2e-16
The new adjusted R-squared value is 0.1967 which is an improvement from the first model. However, this is still not a great model and other transformation could be done to improve the R-squared value.
plot(mpg_data$acceleration, mpg_data$mpg, xlab="Acceleration", ylab="mpg")
The relationship between acceleration and mpg does not appear to be perfectly linear.
#acceleration squared
model3 <- lm(I(log(mpg))~ acceleration + I(acceleration^2), data = mpg_data)
summary(model3)
##
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(acceleration^2),
## data = mpg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07126 -0.22527 -0.00066 0.21838 0.77803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.023320 0.331575 3.086 0.002170 **
## acceleration 0.213095 0.041764 5.102 5.22e-07 ***
## I(acceleration^2) -0.004959 0.001298 -3.820 0.000155 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2993 on 395 degrees of freedom
## Multiple R-squared: 0.2273, Adjusted R-squared: 0.2234
## F-statistic: 58.1 on 2 and 395 DF, p-value: < 2.2e-16
#inverse of acceleration
model4 <- lm(I(log(mpg))~ acceleration + I(1/acceleration), data = mpg_data)
summary(model4)
##
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(1/acceleration),
## data = mpg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.05749 -0.22920 0.00108 0.22127 0.76895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.26294 0.58605 7.274 1.89e-12 ***
## acceleration -0.01068 0.01963 -0.544 0.58682
## I(1/acceleration) -14.99800 4.31148 -3.479 0.00056 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3002 on 395 degrees of freedom
## Multiple R-squared: 0.2226, Adjusted R-squared: 0.2186
## F-statistic: 56.54 on 2 and 395 DF, p-value: < 2.2e-16
#log of acceleration
model5 <- lm(I(log(mpg))~ acceleration + I(log(acceleration)), data = mpg_data)
summary(model5)
##
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(log(acceleration)),
## data = mpg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06111 -0.22515 0.00151 0.21794 0.77069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.59298 1.04365 -1.526 0.127724
## acceleration -0.09011 0.03966 -2.272 0.023624 *
## I(log(acceleration)) 2.23400 0.60516 3.692 0.000254 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2997 on 395 degrees of freedom
## Multiple R-squared: 0.2255, Adjusted R-squared: 0.2215
## F-statistic: 57.49 on 2 and 395 DF, p-value: < 2.2e-16
#square root of acceleration
model6 <- lm(I(log(mpg))~ acceleration + I(sqrt(acceleration)), data = mpg_data)
summary(model6)
##
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(sqrt(acceleration)),
## data = mpg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06339 -0.22731 0.00077 0.21655 0.77214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.38791 1.23506 -1.933 0.053897 .
## acceleration -0.24561 0.08008 -3.067 0.002310 **
## I(sqrt(acceleration)) 2.36968 0.62997 3.762 0.000194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2995 on 395 degrees of freedom
## Multiple R-squared: 0.2265, Adjusted R-squared: 0.2225
## F-statistic: 57.82 on 2 and 395 DF, p-value: < 2.2e-16
From the Adjusted R-squared values displayed above, the best model out of the four covariates being transformed as well as the box-cox transformation is acceleration squared. Demonstrated by model 3 for reference. The Adjusted R-squared value is 0.2234.
bc.mpg <- mpg_data$mpg
bc.mpg <- log(bc.mpg)
#bc.mpg
transformation <- mpg_data$acceleration
transformation <- transformation^2
#transformation
acceleration <- mpg_data$acceleration
#acceleration
mpg2 <- data.frame(bc.mpg, acceleration, transformation)
#mpg2
mpg_unit_normal = as.data.frame(apply(mpg2, 2, function(x){(x - mean(x))/sd(x)}))
mpg_reg_unit_normal <- lm(bc.mpg ~., data = mpg_unit_normal)
summary(mpg_reg_unit_normal)
##
## Call:
## lm(formula = bc.mpg ~ ., data = mpg_unit_normal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.15395 -0.66322 -0.00194 0.64295 2.29063
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.683e-16 4.417e-02 0.000 1.000000
## acceleration 1.730e+00 3.391e-01 5.102 5.22e-07 ***
## transformation -1.295e+00 3.391e-01 -3.820 0.000155 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8813 on 395 degrees of freedom
## Multiple R-squared: 0.2273, Adjusted R-squared: 0.2234
## F-statistic: 58.1 on 2 and 395 DF, p-value: < 2.2e-16
The acceleration appears to be more influential in predicting mpg than the transformation, but both are still significant.