library('xlsx')
mpg <- read.xlsx("/Users/jusimioni/Desktop/MSDA 2021-23/Fall 2022/data/mpg.xlsx", sheetIndex = 1)
mpg <- as.data.frame(mpg)
head(mpg)
## mpg acceleration
## 1 18 12.0
## 2 15 11.5
## 3 18 11.0
## 4 16 12.0
## 5 17 10.5
## 6 15 10.0
mpg_model <- lm(mpg~., data=mpg)
summary(mpg_model)
##
## Call:
## lm(formula = mpg ~ ., data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.007 -5.636 -1.242 4.758 23.192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9698 2.0432 2.432 0.0154 *
## acceleration 1.1912 0.1292 9.217 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.101 on 396 degrees of freedom
## Multiple R-squared: 0.1766, Adjusted R-squared: 0.1746
## F-statistic: 84.96 on 1 and 396 DF, p-value: < 2.2e-16
The adjusted R-squred of the model is 17.46%
3. Determine if a Box-Cox transformation would be beneficial. If so,
perform the transformation. What is your adjusted R-squared value for
your model now? Did it improve after applying a Box-Cox
transformation?
The following plot should show if a transformation is needed.
plot(mpg_model$fitted.values, mpg_model$residuals)
abline(h = 0)
Looking at the plot it does not look linear. Performing a Box-Cox transformation should be beneficial. Looking at the plot to decide the type of transformation.
library(MASS)
boxcox(mpg_model)
Using Log transformation.
model_log <- lm(I(log(mpg)) ~., data=mpg)
summary(model_log)
##
## Call:
## lm(formula = I(log(mpg)) ~ ., data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06515 -0.23641 -0.00943 0.23576 0.79343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.24656 0.08759 25.648 <2e-16 ***
## acceleration 0.05491 0.00554 9.911 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3044 on 396 degrees of freedom
## Multiple R-squared: 0.1987, Adjusted R-squared: 0.1967
## F-statistic: 98.23 on 1 and 396 DF, p-value: < 2.2e-16
The new adjusted r-squared is 19.67% what is better than the previous linear regression.
plot(mpg$mpg, mpg$acceleration, ylab='Acceleration', xlab ='MPG')
The model is non-linear, since in the plot there isn’t define line, but many dots scatter around.
Looking at the transformations two functions seem to have a similar line to the chart above. The functions are y = Log(x) and y = sqrt x.
#Transformation of acceleration - Using Log
model_log <- lm(I(log(mpg)) ~ acceleration+I(log(acceleration)), data=mpg)
summary(model_log)
##
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(log(acceleration)),
## data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06111 -0.22515 0.00151 0.21794 0.77069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.59298 1.04365 -1.526 0.127724
## acceleration -0.09011 0.03966 -2.272 0.023624 *
## I(log(acceleration)) 2.23400 0.60516 3.692 0.000254 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2997 on 395 degrees of freedom
## Multiple R-squared: 0.2255, Adjusted R-squared: 0.2215
## F-statistic: 57.49 on 2 and 395 DF, p-value: < 2.2e-16
#Transformation of acceleration - Using sqrt
model_log <- lm(I(log(mpg)) ~ acceleration+I(sqrt(acceleration)), data=mpg)
summary(model_log)
##
## Call:
## lm(formula = I(log(mpg)) ~ acceleration + I(sqrt(acceleration)),
## data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06339 -0.22731 0.00077 0.21655 0.77214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.38791 1.23506 -1.933 0.053897 .
## acceleration -0.24561 0.08008 -3.067 0.002310 **
## I(sqrt(acceleration)) 2.36968 0.62997 3.762 0.000194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2995 on 395 degrees of freedom
## Multiple R-squared: 0.2265, Adjusted R-squared: 0.2225
## F-statistic: 57.82 on 2 and 395 DF, p-value: < 2.2e-16
The Square root transformation has a better adjusted R-square. Boht of the models are still very similar, but Square root of x has a 0.1% better adjusted R-square. Both of the adjustes R-squares are better than the previous model.
mpg_transformation <- data.frame(log_mpg=log(mpg$mpg))
• acceleration
mpg_transformation <- data.frame(acceleration=mpg$acceleration, mpg_transformation)
• The transformation of acceleration that yielded the best adjusted R-squared in the preceding question
mpg_transformation <- data.frame(sqrt_acceleration=sqrt(mpg$acceleration), mpg_transformation)
mpg_unit_normal = as.data.frame(apply(mpg_transformation, 2, function(x){(x - mean(x))/sd(x)}))
model2_unit_normal <- lm(log_mpg ~., data=mpg_unit_normal)
summary(model2_unit_normal)
##
## Call:
## lm(formula = log_mpg ~ ., data = mpg_unit_normal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.13079 -0.66924 0.00226 0.63756 2.27330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.991e-15 4.420e-02 0.000 1.000000
## sqrt_acceleration 2.446e+00 6.502e-01 3.762 0.000194 ***
## acceleration -1.994e+00 6.502e-01 -3.067 0.002310 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8817 on 395 degrees of freedom
## Multiple R-squared: 0.2265, Adjusted R-squared: 0.2225
## F-statistic: 57.82 on 2 and 395 DF, p-value: < 2.2e-16
The transformation of the acceleration (sqrt_acceleration has a p-value of 0.000194) has a much greater influence on the variable when trying to predict the model.