You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
1.- Is an automatic or manual transmission better for MPG" 2.- “Quantify the MPG difference between automatic and manual transmissions”
data(mtcars)
mtcars2 <- mtcars #We will use this later
The following are the variables from the mtcars database:
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
As we can see, the transmission types are stored in the variable am. 0 means that the car has an automatic transmission, whereas 1 means that the car has a manual transmission. We can put it as a factor:
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic transmission", "Manual transmission")
We will make a boxplot to quantify the main difference between automatic and manual transmission. Looking to the boxplot, we can affirm that the cars with automatic transmission have higher mpg than manual transmission cars.
boxplot(mpg ~ am, data = mtcars, col = "red", ylab = "Miles per gallon (mpg)")
This can be set as the null hypothesis. To verify this, we can perform a t-test. It can be seen that the obtained p-value is 0.001374. As this value is smaller than our error tolerance level (0.05), we can reject this null hypothesis: that is, manual cars have better transmission than automatic cars. This conclusion is only valid in case of the equality of other variables: for eample, two cars having the same horse power, the manual one will have better mpg.
t.test(mtcars$mpg ~ mtcars$am, 0.95)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic transmission mean in group Manual transmission
## 17.14737 24.39231
In order to fit the model, we will fit it with the most correlated variables.
cor(mtcars2$mpg,mtcars2[,-1])
## cyl disp hp drat wt qsec
## [1,] -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684
## vs am gear carb
## [1,] 0.6640389 0.5998324 0.4802848 -0.5509251
mod <- lm(mpg~ am + vs + drat + wt + hp + cyl, data = mtcars)
m1<-lm(mpg~am,data=mtcars)
anova(m1,mod)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + vs + drat + wt + hp + cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 25 166.84 5 554.06 16.605 3.033e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(mod)
We can also create the model using all the variables:
mod2<- lm(mpg~ ., data = mtcars)
anova(m1,mod2)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 21 147.49 9 573.4 9.0711 1.779e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(mod2)
summary(mod)
##
## Call:
## lm(formula = mpg ~ am + vs + drat + wt + hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5421 -1.5787 -0.4003 1.3326 5.4488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.34852 9.01121 3.479 0.00186 **
## amManual transmission 1.83252 1.76168 1.040 0.30820
## vs 1.19317 1.84800 0.646 0.52438
## drat 0.40474 1.51180 0.268 0.79111
## wt -2.50419 0.96337 -2.599 0.01545 *
## hp -0.02660 0.01437 -1.850 0.07611 .
## cyl -0.32673 0.85544 -0.382 0.70573
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.583 on 25 degrees of freedom
## Multiple R-squared: 0.8518, Adjusted R-squared: 0.8163
## F-statistic: 23.96 on 6 and 25 DF, p-value: 3.139e-09
summary(mod2)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## amManual transmission 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
It can be seen that, with the first model we cover 85.18% of the variance, and with the second one 86.9%.