The objective of this document is looking at a data set of a collection of cars, and explore the relationship between a set of variables and miles per gallon (MPG) (outcome).
There is a particular interest in the following two questions:
Key takeaways: - Regarding the first question: Manual Transmissions are better for MPG. - Regarding the second question: Manual transmission are better by a factor of 1.8. Also, the means of MPG are better by 7.25.
At first glance, it seems that Manual Transmission cars have higher MPG than Automatic cars. Let’s try to quantify it. For code reference and graphics got to the Appendix Section, Plot 1.
fit <- lm(mpg ~ am, data = mtcars)
summary(fit)$coefficients
summary(fit)$r.squared
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
## [1] 0.3597989
fullModel <- lm(mpg ~ ., data=mtcars)
summary(fullModel)$adj.r.squared
sum(summary(fullModel)$coef[,4] < 0.05)
## [1] 0.8066423
## [1] 0
Although the Adjusted R-squared is 0.77, there aren’t good predictors since none of the p-values are lower than 0.05. For more details on coefficients check Appendix: Full Model Coefficients.
stepModel <- step(fullModel, k=log(nrow(mtcars)))
summary(stepModel)$coefficients
summary(stepModel)$r.squared
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## am 2.935837 1.4109045 2.080819 4.671551e-02
## [1] 0.8496636
Much better! The variables (wt + qsec + am) can explain the 85% of the variability in MPG values. Also, all the coefficients are statistically signifficant and also suggest that wt is the most explanatory variable.
For details on the plot check Apendix: Residuals Plot.
Assuming that cars with Manual and Automatic Transmission are from the same population as the Null Hypothesis, the p-value of 0.00137 reject it. So, the automatic and manual transmissions are from different populations.
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
summary(mtcars$mpg)
table(mtcars$am)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.42 19.20 20.09 22.80 33.90
##
## Automatic Manual
## 19 13
library(ggplot2)
g <- ggplot(mtcars, aes(x=am,y=mpg, fill=am)) + geom_boxplot()
g <- g + xlab("Transmission") + ylab("Miles per Galon")
g <- g + ggtitle("Miles Per Galon by Transmission \n (Plot 1)")
g
summary(fullModel)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## am 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
How do we interpret the p-values?
plot(stepModel)
result <- t.test(mtcars$mpg ~ mtcars$am)
result$p.value
## [1] 0.001373638
result$estimate
## mean in group Automatic mean in group Manual
## 17.14737 24.39231