Background: The purpose of this analysis is to explore the relationship between a set of variables that were collected on cars and miles per gallon (MPG). The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). The following two questions are of interest: (1) Is an automatic or manual transmission better for MPG? (2) What is the MPG difference between automatic and manual transmissions? Findings: Manual transmission is better for MPG. There is a difference of 2.93 MPG on average for manual versus automatic transmission vehicles. However, 95% CI (0.046 - 5.826) reveals uncertainty in our estimate.
# Load data
require(datasets)
data(mtcars)
#head(mtcars, 5)
Here we look at the distribution of our two primary variables of interest, miles per gallon (mpg) and tranmission type (am). For multiple linear regression, it is necessary that the outcome variable is continuous and normally or approximately normally distributed. In addition, we want to see whether or not there is a difference in means between the two groups. Specifically, we want to know if there is a discernible or statistical difference in mean mpg between automatic and manual transmission types.
From the plots, we can see that mpg appears approximately normally distributed; and there is a discernible difference in the mean mpg between automatic and manual transmissions. Specifically, we can see that manual transmission is associated with a higher mean value of miles per gallon. Statistical t-test (see Appendix) confirms that the difference in mean value between the two groups is statistically significant.
In modeling, our interest lies in parsimonious, interpretable representations of the data. Therefore, model selection will proceed using a nested model strategy as follows: Step (1), Fit a model with only transmission type (am). Step (2), Adjust model (1) with additional variables of interest. Step (3), variables will be retained in the model based on significance (p-value < 0.05) of the nested likelihood ratio test (ANOVA). Initial determination of which additional variables to include is based on univariate distribution (normal or approximately normal) and correlation values (high values are not good candidates for inclusion). See Appendix. These additional variables will be entered in subsequent models one at a time.
Model1 <- lm(mpg ~ am, data=mtcars)
summary(Model1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
Model2 <- lm(mpg ~ am + wt, data=mtcars)
summary(Model2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.32155131 3.0546385 12.21799285 5.843477e-13
## am -0.02361522 1.5456453 -0.01527855 9.879146e-01
## wt -5.35281145 0.7882438 -6.79080719 1.867415e-07
Model3 <- lm(mpg ~ am + wt + qsec, data=mtcars)
summary(Model3)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## am 2.935837 1.4109045 2.080819 4.671551e-02
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
Model4 <- lm(mpg ~ am + wt + qsec + disp, data=mtcars)
summary(Model4)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.442378425 8.25723071 0.7802105 0.4420535035
## am 3.310153929 1.51241278 2.1886577 0.0374482129
## wt -4.588282099 1.16677426 -3.9324506 0.0005290227
## qsec 1.416958261 0.39148853 3.6194119 0.0012000578
## disp 0.007689836 0.01053478 0.7299473 0.4717085406
anova(Model1, Model2, Model3, Model4)
Results of model selection shows that Model 3 is the best parsimonious explanatory model. Model 2 shows evidence of omitted variable bias in the coefficent of am (i.e, notice the change in magnitude and sign). Model 4 provides evidence that including a variable that is highly correlated with another variable increases the standard errors of other variables. In addition, Model 3 diagnostics (see Appendix), shows that the residuals are approximately normally distributed with constant variance. It is important to note that the number of variables selected for the model is limited by our sample size, n = 32, in this case. Adding more variables would result in model over-fitting; that is, having too many terms for the number of observations.
mat <- mtcars[, c("mpg", "am", "qsec", "wt", "disp", "hp", "drat", "gear", "carb", "cyl", "vs")]
library(PerformanceAnalytics)
chart.Correlation(mat, method = "spearman", histogram=TRUE, pch=20)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg and mtcars$am
## t = 18.413, df = 31.425, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 17.50519 21.86356
## sample estimates:
## mean of x mean of y
## 20.09062 0.40625
par(mfrow = c(2, 2))
plot(Model3)
confint(Model3)[2,]# CI for Model 3 inference
## 2.5 % 97.5 %
## 0.04573031 5.82594408