This report is an analysis of how miles per gallon relates to various other vehicle variables. Particularly, the effect of transmission type–Automatic versus Manual–on how a vehicle’s miles per gallon metric is assessed. The data reviewed is from the mtcars dataset available in the mtcars R package.
The two main questions the analysis is attempting to answer are:
We conclude that the difference in transmission type has a statistically significant effect on the mpg metric of a vehicle. However, given the p-value of the AM variable of 0.0467 we are very close to the preset threshold of 5% where we would reject that any difference in MPG exists. The final model (fit_3) is also deemed significant with a p-value of 1.21e-11. The model has an r-squared of 83.36% suggesting that 83% of the effect on mpg is explained through the model.
Given that the type of transmission does materially effect the mpg of a car, we estimate that the effect adds 2.94 (+/- 2.8) miles per gallon on average for a range of (0.12 - 5.74 mpg) with 95% certainty to cars with manual transmissions compared to automatic transmissions.
The analysis for this report used a multi-variate linear regression model. The outcome term \(Y_i\) is the model’s estimate for the mpg variable. \(beta_i\) is the term corresponding to the coefficients for the \(X_i\) terms. The \(x_i\) terms were evaluated for best model fit–minimizing residual error, maximizing r-squared, while evaluating for statistical significance by examining the p-value of the model and the terms.
\[ Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots + \beta_{p} X_{pi}^2 + \epsilon_{i} \]
Review of the data set revealed some important characteristics to be aware of throughout the analysis. One, there exists several points that are considered as outliers from the main mass of data points. For automatic vehicles these outliers(7 of 33 data points) are very low mpg(~10 mpg) vehicles, and for manual cars these are very high mpg(~30+) vehicles. Next, manual cars are much lower in weight(over 1000 lbs on average) than automatic vehicles. Automatic cars also are represented by more 8 cylinder cars(12/19 vs 2/13). Prior to examining correlation, one must be cautious that the additional weight and higher cylinder vehicles in the automatic cars sample is a main driver of mpg.
Please see appendix for additional information.
As the correlation of variables in the data set can have a significant impact upon the model, variable correlation has been examined. Of the variables considered likely to drive the mpg metric–wt, cyl, hp, qsec–we see most of these are all highly correlated(+0.66 and greater). Please see the appendix for spefics.
Building the multivariate model began with including all the variables, and then just transmission type (am) for reference. Systematically, additional variables were added to the model based on theoretical effect on mpg while attempting to minimize correlation among the added variables. For example, with the variable wt added the disp variable was not considered for addition to the model based on the strong correlation (0.888). Variables were iteratively added and removed based on amount of explained effect and significance among other factors.
Models were then analyzed based on significance using an Anova test. Fit_3 containing am, wt, and qsec was determined to be the best model based on statistical significance (p-value 0.00045) and variance inflation (vif) factors. Previous iterations of the model caused AM to be insignificant (p-value > 5%). Given the high correlation among the other variables with wt, we then chose to add a variable with a negative correlation and a logical effect on mpg–qsec. Additionally, plots of the residuals were considered. Please see the Appendix for model diagnostics and residual plots.
The Residuals vs Fitted plot shows no noticeable patterns so we can be less concerned that the dataset lacks independence and does not meet our distribution assumptions.
The QQ plot shows a near linear relationship between the standardized residuals and the model quantiles suggesting that the data are normally distributed.
The Residuals vs Leverage plot gives us confidence that our previously identified outlier points are not exerting significant leverage and influence over the dataset. The Cook’s distances are all less than 1.
Given a normal distribution for the data set we tested the hypothesis that the two transmission types have equal means.
H0: (H0 = Mean(Manual) - Mean(Auto) = 0) –There is no difference in means H1: (H1 = Mean(Manual) - Mean(Auto) > 0) –MPG is greater when transmission type is manual.
We reject the Null Hypothesis that there is no difference in the means given our p-value. And we accept the Alternative Hypothesis that MPG is greater for manual transmissions.
test <- t.test(mpg ~ factor(am), data = data, alternative = "greater", paired = FALSE, conf.level = 0.95)
test
##
## Welch Two Sample t-test
##
## data: mpg by factor(am)
## t = -3.7671, df = 18.332, p-value = 0.9993
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -10.57662 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
test$p.value
## [1] 0.9993132
fit_all <- lm(mpg ~ ., data)
fit_1 <- lm(mpg ~ factor(am), data)
fit_2 <- lm(mpg ~ factor(am) + wt, data)
fit_3 <- lm(mpg ~ factor(am) + wt + qsec, data)
fit_4 <- lm(mpg ~ factor(am) + wt + qsec + hp, data)
fit_5 <- lm(mpg ~ factor(am) + wt + qsec + hp + disp, data)
summary(fit_2)
##
## Call:
## lm(formula = mpg ~ factor(am) + wt, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5295 -2.3619 -0.1317 1.4025 6.8782
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.32155 3.05464 12.218 5.84e-13 ***
## factor(am)1 -0.02362 1.54565 -0.015 0.988
## wt -5.35281 0.78824 -6.791 1.87e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.098 on 29 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7358
## F-statistic: 44.17 on 2 and 29 DF, p-value: 1.579e-09
summary(fit_3)
##
## Call:
## lm(formula = mpg ~ factor(am) + wt + qsec, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## factor(am)1 2.9358 1.4109 2.081 0.046716 *
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
anova(fit_1, fit_2, fit_3, fit_4, fit_5)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt
## Model 3: mpg ~ factor(am) + wt + qsec
## Model 4: mpg ~ factor(am) + wt + qsec + hp
## Model 5: mpg ~ factor(am) + wt + qsec + hp + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 74.9946 3.877e-09 ***
## 3 28 169.29 1 109.03 18.4757 0.000214 ***
## 4 27 160.07 1 9.22 1.5622 0.222472
## 5 26 153.44 1 6.63 1.1232 0.298972
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1