The goal of this report is a study of the effect of transmission (automatic or manual) on car fuel consumptionn. The mtcars
dataset is used to answer questions like:
Conclusion: From the best-fit model, one can conclude that:
The analyzed sample contained only 32 observations, which may have additional impact on statistical inference.
The mtcars
is a data frame with 32 observations gathered for 11 variables:
mpg - miles/(US) gallon,
cyl - number of cylinders,
disp - displacement (cu.in.),
hp - gross horsepower,
drat - rear axle ratio, wt - weight (1000 lbs),
qsec - 1/4 mile time,
vs - V/S,
am - transmission (0 = automatic, 1 = manual),
gear - number of forward gears,
carb - number of carburetors.
Exploratory panel plot for these variables is presented in Appendix.
Some variables were converted to factors for further analysis.
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)
mtcars$am <- factor(mtcars$am,labels=c('_automatic','_manual'))
mtcars$carb <- factor(mtcars$carb)
mtcars$vs <- factor(mtcars$vs)
From this test we clearly see that there is a significant (98% CL) difference in fuel consumption between automatic and manual cars, with better mileage for manual cars.
t.test(mpg ~ am, data=mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group _automatic mean in group _manual
## 17.14737 24.39231
We can compare models with different variable input. From the exploratory panel plot it looks like there are correlated variables.
Coefficients in all-variable model have p-values far from significant. The adjusted (for the numbers of predictors) R\(^{2}\) = 0.78 implying the model has fairly good explanatory power of the regression model.
On the other hand, the coefficients for transmission-only model are significantly different, but its R\(^{2}\) = 0.34 is fairly low.
fit_allvars <- lm(mpg ~ ., data = mtcars)
fit_am <- lm(mpg ~ am, data = mtcars)
summary(fit_allvars)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913244 20.06582026 1.19004018 0.25252548
## cyl6 -2.64869528 3.04089041 -0.87102622 0.39746642
## cyl8 -0.33616298 7.15953951 -0.04695316 0.96317000
## disp 0.03554632 0.03189920 1.11433290 0.28267339
## hp -0.07050683 0.03942556 -1.78835344 0.09393155
## drat 1.18283018 2.48348458 0.47627845 0.64073922
## wt -4.52977584 2.53874584 -1.78425732 0.09461859
## qsec 0.36784482 0.93539569 0.39325050 0.69966720
## vs1 1.93085054 2.87125777 0.67247551 0.51150791
## am_manual 1.21211570 3.21354514 0.37718957 0.71131573
## gear4 1.11435494 3.79951726 0.29328856 0.77332027
## gear5 2.52839599 3.73635801 0.67670068 0.50889747
## carb2 -0.97935432 2.31797446 -0.42250436 0.67865093
## carb3 2.99963875 4.29354611 0.69863900 0.49546781
## carb4 1.09142288 4.44961992 0.24528452 0.80956031
## carb6 4.47756921 6.38406242 0.70136677 0.49381268
## carb8 7.25041126 8.36056638 0.86721532 0.39948495
summary(fit_am)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am_manual 7.244939 1.764422 4.106127 2.850207e-04
To find the best model to fit the data (with best variable selection), a step
function is used. For details on this function, type ?step
in R.
The step function calculates Akaike’s ‘An Information Criterion’ (AIC) for one or several fitted model objects for which a log-likelihood value can be obtained, according to the formula \(-2*log-likelihood + k*npar\), where npar represents the number of parameters in the fitted model, and k=2 for the usual AIC, or *k=log(n)$ (n being the number of observations) for the so-called BIC or SBC (Schwarz’s Bayesian criterion).
When comparing models fitted by maximum likelihood to the same data, the smaller the AIC or BIC, the better the fit.
The theory of AIC requires that the log-likelihood has been maximized: whereas AIC can be computed for models not fitted by maximum likelihood, their AIC values should not be compared.
(Above information taken from https://stat.ethz.ch/R-manual/R-devel/library/stats/html/AIC.html)
#trace=FALSE supresses elaborate output
aic <- step(fit_allvars) ## model with the best fit
## Start: AIC=76.4
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 5 13.5989 134.00 69.828
## - gear 2 3.9729 124.38 73.442
## - am 1 1.1420 121.55 74.705
## - qsec 1 1.2413 121.64 74.732
## - drat 1 1.8208 122.22 74.884
## - cyl 2 10.9314 131.33 75.184
## - vs 1 3.6299 124.03 75.354
## <none> 120.40 76.403
## - disp 1 9.9672 130.37 76.948
## - wt 1 25.5541 145.96 80.562
## - hp 1 25.6715 146.07 80.588
##
## Step: AIC=69.83
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 2 5.0215 139.02 67.005
## - disp 1 0.9934 135.00 68.064
## - drat 1 1.1854 135.19 68.110
## - vs 1 3.6763 137.68 68.694
## - cyl 2 12.5642 146.57 68.696
## - qsec 1 5.2634 139.26 69.061
## <none> 134.00 69.828
## - am 1 11.9255 145.93 70.556
## - wt 1 19.7963 153.80 72.237
## - hp 1 22.7935 156.79 72.855
##
## Step: AIC=67
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - drat 1 0.9672 139.99 65.227
## - cyl 2 10.4247 149.45 65.319
## - disp 1 1.5483 140.57 65.359
## - vs 1 2.1829 141.21 65.503
## - qsec 1 3.6324 142.66 65.830
## <none> 139.02 67.005
## - am 1 16.5665 155.59 68.608
## - hp 1 18.1768 157.20 68.937
## - wt 1 31.1896 170.21 71.482
##
## Step: AIC=65.23
## mpg ~ cyl + disp + hp + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - disp 1 1.2474 141.24 63.511
## - vs 1 2.3403 142.33 63.757
## - cyl 2 12.3267 152.32 63.927
## - qsec 1 3.1000 143.09 63.928
## <none> 139.99 65.227
## - hp 1 17.7382 157.73 67.044
## - am 1 19.4660 159.46 67.393
## - wt 1 30.7151 170.71 69.574
##
## Step: AIC=63.51
## mpg ~ cyl + hp + wt + qsec + vs + am
##
## Df Sum of Sq RSS AIC
## - qsec 1 2.442 143.68 62.059
## - vs 1 2.744 143.98 62.126
## - cyl 2 18.580 159.82 63.466
## <none> 141.24 63.511
## - hp 1 18.184 159.42 65.386
## - am 1 18.885 160.12 65.527
## - wt 1 39.645 180.88 69.428
##
## Step: AIC=62.06
## mpg ~ cyl + hp + wt + vs + am
##
## Df Sum of Sq RSS AIC
## - vs 1 7.346 151.03 61.655
## <none> 143.68 62.059
## - cyl 2 25.284 168.96 63.246
## - am 1 16.443 160.12 63.527
## - hp 1 36.344 180.02 67.275
## - wt 1 41.088 184.77 68.108
##
## Step: AIC=61.65
## mpg ~ cyl + hp + wt + am
##
## Df Sum of Sq RSS AIC
## <none> 151.03 61.655
## - am 1 9.752 160.78 61.657
## - cyl 2 29.265 180.29 63.323
## - hp 1 31.943 182.97 65.794
## - wt 1 46.173 197.20 68.191
summary(aic)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## am_manual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
The step function returns formula of mpg ~ cyl + hp + wt + am. Let’s call it best-fit model.
The best-fit model coefficients p-values suggests significance of the model with cylinders (cyl), horse power (hp) and weight (wt) significant at 95% level. However, the transmission (am) is not significant. The R\(^{2}\) = 0.84 (adjusted value) shows that the model explains well the variance.
By comparing the best-fit and the transmission-only model one can see that the added variables contribute significantly to the model.
anova(fit_am,aic)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(aic)
boxplot(mpg ~ am, data=mtcars, col=(c("red","steelblue")),
xlab="Transmission", ylab="MPG", main="MPG vs. Transmission")
boxplot(mpg ~ cyl, data=mtcars, outpch = 19, col=(c("steelblue", "green", "yellow")),
ylab="MPG", xlab="number of cylinders", main="Mileage By Cylinder")