true

Executive Summary

This report is devoted to the quantitative analysis of the dependence of miles per gallon (MPG) attribute on other variables from collection of cars dataset mtcars presented in 1974 US magazine Motor Trend. The question to be addressed here is an automatic or manual transmission better for MPG.

By using multivariate linear regression and hypothesis testing variables having significant impact on MPG are established. Built regression model is further validated by ANOVA and residuals are analyzed.

Exploratory Data Analysis

The data mtcars to be analyzed is originated from the 1974 Motor Trend magazine. It contains an information about 32 cars (number of observations) and their 11 variables as follows: miles/(US) per gallon (mpg), number of cylinders (cyl), dispacement (cu. in.) (disp), gross horsepower (hp), rear axle ratio (drat), weight (1000 lbs) (wt), quarter mile time (qsec), engine type (V 0=vs or straight 1=vs engine) (vs), transmission (0 = automatic, 1 = manual) (am), number of forward gears (gear), number of carburetors (carb).

Now let us take a look at the data derived from mtcars dataset by subsetting it and sorting cars by mpg in descending order. As we can see from first 7 cars with best MPG only one car Merc 240D has automatic transmission (am=0) (see Table 1).

require(pacman); p_load(pander,knitr); attach(mtcars) 
cars_by_mpg <- mtcars[order(-mpg),] # sort by mpg
pander(head(cars_by_mpg, 7), caption = "A table of the first 7 cars with best MPG.")
A table of the first 7 cars with best MPG.
  mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Fiat X1-9 27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
Porsche 914-2 26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
Merc 240D 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
detach(mtcars)

Statistical Inference

Two sample t-test

To see if there is difference between automatic and manual transmission we run Welch two sample t-test (see Table 2). The p-value=0.001374 is less than \(\alpha\) = 0.05 and we can safely reject null hypothesis that no difference in MPG bewteen automatic and manual transmissions. Resullts of t-test indicate that manual transmission is better than automatic one and confidence interval show that true difference between manual and aiutomatic lies between 3.21 MPG and 11.28 MPG. The difference in the means between two types of transmissions is about 7 MPG. See violin plot of MPG vs am in Figure 1 in Appendix.

ttest <- t.test(mpg ~ am, data= mtcars)
tt <- data.frame( "t-statistic"=ttest$statistic, "df" = ttest$parameter,
"p-value"  = ttest$p.value, "cl-min" = ttest$conf.int[1], "cl-max" = ttest$conf.int[2],
"autom.mean" = ttest$estimate[1],"manual.mean" = ttest$estimate[2],row.names = "")
pander(tt, caption = "Two Sample t-test: automatic vs manual transmission")
Two Sample t-test: automatic vs manual transmission
t.statistic df p.value cl.min cl.max autom.mean manual.mean
-3.767 18.33 0.001374 -11.28 -3.21 17.15 24.39

Regression Analysis

Simple Linear Model

We start from simplest linear regresion model by taking into account only correlation between transmission type as the predictor and MPG as outcome (see Table 3). This analysis confirms the result of previous section. From Table 4 we see that adjusted R-squared value is 0.3385, which means that this model is poor and can only explain 34 % of variation in MPG. So we need to consider other variables as predictors as well.

model1 <- lm(mpg~am,data = mtcars); su1 <- summary(model1) 
df <- data.frame("R-squared" = su1$r.squared, "Adjusted R-squared" = su1$adj.r.squared)
pander(model1); pander(df, caption = "R-squared values for single model")
Fitting linear model: mpg ~ am
  Estimate Std. Error t value Pr(>|t|)
am 7.245 1.764 4.106 0.000285
(Intercept) 17.15 1.125 15.25 1.134e-15
R-squared values for single model
R.squared Adjusted.R.squared
0.3598 0.3385

Step-wise and best subset Models

To correctly choose which variables have maximum impact on MPG we use MASS library and run step-wise model. The results are shown in Table 5 and Table 6. Test chooses with transmission also car’s weight and inverse acceleration as important variables. Adjusted R-squared value now is 0.8336 indicating that chosen model is robust and predictive and explains 83 % of variation in MPG. Another automatic method to choose right variables is best subset method. Here we run it with maximum 4 variables returning model with even higher adjusted R-squared than the previous model, but we can not use it since horsepower (hp) is highly correlated with weight (wt) and acceleration and will result in overfitting (See Appendix, Figure 2).

require(pacman); p_load(leaps,pander,knitr, MASS)
model2 <- stepAIC(lm(mpg ~ . ,data=mtcars), trace = FALSE); su2 <- summary(model2) 
df1 <- data.frame("R-squared" = su2$r.squared, "Adjusted R-squared" =su2$adj.r.squared)
pander(model2); pander(df1, "R-squared values for step-wise model")
Fitting linear model: mpg ~ wt + qsec + am
  Estimate Std. Error t value Pr(>|t|)
wt -3.917 0.7112 -5.507 6.953e-06
qsec 1.226 0.2887 4.247 0.0002162
am 2.936 1.411 2.081 0.04672
(Intercept) 9.618 6.96 1.382 0.1779
R-squared values for step-wise model
R.squared Adjusted.R.squared
0.8497 0.8336
regmodel1 <- regsubsets(mpg ~ ., data = mtcars, nvmax = 4)

Interactions between variables

Here we investigate if there is an interactions between different variables included in our model. By including interactions we can possibly further improve our regression model. We can introduce an interaction term to the previous model to capture the different slopes and intercepts between automatic and manual transmissions. Most significant interaction term originates from the fact that cars with manual transmission weigh less than cars with automatic one. See figure 3 in Appendix where MPG versus car weight is depicted for different transmission types.

Final Model

To finally select the best model we run ANOVA test.

Our final model includes interaction term between weight and binary transmission wt:am. As we can see we substantially improved previous model and adjusted R-squared value now is 0.88. The results in the Table 7 show that mpg = 9.72 + (14.079 - 4.141 wt) am + 1.01 qsec - 2.936 wt, meaning that changing from automatic to manual transmission adds (14.079 - 4.141 wt) more MPG on average for cars weighing less than 14.079/4.141=3400 lb.

model_final<-lm(mpg ~ wt + qsec + am + wt:am, data=mtcars); su <- summary(model_final) 
ant <- anova(model1,model2,model_final)
dff <- data.frame("R-squared" = su$r.squared, "Adjusted R-squared" = su$adj.r.squared)
pander(model_final); pander(dff, caption = "R-squared values for final model")
Fitting linear model: mpg ~ wt + qsec + am + wt:am
  Estimate Std. Error t value Pr(>|t|)
wt -2.937 0.666 -4.409 0.0001489
qsec 1.017 0.252 4.035 0.000403
am 14.08 3.435 4.099 0.0003409
wt:am -4.141 1.197 -3.46 0.001809
(Intercept) 9.723 5.899 1.648 0.1109
R-squared values for final model
R.squared Adjusted.R.squared
0.8959 0.8804

Residual Diagnostics

See Appendix Figure 4. Looking at the residual plots, we can state the following:

  • The Residuals vs. Fitted plot shows mainly random pattern meaning that the independence assumption is correct.
  • The Normal Q-Q plot shows normally distributed residuals.
  • The Scale-Location plot confirms random distribution of points and confirms assumption of constant variance.
  • The Residuals vs. Leverage clearly demonstrates the absence of outliers outside the 0.5 bands.

Dfbetas measure how much the coeiffcients change when the point is deleted. So far we do not have any influential points and our model meets all requirements of linear regression.

c(max(dfbetas(model_final)), max(hatvalues(model_final)))
## [1] 0.9769875 0.3742640

Appendix

library(pacman); p_load(ggplot2, GGally, scales, psych)
ggplot(mtcars, aes(y=mpg, x=factor(am, labels = c("Automatic","Manual")), 
fill=factor(am)))+geom_violin(size=0.5,trim=FALSE)+scale_fill_brewer(palette="Accent")+
xlab("Transmission Type") + ylab("Miles per Gallon (MPG)") + 
geom_boxplot(width=0.15, color="red", size=0.5) + 
geom_jitter(shape=16, position=position_jitter(0.4)) + theme(legend.position="none")
Violin plot of MPG vs transmission type.

Violin plot of MPG vs transmission type.

my_fn <- function(data, mapping, method="lm", ...){
p <- ggplot(data = data, mapping = mapping) + geom_point()  +
geom_smooth(method=method, ...) 
p}
pairs.panels(mtcars[, c(1,4,6,7,9)], ellipses = FALSE, breaks = 8)
Pair Correlations and Histograms Panel Graph

Pair Correlations and Histograms Panel Graph

auto.fit <- lm(mpg ~ wt, data = mtcars[mtcars$am == "0",])
man.fit <- lm(mpg ~ wt, data = mtcars[mtcars$am == "1",])
plot(mpg ~ wt, data = mtcars, pch = 19, col=(am=="0")*2+2, xlab="Weight (1000 lbs)", 
ylab = "Miles per Gallon (MPG)"); abline(auto.fit, lwd = 3, col = "blue")
abline(man.fit, lwd = 3, col = "red")
legend(4.4,34, c("Automatic","Manual"), lty=c(1,1), lwd=c(3,3),col=c("blue","red"))
MPG versus Weight by Transmission Type

MPG versus Weight by Transmission Type

par(mfrow = c(2, 2)); plot(model_final)
Residuals and Diagnostics PLot

Residuals and Diagnostics PLot