Executive Summary

Working for Motor Trend, a magazine about the automobile industry, our task was to explore the relationship between a set of variables and miles per gallon (MPG) (outcome). Our client was particularly interested in the following two questions:

"Is an automatic or manual transmission better for MPG"
"Quantify the MPG difference between automatic and manual transmissions"

Preliminary analyses showed an effect of transmission type on MPG. The presence of a manual transmisson led to an average increase of roughly 7 miles per gallon. However, when we included further predictors in the model, the effect shrank to about 2 miles per gallon and, moreover, turned out to be non-significant. We propose to reproduce the calculations on a bigger data set in the future so as to investigate further on the topic.

Loading data, libraries and getting an overview of the variables

library(PerformanceAnalytics); library(ggplot2); library(gridExtra);library(car); library(stargazer); data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Looking at the dataset, it gets clear that we’d better treat the variables cylinders (cyl), gear (gear), and transmission (am) as factor variables, as they do not dispose of the characterstics of a numeric variable. However, we wait doing so until we make some plots, that call for this mutation.

Exploratory Analysis

mtcars$am <- as.factor(mtcars$am)
ggplot(mtcars, aes(x = am, y = mpg, fill=am)) + geom_boxplot()

stargazer(subset(mtcars[c("mpg")], mtcars$am==0), title="Automatic transmission", type="text", digits=1, out="table1.txt")
## 
## Automatic transmission
## ====================================
## Statistic N  Mean St. Dev. Min  Max 
## ------------------------------------
## mpg       19 17.1   3.8    10.4 24.4
## ------------------------------------
stargazer(subset(mtcars[c("mpg")], mtcars$am==1), title="Manual transmission", type="text", digits=1, out="table2.txt")
## 
## Manual transmission
## ====================================
## Statistic N  Mean St. Dev. Min  Max 
## ------------------------------------
## mpg       13 24.4   6.2    15.0 33.9
## ------------------------------------

Having a first look at the relationship between MPG and AM, we are intrigued to claim, that there is, indeed, an obvious relatonship between the two variables. A manual transmission seems to have a positive impact on MPG: The average MPG of the cars with a manual transmission is at approximately 24.39, whereas the average MPG for cars with an automatic transmission is at approximately 17.14 (note, that the boxplots show the median, and not the mean; hence the slight deviations between the plot and the table regarding the measure of central tendency). Moreover, the fact that the two boxes (that comprise 50% of the observations per group) do not overlap each other, is a strong indicator for little variation which adds to the preliminary conclusion that there is substantial difference between the two groups as regards MPG. However, we will have another look at the data in the following.

mtcars$am <- as.numeric(mtcars$am)
chart.Correlation(mtcars)

After getting an idea about the correlations between mpg and some further variables in the dataset, it gets clear that we should take these variables into account, as well, when specifying the model. Some more plots illustrate the relationship between mpg, am, and other variables quite well. In this case, we focus on the four variables that correlate the most with our outcome variable MPG, that is: horse power(r = -0.78), displacement(r = -0.85), weight(r = -0.87), and cylinders(r = -0.85). Note that AM appears in all of the plots as a third variable, hence giving us an idea how AM relates to other predictors we shall use in our analysis.

levels(mtcars$am) <- c("Automatic", "Manual")
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)
a <- ggplot(mtcars, aes(x = hp, y = mpg, colour = factor(am))) + geom_point()+ stat_smooth(method="lm", formula= y ~ x, se=FALSE)
b <- ggplot(mtcars, aes(x = disp, y = mpg, colour = factor(am))) + geom_point()+ stat_smooth(method="lm", formula= y ~ x, se=FALSE)
c <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(am))) + geom_point()+ stat_smooth(method="lm", formula= y ~ x, se=FALSE)
d <- ggplot(mtcars, aes(x = cyl, y = mpg, colour = factor(am))) + geom_point()
grid.arrange(a, b, c, d, ncol=2)

The plots show clear relationships between the predictor variables on the x-axes and the outcome variable MPG on the y-axis. All of them are negative relationships, where high values of MPG correlate with low values in the predictor variables, and vice versa. Most interestingly, the plots show more or less pronounced relationship between AM and the other predictors. This is most obvious in the case of WT and DISP, but also in CYL and to a lesser extent in HP. It seems that an automatic transmission correlates with high numbers in displacement, weight and cylinders. In case of HP, there seem to be two observations (the two rightmost blue points) that might have some leverage effect on the blue regression line. Without these two observations, there would be a stronger relationship between an automatic transmission and high numbers of HP.

Specifiying the model

Instead of throwing all variables of the dataset into our model, at once, we propose a different approach by building our model in a stepwise fashion. We certainly do not want to underfit our model (by omitting relevant variables out of the model); however, by slowing down the process of model fitting, we want to make sure, that we do not put too many variables into our model. Out of 10 potential predictors in the dataset, we have already inspected 6 in the plots shown above; these variables will be added to the model one after the other. After having specified the models, these will be assessed via anova. In the end, we will check the assumptions that have to be met, so we can claim our final model a valuable one. As our client is especally interested in the relationship between fuel consumtion (mpg) and type of transmission (am), our first model will comprise only these two variables.

fit1 <- lm(mpg ~ am, data = mtcars) 
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am2            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Using this simple regression model, we come to the conclusion, that the effect of AM on MPG explains about 36% of the variation in the outcome variable (see R-squared). The F-statistic is significant (p=0.0002) which tells us that using AM as a predictor adds valuable information to the model. As regards, the coefficient of AM, we can see that there is an increase of 7.245 miles per gallon, if the car has a manual transmission. Note, that this value is the same value as the difference between the two means we calculated for the two AM groups via the aggregate function! Now, we move on to input one more variable. One possible approach is to look at the relevant variables and choosing them out of substantial considerations. Here, we claim, that the weight of the car should have a huge impact on MPG, as well, so WT is the next variable to be included in our model. This approach can be called a hierarchical regression. After checking how the coeffcients change once we have inserted another predictor into the model, we want to do nested model testing using ANOVA, which will tell us, if the models add up significantly to explain the variance of the outcome variable. As the ANOVA is based on the assumption that the model residuals are equal, we apply a Shapir-Wilk Test after each ANOVA.

fit2 <- lm(mpg ~ am + wt, data=mtcars)
fit3 <- lm(mpg ~ am + wt + hp, data=mtcars)
fit4 <- lm(mpg ~ am + wt + hp + disp, data=mtcars)
fit5 <- lm(mpg ~ am + wt + hp + disp + cyl, data=mtcars)
stargazer(fit2, fit3, fit4, fit5, type="text", intercept.bottom=TRUE, dep.var.labels=c("Miles per gallon"), covariate.labels=c("Transmisson type (manual=1)", "Weight", "Horse Power", "Displacement", "Cylinders(6)", "Cylinders(8)"), out="models.txt")
## 
## =======================================================================================================================
##                                                                 Dependent variable:                                    
##                             -------------------------------------------------------------------------------------------
##                                                                  Miles per gallon                                      
##                                      (1)                    (2)                    (3)                    (4)          
## -----------------------------------------------------------------------------------------------------------------------
## Transmisson type (manual=1)         -0.024                 2.084                  2.159                  1.806         
##                                    (1.546)                (1.376)                (1.435)                (1.421)        
##                                                                                                                        
## Weight                            -5.353***              -2.879***               -3.047**               -2.739**       
##                                    (0.788)                (0.905)                (1.157)                (1.176)        
##                                                                                                                        
## Horse Power                                              -0.037***              -0.039***               -0.032**       
##                                                           (0.010)                (0.012)                (0.014)        
##                                                                                                                        
## Displacement                                                                      0.002                  0.004         
##                                                                                  (0.010)                (0.013)        
##                                                                                                                        
## Cylinders(6)                                                                                            -3.136**       
##                                                                                                         (1.469)        
##                                                                                                                        
## Cylinders(8)                                                                                             -2.718        
##                                                                                                         (2.898)        
##                                                                                                                        
## Constant                          37.322***              34.003***              34.209***              33.864***       
##                                    (3.055)                (2.643)                (2.823)                (2.695)        
##                                                                                                                        
## -----------------------------------------------------------------------------------------------------------------------
## Observations                          32                     32                     32                     32          
## R2                                  0.753                  0.840                  0.840                  0.866         
## Adjusted R2                         0.736                  0.823                  0.817                  0.834         
## Residual Std. Error            3.098 (df = 29)        2.538 (df = 28)        2.581 (df = 27)        2.453 (df = 25)    
## F Statistic                 44.165*** (df = 2; 29) 48.960*** (df = 3; 28) 35.498*** (df = 4; 27) 27.027*** (df = 6; 25)
## =======================================================================================================================
## Note:                                                                                       *p<0.1; **p<0.05; ***p<0.01
fit2 <- update(fit1, mpg ~ am + wt, data=mtcars)
fit3 <- update(fit2, mpg ~ am + wt + hp, data=mtcars)
anova(fit1, fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     29 278.32  1    442.58 68.734 5.071e-09 ***
## 3     28 180.29  1     98.03 15.224 0.0005464 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(fit3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit3$residuals
## W = 0.9453, p-value = 0.1059

Interpreting the models

As we plug in WT, the model changes dramatically. R-squared, the variance explained, jumps up to 75%. However, the coefficient of AM is no longer significant, that is, almost the complete variance is explained now by the variable weight. Looking at the ANOVA analysis, we see what we have already known from looking at R-squared: the new model adds up significantly in explaining the variance of the outcome variable. The Shapiro-Wilk-Test for normality turns out to be non-significant. So we stay with the model and add one more predictor variable. Adding HP, we see an increase in R-squared to 84% explained variance. Both WT and HP have significant coeffcients. AM’s coeffcient is still non-significant, but to a much lesser degree compared to the second model. The ANOVA result turns out to be significant again. The Shapiro-Wilk-Test is non-significant. So we move on and add the variable DISP. Model 4 shows no improvement in R-squared; the coeffcient of DISP is not significant, at all. So we drop this variable and move on with model 5 where we add CYL as a predictor. Adding CYL improves R-squared by about 2%. The dummy variable CYL6 is significant, whereas the dummy variable CYL8 is not. That means, that with a change from 4 (baseline) to 6 cylinders, we see a significant change of 3 miles per gallon less. The decrease of 2 miles per gallon between 4 cylinders and 8 cylinders is not significant. But the ANOVA analysis turns out to be non-significant, assuming a 5% alpha-level, so we stick with model 3. One might argue, that this model disposes of only a few variables. However, given the high R-squared we achieved by including only three variables, we can be satisfied. Besides, having only a few predictors makes interpretation much easier. If we turn back to our question, if an automatic of manual transmission is better for MPG, and if so, what the difference in MPG is between these two types, we can conclude: Given our model, a manual transmission seems to have a positive impact on MPG. However, this effect is very much smaller than we assumed in the first place, when we only looked at the relationship between MPG and AM. It turns out, that WT and HP account for most of the variance in MPG; at the same time they correlate with AM. That means, that a manual transmission correlates with low numbers in HP and WT. If we control for WT and HP, there is only an effect of 2 miles per gallon more in case of a manual transmission, but this estimate is not significant, so chances are high that it could be 0 miles per gallon as well! So, if we wanted to predict MPG for a car given our data, we would prefer the following equation:

MPG_estimated = 34 - 2.87WT - 0.03HP Example (car with WT = 4 and HP = 150): MPG_estimated = 18

Residuals, diagnostics, variation

After having established our model, one final step is to check for some asssumptions that have to be met, before we can claim our model “trustworthy”.

Residuals

Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.

par(mfrow=c(2,2))
plot(fit3)

The two plots in the first column show the fitted values (a.k.a the standardized predicted values) vs. the (standardized) residuals. They inform us about any violations against homoscedasticity and/or linearity. As one can see, the assumption of homoscdasticity is met; on the other hand, we have some slight curves which implies some violation of linearity, however, these curves are not pronounced. So we can check the boxes as regards homoscedasticity and linearity. The plot in the upper right corner informs us about the criterion, if the standardized residuals are normally distributed. The more the points match with the dotted line, the better. As we can see, there are some data points that deviate slightly from the line; however, overall the curve of the data points is not that wiggly, so we can check the box ‘normally distributed residuals’ as well. The fourth plot gives us an idea about potential outliers, that is data points that have a strong effect on the model fit. As a rough threshold (that is often criticized in the literature) we can use the value of 1. From the plot, we know that none of the observations is even above 0.5; so we can say, that there are no outliers that have a disturbing leverge effect on the model. However, we might think about redoing our analysis without the observations, that have been mentioned in the plots (Chrysler Imperial, Toyota Corolla, Fiat 128)

Autocorrelation

In order to test for non-autocorrelation (that is no correlation between the residual terms for any two observations), we use the Durbin-Watson-Test from the car package.

durbinWatsonTest(fit3)
##  lag Autocorrelation D-W Statistic p-value
##    1        0.232331       1.43292   0.054
##  Alternative hypothesis: rho != 0

The null-hypothesis that there is no autocorrelation must be rejected, if we assume alpha<0.05. So that is a problem, we should focus on in future analyses.

Variance inflation (Multicollinearity)

We do not want to throw a bunch of variables without a good reason, as the more variables you put into a model, the more it increases the standard errors of other regression variables. This is called variance inflation. On the other hand, we always have to be careful to not ignore any variables that might have a substantial impact on the model. In this case, we would underfit the model and that would lead to bias in the variance estimate.

vif(fit3)
##       am       wt       hp 
## 2.271082 3.774838 2.088124

Here, we can see, that the variable that may pose problem is WT. However, its tolerance (calculated by 1/VIF) is still above 0.2, which is alright. Still, the average of VIF is above 1, so multicollinearity might pose a problem in this model.

Conclusion

The analysis of residuals and other aspects with an impact on the quality of our model, shows that we can be quite satisfied with our model in that regard. By applying an ANOVA analysis in our regression, we were able to detect the improvement of R(squared) as the number of predictors increased. This leaves us with the final conclusion, that yes, indeed, the type of transmission has an impact on MPG. Cars with a manual transmission perform significantly better than cars with an automatic transmission. However, by including further variables in the model, we were able to correct for some distortions in our estimating the “true” MPG difference between cars with manual transmission and those with automatic transmission.