Analysis of the relationship between miles per gallon (MPG) and type of transmission for cars using 'mtcars' dataset

Executive summary

This analysis presents the variability of the miles per gallon (MPG) value for cars included in 'mtcars' dataset and its relationship to the type of transmission. The analysis tries to describe the context and answer the question of whether either the automatic or manual transmission is better for MPG and quantifies the MPG difference between the two.

The analysis was performed using three models:

The analysis showed that there is a notable difference of 7.24mpg in mean MPG between the two groups of cars with manual transmission cars having the greater value of 24.39mpg and automatic transmission group having the value of 17.15mpg. It was proven that using the type of transmission as categorical regressor is statistically significant (p < 0.001) thus justyfing the third model. The discussion of diagnostic measures supports the analysis and the quality of the data set used.

Exploring the mtcars dataset

First, let's form a simple model of splitting the data into two groups for different types of transmission and compare their mean MPG values.

There are 19 cars with automatic transmission and 13 cars with manual transmission so the data set is slightly skewed towards automatic transmission with greater number of data points. This is natural as this type is more prevalent in the US.

The mean value of MPG for cars with automatic transmission is 17.1474 and the mean for the cars with manual transmission is 24.3923 giving the difference between the two of 7.2449. The difference in medians between the groups is similar at 5.5. This suggests that manual transmission cars are far better in terms of MPG, the further analysis will try to support and substantiate this claim. This conclusion is not unexpected and is supported by the current state of knowledge on car design and manufacturing process as the automatic transmission is inherently less efficient by introducing more mechanical elements to the car and control system that currently is far behind the efficiency of human control.

Below is the plot showing the relationship between the type of transmission and MPG with the MPG means added as horizontal lines.

plot of chunk unnamed-chunk-2

Now, let's consider a working hypothesis that, aside from transmission type, the factor that greatly influences the MPG value is how 'big' and 'heavy' the car and its engine is. This, too should be grounded in theory and reality as bigger vehicles tend to be less efficient per unit of their weight or any the unit of value measuring the 'heaviness' of the engine and its mechanical system.

plot of chunk unnamed-chunk-3

As we see in the figure 2 above the heavier the car is the smaller its MPG value is. A simple regression model with MPG as the outcome and weight as the only regressor tells us that for every ton (1000lb) added to the car weight the MPG is decreased by 5.34mpg on average (beta1 coefficient of the model).

A very similar relationship with MPG exists for other variables fitting the description above i.e. horsepower, number of cylinders, engine displacement (the volume swept by all the pistons inside the cylinders in a cycle). For every unit of those variables added to the car its MPG value will on average decrease by a considerable amount (every model has one regresson only). For conciseness the graphs and beta values for those models are omitted but they are very similar in nature to predicting the MPG with only weight. This suggests that weight, horsepower, number of cylinders and engine displacement are all higly correlated. This is proved by the code snippet below:

mtcars_subset <- mtcars[,(names(mtcars) %in% c("mpg", "wt", "hp", "cyl", "carb", "disp"))]
cor(mtcars_subset)[,1]
##     mpg     cyl    disp      hp      wt    carb 
##  1.0000 -0.8522 -0.8476 -0.7762 -0.8677 -0.5509

Let's combine our knowledge so far and construct a model with one of the 'heaviness' regressors we know have big influence on MPG with the transmission type as a categorical regressor. The model will contain the regressors interaction term so the outcome will be two separate regression lines for both groups.

#amf is the transmission type as factor
fit1 <- lm(mpg ~ wt, mtcars)
fit2 <- lm(mpg ~ wt + amf + wt*amf, mtcars)
anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + amf + wt * amf
##   Res.Df RSS Df Sum of Sq    F Pr(>F)   
## 1     30 278                            
## 2     28 188  2      90.3 6.73 0.0041 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model comparison using anova function informs us that the p-value for comparing model with weight and transmission type regressors against the model with just weight regressor is equal to 0.004. This is a higly significant value proving that the transmission type splits the cars into two distinct MPG groups and should be included in the model.

The model tells us that for cars with manual transmission for every ton (1000lb) added to car's weight its MPG will drop by 9.08mpg and for cars with automatic transmission the MPG will drop by 3.79mpg on average. The decrease per unit of weight is much higher for manual transmission but this doesn't undermine the conclusion that manual transmission is better for MPG since the range of the data is highly limited to actually produced models and the model for manual has a much higher intercept term (46.29mpg versus 31.41mpg for automatic).

Checking the diagnostic values of the model

Below (Figure 3) is a plot of all cars presenting MPG vs total car weight with both car groups presented in different color (red is manual and blue is automatic transmission). The red line represents the regression line in the model for manual and the blue for automatic cars. The dashed lines represent the 95% confidence interval of the regression lines. The prediction confidence intervals are not drawn as this analysis does not address prediction of unknown future models.

plot of chunk unnamed-chunk-6

In order to prove the correctness of the model some diagnostic measures will be computed.
Below in Figure 4 we can see the residual plot for the model.
As we can infer from the plot the model seems to give normally distributed residuals.

plot of chunk unnamed-chunk-7

The last diagnostic measure we will use to prove the model will be the list of hatvalues presented below for all data points. As we can see some values are moderately high (Lotus Europa 0.25, Maserati Bora 0.37). However in this particular case it is still acceptable and no data points should be excluded from the analysis. This comes from the fact that each point represents a real car with official statistics and not a possibly erroneous measurement so even if it deviates from the linear model we chose it should still be used.

##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##             0.08649             0.12405             0.07874 
##      Hornet 4 Drive   Hornet Sportabout             Valiant 
##             0.08083             0.06258             0.06140 
##          Duster 360           Merc 240D            Merc 230 
##             0.05627             0.08344             0.08784 
##            Merc 280           Merc 280C          Merc 450SE 
##             0.06258             0.06258             0.06097 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##             0.05277             0.05264             0.25429 
## Lincoln Continental   Chrysler Imperial            Fiat 128 
##             0.30445             0.28099             0.08667 
##         Honda Civic      Toyota Corolla       Toyota Corona 
##             0.21563             0.14955             0.20892 
##    Dodge Challenger         AMC Javelin          Camaro Z28 
##             0.05833             0.06288             0.05310 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
##             0.05316             0.12652             0.09300 
##        Lotus Europa      Ford Pantera L        Ferrari Dino 
##             0.25346             0.20304             0.10514 
##       Maserati Bora          Volvo 142E 
##             0.37099             0.10673