Overview

This analysis will look at the mtcars data set, a data set on a collection of cars, and analyze the relationship between a set of variables and miles per gallon (MPG). In particular, this analysis seeks to answer whether automatic or manual transmissions get better MPG.

Exploratory Analysis

Looking at the head, tail and structure of the data set will show the number of observations, number/type of variables and the range of values a given variable takes on.

rbind(head(mtcars), tail(mtcars))
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa      30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L    15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E        21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

One can see that the data set has 32 observations of 11 variables, but some of these variables should be factors and not numerics. So, some of these variables will be converted to factors.

cars <- mtcars
cars$am <- ifelse(cars$am == 0, "Automatic", "Manual")
cars[,c(2,8:11)] <- lapply(cars[,c(2,8:11)], as.factor)


A boxplot of MPG by transmission type suggests that cars with manual transmissions get better (more) MPG than cars with automatic transmissions. However, it’s possible that there are confounding variables affecting the relationship between transmission type and miles per gallon, and thus a more rigorous analysis will be used to quantify the relationship between manual/automatic transmissions and miles per gallon.

boxplot(mpg ~ am, cars, col = 3:4, xlab = "Transmission", ylab = "Miles Per Gallon (MPG)")

Regression Analysis

Now that one has a sense of the data and it has been tidied up, I will look for a model for predicting MPG for a car’s given values for a number of variables, including transmission type. I will first fit a linear regression model with MPG as the dependent variable, and all 10 other variables as the independent variables.

model.all <- lm(mpg ~ ., data = cars)
round(summary(model.all)$coef, 3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   23.879     20.066   1.190    0.253
## cyl6          -2.649      3.041  -0.871    0.397
## cyl8          -0.336      7.160  -0.047    0.963
## disp           0.036      0.032   1.114    0.283
## hp            -0.071      0.039  -1.788    0.094
## drat           1.183      2.483   0.476    0.641
## wt            -4.530      2.539  -1.784    0.095
## qsec           0.368      0.935   0.393    0.700
## vs1            1.931      2.871   0.672    0.512
## amManual       1.212      3.214   0.377    0.711
## gear4          1.114      3.800   0.293    0.773
## gear5          2.528      3.736   0.677    0.509
## carb2         -0.979      2.318  -0.423    0.679
## carb3          3.000      4.294   0.699    0.495
## carb4          1.091      4.450   0.245    0.810
## carb6          4.478      6.384   0.701    0.494
## carb8          7.250      8.361   0.867    0.399

The p-values are high for all coefficients in this model. I’d like to test this model against a few other others. Of course, the chosen model will include the transmission variable, even though its p-value in this model is quite high. I will take two other variables to use for further making models, in this case horsepower (hp) and weight (wt) because they have relatively low p-values in the model that included every variable. I will create a total of three other models, one with just AM, a second with AM + HP, and a third with AM + HP + WT. I will then compare all four models using analysis of variance (anova).

model1 <- lm(mpg ~ am, data = cars)
model2 <- lm(mpg ~ am + hp, data = cars)
model3 <- lm(mpg ~ am + hp + wt, data = cars)
anova(model1, model2, model3, model.all)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + hp
## Model 3: mpg ~ am + hp + wt
## Model 4: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 245.44  1    475.46 59.2334 1.382e-06 ***
## 3     28 180.29  1     65.15  8.1163   0.01219 *  
## 4     15 120.40 13     59.89  0.5739   0.83944    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output from anova indicates the model that includes AM, HP and WT explains a significant amount of the variance in the dependent variable, MPG; the R-squared (\(R{^2}\)) for this model is 0.84. Comparatively, the model that includes 10 independent variables explains marginally little variance. The p-value for this model is .84 and the \(R{^2}\) for this model is 0.89. Thus, the model with 10 independent variables explains very little more variance than the model with 3 independent variables. One could reasonably argue that I may as well just use the model that includes all variables because more variance is explained, and more confounding variables are potentially being sorted out, but for ease of analysis, I will use the model that only includes AM, HP and WT as the independent variables moving forward. I will center the HP and WT variables so that the coefficients are more interpretable.

model3 <- lm(mpg ~ am + I(hp - mean(hp)) + I(wt - mean(wt)), data = cars)

Below are the coefficients for this model. According to the model and associated data set, a car with average weight and average horsepower using an automatic transmission gets 19.2 MPG. In comparison, a car with average weight and average horsepower using a manual transmission gets 19.2 + 2.1 = 21.3 MPG. The amManual coefficient of 2.1 indicates that a car with a manual transmission gets 2.1 more MPG, holding HP and WT constant. Similar to what was inferred from the boxplot above, cars with manual transmissions appear to get better MPG than cars with automatic transmissions; however, taking other confounding variables into account seems to have decreased the absolute difference in MPG by transmission type. The HP coefficient indicates that an increase in 1 of horsepower leads to a decrease of .04 MPG, holding AM and WT constant. Further, the WT coefficient indicates that an increase in weight of 1000lbs leads to a decrease of 2.9 MPG, holding AM and HP constant.

round(summary(model3)$coef, 3)
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        19.244      0.717  26.845    0.000
## amManual            2.084      1.376   1.514    0.141
## I(hp - mean(hp))   -0.037      0.010  -3.902    0.001
## I(wt - mean(wt))   -2.879      0.905  -3.181    0.004

To conclude that cars with manual transmissions get better MPG than cars with automatic transmissions, I will take confidence intervals for the coefficients in the model. As can be seen below, the lower end of the confidence interval for the AM coefficient falls below 0. Since this confidence interval contains 0, it would be unreasonable to rule out that the difference in means in MPG between automatic and manual transmissions is 0. Thus, at the 95% confidence level, one cannot say that there is a significant difference in the mean MPG between the two transmission types. The confidence intervals for the HP and WT coefficients do not contain 0, however, and thus one can say at the 95% confidence level that heavier cars get lower MPG than lighter cars, and cars with more horsepower get lower MPG than cars with less horsepower.

confint(model3)
##                        2.5 %      97.5 %
## (Intercept)      17.77569474 20.71254078
## amManual         -0.73575874  4.90317900
## I(hp - mean(hp)) -0.05715454 -0.01780291
## I(wt - mean(wt)) -4.73232353 -1.02482730

I will plot the fitted values versus residuals to see if anything about this model’s fit looks out of the ordinary.

plot(predict(model3), model3$resid, col = "red", xlab = "Fitted Values", ylab = "Residuals", main = "")
abline(h = 0)

The above plot appears to suggest homoscedasticity, but I will test this using the Goldfeld-Quandt test. The null hypothesis in this case is that homoscedasticity is present. The high p-value of .82 means one can not reject the hypothesis homoscedasticity is present.

library(lmtest)
gqtest(model3, order.by = ~ am + hp + wt, data = mtcars, fraction = 7)
## 
##  Goldfeld-Quandt test
## 
## data:  model3
## GQ = 0.52066, df1 = 9, df2 = 8, p-value = 0.8248
## alternative hypothesis: variance increases from segment 1 to 2

Summary

The above analysis looked at the mtcars data set and analyzed variables’ relationship with miles per gallon. A model was chosen that could explain most of the variance in miles per gallon, but did not include a large number of independent variables, for ease of analysis. The independent variables selected were transmission type, horsepower and weight. The associated model had an \(R{^2}\) of 0.84. The coefficients and their associated 95% confidence intervals led to the conclusion that increasing weight and increasing horsepower were associated with lower miles per gallon. While the coefficient for transmission type suggested that cars with manual transmissions get better miles per gallon than cars with automatic ones, the coefficient’s associated confidence interval included 0, and so one could not conclude at the 95% confidence level that cars with manual transmissions get better MPG than cars with automatic transmissions.