This analysis will look at the mtcars data set, a data set on a collection of cars, and analyze the relationship between a set of variables and miles per gallon (MPG). In particular, this analysis seeks to answer whether automatic or manual transmissions get better MPG.
Looking at the head, tail and structure of the data set will show the number of observations, number/type of variables and the range of values a given variable takes on.
rbind(head(mtcars), tail(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
One can see that the data set has 32 observations of 11 variables, but some of these variables should be factors and not numerics. So, some of these variables will be converted to factors.
cars <- mtcars
cars$am <- ifelse(cars$am == 0, "Automatic", "Manual")
cars[,c(2,8:11)] <- lapply(cars[,c(2,8:11)], as.factor)
A boxplot of MPG by transmission type suggests that cars with manual transmissions get better (more) MPG than cars with automatic transmissions. However, it’s possible that there are confounding variables affecting the relationship between transmission type and miles per gallon, and thus a more rigorous analysis will be used to quantify the relationship between manual/automatic transmissions and miles per gallon.
boxplot(mpg ~ am, cars, col = 3:4, xlab = "Transmission", ylab = "Miles Per Gallon (MPG)")
Now that one has a sense of the data and it has been tidied up, I will look for a model for predicting MPG for a car’s given values for a number of variables, including transmission type. I will first fit a linear regression model with MPG as the dependent variable, and all 10 other variables as the independent variables.
model.all <- lm(mpg ~ ., data = cars)
round(summary(model.all)$coef, 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.879 20.066 1.190 0.253
## cyl6 -2.649 3.041 -0.871 0.397
## cyl8 -0.336 7.160 -0.047 0.963
## disp 0.036 0.032 1.114 0.283
## hp -0.071 0.039 -1.788 0.094
## drat 1.183 2.483 0.476 0.641
## wt -4.530 2.539 -1.784 0.095
## qsec 0.368 0.935 0.393 0.700
## vs1 1.931 2.871 0.672 0.512
## amManual 1.212 3.214 0.377 0.711
## gear4 1.114 3.800 0.293 0.773
## gear5 2.528 3.736 0.677 0.509
## carb2 -0.979 2.318 -0.423 0.679
## carb3 3.000 4.294 0.699 0.495
## carb4 1.091 4.450 0.245 0.810
## carb6 4.478 6.384 0.701 0.494
## carb8 7.250 8.361 0.867 0.399
The p-values are high for all coefficients in this model. I’d like to test this model against a few other others. Of course, the chosen model will include the transmission variable, even though its p-value in this model is quite high. I will take two other variables to use for further making models, in this case horsepower (hp) and weight (wt) because they have relatively low p-values in the model that included every variable. I will create a total of three other models, one with just AM, a second with AM + HP, and a third with AM + HP + WT. I will then compare all four models using analysis of variance (anova).
model1 <- lm(mpg ~ am, data = cars)
model2 <- lm(mpg ~ am + hp, data = cars)
model3 <- lm(mpg ~ am + hp + wt, data = cars)
anova(model1, model2, model3, model.all)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + hp
## Model 3: mpg ~ am + hp + wt
## Model 4: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 245.44 1 475.46 59.2334 1.382e-06 ***
## 3 28 180.29 1 65.15 8.1163 0.01219 *
## 4 15 120.40 13 59.89 0.5739 0.83944
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output from anova indicates the model that includes AM, HP and WT explains a significant amount of the variance in the dependent variable, MPG; the R-squared (\(R{^2}\)) for this model is 0.84. Comparatively, the model that includes 10 independent variables explains marginally little variance. The p-value for this model is .84 and the \(R{^2}\) for this model is 0.89. Thus, the model with 10 independent variables explains very little more variance than the model with 3 independent variables. One could reasonably argue that I may as well just use the model that includes all variables because more variance is explained, and more confounding variables are potentially being sorted out, but for ease of analysis, I will use the model that only includes AM, HP and WT as the independent variables moving forward. I will center the HP and WT variables so that the coefficients are more interpretable.
model3 <- lm(mpg ~ am + I(hp - mean(hp)) + I(wt - mean(wt)), data = cars)
Below are the coefficients for this model. According to the model and associated data set, a car with average weight and average horsepower using an automatic transmission gets 19.2 MPG. In comparison, a car with average weight and average horsepower using a manual transmission gets 19.2 + 2.1 = 21.3 MPG. The amManual coefficient of 2.1 indicates that a car with a manual transmission gets 2.1 more MPG, holding HP and WT constant. Similar to what was inferred from the boxplot above, cars with manual transmissions appear to get better MPG than cars with automatic transmissions; however, taking other confounding variables into account seems to have decreased the absolute difference in MPG by transmission type. The HP coefficient indicates that an increase in 1 of horsepower leads to a decrease of .04 MPG, holding AM and WT constant. Further, the WT coefficient indicates that an increase in weight of 1000lbs leads to a decrease of 2.9 MPG, holding AM and HP constant.
round(summary(model3)$coef, 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.244 0.717 26.845 0.000
## amManual 2.084 1.376 1.514 0.141
## I(hp - mean(hp)) -0.037 0.010 -3.902 0.001
## I(wt - mean(wt)) -2.879 0.905 -3.181 0.004
To conclude that cars with manual transmissions get better MPG than cars with automatic transmissions, I will take confidence intervals for the coefficients in the model. As can be seen below, the lower end of the confidence interval for the AM coefficient falls below 0. Since this confidence interval contains 0, it would be unreasonable to rule out that the difference in means in MPG between automatic and manual transmissions is 0. Thus, at the 95% confidence level, one cannot say that there is a significant difference in the mean MPG between the two transmission types. The confidence intervals for the HP and WT coefficients do not contain 0, however, and thus one can say at the 95% confidence level that heavier cars get lower MPG than lighter cars, and cars with more horsepower get lower MPG than cars with less horsepower.
confint(model3)
## 2.5 % 97.5 %
## (Intercept) 17.77569474 20.71254078
## amManual -0.73575874 4.90317900
## I(hp - mean(hp)) -0.05715454 -0.01780291
## I(wt - mean(wt)) -4.73232353 -1.02482730
I will plot the fitted values versus residuals to see if anything about this model’s fit looks out of the ordinary.
plot(predict(model3), model3$resid, col = "red", xlab = "Fitted Values", ylab = "Residuals", main = "")
abline(h = 0)
The above plot appears to suggest homoscedasticity, but I will test this using the Goldfeld-Quandt test. The null hypothesis in this case is that homoscedasticity is present. The high p-value of .82 means one can not reject the hypothesis homoscedasticity is present.
library(lmtest)
gqtest(model3, order.by = ~ am + hp + wt, data = mtcars, fraction = 7)
##
## Goldfeld-Quandt test
##
## data: model3
## GQ = 0.52066, df1 = 9, df2 = 8, p-value = 0.8248
## alternative hypothesis: variance increases from segment 1 to 2
The above analysis looked at the mtcars data set and analyzed variables’ relationship with miles per gallon. A model was chosen that could explain most of the variance in miles per gallon, but did not include a large number of independent variables, for ease of analysis. The independent variables selected were transmission type, horsepower and weight. The associated model had an \(R{^2}\) of 0.84. The coefficients and their associated 95% confidence intervals led to the conclusion that increasing weight and increasing horsepower were associated with lower miles per gallon. While the coefficient for transmission type suggested that cars with manual transmissions get better miles per gallon than cars with automatic ones, the coefficient’s associated confidence interval included 0, and so one could not conclude at the 95% confidence level that cars with manual transmissions get better MPG than cars with automatic transmissions.