Thursday, January 22, 2015
Using the mtcars data set, we conducted some exploratory analyses and identified 4 key covariates in addition to transmission type with which to predict fuel economy in miles per gallon. We created a linear model based to predict fuel economy based on these 5 variables and concluded that for cars in this data set, manual transmissions offer better fuel economy by 4.32 +/- 1.57 mpg.
We performed our MPG analysis on the mtcars data set taken from the 1974 Motor Trend US magazine. This data set consists of 10 features of automobile design and performance for 32 distinct 1973-1974 model automobiles. The 10 features include, miles per gallon (mpg), number of cylinders, engine displacement in cubic inches, gross horsepower, rear axle ratio, weight in 1000 lbs, ¼ mile time, engine type (V line or Straight line), transmission type (automatic or manual), number of forward gears, and number of carburetors.
Initially, in order to understand the data, we did some exploratory analysis by plotting mpg vs transmission type for all cars in the data set (this and all figures referenced henceforth can be found in the appendix). We then fit a linear model for the outcome, mpg, with only the transmission type as a predictor. The results can be seen below.
fit1<-lm(mpg~factor(am),data=mtcars)
round(fit1$coef,2)
## (Intercept) factor(am)1
## 17.15 7.24
Based on this fit, the average fuel economy for cars with an automatic transmission is 17.15 mpg and the effect of having an automatic transmission is 7.24 mpg, which means the manual transmission cars have an average fuel economy of 24.39 mpg. The p-value is of order 10-15, suggesting that this difference is statistically significant. Box plots of the data also show that for mpg for cars with manual transmissions, the lower quartile is greater than the upper quartile for cars with automatic transmissions.
Of course, there are 8 other covariates in the data that may contribute to the difference in mpg. We quickly examined these by looking at pairwise plots of all variables relative to one another. These showed clear correlations between mpg and the other variables, suggesting that transmission type should not be the only variable included in the model. As such, we ran a second model for the outcome mpg as a linear combination of all 9 other covariates, assigning engine type, transmission type, cylinder number, and foward gear number to be discrete factor variables.
We ran the anova() function on this fit, which analyzed the variance of each covariate and its effect on the response variable. The output is displayed below.
fit3<-lm(mpg~factor(vs)+factor(am)+factor(cyl)+factor(gear)+.,data=mtcars)
round(anova(fit3),2)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(vs) 1 496.53 496.53 72.54 <2e-16 ***
## factor(am) 1 276.03 276.03 40.33 <2e-16 ***
## factor(cyl) 2 94.59 47.30 6.91 0.01 **
## factor(gear) 2 6.30 3.15 0.46 0.64
## disp 1 35.75 35.75 5.22 0.03 *
## hp 1 57.93 57.93 8.46 0.01 **
## drat 1 4.66 4.66 0.68 0.42
## wt 1 14.99 14.99 2.19 0.16
## qsec 1 5.26 5.26 0.77 0.39
## carb 1 3.95 3.95 0.58 0.46
## Residuals 19 130.05 6.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This suggests that the statistically significant (p<0.05) covariates to include in our model are engine type, cylinder number, displacement, and horsepower in addition to transmission type, our predictor of interest. Fitting a new model with only these 5 predictors yields the following coefficients.
fit4<-lm(mpg~factor(vs)+factor(am)+factor(cyl)+I(disp-mean(disp))+I(hp-mean(hp)),data=mtcars)
summ<-round(summary(fit4)$coef,2)
summ
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.95 2.99 5.67 0.00
## factor(vs)1 2.38 1.92 1.24 0.23
## factor(am)1 4.32 1.57 2.76 0.01
## factor(cyl)6 -2.09 1.82 -1.15 0.26
## factor(cyl)8 1.84 3.78 0.49 0.63
## I(disp - mean(disp)) -0.01 0.01 -1.34 0.19
## I(hp - mean(hp)) -0.04 0.01 -2.78 0.01
The intercept, 16.95 +/- 2.99, should be interpreted as the predicted fuel economy in mpg for a car with 4 cylinders, automatic transmission, and straight line engine that has the mean horsepower and displacement for all cars in the data set. Our model predicts that the quantitative difference between a car with a manual and automatic transmission is 4.32 +/- 1.57 mpg, with the manual transmission delivering superior fuel economy.
Now we look at the residuals to confirm model fit. There appears to be no pattern to suggest poor model fit in either the plot of the residuals vs. the predictor (transmission type) nor the plot of the residuals vs. the model predicted outcome.
Doing some residual diagnostics, we can see which cars were most influential in the data. The dfbeta values are a measure of how influential each car in the data set was to each predictor coefficient, while the hat values are a measure of how influential each car was to its own prediction. The Maserati Bora had the highest hat value at 0.526, and the Toyota Corolla had the highest dfbeta value for the transmission type coefficient at 0.419. The Hornet 4 Drive had the largest absolute dfbeta value for the intercept. In all appendix plots, the Maserati, Toyota and Hornet are highlighted in blue, red, and green respectively.
round(dfbetas(fit4),3)
## (Intercept) factor(vs)1 factor(am)1 factor(cyl)6
## Mazda RX4 -0.022 0.088 -0.045 -0.067
## Mazda RX4 Wag -0.022 0.088 -0.045 -0.067
## Datsun 710 0.110 -0.313 -0.383 0.124
## Hornet 4 Drive -0.109 0.349 0.093 0.347
## Hornet Sportabout -0.039 -0.014 0.025 0.039
## Valiant 0.046 -0.097 -0.008 -0.117
## Duster 360 0.094 -0.043 -0.104 -0.071
## Merc 240D 0.051 -0.023 -0.050 -0.050
## Merc 230 0.013 -0.006 -0.014 -0.012
## Merc 280 -0.004 0.008 -0.006 0.013
## Merc 280C 0.057 -0.115 0.080 -0.175
## Merc 450SE 0.012 0.007 0.022 -0.017
## Merc 450SL -0.017 -0.010 -0.031 0.024
## Merc 450SLC 0.051 0.030 0.094 -0.072
## Cadillac Fleetwood -0.163 0.006 -0.163 0.187
## Lincoln Continental -0.159 0.019 -0.083 0.169
## Chrysler Imperial 0.175 -0.037 -0.002 -0.170
## Fiat 128 -0.209 0.232 0.298 0.044
## Honda Civic -0.054 0.050 0.070 0.019
## Toyota Corolla -0.325 0.340 0.419 0.085
## Toyota Corona -0.266 0.120 0.309 0.239
## Dodge Challenger 0.108 0.002 -0.018 -0.109
## AMC Javelin 0.149 0.008 0.002 -0.154
## Camaro Z28 0.016 -0.008 -0.021 -0.012
## Pontiac Firebird -0.011 0.000 0.155 -0.010
## Fiat X1-9 0.112 -0.125 -0.161 -0.023
## Porsche 914-2 0.323 -0.355 -0.105 -0.261
## Lotus Europa -0.007 0.185 0.149 -0.134
## Ford Pantera L 0.074 -0.083 -0.140 -0.037
## Ferrari Dino 0.062 -0.083 -0.021 0.016
## Maserati Bora -0.016 0.072 0.043 -0.006
## Volvo 142E -0.003 -0.336 -0.383 0.262
## factor(cyl)8 I(disp - mean(disp)) I(hp - mean(hp))
## Mazda RX4 0.011 -0.012 0.053
## Mazda RX4 Wag 0.011 -0.012 0.053
## Datsun 710 -0.033 -0.146 0.075
## Hornet 4 Drive -0.020 0.384 -0.122
## Hornet Sportabout 0.120 0.058 -0.198
## Valiant -0.026 -0.052 0.031
## Duster 360 -0.060 -0.038 0.143
## Merc 240D -0.041 0.005 0.007
## Merc 230 -0.011 -0.002 0.008
## Merc 280 0.005 -0.007 0.005
## Merc 280C -0.066 0.093 -0.063
## Merc 450SE -0.038 0.051 0.007
## Merc 450SL 0.053 -0.071 -0.009
## Merc 450SLC -0.162 0.216 0.028
## Cadillac Fleetwood 0.241 -0.615 0.157
## Lincoln Continental 0.212 -0.444 0.051
## Chrysler Imperial -0.203 0.298 0.067
## Fiat 128 0.209 -0.014 -0.230
## Honda Civic 0.055 0.000 -0.069
## Toyota Corolla 0.343 -0.084 -0.334
## Toyota Corona 0.207 0.099 -0.197
## Dodge Challenger -0.207 0.072 0.223
## AMC Javelin -0.293 0.155 0.270
## Camaro Z28 -0.009 -0.011 0.027
## Pontiac Firebird 0.054 0.377 -0.373
## Fiat X1-9 -0.112 0.006 0.124
## Porsche 914-2 -0.270 0.039 0.032
## Lotus Europa -0.030 -0.027 0.146
## Ford Pantera L -0.049 -0.044 -0.025
## Ferrari Dino -0.052 -0.036 0.066
## Maserati Bora 0.005 -0.106 0.277
## Volvo 142E 0.133 -0.238 -0.110
round(hatvalues(fit4),3)
## Mazda RX4 Mazda RX4 Wag Datsun 710
## 0.293 0.293 0.127
## Hornet 4 Drive Hornet Sportabout Valiant
## 0.272 0.102 0.230
## Duster 360 Merc 240D Merc 230
## 0.132 0.203 0.253
## Merc 280 Merc 280C Merc 450SE
## 0.240 0.240 0.181
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 0.181 0.181 0.277
## Lincoln Continental Chrysler Imperial Fiat 128
## 0.228 0.177 0.149
## Honda Civic Toyota Corolla Toyota Corona
## 0.180 0.152 0.274
## Dodge Challenger AMC Javelin Camaro Z28
## 0.167 0.179 0.139
## Pontiac Firebird Fiat X1-9 Porsche 914-2
## 0.143 0.149 0.458
## Lotus Europa Ford Pantera L Ferrari Dino
## 0.131 0.273 0.333
## Maserati Bora Volvo 142E
## 0.526 0.136
data(mtcars)
plot(mtcars$am,mtcars$mpg, xaxt="n", ylab="MPG",xlim=c(-0.5,1.5),xlab="Trasmission Type")
axis(1,at=c(0,1),labels=c("Automatic","Manual"))
abline(fit1)
points(mtcars[c("Maserati Bora","Toyota Corolla","Hornet 4 Drive"),]$am,mtcars[c("Maserati Bora","Toyota Corolla","Hornet 4 Drive"),]$mpg,col=c("blue","red","forestgreen"),cex=2)
MPG vs. transmission type with regression based on predicting mpg with transmission type only.
plot(factor(mtcars$am),mtcars$mpg,xaxt="n",ylab="MPG")
axis(1,at=c(1,2),labels=c("Automatic","Manual"))
Box plot of mpg vs. transmission type, illustrating difference in distribution.
pairs(mtcars)
Pairwise plot of all variables in the mtcars data set
plot(mtcars$am,resid(fit4),xaxt="n",xlim=c(-0.5,1.5),xlab="Trasmission Type")
axis(1,at=c(0,1),labels=c("Automatic","Manual"))
points(mtcars[c("Maserati Bora","Toyota Corolla","Hornet 4 Drive"),]$am,resid(fit4)[c("Maserati Bora","Toyota Corolla","Hornet 4 Drive")],col=c("blue","red","forestgreen"),cex=2)
Residuals vs. transmission type
plot(predict(fit4),resid(fit4))
points(predict(fit4)[c("Maserati Bora","Toyota Corolla","Hornet 4 Drive")],resid(fit4)[c("Maserati Bora","Toyota Corolla","Hornet 4 Drive")],col=c("blue","red","forestgreen"),cex=2)
Residuals vs. predicted values