The purpose of this statistical analysis is to observe the effects of three different factors (drive of the car, fuel type, and number of cylinders) on a single response variable (city gas mileage) while blocking for two different nuisance factors (year and class of the car) using EPA’s fuel economy dataset.
library("fueleconomy")
str(y)
## Classes 'tbl_df', 'tbl' and 'data.frame': 9210 obs. of 12 variables:
## $ id : int 12674 13422 14164 16573 17489 18458 18459 15069 15871 16734 ...
## $ make : chr "Acura" "Acura" "Acura" "Acura" ...
## $ model: chr "2.5TL/3.2TL" "2.5TL/3.2TL" "2.5TL/3.2TL" "3.2CL" ...
## $ year : int 1996 1997 1998 2001 2002 2003 2003 1999 2000 2001 ...
## $ class: chr "Compact Cars" "Compact Cars" "Compact Cars" "Compact Cars" ...
## $ trans: chr "Automatic 4-spd" "Automatic 4-spd" "Automatic 4-spd" "Automatic (S5)" ...
## $ drive: chr "Front-Wheel" "Front-Wheel" "Front-Wheel" "Front-Wheel" ...
## $ cyl : int 6 6 6 6 6 6 6 6 6 6 ...
## $ displ: num 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 ...
## $ fuel : chr "Premium" "Premium" "Premium" "Premium" ...
## $ hwy : int 22 22 22 27 27 26 27 25 27 27 ...
## $ cty : int 17 17 17 17 17 17 17 17 17 17 ...
For this analysis we are interested in three explanatory variables: drive of the car (drive), number of cylinders (cyl), and fuel type (fuel).
summary(as.factor(drive))
## All-Wheel Front-Wheel Rear-Wheel
## 2781 4299 2130
summary(as.factor(cyl))
## 4 6 8
## 3530 3748 1932
summary(as.factor(fuel))
## Premium Regular
## 3020 6190
We will also be blocking on two variables: year and class because they are both known and controllable nuisance variables that are not of interest to us in the analysis. We will combine them into one blocking factor in the form of class.year.
summary(as.factor(year))
## 1995-2000 2000-2005 2005-2010 2010-2015
## 1373 2496 2841 2500
summary(as.factor(class))
## Compact Large Midsize SUV
## 2677 989 2225 3319
block <- paste(class, year, sep=".")
The response variale of this analysis is city gas milage in MPG (cty).
summary(cty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 15.0 17.0 18.2 20.0 53.0
This is a completely randomized blocked designs with multiple factors. The purpose of the experiment is to monitor the effects of fuel type, car drive, and and number of cylinders on the response variable of city gas mileage. The blocking factor class/year is used to reduce noise in the fitted model.
boxplot(cty~drive)
boxplot(cty~cyl)
boxplot(cty~fuel)
boxplot(cty~year)
boxplot(cty~class)
interaction.plot(drive,cyl,cty)
interaction.plot(drive,fuel,cty)
interaction.plot(cyl,fuel,cty)
All of the factors appear to have a significant difference in means accross levels and there do not seem to be any major interactions between the three factors.
model <- aov(cty~drive*fuel*cyl+block) #include interaction terms just for illustration
anova(model)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## drive 2 52185 26092 3486.0 < 2e-16 ***
## fuel 1 1010 1010 135.0 < 2e-16 ***
## cyl 1 45243 45243 6044.6 < 2e-16 ***
## block 15 18803 1254 167.5 < 2e-16 ***
## drive:fuel 2 370 185 24.7 2.0e-11 ***
## drive:cyl 2 1105 553 73.8 < 2e-16 ***
## fuel:cyl 1 491 491 65.6 6.1e-16 ***
## drive:fuel:cyl 2 1162 581 77.6 < 2e-16 ***
## Residuals 9183 68734 7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA results we see that each of our three factors are found to have a significant difference in means, thus we can reject the null hypothesis that randomization alone can account for the variation in city gas milage.
Drive of the car, fuel type, and number of cylinders are all explanatory factors in the variation of city gas milage.
TukeyHSD(aov(cty~as.factor(drive)))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ as.factor(drive))
##
## $`as.factor(drive)`
## diff lwr upr p adj
## Front-Wheel-All-Wheel 4.4818 4.2619 4.7018 0
## Rear-Wheel-All-Wheel -0.6185 -0.8788 -0.3583 0
## Rear-Wheel-Front-Wheel -5.1004 -5.3399 -4.8609 0
aggregate(cty, by=list(drive), FUN=mean)
## Group.1 x
## 1 All-Wheel 16.22
## 2 Front-Wheel 20.70
## 3 Rear-Wheel 15.60
Front-wheel drive gets the best gas milage (mean 20.70 MPG) while Rear-wheel and All-Wheel get worse, more similar gas milage (mean 15.60 and 16.22 MPG)
TukeyHSD(aov(cty~as.factor(fuel)))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ as.factor(fuel))
##
## $`as.factor(fuel)`
## diff lwr upr p adj
## Regular-Premium 2.334 2.142 2.525 0
aggregate(cty, by=list(fuel), FUN=mean)
## Group.1 x
## 1 Premium 16.60
## 2 Regular 18.93
Surprisingly, regular gas gets better gas milage than premium gas by a difference of 2.33 MPG.
TukeyHSD(aov(cty~as.factor(cyl)))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ as.factor(cyl))
##
## $`as.factor(cyl)`
## diff lwr upr p adj
## 6-4 -5.254 -5.430 -5.078 0
## 8-4 -8.106 -8.318 -7.893 0
## 8-6 -2.852 -3.062 -2.641 0
aggregate(cty, by=list(cyl), FUN=mean)
## Group.1 x
## 1 4 22.01
## 2 6 16.75
## 3 8 13.90
Number of cylinders has the biggest differences in means with 4 cylinder cars getting over 5 MPG more than 6 cylinder cars, which get almost 3 MPG more than 8 cylinder cars.
qqnorm(residuals(model)); qqline(residuals(model))
Unfortunately, the residuals of the model are not normally distributed on the high tail because city gas milage is not normally distributed. This indicates that the model may not be the most appropriate fit.