as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
In this project, we use the “2014 EPA fuel economy” dataset to construct a completely randomized block “pseudo-design” with three factors and two blocks. The purpose of the experiment to test whehter the number of cylinders or/and drive type or/tr/and transmission type contribute to the variation in vehicle highway fuel economy, under the control of two blockings: year and vehicle make.
install.packages("fueleconomy")
## Installing package into 'C:/Users/wei/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("fueleconomy", lib.loc="C:/Users/wei/Documents/R/win-library/3.1")
data1<-vehicles
head(data1)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
summary(data1)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :58
## displ fuel hwy cty
## Min. :0.00 Length:33442 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.: 19.0 1st Qu.: 15.0
## Median :3.00 Mode :character Median : 23.0 Median : 17.0
## Mean :3.35 Mean : 23.6 Mean : 17.5
## 3rd Qu.:4.30 3rd Qu.: 27.0 3rd Qu.: 20.0
## Max. :8.40 Max. :109.0 Max. :138.0
## NA's :57
As a pilot study, we select a subsample from the original database to conduct the analysis: vehicles made by Honda, Bentley, GMC from year 2007 and year 2010 are selected.
# Logical vector identifying all vehicles with from year 2007 and year 2014
data2<-subset(data1,data1$year==2007|data1$year==2010)
# Logical vector identifying all vehicles from Honda, Bentley and GMC
data3<-subset(data2, data2$make=="Honda"|data2$make=="Bentley"|data2$make=="GMC")
head(data3)
## id make model year class
## 2644 23556 Bentley Arnage 2007 Midsize Cars
## 2651 23601 Bentley Arnage LWB 2007 Large Cars
## 2659 23488 Bentley Azure 2007 Subcompact Cars
## 2662 29707 Bentley Azure 2010 Compact Cars
## 2664 29708 Bentley Brooklands 2010 Compact Cars
## 2669 23239 Bentley Continental Flying Spur 2007 Midsize Cars
## trans drive cyl displ fuel hwy cty
## 2644 Automatic (S6) Rear-Wheel Drive 8 6.7 Premium 15 10
## 2651 Automatic (S6) Rear-Wheel Drive 8 6.7 Premium 14 10
## 2659 Automatic (S6) Rear-Wheel Drive 8 6.7 Premium 15 10
## 2662 Automatic (A6) Rear-Wheel Drive 8 6.8 Premium 15 9
## 2664 Automatic (A6) Rear-Wheel Drive 8 6.8 Premium 15 9
## 2669 Automatic (S6) 4-Wheel or All-Wheel Drive 12 6.0 Premium 17 10
summary(data3)
## id make model year
## Min. :22860 Length:186 Length:186 Min. :2007
## 1st Qu.:23374 Class :character Class :character 1st Qu.:2007
## Median :23851 Mode :character Mode :character Median :2007
## Mean :26103 Mean :2008
## 3rd Qu.:29436 3rd Qu.:2010
## Max. :29944 Max. :2010
## class trans drive cyl
## Length:186 Length:186 Length:186 Min. : 4.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 6.47
## 3rd Qu.: 8.00
## Max. :12.00
## displ fuel hwy cty
## Min. :1.30 Length:186 Min. :14.0 Min. : 9.0
## 1st Qu.:2.90 Class :character 1st Qu.:18.0 1st Qu.:13.0
## Median :4.30 Mode :character Median :20.0 Median :15.0
## Mean :4.18 Mean :22.4 Mean :16.5
## 3rd Qu.:5.30 3rd Qu.:24.0 3rd Qu.:18.0
## Max. :6.80 Max. :45.0 Max. :40.0
The selected subsample dataset has three test factors: transmission type (with 10 levels), number of cylinders (with 5 levels), and drive type (with 5 levels); and two blocking factors: vehicle make (with 3 levels) and year (with two levels)
data3$trans<-as.factor(data3$trans)
nlevels(data3$trans)
## [1] 10
data3$cyl<-as.factor(data3$cyl)
nlevels(data3$cyl)
## [1] 5
data3$drive<-as.factor(data3$drive)
nlevels(data3$drive)
## [1] 5
data3$make<-as.factor(data3$make)
nlevels(data3$make)
## [1] 3
data3$year<-as.factor(data3$year)
nlevels(data3$year)
## [1] 2
In the selected subsample dataset, there's no strict continuous variable, however, the hwy (highway fuel economy) maybe considered as a continuous variable conceptually, although all of them are integers in this dataset.
The response variable is the vehicle highway fuel economy.
ThE original dataset summarizes the Envrionmental Protection Agency (EPA) fuel economy data from 1985 to 2014, it contains the highway and city fuel economy information and data on vehicle characteristics such as number of cylinders, make, model and so on.For this study,we select a subsample: vehicles made by Honda, Bentley, GMC from year 2007 and year 2010.
EPA cooperates with Natioanl Highway Traffic Safety Administration to test the fuel economy data for all new cars and light trucks, and update the fuel economy data for old vehicles every year. Since this is not a speically designed experiment conducted under control, the data are not completely random. However, sicne the dataset includes all the produced vehicle make and models, it is very representative.
The purpose of this project is to design a completely random blocked experiment. To do so, we select two nuisance variables (year and make) as the blocking variables and test the effect of three factors (transmission type, number of cylinders and drive type) on the vehicle highway fuel economy. The null hypothesis is that the variation in highway fuel economy is simply due to sample randomization (i.e., the effects of the three selected variables are zero). As a pilot study, we reduce the sample size by selecting vehicles made in year 2007 and 2010, from make “Bently”, “GMC” and “Honda”.
Since all the new vehicle make/models are required to take the fuel economy examination at EPA, the data are random and represent the population well.
The fuel economy exam is carried out for all new vehicles before they are sent to the market. Therefore, for each vehicle there is no replicate/repeated measures. However, repeated measures maybe taken for the same make/model in different years.
Two blocking variables are used in the design: vehicle make and year. They are selected based on three main criteria: 1. controllable 2. Known 3. Not the interest of the study. The boxplots shown below also prove that the above criteria has been satified: The within group variation is smaller than the between group variation, indicating that they are nuisance variables that we should control for.
# All Acura records in the dataframe
boxplot(hwy~year,data=data3, xlab="Year", ylab="Highway Fuel Economy (mpg)")
title("Boxplot of Highway Fuel Economy at different Years")
boxplot(hwy~make,data=data3, xlab="Vehicle Make", ylab="Highway Fuel Economy (mpg)")
title("Boxplot of Highway Fuel Economy for different Vehicle Makes")
We use boxplots to compare the highway fuel economy of vehicles with various transmission types, number of cylinders, and drive types. The boxplots show that the median highway fuel economy do vary across different groups, indicating that they may contribute to the variation in highway fuel economy, the analysis of variance conducted in the following section further confirm this finding.
boxplot(hwy~trans,data=data3, xlab="Transmission Type", ylab="Highway Fuel Economy (mpg)",names=c("AV-S7","A6","S5","S6","A(Variable)","A4","A5","Auto6","M5","M6"))
title("Boxplot of Highway Fuel Economy vs. Transmission Type")
boxplot(hwy~cyl,data=data3, xlab="Number of Cylinders", ylab="Highway Fuel Economy (mpg)")
title("Boxplot of Highway Fuel Economy vs. Number of Cylinders")
boxplot(hwy~drive,data=data3, xlab="Drive Type", ylab="Highway Fuel Economy (mpg)",names=c("4W","4W/AW","AW","FW","RW"))
title("Boxplot of Highway Fuel Economy vs. Drive Type")
The summary of the ANOVA shows below gives the p-values for each individual factor, along with the p-value for the interactions between the factors. The null hypothesis states that the variation in the response variable, highway fuel economy, cannot be explained by anything other than sample randomization. We estimate seven models to test our hypothesis: model1 includes all three testing factors and their interaction effect, model2 adds in one blocking factor year, model3 adds in the other blocking factormake, model4 includes both of the blocking factors, model5, model6 and model7 excludes the blocking factors and test each of the factors seperately, their results are mainly used for the following Tukey's HSD Test.
The difference in model1, model2, model3 and model4 shows that by adding in the blocking variables, the precision of the estimation results have increased. We have selected model4 as our final model based on conceptual validity and statistical significance, the p-value shown in the estimation result shows that it is highly possible that the variation in highway fuel economy is not due to sample randomization only, the transmission type, number of cylinders, drive type and the interaction between transmission type and number of cylinders may contribute to the such variation.
model1 <- aov(hwy~ trans+cyl+drive+trans*cyl*drive, data = data3)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 9 3918 435 96.18 < 2e-16 ***
## cyl 4 1221 305 67.43 < 2e-16 ***
## drive 4 648 162 35.80 < 2e-16 ***
## trans:cyl 6 206 34 7.59 4.7e-07 ***
## trans:drive 12 46 4 0.85 0.60
## cyl:drive 9 15 2 0.36 0.95
## trans:cyl:drive 2 0 0 0.05 0.95
## Residuals 139 629 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2 <- aov(hwy~ trans+cyl+drive+trans* cyl*drive+year, data = data3)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 9 3918 435 95.79 < 2e-16 ***
## cyl 4 1221 305 67.16 < 2e-16 ***
## drive 4 648 162 35.65 < 2e-16 ***
## year 1 0 0 0.00 1.00
## trans:cyl 6 208 35 7.61 4.5e-07 ***
## trans:drive 12 47 4 0.86 0.59
## cyl:drive 9 14 2 0.35 0.96
## trans:cyl:drive 2 1 0 0.06 0.94
## Residuals 138 627 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3 <- aov(hwy~ trans+cyl+drive+trans*cyl*drive+year+make, data = data3)
summary(model3)
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 9 3918 435 95.18 < 2e-16 ***
## cyl 4 1221 305 66.73 < 2e-16 ***
## drive 4 648 162 35.42 < 2e-16 ***
## year 1 0 0 0.00 0.9960
## make 1 46 46 10.15 0.0018 **
## trans:cyl 6 170 28 6.20 8.9e-06 ***
## trans:drive 12 39 3 0.70 0.7476
## cyl:drive 9 14 2 0.35 0.9564
## trans:cyl:drive 2 0 0 0.03 0.9682
## Residuals 137 627 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4 <- aov(hwy~ trans+cyl+drive+trans*cyl+drive+year+make, data = data3)
summary(model4)
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 9 3918 435 102.46 < 2e-16 ***
## cyl 4 1221 305 71.83 < 2e-16 ***
## drive 4 648 162 38.13 < 2e-16 ***
## year 1 0 0 0.00 0.9958
## make 1 46 46 10.92 0.0012 **
## trans:cyl 6 170 28 6.67 2.6e-06 ***
## Residuals 160 680 4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model5<-aov(hwy~trans,data=data3)
model6<-aov(hwy~cyl,data=data3)
model7<-aov(hwy~drive,data=data3)
Tukey's Honestly Significantly Difference is used after an ANOVA to determine which specific sample means are significantly different from the others. Tukey's HSD returns a matrix that contains statistical parameters for the interaction of an individual sample mean with every other sample mean being statistically analyzed. In this case, we can see that there are statistical significant difference between most pairs of the sampe mean, for each of the selected testing factors (with p-value <0.1)
#Tukey's HSD
tukey1<-TukeyHSD(model5, ordered = FALSE, conf.level = 0.9,names=c("AV-S7","A6","S5","S6","A(Variable)","A4","A5","Auto6","M5","M6"))
plot(tukey1)
tukey2<-TukeyHSD(model6, ordered = FALSE, conf.level = 0.9)
plot(tukey2)
tukey3<-TukeyHSD(model7, ordered = FALSE, conf.level = 0.9,names=c("4W","4W/AW","AW","FW","RW"))
plot(tukey3)
The QQ-plot shows that the sub-sample selected generally follows the normal distribution assuption and thus the application of anova is appropriate.
#qq-plot
qqnorm(residuals(model4), main="Normal Q-Q Plot for Highway Fuel Economy", ylab="Average Model Residuals")
qqline(residuals(model4))
The residual vs. Fit plot shows that the residuals are generally randomly distributed and thus the model estimation results are reliable.
#Residual vs Fit
plot(fitted(model4),residuals(model4), main="Residual vs Fitted Plot for Highway Fuel Economy")
Since the lines in the plots are not parallel to each other, the interaction between selected factors does exist and should be taken into consider.
#Interaction Plot
interaction.plot(data3$drive, data3$cyl, data3$hwy)
interaction.plot(data3$drive, data3$trans, data3$hwy)