Recipe 5: Blocked Designs with multiple explanatory and nuisance factors

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Recipe 5: Blocked Designs with multiple explanatory and nuisance factors

Wei Zou

RPI

Oct 20, 2014 Version 1

1. Setting

System under test

In this project, we use the “2014 EPA fuel economy” dataset to construct a completely randomized block “pseudo-design” with three factors and two blocks. The purpose of the experiment to test whehter the number of cylinders or/and drive type or/tr/and transmission type contribute to the variation in vehicle highway fuel economy, under the control of two blockings: year and vehicle make.

install.packages("fueleconomy")
## Installing package into 'C:/Users/wei/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("fueleconomy", lib.loc="C:/Users/wei/Documents/R/win-library/3.1")
data1<-vehicles
head(data1)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13
summary(data1)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl       
##  Length:33442       Length:33442       Length:33442       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.77  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##                                                           NA's   :58     
##      displ          fuel                hwy             cty       
##  Min.   :0.00   Length:33442       Min.   :  9.0   Min.   :  6.0  
##  1st Qu.:2.30   Class :character   1st Qu.: 19.0   1st Qu.: 15.0  
##  Median :3.00   Mode  :character   Median : 23.0   Median : 17.0  
##  Mean   :3.35                      Mean   : 23.6   Mean   : 17.5  
##  3rd Qu.:4.30                      3rd Qu.: 27.0   3rd Qu.: 20.0  
##  Max.   :8.40                      Max.   :109.0   Max.   :138.0  
##  NA's   :57

Sub-sampling

As a pilot study, we select a subsample from the original database to conduct the analysis: vehicles made by Honda, Bentley, GMC from year 2007 and year 2010 are selected.

# Logical vector identifying all vehicles with from year 2007 and year 2014
data2<-subset(data1,data1$year==2007|data1$year==2010)
# Logical vector identifying all vehicles from Honda, Bentley and GMC
data3<-subset(data2, data2$make=="Honda"|data2$make=="Bentley"|data2$make=="GMC")
head(data3)
##         id    make                   model year           class
## 2644 23556 Bentley                  Arnage 2007    Midsize Cars
## 2651 23601 Bentley              Arnage LWB 2007      Large Cars
## 2659 23488 Bentley                   Azure 2007 Subcompact Cars
## 2662 29707 Bentley                   Azure 2010    Compact Cars
## 2664 29708 Bentley              Brooklands 2010    Compact Cars
## 2669 23239 Bentley Continental Flying Spur 2007    Midsize Cars
##               trans                      drive cyl displ    fuel hwy cty
## 2644 Automatic (S6)           Rear-Wheel Drive   8   6.7 Premium  15  10
## 2651 Automatic (S6)           Rear-Wheel Drive   8   6.7 Premium  14  10
## 2659 Automatic (S6)           Rear-Wheel Drive   8   6.7 Premium  15  10
## 2662 Automatic (A6)           Rear-Wheel Drive   8   6.8 Premium  15   9
## 2664 Automatic (A6)           Rear-Wheel Drive   8   6.8 Premium  15   9
## 2669 Automatic (S6) 4-Wheel or All-Wheel Drive  12   6.0 Premium  17  10
summary(data3)
##        id            make              model                year     
##  Min.   :22860   Length:186         Length:186         Min.   :2007  
##  1st Qu.:23374   Class :character   Class :character   1st Qu.:2007  
##  Median :23851   Mode  :character   Mode  :character   Median :2007  
##  Mean   :26103                                         Mean   :2008  
##  3rd Qu.:29436                                         3rd Qu.:2010  
##  Max.   :29944                                         Max.   :2010  
##     class              trans              drive                cyl       
##  Length:186         Length:186         Length:186         Min.   : 4.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 6.47  
##                                                           3rd Qu.: 8.00  
##                                                           Max.   :12.00  
##      displ          fuel                hwy            cty      
##  Min.   :1.30   Length:186         Min.   :14.0   Min.   : 9.0  
##  1st Qu.:2.90   Class :character   1st Qu.:18.0   1st Qu.:13.0  
##  Median :4.30   Mode  :character   Median :20.0   Median :15.0  
##  Mean   :4.18                      Mean   :22.4   Mean   :16.5  
##  3rd Qu.:5.30                      3rd Qu.:24.0   3rd Qu.:18.0  
##  Max.   :6.80                      Max.   :45.0   Max.   :40.0

Factors and Levels

The selected subsample dataset has three test factors: transmission type (with 10 levels), number of cylinders (with 5 levels), and drive type (with 5 levels); and two blocking factors: vehicle make (with 3 levels) and year (with two levels)

data3$trans<-as.factor(data3$trans)
nlevels(data3$trans)
## [1] 10
data3$cyl<-as.factor(data3$cyl)
nlevels(data3$cyl)
## [1] 5
data3$drive<-as.factor(data3$drive)
nlevels(data3$drive)
## [1] 5
data3$make<-as.factor(data3$make)
nlevels(data3$make)
## [1] 3
data3$year<-as.factor(data3$year)
nlevels(data3$year)
## [1] 2

Continuous variables (if any)

In the selected subsample dataset, there's no strict continuous variable, however, the hwy (highway fuel economy) maybe considered as a continuous variable conceptually, although all of them are integers in this dataset.

Response variables

The response variable is the vehicle highway fuel economy.

The Data: How is it organized and what does it look like?

ThE original dataset summarizes the Envrionmental Protection Agency (EPA) fuel economy data from 1985 to 2014, it contains the highway and city fuel economy information and data on vehicle characteristics such as number of cylinders, make, model and so on.For this study,we select a subsample: vehicles made by Honda, Bentley, GMC from year 2007 and year 2010.

Randomization

EPA cooperates with Natioanl Highway Traffic Safety Administration to test the fuel economy data for all new cars and light trucks, and update the fuel economy data for old vehicles every year. Since this is not a speically designed experiment conducted under control, the data are not completely random. However, sicne the dataset includes all the produced vehicle make and models, it is very representative.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The purpose of this project is to design a completely random blocked experiment. To do so, we select two nuisance variables (year and make) as the blocking variables and test the effect of three factors (transmission type, number of cylinders and drive type) on the vehicle highway fuel economy. The null hypothesis is that the variation in highway fuel economy is simply due to sample randomization (i.e., the effects of the three selected variables are zero). As a pilot study, we reduce the sample size by selecting vehicles made in year 2007 and 2010, from make “Bently”, “GMC” and “Honda”.

What is the rationale for this design?

  1. For transmissions, normally we may understand that automative drive with more speed levels will tend to produce higher fuel economy, due to the fact that the transition between various speed levels is more smooth and continuous, we are interested in seeing whether this assumption is valid.
  2. For the number of cylinders, the more cylinders the more power, and thus the lower fuel economy (i.e., less mileage traveled per gallon of gasoline). However, this might not be true in the new vehicle models, with more advanced technologies, the design of engine might be able to increase fuel economy while maintaining the same level of power. Therefore the effect of the number of cylinders on fuel economy is not clear and this project is aiming to provide more insights to the subject.
  3. For the drive type, its effect on fuel economy may vary under different conditions, for example, on snow days, vehicles with all wheel drive may have higher fuel economy since it requires less engine power to travel on roads. Therefore it is interesting to see whehter this factor truly contribute to the variation in highway fuel economy.
  4. The blocking variables: it is clear that vehicles manuafactuered in different companies may have different fuel economy standard, for example, vehicles produced in Japan tend to have higher fuel economy standard than the luxury cars produced in the US, and the fuel economy technology improves every years, so these two factos seem to be the perfect nuisance variables to block on.

Randomize: What is the Randomization Scheme?

Since all the new vehicle make/models are required to take the fuel economy examination at EPA, the data are random and represent the population well.

Replicate: Are there replicates and/or repeated measures?

The fuel economy exam is carried out for all new vehicles before they are sent to the market. Therefore, for each vehicle there is no replicate/repeated measures. However, repeated measures maybe taken for the same make/model in different years.

Block: Did you use blocking in the design?

Two blocking variables are used in the design: vehicle make and year. They are selected based on three main criteria: 1. controllable 2. Known 3. Not the interest of the study. The boxplots shown below also prove that the above criteria has been satified: The within group variation is smaller than the between group variation, indicating that they are nuisance variables that we should control for.

# All Acura records in the dataframe
boxplot(hwy~year,data=data3, xlab="Year", ylab="Highway Fuel Economy (mpg)")
title("Boxplot of Highway Fuel Economy at different Years")

plot of chunk unnamed-chunk-4

boxplot(hwy~make,data=data3, xlab="Vehicle Make", ylab="Highway Fuel Economy (mpg)")
title("Boxplot of Highway Fuel Economy for different Vehicle Makes")

plot of chunk unnamed-chunk-4

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

We use boxplots to compare the highway fuel economy of vehicles with various transmission types, number of cylinders, and drive types. The boxplots show that the median highway fuel economy do vary across different groups, indicating that they may contribute to the variation in highway fuel economy, the analysis of variance conducted in the following section further confirm this finding.

boxplot(hwy~trans,data=data3, xlab="Transmission Type", ylab="Highway Fuel Economy (mpg)",names=c("AV-S7","A6","S5","S6","A(Variable)","A4","A5","Auto6","M5","M6"))
title("Boxplot of Highway Fuel Economy vs. Transmission Type")

plot of chunk unnamed-chunk-5

boxplot(hwy~cyl,data=data3, xlab="Number of Cylinders", ylab="Highway Fuel Economy (mpg)")
title("Boxplot of Highway Fuel Economy vs. Number of Cylinders")

plot of chunk unnamed-chunk-5

boxplot(hwy~drive,data=data3, xlab="Drive Type", ylab="Highway Fuel Economy (mpg)",names=c("4W","4W/AW","AW","FW","RW"))
title("Boxplot of Highway Fuel Economy vs. Drive Type")

plot of chunk unnamed-chunk-5

Analysis of Variance

The summary of the ANOVA shows below gives the p-values for each individual factor, along with the p-value for the interactions between the factors. The null hypothesis states that the variation in the response variable, highway fuel economy, cannot be explained by anything other than sample randomization. We estimate seven models to test our hypothesis: model1 includes all three testing factors and their interaction effect, model2 adds in one blocking factor year, model3 adds in the other blocking factormake, model4 includes both of the blocking factors, model5, model6 and model7 excludes the blocking factors and test each of the factors seperately, their results are mainly used for the following Tukey's HSD Test.

The difference in model1, model2, model3 and model4 shows that by adding in the blocking variables, the precision of the estimation results have increased. We have selected model4 as our final model based on conceptual validity and statistical significance, the p-value shown in the estimation result shows that it is highly possible that the variation in highway fuel economy is not due to sample randomization only, the transmission type, number of cylinders, drive type and the interaction between transmission type and number of cylinders may contribute to the such variation.

model1 <- aov(hwy~ trans+cyl+drive+trans*cyl*drive, data = data3)
summary(model1)
##                  Df Sum Sq Mean Sq F value  Pr(>F)    
## trans             9   3918     435   96.18 < 2e-16 ***
## cyl               4   1221     305   67.43 < 2e-16 ***
## drive             4    648     162   35.80 < 2e-16 ***
## trans:cyl         6    206      34    7.59 4.7e-07 ***
## trans:drive      12     46       4    0.85    0.60    
## cyl:drive         9     15       2    0.36    0.95    
## trans:cyl:drive   2      0       0    0.05    0.95    
## Residuals       139    629       5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2 <- aov(hwy~ trans+cyl+drive+trans* cyl*drive+year, data = data3)
summary(model2)
##                  Df Sum Sq Mean Sq F value  Pr(>F)    
## trans             9   3918     435   95.79 < 2e-16 ***
## cyl               4   1221     305   67.16 < 2e-16 ***
## drive             4    648     162   35.65 < 2e-16 ***
## year              1      0       0    0.00    1.00    
## trans:cyl         6    208      35    7.61 4.5e-07 ***
## trans:drive      12     47       4    0.86    0.59    
## cyl:drive         9     14       2    0.35    0.96    
## trans:cyl:drive   2      1       0    0.06    0.94    
## Residuals       138    627       5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3 <- aov(hwy~ trans+cyl+drive+trans*cyl*drive+year+make, data = data3)
summary(model3)
##                  Df Sum Sq Mean Sq F value  Pr(>F)    
## trans             9   3918     435   95.18 < 2e-16 ***
## cyl               4   1221     305   66.73 < 2e-16 ***
## drive             4    648     162   35.42 < 2e-16 ***
## year              1      0       0    0.00  0.9960    
## make              1     46      46   10.15  0.0018 ** 
## trans:cyl         6    170      28    6.20 8.9e-06 ***
## trans:drive      12     39       3    0.70  0.7476    
## cyl:drive         9     14       2    0.35  0.9564    
## trans:cyl:drive   2      0       0    0.03  0.9682    
## Residuals       137    627       5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4 <- aov(hwy~ trans+cyl+drive+trans*cyl+drive+year+make, data = data3)
summary(model4)
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## trans         9   3918     435  102.46 < 2e-16 ***
## cyl           4   1221     305   71.83 < 2e-16 ***
## drive         4    648     162   38.13 < 2e-16 ***
## year          1      0       0    0.00  0.9958    
## make          1     46      46   10.92  0.0012 ** 
## trans:cyl     6    170      28    6.67 2.6e-06 ***
## Residuals   160    680       4                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model5<-aov(hwy~trans,data=data3)

model6<-aov(hwy~cyl,data=data3)

model7<-aov(hwy~drive,data=data3)

4. Diagnostics/Model Adequacy Checking

Tukey's HSD Test

Tukey's Honestly Significantly Difference is used after an ANOVA to determine which specific sample means are significantly different from the others. Tukey's HSD returns a matrix that contains statistical parameters for the interaction of an individual sample mean with every other sample mean being statistically analyzed. In this case, we can see that there are statistical significant difference between most pairs of the sampe mean, for each of the selected testing factors (with p-value <0.1)

#Tukey's HSD
tukey1<-TukeyHSD(model5, ordered = FALSE, conf.level = 0.9,names=c("AV-S7","A6","S5","S6","A(Variable)","A4","A5","Auto6","M5","M6"))
plot(tukey1)

plot of chunk unnamed-chunk-7

tukey2<-TukeyHSD(model6, ordered = FALSE, conf.level = 0.9)
plot(tukey2)

plot of chunk unnamed-chunk-7

tukey3<-TukeyHSD(model7, ordered = FALSE, conf.level = 0.9,names=c("4W","4W/AW","AW","FW","RW"))
plot(tukey3)

plot of chunk unnamed-chunk-7

QQ-Plot

The QQ-plot shows that the sub-sample selected generally follows the normal distribution assuption and thus the application of anova is appropriate.

#qq-plot
qqnorm(residuals(model4), main="Normal Q-Q Plot for Highway Fuel Economy", ylab="Average Model Residuals")
qqline(residuals(model4))

plot of chunk unnamed-chunk-8

Residual vs. Fit Plots

The residual vs. Fit plot shows that the residuals are generally randomly distributed and thus the model estimation results are reliable.

#Residual vs Fit 
plot(fitted(model4),residuals(model4), main="Residual vs Fitted Plot for Highway Fuel Economy")

plot of chunk unnamed-chunk-9

Interaction Plots

Since the lines in the plots are not parallel to each other, the interaction between selected factors does exist and should be taken into consider.

#Interaction Plot
interaction.plot(data3$drive, data3$cyl, data3$hwy)

plot of chunk unnamed-chunk-10

interaction.plot(data3$drive, data3$trans, data3$hwy)

plot of chunk unnamed-chunk-10

4. References to the literature

http://www.epa.gov/fueleconomy/basicinformation.htm