This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Complete Randomized Block Design

Hongyu Chen

RPI, RIN:661405156

Oct.19 V1.0

1. Setting

System under test

Data of this test is obtained from the ‘fueleconomy’ dataset. Last time we analyzed effect of two factors: manufacture and year, on fuel comsumption. This time we are going to analyze other possible factors that might influence aim of this analysis is to test factors that may influence fuel comsumption using complete randomized block design. A quick view of the dataset is as below.

install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/acer1/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\acer1\AppData\Local\Temp\RtmpgJbLsd\downloaded_packages
library("fueleconomy")
data<-vehicles
head(data)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13
tail(data)
##          id  make                             model year       class
## 33437 31064 smart   fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart       fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart       fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart       fortwo electric drive coupe 2014 Two Seaters
##                trans            drive cyl displ        fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33438 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33439 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33440 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33441 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33442 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
summary(data)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl       
##  Length:33442       Length:33442       Length:33442       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.77  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##                                                           NA's   :58     
##      displ          fuel                hwy             cty       
##  Min.   :0.00   Length:33442       Min.   :  9.0   Min.   :  6.0  
##  1st Qu.:2.30   Class :character   1st Qu.: 19.0   1st Qu.: 15.0  
##  Median :3.00   Mode  :character   Median : 23.0   Median : 17.0  
##  Mean   :3.35                      Mean   : 23.6   Mean   : 17.5  
##  3rd Qu.:4.30                      3rd Qu.: 27.0   3rd Qu.: 20.0  
##  Max.   :8.40                      Max.   :109.0   Max.   :138.0  
##  NA's   :57

Factors and Levels

This time we set five factors that may influence vehicle fuel consumption, namely drive train, transmissio type, fuel type, numbber of cylinders and vehicle size, which are present as ‘drive’, ‘trans’, ‘fuel’, ‘cyl’ and ‘class’ in the dataset. Initially, we hope to test all data provided in order to get a more convincing conclusion, however in the factors of ‘trans’ and ‘class’, there are too many levels which R is unable to analyze. Therefore we select several most common types of transmission and vehicle size as levels for these factors as an example.

# factors and levels setting

M<-data$class=='Compact Cars'|data$class=='Midsize Cars'|data$class=='Sport Utility Vehicle - 4WD'|data$class=='Vans'|data$class=='Standard Pickup Trucks'
N<-data[M,]
P<-N$trans=='Auto (AV-S8)'|N$trans=='Automatic 6-spd'|N$trans=='Manual 6-spd'|N$trans=='Auto (AV)'|N$trans=='Manual 4-spd'|N$trans=='Automatic 4-spd'
sample<-N[P,]

#factors and levels

fuel<-factor(sample$fuel)
cyl<-factor(sample$cyl)
drive<-factor(sample$drive)
class<-factor(sample$class)
trans<-factor(sample$trans)
levels(fuel)
## [1] "CNG"                        "Diesel"                    
## [3] "Gasoline or E85"            "Midgrade"                  
## [5] "Premium"                    "Premium Gas or Electricity"
## [7] "Premium or E85"             "Regular"
levels(cyl)
## [1] "4"  "5"  "6"  "8"  "10" "12"
levels(drive)
## [1] "4-Wheel Drive"              "4-Wheel or All-Wheel Drive"
## [3] "All-Wheel Drive"            "Front-Wheel Drive"         
## [5] "Part-time 4-Wheel Drive"    "Rear-Wheel Drive"
levels(class)
## [1] "Compact Cars"                "Midsize Cars"               
## [3] "Sport Utility Vehicle - 4WD" "Standard Pickup Trucks"     
## [5] "Vans"
levels(trans)
## [1] "Auto (AV-S8)"    "Auto (AV)"       "Automatic 4-spd" "Automatic 6-spd"
## [5] "Manual 4-spd"    "Manual 6-spd"

Continuous variables (if any)

All factors are categorical variables, and the response variable ‘cty’, which means city fuel economy in mpg (miles per galon) is continuous variable which can be any integer between 6 to 138.

Response variables

Response variable in this test is ‘cty’, referring to city fuel economy in mpg (miles per gallon)

The Data: How is it organized and what does it look like?

Detailed information is provided in the ‘fueleconomy’ dataset, including almost everything to evaluate an auto.There are five factors selected in this analysis to investigate how these factors affect vehicle fuel economy.

str(sample)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6225 obs. of  12 variables:
##  $ id   : int  3347 11789 12673 12674 13421 13422 14163 14164 18458 15069 ...
##  $ make : chr  "ASC Incorporated" "Acura" "Acura" "Acura" ...
##  $ model: chr  "GNX" "2.5TL" "2.5TL/3.2TL" "2.5TL/3.2TL" ...
##  $ year : int  1987 1995 1996 1996 1997 1997 1998 1998 2003 1999 ...
##  $ class: chr  "Midsize Cars" "Compact Cars" "Compact Cars" "Compact Cars" ...
##  $ trans: chr  "Automatic 4-spd" "Automatic 4-spd" "Automatic 4-spd" "Automatic 4-spd" ...
##  $ drive: chr  "Rear-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
##  $ cyl  : int  6 5 5 6 5 6 5 6 6 6 ...
##  $ displ: num  3.8 2.5 2.5 3.2 2.5 3.2 2.5 3.2 3.2 3.2 ...
##  $ fuel : chr  "Premium" "Premium" "Premium" "Premium" ...
##  $ hwy  : int  21 23 23 22 23 22 23 22 26 25 ...
##  $ cty  : int  14 18 18 17 18 17 17 17 17 17 ...

Randomization

Fuel economy data are the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA. Therefore it is reasonalbe to assume these data are complete randomized. In this case, we will set two factors as blocks, so it is a complete randomized block design.

Replication/repeated measures

There is not replication or repeated measures in raw data or this analysis.

Block

There will be two blocks in this study which are selected depend on following exploratory data analysis. (Which turn out to be fuel type and numbers of cylinder)

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

Obtain data from ‘fueleconomy’ dataset and collect sample data based on most common conditions. Then through exploratory data analysis to decide two blocks (‘fuel’ and ‘cyl’). Conduct ANOVA to test effect of individual factor and interaction between factors on vehicle fuel economy. Following analysis include Tukey’s Honest test and interaction plot. Therefore, null hypothesis is: H0: the difference in vehicle city fuel economy can only be explained by randomization. Alternative hypothesis: Ha: the difference in vehicle city fuel economy can be explained by anything other than randomization.

What is the rationale for this design?

Fuel economy is a critical element when purchasing a car, and there are many facotrs that may influence car fuel economy other than manufacture and year a car was made, including vehicle type, transmission type, vehicle size and so on. Therefore through this analysis we are able to get a useful reference on how possible factors influence city fuel economy, and then get experience on choosing our own cars.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

#Histogram of response variable
hist(sample$cty, ylab="City Fuel Economy(mpg)")

plot of chunk unnamed-chunk-4

#Boxplot
boxplot(cty~trans, data=sample)

plot of chunk unnamed-chunk-4

boxplot(cty~drive, data=sample)

plot of chunk unnamed-chunk-4

boxplot(cty~class, data=sample)

plot of chunk unnamed-chunk-4

boxplot(cty~fuel, data=sample)

plot of chunk unnamed-chunk-4

boxplot(cty~cyl, data=sample)

plot of chunk unnamed-chunk-4

Boxplots revel that for factors of transmission, drive train and vehicle size class, variation between groups is mainly smaller than variation within groups, therefore they can be selected as factors under test. However for fuel type and number of cylinders, varation between groups is larger than variation within groups, therefore we select these two factors as block, and will not consider them in future tests.

Testing

ANOVA test is conducted as below:

#variance of drive train, vehicle size class, and transmission
model1=aov(cty ~ drive, data=sample)
anova(model1)
## Analysis of Variance Table
## 
## Response: cty
##             Df Sum Sq Mean Sq F value Pr(>F)    
## drive        5  42492    8498     990 <2e-16 ***
## Residuals 6219  53405       9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2=aov(cty ~ class, data=sample)
anova(model2)
## Analysis of Variance Table
## 
## Response: cty
##             Df Sum Sq Mean Sq F value Pr(>F)    
## class        4  36995    9249     977 <2e-16 ***
## Residuals 6220  58902       9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3=aov(cty ~ trans, data=sample)
anova(model3)
## Analysis of Variance Table
## 
## Response: cty
##             Df Sum Sq Mean Sq F value Pr(>F)    
## trans        5   8887    1777     127 <2e-16 ***
## Residuals 6219  87010      14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#variance of interaction between factors
modeli=aov(cty ~ drive*class*trans, data=sample)
anova(modeli)
## Analysis of Variance Table
## 
## Response: cty
##                     Df Sum Sq Mean Sq F value  Pr(>F)    
## drive                5  42492    8498  1315.9 < 2e-16 ***
## class                4   5745    1436   222.4 < 2e-16 ***
## trans                5   3380     676   104.7 < 2e-16 ***
## drive:class          8    636      80    12.3 < 2e-16 ***
## drive:trans         10   2467     247    38.2 < 2e-16 ***
## class:trans          7    877     125    19.4 < 2e-16 ***
## drive:class:trans    4    381      95    14.7 5.5e-12 ***
## Residuals         6181  39919       6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
modeli1=aov(cty ~ drive*class, data=sample)
anova(modeli1)
## Analysis of Variance Table
## 
## Response: cty
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## drive          5  42492    8498  1122.4 < 2e-16 ***
## class          4   5745    1436   189.7 < 2e-16 ***
## drive:class    8    663      83    10.9 1.9e-15 ***
## Residuals   6207  46997       8                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
modeli2=aov(cty ~ drive*trans, data=sample)
anova(modeli2)
## Analysis of Variance Table
## 
## Response: cty
##               Df Sum Sq Mean Sq F value Pr(>F)    
## drive          5  42492    8498  1149.0 <2e-16 ***
## trans          5   4956     991   134.0 <2e-16 ***
## drive:trans   10   2563     256    34.6 <2e-16 ***
## Residuals   6204  45886       7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
modeli3=aov(cty ~ class*trans, data=sample)
anova(modeli3)
## Analysis of Variance Table
## 
## Response: cty
##               Df Sum Sq Mean Sq F value Pr(>F)    
## class          4  36995    9249  1068.0 <2e-16 ***
## trans          5   3000     600    69.3 <2e-16 ***
## class:trans    7   2140     306    35.3 <2e-16 ***
## Residuals   6208  53762       9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All ANOVA tests for each individual factor have a very small p-value, indicating variance of all three factors are able to explain variance of vehicle city fuel economy. ANOVA tests for interactions between factors also have small p-value, however p-value of some interactions is smaller than p-value of individual factor, therefore probably interaction will decrease relative importance of each factor, but interaction can still explain variance of response variable. Then we can reject the null hypothesis and accept alternative hypothesis, which is variance of vehicle city fuel economy can be explained by something other than randomization.

Estimation (of Parameters)

Summary of data sample and raw data is as below. Compared to raw data set, sample data change in the levels of ‘trans’ and ‘class’, therefore may lead to limitations. However, since sample data contain most common types of transmissions and vehicle class, it is believed to be enough in daily use.

summary(sample)
##        id            make              model                year     
##  Min.   :  242   Length:6225        Length:6225        Min.   :1985  
##  1st Qu.: 7108   Class :character   Class :character   1st Qu.:1990  
##  Median :12821   Mode  :character   Mode  :character   Median :1995  
##  Mean   :14338                                         Mean   :1997  
##  3rd Qu.:20682                                         3rd Qu.:2004  
##  Max.   :34909                                         Max.   :2015  
##     class              trans              drive                cyl    
##  Length:6225        Length:6225        Length:6225        Min.   : 4  
##  Class :character   Class :character   Class :character   1st Qu.: 4  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6  
##                                                           Mean   : 6  
##                                                           3rd Qu.: 8  
##                                                           Max.   :12  
##      displ          fuel                hwy            cty      
##  Min.   :0.00   Length:6225        Min.   :11.0   Min.   : 8.0  
##  1st Qu.:2.40   Class :character   1st Qu.:18.0   1st Qu.:14.0  
##  Median :3.40   Mode  :character   Median :22.0   Median :16.0  
##  Mean   :3.63                      Mean   :22.5   Mean   :16.5  
##  3rd Qu.:5.00                      3rd Qu.:26.0   3rd Qu.:18.0  
##  Max.   :7.40                      Max.   :43.0   Max.   :35.0
summary(data)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl       
##  Length:33442       Length:33442       Length:33442       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.77  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##                                                           NA's   :58     
##      displ          fuel                hwy             cty       
##  Min.   :0.00   Length:33442       Min.   :  9.0   Min.   :  6.0  
##  1st Qu.:2.30   Class :character   1st Qu.: 19.0   1st Qu.: 15.0  
##  Median :3.00   Mode  :character   Median : 23.0   Median : 17.0  
##  Mean   :3.35                      Mean   : 23.6   Mean   : 17.5  
##  3rd Qu.:4.30                      3rd Qu.: 27.0   3rd Qu.: 20.0  
##  Max.   :8.40                      Max.   :109.0   Max.   :138.0  
##  NA's   :57

Diagnostics/Model Adequacy Checking

#shapiro test
#shapiro.test(sample$cty)

Shapiro test reveals a p-value much smaller than 0.1, which means a normally distribution.

#qqplot
qqnorm(residuals(modeli))
qqline(residuals(modeli))

plot of chunk unnamed-chunk-8 Q-Q plot and Q-Q line of residuals exhibit linear pattern of residuals, which means previous model used is valid.

interaction.plot(sample$trans,sample$class,sample$cty)

plot of chunk unnamed-chunk-9

interaction.plot(sample$trans,sample$drive,sample$cty)

plot of chunk unnamed-chunk-9

interaction.plot(sample$drive,sample$class,sample$cty)

plot of chunk unnamed-chunk-9 Interaction polt does not appear as parallel lines, which means interactions between factor are probably adequate to explain variance of cty.

plot(fitted(modeli),residuals(modeli))

plot of chunk unnamed-chunk-10

Plot of fitted model and residuals model shows residuals are mainly evenly distributed on each side of 0, indicating the model of interaction among those three factors is adequate enough to explian the variance of cty.

Tukey’s test

To further understand how interactions of factors affect response variable and avoid the false positives we conduct Tukey’s HSD test.

Tukey_modeli<-TukeyHSD(modeli, order=FALSE, conf.level=0.95)
plot(Tukey_modeli)

plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11plot of chunk unnamed-chunk-11

Since there are so many combinations, Tukey’s plot is not appear as clear lines, howeve we can still know that not all levels have difference in means containing zero, so it can be assumed that there is difference in means to some extent.

4. References to the literature

5. Appendices

Note

For Shapiro-wilk normality test, we can run this function in R Markdown, but it cannot be knitted to HTML since there are more than 5000 pieces of data.

#Shapiro-Wilk normality test
#data:  sample$cty
#W = 0.818, p-value < 2.2e-16

A summary of, or pointer to, the raw data

http://www.fueleconomy.gov/feg/download.shtml