Recipe 5: Example of Descriptive Statistics

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Design of Experiments

Trevor Manzanares

Rensselaer Polytechnic Institute

10/23/14

1. Setting

System under test

Dataset under analysis includes data on vehicles from the EPA. This experiment will seek to understand which factors have an effect on highway fuel efficiency (miles per gallon).

remove(list=ls())
library("fueleconomy", lib.loc="~/R/win-library/3.1")

#view first few lines
head(vehicles)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13
#observe the structure of the data, ie. how many variables
str(vehicles)
## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
##  $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
##  $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
##  $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
##  $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
##  $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
##  $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
##  $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...
#55 observations of 26 variables
attach(vehicles)
#typecast variables as factors for use in ANOVA
vehicles$year=as.factor(vehicles$year)
vehicles$class=as.factor(vehicles$class)
vehicles$trans=as.factor(vehicles$trans)
vehicles$drive=as.factor(vehicles$drive)
vehicles$cyl=as.factor(vehicles$cyl)
vehicles$make=as.factor(vehicles$make)
vehicles$model=as.factor(vehicles$model)
vehicles$displ=as.factor(vehicles$displ)
vehicles$fuel=as.factor(vehicles$fuel)

Factors of interest and their levels

year(32 levels), class(34 levels), trans(48 levels), displ(64 levels), cyl(9 levels)

View summary statistics

summary(vehicles)
##        id               make                     model      
##  Min.   :    1   Chevrolet: 3461   F150 Pickup 2WD  :  202  
##  1st Qu.: 8361   Ford     : 2807   Truck 2WD        :  187  
##  Median :16724   Dodge    : 2350   F150 Pickup 4WD  :  180  
##  Mean   :17038   GMC      : 2265   Ranger Pickup 2WD:  169  
##  3rd Qu.:25265   Toyota   : 1727   Mustang          :  154  
##  Max.   :34932   BMW      : 1395   Jetta            :  150  
##                  (Other)  :19437   (Other)          :32400  
##       year                               class      
##  1985   : 1701   Compact Cars               : 4739  
##  1987   : 1247   Subcompact Cars            : 4185  
##  2014   : 1214   Midsize Cars               : 3621  
##  1986   : 1210   Standard Pickup Trucks     : 2354  
##  2008   : 1186   Sport Utility Vehicle - 4WD: 2091  
##  2009   : 1184   Sport Utility Vehicle - 2WD: 1626  
##  (Other):25700   (Other)                    :14826  
##              trans                              drive      
##  Automatic 4-spd:10771   2-Wheel Drive             :  507  
##  Manual 5-spd   : 7858   4-Wheel Drive             :  699  
##  Automatic 3-spd: 2720   4-Wheel or All-Wheel Drive: 6647  
##  Automatic 5-spd: 2158   All-Wheel Drive           : 1267  
##  Manual 6-spd   : 2067   Front-Wheel Drive         :12233  
##  Automatic (S6) : 1946   Part-time 4-Wheel Drive   :   96  
##  (Other)        : 5922   Rear-Wheel Drive          :11993  
##       cyl            displ                    fuel            hwy       
##  4      :12381   2      : 2806   Regular        :22622   Min.   :  9.0  
##  6      :11885   3      : 2573   Premium        : 8617   1st Qu.: 19.0  
##  8      : 7550   2.5    : 2153   Gasoline or E85: 1043   Median : 23.0  
##  5      :  718   2.4    : 1721   Diesel         :  874   Mean   : 23.6  
##  12     :  478   1.8    : 1353   Premium or E85 :   88   3rd Qu.: 27.0  
##  (Other):  372   (Other):22779   CNG            :   58   Max.   :109.0  
##  NA's   :   58   NA's   :   57   (Other)        :  140                  
##       cty       
##  Min.   :  6.0  
##  1st Qu.: 15.0  
##  Median : 17.0  
##  Mean   : 17.5  
##  3rd Qu.: 20.0  
##  Max.   :138.0  
## 

Continuous variables (if any)

Highway mileage is the continuous response variable.

The Data: How is it organized and what does it look like?

The data are tabluated into 12 columns, with detailed information on vehicle id, make, model, year, class, transmission, drive, cylinders, displacement, fuel type, and highway and city mileage.

Randomization

The data were collected by the Environmental Protection Agency (EPA) from 1985 to 2015, containing the above listed variables. Each vehicle has complete data.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

Analysis of Variance will be used to understand the effects of year, class, and transmission type on highway gas mileage. Blocking will also be used to remove any variation due to possible nuisance factors, namely engine displacement and number of cylinders (determined by boxplots below, small within group variation and large between group variation).

What is the rationale for this design?

This project seeks to discover if year, class, and transmission type may have an individual or interaction effect on highway gas mileage, while holding engine displacement and number of cylinders constant. A linear model will be designed and analysis of variance conducted to determine if the individual or joint effects are statistically significant. Blocking is necessary because it helps remove variation in the response due to possible confounding variables (displ and cyl).

Replicate: Are there replicates and/or repeated measures?

There are replicates but not repeated measures. There are multiple observations for the same make and model vehicles but one vehicle will only have one predetermined set of static attributes, therefore there are no repeated measures.

Block: Did you use blocking in the design?

Yes, the analysis of variance accounts for nuisance variables (displ and cyl) and therefore controls for them appropriately.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Discover which factors should be used as blocking variables

par(mfrow=c(1,1))
boxplot(hwy~year,las=3,xlab="Year",ylab="Highway Gas Mileage")

plot of chunk unnamed-chunk-3

boxplot(hwy~class,las=3,xlab="Class",ylab="Highway Gas Mileage")

plot of chunk unnamed-chunk-3

boxplot(hwy~trans,las=3,xlab="Transmission Type",ylab="Highway Gas Mileage")

plot of chunk unnamed-chunk-3

boxplot(hwy~drive,las=3,xlab="Drive",ylab="Highway Gas Mileage")

plot of chunk unnamed-chunk-3

boxplot(hwy~cyl,las=3,xlab="Cylinders",ylab="Highway Gas Mileage")

plot of chunk unnamed-chunk-3

boxplot(hwy~displ,las=3,xlab="Displacement",ylab="Highway Gas Mileage")

plot of chunk unnamed-chunk-3

boxplot(hwy~fuel,las=3,xlab="Fuel Type",ylab="Highway Gas Mileage")

plot of chunk unnamed-chunk-3 displ and cyl appear to have small within group variation but large between group variation. Thus, these factors will be chosen as blocking variables.

Testing

set up factorial design

library("DoE.base", lib.loc="~/R/win-library/3.1")
## Loading required package: grid
## Loading required package: conf.design
## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design
library("agricolae", lib.loc="~/R/win-library/3.1")

design = fac.design(nlevels=c(32,34,48),factor.names=c("year","class","trans"),blocks=2)
## creating full factorial with  52224  runs ...
head(design)
##   Blocks year class trans
## 1      1   25    34    19
## 2      1   23    12    10
## 3      1   32    18    16
## 4      1   10     4     3
## 5      1   22    16    27
## 6      1   28    25    38
str(design)
## Classes 'design' and 'data.frame':   52224 obs. of  4 variables:
##  $ Blocks: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year  : Factor w/ 32 levels "1","2","3","4",..: 25 23 32 10 22 28 15 4 2 31 ...
##   ..- attr(*, "contrasts")= num [1:32, 1:31] -0.297 -0.278 -0.258 -0.239 -0.22 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "1" "2" "3" "4" ...
##   .. .. ..$ : chr  ".L" ".Q" ".C" "^4" ...
##  $ class : Factor w/ 34 levels "1","2","3","4",..: 34 12 18 4 16 25 30 18 17 21 ...
##   ..- attr(*, "contrasts")= num [1:34, 1:33] -0.288 -0.271 -0.253 -0.236 -0.219 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "1" "2" "3" "4" ...
##   .. .. ..$ : chr  ".L" ".Q" ".C" "^4" ...
##  $ trans : Factor w/ 48 levels "1","2","3","4",..: 19 10 16 3 27 38 20 18 3 26 ...
##   ..- attr(*, "contrasts")= num [1:48, 1:47] -0.245 -0.234 -0.224 -0.214 -0.203 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "1" "2" "3" "4" ...
##   .. .. ..$ : chr  ".L" ".Q" ".C" "^4" ...
##  - attr(*, "desnum")= num [1:52224, 1:112] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr  "1" "2" "3" "4" ...
##   .. ..$ : chr  "Blocks2" "year.L" "year.Q" "year.C" ...
##  - attr(*, "run.order")='data.frame':    52224 obs. of  3 variables:
##   ..$ run.no.in.std.order: Factor w/ 52224 levels "1.1.1","4.1.2",..: 10333 5084 8448 1141 14395 20526 10808 9522 1345 13936 ...
##   ..$ run.no             : int  1 2 3 4 5 6 7 8 9 10 ...
##   ..$ run.no.std.rp      : Factor w/ 52224 levels "1.1.1","10.1.5",..: 11853 189 7664 14247 20880 34506 12909 10052 18780 19859 ...
##  - attr(*, "design.info")=List of 16
##   ..$ type        : chr "full factorial.blocked"
##   ..$ nruns       : num 52224
##   ..$ nfactors    : int 3
##   ..$ nlevels     : num  32 34 48
##   ..$ factor.names:List of 3
##   .. ..$ year : int  1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ class: int  1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ trans: int  1 2 3 4 5 6 7 8 9 10 ...
##   ..$ replications: num 1
##   ..$ repeat.only : logi FALSE
##   ..$ randomize   : logi TRUE
##   ..$ seed        : NULL
##   ..$ creator     : language fac.design(nlevels = c(32, 34, 48), factor.names = c("year", "class",      "trans"), blocks = 2)
##   ..$ block.name  : chr "Blocks"
##   ..$ bbreps      : num 1
##   ..$ wbreps      : num 1
##   ..$ nblocks     : num 2
##   ..$ blocksize   : num 26112
##   ..$ block.gen   : num [1, 1:12] 1 1 1 1 1 1 0 1 1 1 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : Named chr  "year" "year" "year" "year" ...
##   .. .. .. ..- attr(*, "names")= chr  "year1" "year2" "year3" "year4" ...

Run analysis of variance for individual factor effects

model1=(aov(hwy~year+displ,data=vehicles))
anova(model1)
## Analysis of Variance Table
## 
## Response: hwy
##              Df Sum Sq Mean Sq F value Pr(>F)    
## year         31  81071    2615     273 <2e-16 ***
## displ        63 694098   11017    1150 <2e-16 ***
## Residuals 33290 319025      10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2=(aov(hwy~class+displ,data=vehicles))
anova(model2)
## Analysis of Variance Table
## 
## Response: hwy
##              Df Sum Sq Mean Sq F value Pr(>F)    
## class        33 455707   13809    1540 <2e-16 ***
## displ        63 339954    5396     602 <2e-16 ***
## Residuals 33288 298533       9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3=(aov(hwy~trans+displ,data=vehicles))
anova(model3)
## Analysis of Variance Table
## 
## Response: hwy
##              Df Sum Sq Mean Sq F value Pr(>F)    
## trans        46 160547    3490     358 <2e-16 ***
## displ        63 608935    9666     990 <2e-16 ***
## Residuals 33275 324712      10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4=(aov(hwy~year+cyl,data=vehicles))
anova(model4)
## Analysis of Variance Table
## 
## Response: hwy
##              Df Sum Sq Mean Sq F value Pr(>F)    
## year         31  81069    2615     187 <2e-16 ***
## cyl           8 547024   68378    4892 <2e-16 ***
## Residuals 33344 466098      14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model5=(aov(hwy~class+cyl,data=vehicles))
anova(model5)
## Analysis of Variance Table
## 
## Response: hwy
##              Df Sum Sq Mean Sq F value Pr(>F)    
## class        33 455708   13809    1267 <2e-16 ***
## cyl           8 275026   34378    3154 <2e-16 ***
## Residuals 33342 363457      11                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model6=(aov(hwy~trans+cyl,data=vehicles))
anova(model6)
## Analysis of Variance Table
## 
## Response: hwy
##              Df Sum Sq Mean Sq F value Pr(>F)    
## trans        46 160557    3490     257 <2e-16 ***
## cyl           8 481345   60168    4434 <2e-16 ***
## Residuals 33329 452289      14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Each of the above variables have an individual effect on the highway gas mileage when blocking for engine displacement and number of cylinders.For each, the p-value is essentially zero so we can reject the H0 that each has no effect on the response. Further, any variation in the response can be solely attributed to variation in each of the factor levels.

Run analysis of variance using interaction. Data must first be subset so there is ample computational memory to execute the ANOVA command to include interaction.We will seek to find if there are interaction effects for year, class, and transmission type just for the years 1985 and 1986.

subset=subset(vehicles,year==1985:1986)
interaction=aov(hwy~year*class*trans+displ+cyl, data=subset)
anova(interaction)
## Analysis of Variance Table
## 
## Response: hwy
##                    Df Sum Sq Mean Sq F value  Pr(>F)    
## year                1     21      21    2.56    0.11    
## class              13  22627    1741  211.13 < 2e-16 ***
## trans               5   5502    1100  133.47 < 2e-16 ***
## displ              37  14023     379   45.97 < 2e-16 ***
## cyl                 4    945     236   28.64 < 2e-16 ***
## year:class         12    113       9    1.14    0.33    
## year:trans          4     53      13    1.61    0.17    
## class:trans        39    712      18    2.21 3.1e-05 ***
## year:class:trans   32    209       7    0.79    0.79    
## Residuals        1312  10816       8                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this particular model, we fail to reject the H0 that year and class jointly have no effect on highway gas mileage in the years 1985 to 1986. Further, we fail to reject the H0 that year and transmission have no effect on highway gas mileage in the years 1985 to 1986. However, we reject the H0 that vehicle class and transmission type have no effect in the years 1985 to 1986. That is, variation in the response variable can be explained by the interaction effect between vehicle class and transmission. Since this model blocks on the nuisance variables displ and cyl, it is safe to assume the model is more accurate because variation due to erroneous factors has been accounted for.

Estimation (of Parameters)

# Shapiro-Wilk test of normality.  Adequate if p < 0.
# Cannot perform test on entire dataset, so sampling was used
sampling = vehicles[sample(1:nrow(vehicles), 500, replace=FALSE),]
shapiro.test(sampling$hwy)
## 
##  Shapiro-Wilk normality test
## 
## data:  sampling$hwy
## W = 0.9756, p-value = 2.085e-07
# We reject the H0 that the data are normally distributed.
qqnorm(sampling$hwy)
qqline(sampling$hwy)

plot of chunk unnamed-chunk-7

# Thus, we attempt to transform the data
shapiro.test(sampling$hwy^2) #still not normally distributed
## 
##  Shapiro-Wilk normality test
## 
## data:  sampling$hwy^2
## W = 0.9058, p-value < 2.2e-16
shapiro.test(sqrt(sampling$hwy)) #still not normally distributed
## 
##  Shapiro-Wilk normality test
## 
## data:  sqrt(sampling$hwy)
## W = 0.9911, p-value = 0.004213
shapiro.test(log(sampling$hwy)) #still not normallly distributed
## 
##  Shapiro-Wilk normality test
## 
## data:  log(sampling$hwy)
## W = 0.9922, p-value = 0.01008
shapiro.test(log10(sampling$hwy)) #still not normallly distributed
## 
##  Shapiro-Wilk normality test
## 
## data:  log10(sampling$hwy)
## W = 0.9922, p-value = 0.01008
shapiro.test(1/sampling$hwy) #still not normallly distributed
## 
##  Shapiro-Wilk normality test
## 
## data:  1/sampling$hwy
## W = 0.951, p-value = 8.094e-12
shapiro.test(1/sqrt(sampling$hwy)) #still not normallly distributed
## 
##  Shapiro-Wilk normality test
## 
## data:  1/sqrt(sampling$hwy)
## W = 0.9787, p-value = 1.077e-06
# It appears for this sampling subset the data cannot be readily transformed into a normal distribution.

One of the primary assumptions of the analysis of variance is that the data are normally distributed, and since they are (and cannot be readily transformed into such), the results from the analysis of variance can potentially be invalid.

Diagnostics/Model Adequacy Checking

Describe

qqnorm(residuals(interaction))
qqline(residuals(interaction))

plot of chunk unnamed-chunk-8

plot(fitted(interaction),residuals(interaction))

plot of chunk unnamed-chunk-8

#The model appears to have a good fit to the data since the points are evenly dispersed across the Cartesian plane of resiual error and the fitted model. 
interaction.plot(class,trans,year)

plot of chunk unnamed-chunk-8

#There appears to be significant interaction among the factors because the lines are not perfectly parallel.

4. References to the literature

These data are one of the original Hadley Wickham datasets compiled by the US EPA.

https://github.com/hadley/fueleconomy

5. Appendices

A summary of, or pointer to, the raw data

Please see above.

complete and documented R code

Please see above.