This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Design of Experiments: Recipe 2 - Two or More Factor Analysis

Kevin Toth

RPI

10/01/2014 V2.0

1. Setting

Fuel Economy of Vehicles

This analysis uses fuel economy data collected by the EPA from 1985 to 2015 to test the effect different factors may have on fuel consumption.

Below is the installation and initial examination of the dataset:

#Installing data package
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/tothk2/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tothk2\AppData\Local\Temp\Rtmp088G8X\downloaded_packages
library("fueleconomy", lib.loc="~/R/win-library/3.1")
data<-vehicles
head(data)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13
attach(data)

Factors and Levels

For this two factor analysis we will be examining the factors “cyl” which represents the number of cylinders the engine has and “fuel” which represents the fuel type consumed by the car.

There are nine levels of “cyl” which are 2,3,4,5,8,10,12,16 this analysis will only examine 3 of the 13 levels of fuel which will be Regular, Premium and Gasoline.

#Subset data for specific values of fuel type and provide summary statistics
datasub<-data[data$fuel %in% c("Regular","Premium","Diesel"), ]
head(datasub)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13
tail(datasub)
##          id  make        model year       class           trans
## 33431 26294 smart fortwo coupe 2009 Two Seaters Automatic (AM5)
## 33432 29812 smart fortwo coupe 2010 Two Seaters       Auto(AM5)
## 33433 30918 smart fortwo coupe 2011 Two Seaters       Auto(AM5)
## 33434 32172 smart fortwo coupe 2012 Two Seaters       Auto(AM5)
## 33435 32357 smart fortwo coupe 2013 Two Seaters       Auto(AM5)
## 33436 34460 smart fortwo coupe 2014 Two Seaters       Auto(AM5)
##                  drive cyl displ    fuel hwy cty
## 33431 Rear-Wheel Drive   3     1 Premium  41  33
## 33432 Rear-Wheel Drive   3     1 Premium  41  33
## 33433 Rear-Wheel Drive   3     1 Premium  41  33
## 33434 Rear-Wheel Drive   3     1 Premium  38  34
## 33435 Rear-Wheel Drive   3     1 Premium  38  34
## 33436 Rear-Wheel Drive   3     1 Premium  38  34
summary(datasub)
##        id            make              model                year     
##  Min.   :    1   Length:32113       Length:32113       Min.   :1984  
##  1st Qu.: 8029   Class :character   Class :character   1st Qu.:1990  
##  Median :16066   Mode  :character   Mode  :character   Median :1999  
##  Mean   :16577                                         Mean   :1999  
##  3rd Qu.:24620                                         3rd Qu.:2007  
##  Max.   :34931                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl       
##  Length:32113       Length:32113       Length:32113       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.72  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##                                                           NA's   :3      
##      displ          fuel                hwy            cty      
##  Min.   :0.00   Length:32113       Min.   : 9.0   Min.   : 6.0  
##  1st Qu.:2.20   Class :character   1st Qu.:20.0   1st Qu.:15.0  
##  Median :3.00   Mode  :character   Median :23.0   Median :17.0  
##  Mean   :3.31                      Mean   :23.5   Mean   :17.4  
##  3rd Qu.:4.30                      3rd Qu.:27.0   3rd Qu.:20.0  
##  Max.   :8.40                      Max.   :61.0   Max.   :53.0  
##  NA's   :2
#Levels of Cylinders
unique(datasub$cyl)
##  [1]  4  6  5  8 12 10 16  3  2 NA
#Level of Fuel
unique(datasub$fuel)
## [1] "Regular" "Premium" "Diesel"

Continuous variables (if any)

The continuous variables in this data are the city (“cty”) and highway (“hwy”) gas mileage of each vehicle. Highway mileage ranges from 9 to 109 and City mileage ranges from 6 to 138. ### Response variables For this experiment we will be focusing on the city gas mileage for the response variable being effected by the number of cylinders and fuel type. ### The Data: How is it organized and what does it look like? The data set is organized by the following variables: id, make, model, year, class, trans, drive, cyl, displ, fuel, hwy, cty

Make, model and class are indications of the manufacturer and type of the vehicle such as Audi and Ford and the model of the vehicle is model from that manufacturere such as Passat or Gran Prix. Class indicates vehicle type such as compact car or van.

Year is simply the year that vehicle was manufactured.

trans, drive, cyl, and displ all describe the type of set up the car has, mostly relating to the engine. The trans is the transmission which is thinks like Automatic 9-spd or Manual 5-spd. The drive describes the type of wheel drive like All-wheel or front-wheel. Cyl is the number of cylinders the engine has and displ shows the displacement in liters of the engine.

Fuel is the type of fuel the engine uses like Regular or Premium.

Cty and hwy are the gas mileage for city driving and gas mileage for highway driving.

Randomization

It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

To perform the experiment the data will first be subset by fuel type. Then an analysis of variance will be performed on each factor individually and then a combination of both factors. From these analysis we will be able to test the null hypothesis, that city gas mileage is independant of fuel type and number of cylinders.

What is the rationale for this design?

The rationale for using an analysis of variance test is used when multiple factors are considered. It checks whether the means of several groups are equal. The alternative would be to use multiple two-sample t-tests however there is more likely chance of the test resulting in a false hypothesis.

Randomize: What is the Randomization Scheme?

The data was collected in an unknown way so we do not know if there was any randomization to it.

Replicate: Are there replicates and/or repeated measures?

There are no replicated or repeated measures in the data. Each unique vehicle had it’s fuel economy statistics measured once.

Block: Did you use blocking in the design?

There was no blocking performed in the design of this experiment. All vehicles that contained the observed levels of each factor were analyzed together.

3. Statistical Analysis

Exploratory Data Analysis: Graphics and Descriptive summary

To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of those factors.

#Defining "cyl" as a factor
datasub$cyl = as.factor(datasub$cyl)

#Defining "fuel" as a factor
datasub$fuel = as.factor(datasub$fuel)


#Boxplots of of the means of each variable
boxplot(cty~cyl, data=datasub, xlab="Number of Cylinders", ylab="City Gas Mileage")

plot of chunk unnamed-chunk-3

boxplot(cty~fuel, data=datasub, xlab="Fuel Type", ylab="City Gas Mileage")

plot of chunk unnamed-chunk-3 The boxplot showing cylinders implies an inverse relationship between city gas mileage and number of cylinders while the fuel boxplot shows a somewhat neutral effect on city gas mileage. From both boxplots we see a large number of outliers for certain values of each variable, particularly 4 and 6 cylinders and Premium and Regular fuel type.

Testing

To test the hypotheses we perform an ANOVA test on the factors individually and then in combination.

The null hypothesis of the first two tests is that the single factor does not have an effect on the response variable of city gas mileage.

#Analysis of Variance for Factor "cyl"
cylmodel=aov(cty~cyl, data=datasub)
anova(cylmodel)
## Analysis of Variance Table
## 
## Response: cty
##              Df Sum Sq Mean Sq F value Pr(>F)    
## cyl           8 390755   48844    5574 <2e-16 ***
## Residuals 32101 281291       9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Analysis of Variance for Factor "fuel"
fuelmodel=aov(cty~fuel, data=datasub)
anova(fuelmodel)
## Analysis of Variance Table
## 
## Response: cty
##              Df Sum Sq Mean Sq F value Pr(>F)    
## fuel          2  18986    9493     467 <2e-16 ***
## Residuals 32110 653099      20                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Analysis of Variance for Both Factors "cyl" and "fuel"
cylfuelmodel=aov(cty~cyl*fuel, data=datasub)
anova(cylfuelmodel)
## Analysis of Variance Table
## 
## Response: cty
##              Df Sum Sq Mean Sq F value Pr(>F)    
## cyl           8 390755   48844    6229 <2e-16 ***
## fuel          2  19485    9743    1243 <2e-16 ***
## cyl:fuel     11  10202     927     118 <2e-16 ***
## Residuals 32088 251604       8                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For each of our tests we found resulting p-values of 2.2e-16. The p-value represents the probability that we can get the F value using the null hypothesis. Since our probability is extremely close to 0 we can assume that each factor demonstrates an effect on the response variable. We are lead to reject the null hypothesis.

Diagnostics/Model Adequacy Checking

Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid.

#QQ plots for residuals of cylinder only model
qqnorm(residuals(cylmodel))
qqline(residuals(cylmodel))

plot of chunk unnamed-chunk-5

#QQ plots for residuals of fuel only model
qqnorm(residuals(fuelmodel))
qqline(residuals(fuelmodel))

plot of chunk unnamed-chunk-5

#QQ plots for residuals of combination fuel and cylinder model
qqnorm(residuals(cylfuelmodel))
qqline(residuals(cylfuelmodel))

plot of chunk unnamed-chunk-5 We also use an interaction plot to visualize the interaction of the factors on the response variable. We can see somewhat parrallelism which implies there is no interaction effect however we also see a few instances of intersection leading us to belive there may be an interaction effect.

# Interaction Plot of factors
interaction.plot(datasub$cyl,datasub$fuel,datasub$cty)

plot of chunk unnamed-chunk-6 Lastly we plot a fitted model against the residuals. We do not see a very large degree of variation among the plot.

#Plot of Fitted vs Residuals of combination cylinder and fuel type model
plot(fitted(cylfuelmodel), residuals(cylfuelmodel))

plot of chunk unnamed-chunk-7

Overrall the results of our model lead us to believe our model is not adequate and does not explain the effect of the number of cylinders and fuel types on the variance in city gas mileage.

4. References to the literature

No literature was used

5. Appendices

Data can be found at https://github.com/hadley/fueleconomy

6. Contingencies

It is possible that the conclusions of our analysis are the results of chance. One concern is that the data for each car was collected only once. We don’t know the condition of the tests and it is possible that the conditions of each test may have resulted in better or worse outputs for city gas mileage. There are other factors involved as well. Two vehicles could have the exact same engine set up yet it is possible that they have different weights therefore effecting their mileage output.

It may be better to use a blocking factor to address some of these contingencies. By blocking the data by vehicle type (van, compact car, truck) we may be able to lower the variation in unknown factors such as weight and drag.