Recipe 2: Two factor multiple level design

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Two factor multiple level design

Uzma Mushtaque

RPI

October 1 2014 and Version:2.1

1. Setting

System under test

This study involves designing an experiment with two factors and more than 2 levels corresponding to each factor in order to study the effect of ‘number of cylinders’ and ‘type of transmission’ used in vehicles on their fuel economy.In the data, there are two types of response variables i.e. mileage of each vehicle in city and that on the highway, but we consider only the values for ‘city’ in this experiment.The analysis is aimed at finding the effect of the number of cylinders (4,6,8:levels) and the type of transmission (3 types considered as levels: 2 automatic and one manual)on the mileage of a vehicle. For analysis purposes we take a subset of the data set with the make as ‘Toyota’ and the vehicles from the past 10 years only. We further subset the data by explicitly eliminating certain transmission types to focus on the four levels only.

install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/uzma/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\uzma\AppData\Local\Temp\RtmpUfwqTM\downloaded_packages
library("fueleconomy", lib.loc="C:\\Users\\uzma\\Documents\\R\\win-library\\3.1")
x<-vehicles
head(x)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

Factors and Levels

This experiment is a two factor multiple level (3 level) experiment where we consider the ‘number of cylinders’ in a car and its ‘transmission’ type as two factors. We further take three levels of each factor to see the effect on the fuel economy of each vehicle.

head(x)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13
tail(x)
##          id  make                             model year       class
## 33437 31064 smart   fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart       fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart       fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart       fortwo electric drive coupe 2014 Two Seaters
##                trans            drive cyl displ        fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33438 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33439 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33440 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33441 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33442 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
summary(x)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl       
##  Length:33442       Length:33442       Length:33442       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.77  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##                                                           NA's   :58     
##      displ          fuel                hwy             cty       
##  Min.   :0.00   Length:33442       Min.   :  9.0   Min.   :  6.0  
##  1st Qu.:2.30   Class :character   1st Qu.: 19.0   1st Qu.: 15.0  
##  Median :3.00   Mode  :character   Median : 23.0   Median : 17.0  
##  Mean   :3.35                      Mean   : 23.6   Mean   : 17.5  
##  3rd Qu.:4.30                      3rd Qu.: 27.0   3rd Qu.: 20.0  
##  Max.   :8.40                      Max.   :109.0   Max.   :138.0  
##  NA's   :57

Continuous variables (if any)

Most of the numeric values in the data set are integers which indicates that all of them are discrete variables. The values of make and transmission are categorical variables.

Response variables

The response variable is the mileage (in mpg) of each vehicle. However, there are two different values given in the data set for the mileage. One is for the highway (hwy) and the other for the city (cty). For analysis purposes we consider only the city mileage as the response variable in our experiment.

The Data: How is it organized and what does it look like?

The given data set is the fuel economy data from the EPA. It ranges from the year 1985 to 2015 for various car models and each row has a detailed specification of the vehicle. ### Randomization

We can safely assume the data to be randomized because it is a result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory. Since almost every vehicle needs to clear this testing therefore data is a true representative of population as a whole.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This is a factorial design experiment where we consider two factors and multiple levels in order to analyze the main effect and the interaction effect.So our null hypotheses is that there is a significant effect of the number of cylinders used and the type of transmission installed on the vehicle mileage in the city.

What is the rationale for this design?

There is a possibility that if the number of cylinders in a vehicle is increased, it will impart more power to the vehicle. This can lower the fuel economy. Similarly the transmission types (manual or automatic) can have an impact on the fuel economy of a vehicle. Since, mileage is adversely impacted in city traffic therefore we want to analyze the main effect and the interaction effect in this setting only.

Randomize: What is the Randomization Scheme?

The data can be assumed to be well randomized because essentially every vehicle is required to pass the fuel economy test at EPA.

Replicate: Are there replicates and/or repeated measures?

Since the testing of each vehicle is carried out once before they are sold, therefore there are no replicates or repeated measures in the experiment.

3. (Statistical) Analysis

In the Statistical analysis we use ANOVA as a tool. This is a test for statistical significance used when we have more than two groups. So it generalizes the t-test to a more complex setting.

(Exploratory Data Analysis) Graphics and descriptive summary

In the data anlysis, we consider a subset of the data such that vehicles from only the past ten years are considered. Also this analysis will be carried out only for ‘Toyota’ vehicles and some of the levels within the factor-‘transmission’ are explicitly removed in order to focus on three specific levels.

Y<-subset(x,year>2003 & make=='Toyota' & trans !='Automatic (S6)'& trans !='Automatic (S5)'& trans!='Automatic (variable gear ratios)'& trans!='Manual 5-spd' & trans!='Auto(AV-S7)'& trans!='Automatic 5-spd'& trans!='Automatic (S4)')

Y$cyl=as.factor(Y$cyl) 
Y$trans=as.factor(Y$trans)  
summary(Y)
##        id            make              model                year     
##  Min.   :19390   Length:146         Length:146         Min.   :2004  
##  1st Qu.:21030   Class :character   Class :character   1st Qu.:2005  
##  Median :24498   Mode  :character   Mode  :character   Median :2008  
##  Mean   :25819                                         Mean   :2008  
##  3rd Qu.:30809                                         3rd Qu.:2011  
##  Max.   :34724                                         Max.   :2014  
##     class                       trans        drive           cyl    
##  Length:146         Auto(AV-S6)    :  4   Length:146         4:103  
##  Class :character   Automatic 4-spd:100   Class :character   6: 39  
##  Mode  :character   Manual 6-spd   : 42   Mode  :character   8:  4  
##                                                                     
##                                                                     
##                                                                     
##      displ          fuel                hwy          cty      
##  Min.   :1.50   Length:146         Min.   :16   Min.   :13.0  
##  1st Qu.:1.80   Class :character   1st Qu.:20   1st Qu.:16.0  
##  Median :2.45   Mode  :character   Median :26   Median :20.0  
##  Mean   :2.70                      Mean   :26   Mean   :20.5  
##  3rd Qu.:3.50                      3rd Qu.:31   3rd Qu.:25.0  
##  Max.   :4.70                      Max.   :39   Max.   :40.0
# Boxplots

boxplot(cty~cyl,data=Y)

plot of chunk unnamed-chunk-3

boxplot(cty~trans,data=Y)

plot of chunk unnamed-chunk-3

Testing

Analysis of variance for the factor cylinder

model=aov(cty~cyl,data=Y) 
anova(model)
## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value Pr(>F)    
## cyl         2   1620     810    52.4 <2e-16 ***
## Residuals 143   2212      15                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of variance for the factor transmission

model=aov(cty~trans,data=Y) 
anova(model)
## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value Pr(>F)    
## trans       2   1287     644    36.1  2e-13 ***
## Residuals 143   2545      18                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of variance for the both the factors:cylinder and transmission

model=aov(cty~cyl*trans,data=Y) 
anova(model)
## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value Pr(>F)    
## cyl         2   1620     810   87.10 <2e-16 ***
## trans       2    891     445   47.90 <2e-16 ***
## cyl:trans   2     28      14    1.52   0.22    
## Residuals 139   1293       9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test for the both the factors taken separately as well as taken together return a very small p-value. We can therefore conclude than in all three cases there is a very small probability that the variation in city gas mileage (with respect to number of cylinders, transmission type or both respectively) is a result of randomization.

Diagnostics/Model Adequacy Checking

In this section we check the adequacy of the ANOVA model.

qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-7

plot(fitted(model),residuals(model))

plot of chunk unnamed-chunk-7

Interaction Plot

interaction.plot(Y$cyl,Y$trans,Y$cty)

plot of chunk unnamed-chunk-8

5. Appendices

The data from the fueleconomy data set is available at https://github.com/hadley/fueleconomy.