Anthony D’Amato

RPI

11/30/2014

1. Setting

EPA Fuel Economy Data

This experimental data analysis looks at a data set which contains fuel economy data for a wide variety of vehicles. This data was obtained by the EPA. Below I load the CSV file containing the data, and construct a figure containing the first 6 lines of the data set followed by another figure representing the structure of the data.

CSV<-read.csv("C:\\Users\\Anthony\\Desktop\\School\\RPI Year 1\\DoE\\R10.csv",header=TRUE)
head(CSV)

##   cyl year displ drive fuelType hwy
## 1   4 1984   1.6     1        1  27
## 2   4 1984   1.6     1        1  32
## 3   4 1984   2.2     1        1  31
## 4   4 1984   2.2     1        1  32
## 5   4 1984   2.2     1        1  28
## 6   4 1984   2.2     1        1  31

str(CSV)

## 'data.frame':    33126 obs. of  6 variables:
##  $ cyl     : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ year    : int  1984 1984 1984 1984 1984 1984 1984 1984 1984 1984 ...
##  $ displ   : num  1.6 1.6 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 ...
##  $ drive   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fuelType: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hwy     : int  27 32 31 32 28 31 32 24 35 33 ...

Factors and Levels

This data set consists of five factors: Cylinders which describes the number of cylinders in the cars engine. Year which describes the year that the car was fabricated in. Displacement (denoted as “displ”) which is a measurement of the engine displacement. Drive which describes the drive of the car. Fuel Type which describes the type of fuel that each car utilizes. The different factors have varying numbers of levels, and the data will be modified to create a data set in which the factors are categorical as opposed to continuous.

The response variable for this experimental data analysis is the highway gas mileage of each car which will be denoted as “hwy” in this analysis.

Below I modify the factor “cyl” by grouping vehicles with 2-6 cylinders into the factor level “1”, vehicles with 6-10 cylinders into the factor level “2”, and vehicles with more than 10 cylinders into the factor level “3”.

CSV$cyl[as.numeric(CSV$cyl)>= 2 & as.numeric(CSV$cyl) <6]="1"
CSV$cyl[as.numeric(CSV$cyl)>= 6 & as.numeric(CSV$cyl) <10]="2"
CSV$cyl[as.numeric(CSV$cyl)>= 10]="3"
unique(CSV$cyl)

## [1] "1" "2" "3"

Below I modify the factor “year” by grouping vehicles from each decade from the 1980s-2010s into categorical factor levels 1-4.

CSV$year[as.numeric(CSV$year)>= 1980 & as.numeric(CSV$year) <= 1990] = "1"
CSV$year[as.numeric(CSV$year)>= 1991 & as.numeric(CSV$year) <= 2000] = "2"
CSV$year[as.numeric(CSV$year)>= 2001 & as.numeric(CSV$year) <= 2010] = "3"
CSV$year[as.numeric(CSV$year)>= 2011 & as.numeric(CSV$year) <= 2015] = "4"
unique(CSV$year)

## [1] "1" "4" "3" "2"

Below I modify the factor “displacement” by grouping vehicles with engine displacements between 0 and 2.8 into the factor level “1”, vehicles with engine displacements between 2.9 and 5.6 into the factor level “2”, and vehicles with engine displacements between 5.7 and 8.4 into the factor level “3”.

CSV$displ[as.numeric(CSV$displ)>=0 & as.numeric(CSV$displ)<=2.8]="1"
CSV$displ[as.numeric(CSV$displ)>=2.9 & as.numeric(CSV$displ)<=5.6]="2"
CSV$displ[as.numeric(CSV$displ)>=5.7 & as.numeric(CSV$displ)<=8.4]="3"
unique(CSV$displ)

## [1] "1" "3" "2"

Prior to this experimental analysis I changed the data containing CSV file to modify the levels of the factor “drive” by grouping vehicles with 2-wheel, rear-wheel, and front-wheel drives into the factor level “1”, vehicles with 4-wheel drive into the factor level “2”, and vehicles with all-wheel or 4-wheel/all-wheel drives into the factor level “3”. I did the same with the factor “fuel type” by grouping vehicles that use diesel fuel into the factor level “1”, vehicles that use either regular, midgrade, or preium gasoline into the factor level “2”, and vehicles that use gasoline/E85, gasoline/natural gas, or gasoline/propane into the factor level “3”.

Below I set all five modified factors as factors for R to analyze.

cyl<-as.factor(CSV$cyl)
year<-as.factor(CSV$year)
displ<-as.factor(CSV$displ)
drive<-as.factor(CSV$drive)
fuel<-as.factor(CSV$fuelType)

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?
In this experimental data analysis I will use an analysis of variance (ANOVA) to determine if there are significant effects of each of these five factors on the response variable of highway gas mileage. The null hypothesis (H0) for each variable is that they do not have a significant effect on the highway gas mileage of the vehicle.

To begin this experimenal analysis I will create a linear model of the data and perform an ANOVA to determine the main effects of each factor on the response variable. After that I will use Taguchi design methods to create an orthogonal array of the data set and determine if this method has a significant effect on the outcome of the ANOVA.

What is the rationale for this design?
I have chosen to use this experimental design to demonstrate proper usage of Taguchi Deisgn methods to analyze the effects of five factors on a single response variable.

3. Statistical Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Below are boxplots showing any evident trends in the response variable with respect to each of the five factors.

par(mfrow=c(1,1))
boxplot(CSV$hwy~cyl,xlab="Cylinders",ylab="Highway Gas Mileage (MPG)")

plot of chunk unnamed-chunk-6

boxplot(CSV$hwy~year,xlab="Year",ylab="Highway Gas Mileage (MPG)")

plot of chunk unnamed-chunk-6

boxplot(CSV$hwy~displ,xlab="Engine Displacement",ylab="Highway Gas Mileage (MPG)")

plot of chunk unnamed-chunk-6

boxplot(CSV$hwy~drive,xlab="Drive",ylab="Highway Gas Mileage (MPG)")

plot of chunk unnamed-chunk-6

boxplot(CSV$hwy~fuel,xlab="Fuel Type",ylab="Highway Gas Mileage (MPG)")

plot of chunk unnamed-chunk-6 Below I create a linear model of the data for use in the initial ANOVA.

model=lm(CSV$hwy~cyl+year+displ+drive+fuel,data=CSV)
anova(model)

## Analysis of Variance Table
## 
## Response: CSV$hwy
##              Df Sum Sq Mean Sq F value Pr(>F)    
## cyl           2 406059  203029   14901 <2e-16 ***
## year          3  85437   28479    2090 <2e-16 ***
## displ         2  46716   23358    1714 <2e-16 ***
## drive         1  53739   53739    3944 <2e-16 ***
## fuel          2  37633   18816    1381 <2e-16 ***
## Residuals 33115 451207      14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As seen above in the results of the ANOVA we can reject the null hypothesis for each of the five factors. Therefore, we can say that each factor has an effect on the response variable that can be attributed to something other than randomization.

Below I construct a Taguchi design based on the number of levels of the factor “Year” because this factor had four levels as opposed to all other factors that only had three levels.

library(qualityTools)

## Warning: package 'qualityTools' was built under R version 3.1.2

library(DoE.base)

## Warning: package 'DoE.base' was built under R version 3.1.2

## Loading required package: grid
## Loading required package: conf.design

## Warning: package 'conf.design' was built under R version 3.1.2

## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design

Array=oa.design(factor.names=c("cyl","year","displ","drive","fuelType"),nlevels=c(3,4,3,3,3),columns="min3")
Array

##    cyl year displ drive fuelType
## 1    3    2     1     1        1
## 2    3    4     1     3        3
## 3    2    1     2     2        1
## 4    1    1     1     1        1
## 5    1    4     2     3        1
## 6    1    4     2     1        3
## 7    1    2     2     2        1
## 8    2    3     1     3        1
## 9    1    1     2     3        3
## 10   3    1     3     3        1
## 11   2    4     1     2        2
## 12   3    3     3     1        2
## 13   3    4     2     3        2
## 14   3    2     3     2        2
## 15   1    4     3     1        2
## 16   1    3     3     3        3
## 17   2    2     2     2        3
## 18   3    4     1     2        1
## 19   1    3     1     2        2
## 20   3    2     3     3        3
## 21   2    4     3     2        3
## 22   1    1     3     2        2
## 23   2    2     2     1        2
## 24   3    3     2     2        3
## 25   2    1     3     1        3
## 26   3    1     1     2        3
## 27   2    3     1     1        3
## 28   3    3     2     1        1
## 29   3    1     2     1        2
## 30   1    3     3     2        1
## 31   2    3     2     3        2
## 32   1    2     1     3        2
## 33   2    4     3     1        1
## 34   2    2     3     3        1
## 35   1    2     1     1        3
## 36   2    1     1     3        2
## class=design, type= oa

Below I use the orthogonal array seen above to select the experimental runs from the dataset that correspond to the factor assignments seen in the array.

NewData=merge(Array,CSV,by=c("cyl","year","displ","drive","fuelType"),all=FALSE)
head(NewData)

##   cyl year displ drive fuelType hwy
## 1   1    1     1     1        1  39
## 2   1    1     1     1        1  25
## 3   1    1     1     1        1  39
## 4   1    1     1     1        1  33
## 5   1    1     1     1        1  27
## 6   1    1     1     1        1  39

Below I remove rows that are duplicates of factor combinations. I then use the rownames() call to determine which rows meet this criteria.

RemDup=unique(NewData[,1:5])
RemDup

##      cyl year displ drive fuelType
## 1      1    1     1     1        1
## 123    1    2     1     1        3
## 124    1    2     1     3        2
## 677    1    3     1     2        2
## 696    2    1     1     3        2
## 829    2    2     2     1        2
## 3837   2    2     3     3        1
## 3910   2    3     1     1        3
## 3939   2    3     2     3        2
## 5884   2    4     3     2        3
## 5891   3    1     2     1        2
## 5918   3    3     3     1        2
## 6063   3    4     2     3        2

rownames(RemDup)

##  [1] "1"    "123"  "124"  "677"  "696"  "829"  "3837" "3910" "3939" "5884"
## [11] "5891" "5918" "6063"

Below I retrieve the response variable values for each of these experimental runs, and then create a new orthogonal array with the desired experimental runs and their corresponding response values.

NewHwy=NewData$hwy[index=c(1,123,124,677,696,829,3837,3910,3939,5884,5891,5918,6063)]
NewArray=cbind(RemDup,NewHwy)
NewArray

##      cyl year displ drive fuelType NewHwy
## 1      1    1     1     1        1     39
## 123    1    2     1     1        3     25
## 124    1    2     1     3        2     24
## 677    1    3     1     2        2     22
## 696    2    1     1     3        2     23
## 829    2    2     2     1        2     16
## 3837   2    2     3     3        1     18
## 3910   2    3     1     1        3     27
## 3939   2    3     2     3        2     23
## 5884   2    4     3     2        3     18
## 5891   3    1     2     1        2     14
## 5918   3    3     3     1        2     22
## 6063   3    4     2     3        2     19

Below I calcualate the signal to noise (S/N) ratio for all experimental runs.

SN=-10*log10(1/NewArray$NewHwy^2)
SN

##  [1] 31.82 27.96 27.60 26.85 27.23 24.08 25.11 28.63 27.23 25.11 22.92
## [12] 26.85 25.58

As seen in the results of the S/N calculations above, the first experimental run which corresponds to row number 1 had the highest signal to noise ratio. This indicates that the best way to achieve higher highway gas mileage for a vehicle is to operate a vehicle that has those corresponding factor assignments.

Below I create a new linear model for the data in our newly created orthogonal array.

Newmodel=lm(NewArray$NewHwy~NewArray$cyl+NewArray$year+NewArray$displ+NewArray$drive+NewArray$fuelType,data=NewArray)
anova(Newmodel)

## Analysis of Variance Table
## 
## Response: NewArray$NewHwy
##                   Df Sum Sq Mean Sq F value Pr(>F)   
## NewArray$cyl       2  168.3    84.1    7025 0.0084 **
## NewArray$year      3   96.0    32.0    2673 0.0142 * 
## NewArray$displ     2   13.0     6.5     544 0.0303 * 
## NewArray$drive     2  179.5    89.8    7496 0.0082 **
## NewArray$fuelType  2   11.9     5.9     495 0.0318 * 
## Residuals          1    0.0     0.0                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As seen above in the results of the ANOVA performed on the fractional design of the original dataset, we fail to reject the null hypothesis for all five factors. Thus we cannot attribute any changes in the response variable to anything other than randomization in this case. Because these results are so different from the results of the ANOVA performed on the full factorial design earlier it can be concluded that either this is too small of a fraction of the data to reach any significant conclusions, or that the manipulations of the factors into three or four levels may have had adverse effects on the outcome of this experimental data analysis.

Diagnostics/Model Adequacy Checking

Shapiro-Wilk Test

Below I perform a Shapiro-Wilk Test to determine if the data set is normally distributed.

NewArray$NewHwy=as.numeric(NewArray$NewHwy)
shapiro.test(NewArray$NewHwy)

## 
##  Shapiro-Wilk normality test
## 
## data:  NewArray$NewHwy
## W = 0.8756, p-value = 0.06225

The null hypothesis for a Shapiro-Wilk test is that the data does not deviate from normality. Thus the resulting P-value which is greater than the alpha level of 0.05 indicates that we cannot reject the null hypothesis and thus the data is considered to follow a normal distribution.

To further study the model adequacy of the experimental data analysis that I performed I create two plots below. The first plot is a quantile-quantile (QQ) plot that is used to determine if the residuals of the data set follow a normal distribution. Since the residuals are nearly linear it seems that this is an adequate model.

After the QQ plot I created a Residuals vs. Fits plot which is used to identify the linearity of the residual values and to determine if there are any outlying values. Because the residual values seem to be scattered around zero for this model it can be concluded that the model used in this analysis is accurate for determining the effect of the five factors on the response variable of highway gas mileage.

par(mfrow=c(1,2))
qqnorm(residuals(Newmodel))
qqline(residuals(Newmodel))
plot(fitted(Newmodel),residuals(Newmodel))

plot of chunk unnamed-chunk-16

4.Appendices

The data for this experimental analysis was collected by the United States EPA and can be found at https://github.com/hadley/fueleconomy .