This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Design of Experiments: Recipe 10 - Taguchi Design of Experiments

Kevin Toth

RPI

12/08/2014 V2.0

1. Setting

Fuel Economy Data

This analysis uses fuel economy data to test the effect of 5 factors on the response variable of highway fuel economy us a Taguchi Design of experiment.

Below is the installation and initial examination of the dataset:

#Installing data package
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/tothk2/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tothk2\AppData\Local\Temp\RtmpAh9Ain\downloaded_packages
library("fueleconomy", lib.loc="~/R/win-library/3.1")
## Warning: package 'fueleconomy' was built under R version 3.1.2
data<-vehicles
attach(data)
## The following object is masked _by_ .GlobalEnv:
## 
##     model

Factors and Levels

For this experiment we will have 5 factors which are year, fuel type, displacement, cylinders and drive.

Year will have 16 levels, which will be each year from 2000 to 2016.

Fuel type has three levels which are Regular, Premium and Diesel.

Cylinders has three levels which are <6, 6-10 and >=10 cylinders.

Displacement has 4 levels which are <2, 2-4, 4-6 and >6

Lastly drive has 4 levels which are 4-wheel drive, 2-wheel drive, rear wheel drive, front wheel drive.

#Subsetting Cylinders into 3 levels
data$cyl[as.numeric(data$cyl)>= 2 & as.numeric(data$cyl) < 6 ] ="<6"
data$cyl[as.numeric(data$cyl)>= 6 & as.numeric(data$cyl) < 10 ] = "6-10"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$cyl[as.numeric(data$cyl)>= 10] =">10"
## Warning: NAs introduced by coercion
#Subsetting Displacement into 4 levels
data$displ[data$displ >= 0 & data$displ < 3] = "<3"
data$displ[data$displ >= 3 & data$displ < 6] = "3-6"
data$displ[data$displ >= 6] = ">6"

#Subset from year 2000 and on
datasub <- subset(data, year >= 2000)
datasub$year[datasub$year >= 2000 & datasub$year < 2005] = "2000-2005"
datasub$year[datasub$year >= 2005 & datasub$year < 2010] = "2005-2010"
datasub$year[datasub$year >= 2010] = ">2010"

#Subset Fuel Types
datasub<-datasub[datasub$fuel %in% c("Regular","Premium","Diesel"), ]

#Subset Drive Types
datasub<-datasub[datasub$drive %in% c("Front-Wheel Drive", "Rear-Wheel Drive", "4-Wheel Drive","2-Wheel Drive"), ]


#Levels of Year
unique(datasub$year)
## [1] "2000-2005" ">2010"     "2005-2010"
#Levels of each Fuel Type
unique(datasub$fuel)
## [1] "Premium" "Regular" "Diesel"
# Levels of Drive Type
unique(datasub$drive)
## [1] "Front-Wheel Drive" "Rear-Wheel Drive"  "4-Wheel Drive"    
## [4] "2-Wheel Drive"
#Levels of Number of Cylinders
unique(datasub$cyl)
## [1] "6-10" "<6"   ">10"
#Levels of Displacement
unique(datasub$displ)
## [1] "3-6" "<3"  ">6"
head(datasub)
##       id  make model      year        class           trans
## 24 16573 Acura 3.2CL 2000-2005 Compact Cars  Automatic (S5)
## 25 17489 Acura 3.2CL 2000-2005 Compact Cars  Automatic (S5)
## 26 18458 Acura 3.2CL 2000-2005 Compact Cars    Manual 6-spd
## 27 18459 Acura 3.2CL 2000-2005 Compact Cars  Automatic (S5)
## 29 15871 Acura 3.2TL 2000-2005 Midsize Cars Automatic 5-spd
## 30 16734 Acura 3.2TL 2000-2005 Midsize Cars  Automatic (S5)
##                drive  cyl displ    fuel hwy cty
## 24 Front-Wheel Drive 6-10   3-6 Premium  27  17
## 25 Front-Wheel Drive 6-10   3-6 Premium  27  17
## 26 Front-Wheel Drive 6-10   3-6 Premium  26  17
## 27 Front-Wheel Drive 6-10   3-6 Premium  27  17
## 29 Front-Wheel Drive 6-10   3-6 Premium  27  17
## 30 Front-Wheel Drive 6-10   3-6 Premium  27  17
tail(datasub)
##          id  make        model      year       class           trans
## 33431 26294 smart fortwo coupe 2005-2010 Two Seaters Automatic (AM5)
## 33432 29812 smart fortwo coupe     >2010 Two Seaters       Auto(AM5)
## 33433 30918 smart fortwo coupe     >2010 Two Seaters       Auto(AM5)
## 33434 32172 smart fortwo coupe     >2010 Two Seaters       Auto(AM5)
## 33435 32357 smart fortwo coupe     >2010 Two Seaters       Auto(AM5)
## 33436 34460 smart fortwo coupe     >2010 Two Seaters       Auto(AM5)
##                  drive cyl displ    fuel hwy cty
## 33431 Rear-Wheel Drive  <6    <3 Premium  41  33
## 33432 Rear-Wheel Drive  <6    <3 Premium  41  33
## 33433 Rear-Wheel Drive  <6    <3 Premium  41  33
## 33434 Rear-Wheel Drive  <6    <3 Premium  38  34
## 33435 Rear-Wheel Drive  <6    <3 Premium  38  34
## 33436 Rear-Wheel Drive  <6    <3 Premium  38  34
summary(datasub)
##        id            make              model               year          
##  Min.   :15589   Length:11085       Length:11085       Length:11085      
##  1st Qu.:19432   Class :character   Class :character   Class :character  
##  Median :23901   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :24692                                                           
##  3rd Qu.:30656                                                           
##  Max.   :34920                                                           
##     class              trans              drive          
##  Length:11085       Length:11085       Length:11085      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##      cyl               displ               fuel                hwy      
##  Length:11085       Length:11085       Length:11085       Min.   :11.0  
##  Class :character   Class :character   Class :character   1st Qu.:22.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :25.0  
##                                                           Mean   :25.6  
##                                                           3rd Qu.:29.0  
##                                                           Max.   :61.0  
##       cty      
##  Min.   : 7.0  
##  1st Qu.:15.0  
##  Median :18.0  
##  Mean   :18.5  
##  3rd Qu.:21.0  
##  Max.   :53.0

Continuous variables (if any)

The continuous variables in this data are the city (“cty”) and highway (“hwy”) gas mileage of each vehicle. Highway mileage ranges from 9 to 109 and City mileage ranges from 6 to 138.

Response variables

For this experiment we will be focusing on the highway gas mileage for the response variable being effected by the chosen factors.

The Data: How is it organized and what does it look like?

The data set is organized by the following variables: id, make, model, year, class, trans, drive, cyl, displ, fuel, hwy, cty

Make, model and class are indications of the manufacturer and type of the vehicle such as Audi and Ford and the model of the vehicle is model from that manufacturere such as Passat or Gran Prix. Class indicates vehicle type such as compact car or van.

Year is simply the year that vehicle was manufactured.

trans, drive, cyl, and displ all describe the type of set up the car has, mostly relating to the engine. The trans is the transmission which is thinks like Automatic 9-spd or Manual 5-spd. The drive describes the type of wheel drive like All-wheel or front-wheel. Cyl is the number of cylinders the engine has and displ shows the displacement in liters of the engine.

Fuel is the type of fuel the engine uses such as Regular or Premium.

Cty and hwy are the gas mileage for city driving and gas mileage for highway driving.

Randomization

It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

To perform the experiment the data will first be altered so that our data has a reduced number of levels to decrease complexity. A linear model will be created and an analysis of variance will be performed on that model. We will then create an orthogonal taguchi design to test our anova. From these analysis we will be able to test the null hypothesis, that highway gas mileage is independant of drive type, year, cylinders, displacement and fuel type.

What is the rationale for this design?

The rationale for using an analysis of variance test is used when multiple factors are considered. It checks whether the means of several groups are equal. The alternative would be to use multiple two-sample t-tests however there is more likely chance of the test resulting in a false hypothesis.

Randomize: What is the Randomization Scheme?

The data was collected in an unknown way so we do not know if there was any randomization to it.

Replicate: Are there replicates and/or repeated measures?

There are no replicated or repeated measures in the data. Each unique vehicle had it’s fuel economy statistics measured once.

Block: Did you use blocking in the design?

There was no blocking.

3. Statistical Analysis

Exploratory Data Analysis: Graphics and Descriptive summary

To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of those factors.

#Defining Fuel Type as a factor
datasub$fuel = as.factor(datasub$fuel)

#Defining Year as a factor
datasub$year = as.factor(datasub$year)

#Defining Drive Type as a factor
datasub$drive = as.factor(datasub$drive)

#Defining Cylinders as a factor
datasub$cyl = as.factor(datasub$cyl)

#Defining Displacement as a factor
datasub$displ = as.factor(datasub$displ)


#Boxplots of the means of each factor against response variable
boxplot(hwy~fuel, data=datasub, xlab="Fuel Type", ylab="Highway Fuel Economy")

plot of chunk unnamed-chunk-3

boxplot(hwy~year, data=datasub, xlab="Year", ylab="Highway Fuel Economy")

plot of chunk unnamed-chunk-3

boxplot(hwy~drive, data=datasub, xlab="Drive Type", ylab="Highway Fuel Economy")

plot of chunk unnamed-chunk-3

boxplot(hwy~cyl, data=datasub, xlab="Number of Cylinders", ylab="Highway Fuel Economy")

plot of chunk unnamed-chunk-3

boxplot(hwy~displ, data=datasub, xlab="Enginge Displacement", ylab="Highway Fuel Economy")

plot of chunk unnamed-chunk-3

We can see a positibve relationship between highway mileage and years and inverse relationships with cylinders and displacement. There are no discernible relationships with fuel type or number of cylinders.

Testing

To test the hypotheses we perform an ANOVA test on the linear model of the factors

The null hypothesis of the tests is that the factors do not have an effect on the response variable of highway gas mileage.

#Analysis of Variance
model=aov(hwy~year+drive+cyl+displ+fuel, data=datasub)
anova(model)
## Analysis of Variance Table
## 
## Response: hwy
##              Df Sum Sq Mean Sq F value Pr(>F)    
## year          2  21558   10779     941 <2e-16 ***
## drive         3 140501   46834    4089 <2e-16 ***
## cyl           2  47139   23569    2058 <2e-16 ***
## displ         2   5862    2931     256 <2e-16 ***
## fuel          2  15787    7894     689 <2e-16 ***
## Residuals 11073 126811      11                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For each of our individual tests we found resulting p-values of 2.2e-16. The p-value represents the probability that we can get the F value using the null hypothesis. Since our probability is extremely close to 0 we can assume that each factor demonstrates an effect on the response variable. We are lead to reject the null hypothesis.

Next we will load another R library in order to create an orthogonal array for a taguchi design of experiment.

library(DoE.base)
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design
#Defining factors as integers
datasub$fuel = as.integer(datasub$fuel)
datasub$year = as.integer(datasub$year)
datasub$drive = as.integer(datasub$drive)
datasub$cyl = as.integer(datasub$cyl)
datasub$displ = as.integer(datasub$displ)

fuelarray = oa.design(factor.names=c("year", "drive", "cyl","displ","fuel"), nlevels=c(3,4,3,3,3), columns="min3")

Then we combine our experimental array with the original dataset to get the results of the experimental runs.

dataarray = merge(fuelarray, datasub, by=c("year","drive","cyl","displ","fuel"), all=FALSE)
head(dataarray)
##   year drive cyl displ fuel    id       make               model
## 1    1     2   1     1    3 30777       Ford   Escape Hybrid 4WD
## 2    1     2   1     1    3 29946    Hyundai        Santa Fe 4WD
## 3    1     2   1     1    3 32461        Kia        Sportage 4WD
## 4    1     2   1     1    3 29464    Mercury  Mariner Hybrid 4WD
## 5    1     2   1     1    3 33129 Mitsubishi Outlander Sport 4WD
## 6    1     2   1     1    3 31969       Jeep         Compass 4WD
##                             class                            trans hwy cty
## 1     Sport Utility Vehicle - 4WD Automatic (variable gear ratios)  27  30
## 2     Sport Utility Vehicle - 4WD                  Automatic 6-spd  27  21
## 3 Small Sport Utility Vehicle 4WD                     Manual 6-spd  25  19
## 4     Sport Utility Vehicle - 4WD Automatic (variable gear ratios)  27  30
## 5 Small Sport Utility Vehicle 4WD                      Auto(AV-S6)  29  24
## 6     Sport Utility Vehicle - 4WD Automatic (variable gear ratios)  23  20

Then we need to eliminate all duplicate entries in our data and find unique rows for our data.

factors = unique(dataarray[ ,1:5])
factors
##      year drive cyl displ fuel
## 1       1     2   1     1    3
## 108     1     3   3     3    3
## 394     1     4   3     1    2
## 423     2     3   1     1    3
## 1266    2     4   3     2    3
## 1270    3     2   3     3    3
## 1272    3     3   3     1    2
## 1290    3     4   1     3    3
## 1328    3     4   2     3    2
rownames(factors)
## [1] "1"    "108"  "394"  "423"  "1266" "1270" "1272" "1290" "1328"

We then use those rows to pull in the response variable to our entries.

#Select Response Variables
responsevar = dataarray$hwy[index=c(1,108,394,423,1266,1270,1272,1290,1328)]
#Combine data
finaldata = cbind(factors,responsevar)
finaldata
##      year drive cyl displ fuel responsevar
## 1       1     2   1     1    3          27
## 108     1     3   3     3    3          22
## 394     1     4   3     1    2          30
## 423     2     3   1     1    3          27
## 1266    2     4   3     2    3          15
## 1270    3     2   3     3    3          18
## 1272    3     3   3     1    2          26
## 1290    3     4   1     3    3          23
## 1328    3     4   2     3    2          17

We then calculate the signal to noise ratio for our experimental runs.

SN=-10*log10(1/finaldata$responsevar^2)
SN
## [1] 28.63 26.85 29.54 28.63 23.52 25.11 28.30 27.23 24.61
bestsn = which(SN==max(SN))
bestsn
## [1] 3

We find that our best solution is the one with the highest signal to noise ratio. This for us is entry 3 with levels 1, 4, 3, 1, 2 for our factors year, drive, cyl, displ and fuel respectively.

Lastly we run another anova on the new experimental array.

TaguchiModel = lm(finaldata$responsevar~finaldata$year+finaldata$drive+finaldata$cyl+finaldata$displ+finaldata$fuel, data=finaldata)
anova(TaguchiModel)
## Warning: ANOVA F-tests on an essentially perfect fit are unreliable
## Analysis of Variance Table
## 
## Response: finaldata$responsevar
##                 Df Sum Sq Mean Sq F value Pr(>F)
## finaldata$year   2   56.9    28.4               
## finaldata$drive  2   23.3    11.6               
## finaldata$cyl    2   55.2    27.6               
## finaldata$displ  2   80.2    40.1               
## Residuals        0    0.0

Unfortunately this anova yields a perfect fit which is unreliable, therefore we fail to reject the null hypothesis using a Taguchi design.

Diagnostics/Model Adequacy Checking

Next we graph Q-Q plots to check our data in our models for normality. If the data is not normal the results of the analysis may not be valid.

#QQ plots for residuals of first model
qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-11

#QQ plots for residuals of second model
qqnorm(residuals(TaguchiModel))
qqline(residuals(TaguchiModel))

plot of chunk unnamed-chunk-11

We also perform the Tukey test which is used to determine which groups in the sample differ as opposed to the anova which tells whether or not there are differences in groups.

#Tukey HSD test for first model
TukeyHSD(model, ordered=FALSE, conf.level = .95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = hwy ~ year + drive + cyl + displ + fuel, data = datasub)
## 
## $year
##                        diff    lwr     upr p adj
## 2000-2005->2010     -3.0641 -3.249 -2.8796 0e+00
## 2005-2010->2010     -2.7297 -2.911 -2.5480 0e+00
## 2005-2010-2000-2005  0.3344  0.146  0.5228 1e-04
## 
## $drive
##                                      diff     lwr    upr  p adj
## 4-Wheel Drive-2-Wheel Drive         2.486 -1.4196  6.392 0.3586
## Front-Wheel Drive-2-Wheel Drive    10.956  7.0657 14.847 0.0000
## Rear-Wheel Drive-2-Wheel Drive      4.050  0.1598  7.941 0.0376
## Front-Wheel Drive-4-Wheel Drive     8.470  8.0850  8.855 0.0000
## Rear-Wheel Drive-4-Wheel Drive      1.564  1.1777  1.951 0.0000
## Rear-Wheel Drive-Front-Wheel Drive -6.906 -7.0754 -6.736 0.0000
## 
## $cyl
##            diff    lwr    upr p adj
## >10-<6   -7.021 -7.476 -6.566     0
## 6-10-<6  -2.903 -3.057 -2.748     0
## 6-10->10  4.118  3.666  4.570     0
## 
## $displ
##           diff     lwr    upr p adj
## >6-<3  -2.2674 -2.6515 -1.883     0
## 3-6-<3 -0.3612 -0.5153 -0.207     0
## 3-6->6  1.9062  1.5237  2.289     0
## 
## $fuel
##                   diff     lwr     upr p adj
## Premium-Diesel  -8.628  -9.289 -7.9674     0
## Regular-Diesel  -9.678 -10.335 -9.0205     0
## Regular-Premium -1.050  -1.204 -0.8948     0

Lastly we plot a fitted model against the residuals. We do not see a very large degree of variation among the plot.

#Plot of Fitted vs Residuals of original linear model
plot(fitted(model), residuals(model))

plot of chunk unnamed-chunk-13

#Plot of Fitted vs Residuals of Taguchi models
plot(fitted(TaguchiModel), residuals(TaguchiModel))

plot of chunk unnamed-chunk-13

Overrall the results of our model lead us to believe our original linear model is adequate and does explain the effect of the cylinders, displacement, fuel type, year, and drive type on the variance in highway gas mileage. However we cannot use the Taguchi method anova as it’s perfect leaves us skeptical of the actual fit and any model adequacy measures.

4. References to the literature

No literature was used

5. Appendices

The R package in which this data was found can be located at https://github.com/hadley/fueleconomy

6. Contingencies

It is possible that the conclusions of our analysis are the results of chance. One concern is that the data for each car was collected only once. We don’t know the condition of the tests and it is possible that the conditions of each test may have resulted in better or worse outputs for highway gas mileage. There are other factors involved as well. Two vehicles could have the exact same engine set up yet it is possible that they have different weights therefore effecting their mileage output.

Also since our data appears to not be normal, and data being normal is an assumption for the ANOVA test we can instead perform a Kruskal-Wallis rank sum test. Kruskal-Wallis is a non-parametric to test whether the different groups all have the same distribution. We use this on each of our factors.

#Kruskal-Wallis test on Fuel Type factor
kruskal.test(hwy~fuel, data=datasub)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  hwy by fuel
## Kruskal-Wallis chi-squared = 447.3, df = 2, p-value < 2.2e-16
#Kruskal-Wallis test on Year factor
kruskal.test(hwy~year, data=datasub)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  hwy by year
## Kruskal-Wallis chi-squared = 614.1, df = 2, p-value < 2.2e-16
#Kruskal-Wallis test on Drive Type factor
kruskal.test(hwy~drive, data=datasub)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  hwy by drive
## Kruskal-Wallis chi-squared = 4711, df = 3, p-value < 2.2e-16
#Kruskal-Wallis test on Cylinder Number factor
kruskal.test(hwy~cyl, data=datasub)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  hwy by cyl
## Kruskal-Wallis chi-squared = 5382, df = 2, p-value < 2.2e-16
#Kruskal-Wallis test on Displacement factor
kruskal.test(hwy~displ, data=datasub)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  hwy by displ
## Kruskal-Wallis chi-squared = 5262, df = 2, p-value < 2.2e-16