========================================================

Recipie for Taguchi Designs

Ali Svoobda

RPI

12/8/14

1. Setting

System under test

For this recipie, one of the Hadley Wickham datasets, vehicles from the fueleconomy package will be examined. Specifically, we will examine the effect of five factors (make, class, drive, trans, fuel ) with multiple levels on the highway fuel economy, measured in mpg.

Read in and subset data:

library("fueleconomy", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
data<-vehicles

For more on the dataset:

?vehicles
## starting httpd help server ... done

Factors and Levels

We will examine 5 factors in this experiment: make(4 levels), class(4 levels), drive(4 levels), trans(4 levels) and fuel(6 levels).

Setup Make as factor (only acuras, audis, chevys and dodge)

x<-subset(data,data$make=="Acura" | data$make=="Audi" | data$make=="Chevrolet" | data$make=="Dodge")
x$make<-as.factor(x$make)

Setup Class as factor (only compact, subcompact, midsize and two seaters levels)

xx<-subset(x, x$class=="Compact Cars"| x$class=="Subcompact Cars"| x$class=="Midsize Cars"| x$class=="Two Seaters")
xx$class<-as.factor(xx$class)

Set up drive as factor (4 levels):

xx$drive<-as.factor(xx$drive)

Set up trans as factor (4 levels)

xxx<-subset(xx, xx$trans=="Automatic (S5)" |xx$trans=="Manual 5-spd" |xx$trans=="Automatic 4-spd"|xx$trans=="Automatic 5-spd")
xxx$trans<-as.factor(xxx$trans)

Set up fuel as factor 6 levels:

xxx$fuel<-as.factor(xxx$fuel)

Continuous Variables

The only continous variable under study in this experiment is the hwy, which is highway fuel economy in mpg

Response Variables

Hwy will also serve as the response variable.

The Data: How is it organized and what does it look like?

The dataset under study has 1175 observations of 12 variables, although only 6 variables are of interest

Structure and first/last observations of dataset:

str(xxx)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1175 obs. of  12 variables:
##  $ id   : int  13309 13310 13311 14038 14039 14040 14834 14835 14836 11789 ...
##  $ make : Factor w/ 4 levels "Acura","Audi",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model: chr  "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
##  $ year : int  1997 1997 1997 1998 1998 1998 1999 1999 1999 1995 ...
##  $ class: Factor w/ 4 levels "Compact Cars",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ trans: Factor w/ 4 levels "Automatic (S5)",..: 2 4 2 2 4 2 2 4 2 2 ...
##  $ drive: Factor w/ 4 levels "4-Wheel or All-Wheel Drive",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ cyl  : int  4 4 6 4 4 6 4 4 6 5 ...
##  $ displ: num  2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
##  $ fuel : Factor w/ 6 levels "Diesel","Gasoline or E85",..: 6 6 6 6 6 6 6 6 6 5 ...
##  $ hwy  : int  26 28 26 27 29 26 27 29 26 23 ...
##  $ cty  : int  20 22 18 19 21 17 20 21 17 18 ...
head(xxx)
##       id  make       model year           class           trans
## 8  13309 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 9  13310 Acura 2.2CL/3.0CL 1997 Subcompact Cars    Manual 5-spd
## 10 13311 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 11 14038 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
## 12 14039 Acura 2.3CL/3.0CL 1998 Subcompact Cars    Manual 5-spd
## 13 14040 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
##                drive cyl displ    fuel hwy cty
## 8  Front-Wheel Drive   4   2.2 Regular  26  20
## 9  Front-Wheel Drive   4   2.2 Regular  28  22
## 10 Front-Wheel Drive   6   3.0 Regular  26  18
## 11 Front-Wheel Drive   4   2.3 Regular  27  19
## 12 Front-Wheel Drive   4   2.3 Regular  29  21
## 13 Front-Wheel Drive   6   3.0 Regular  26  17
tail(xxx)
##          id  make          model year        class           trans
## 10115 20864 Dodge Stratus 4 Door 2005 Midsize Cars Automatic 4-spd
## 10116 21499 Dodge Stratus 4 Door 2005 Midsize Cars Automatic 4-spd
## 10117 21500 Dodge Stratus 4 Door 2005 Midsize Cars Automatic 4-spd
## 10118 21954 Dodge Stratus 4 Door 2006 Midsize Cars Automatic 4-spd
## 10119 22615 Dodge Stratus 4 Door 2006 Midsize Cars Automatic 4-spd
## 10120 22616 Dodge Stratus 4 Door 2006 Midsize Cars Automatic 4-spd
##                   drive cyl displ            fuel hwy cty
## 10115 Front-Wheel Drive   4   2.4         Regular  27  20
## 10116 Front-Wheel Drive   6   2.7 Gasoline or E85  25  19
## 10117 Front-Wheel Drive   6   2.7 Gasoline or E85  25  19
## 10118 Front-Wheel Drive   4   2.4         Regular  27  20
## 10119 Front-Wheel Drive   6   2.7 Gasoline or E85  25  19
## 10120 Front-Wheel Drive   6   2.7 Gasoline or E85  25  19

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In this recipie, a taguchi design will be created to analyze if any of the 5 factors listed above have effect on highway fuel economy.

What is the Rationale for this design?

The goal is to determine if any of these factors cause a change in the highway milage of a car.

Randomize: What is the Randomization Scheme?

The dataset is a collection of survey data from the EPA on cars from 1985 to 2015.

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures.

Block: Did you use blocking in the design?

Blocking is not used in this design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Mean number of affairts by the 5 factors:

tapply(xxx$hwy, xxx$make, mean)
##     Acura      Audi Chevrolet     Dodge 
##     25.75     23.55     28.67     26.98
tapply(xxx$hwy, xxx$class, mean)
##    Compact Cars    Midsize Cars Subcompact Cars     Two Seaters 
##           27.40           25.82           27.72           22.74
tapply(xxx$hwy, xxx$drive, mean)
## 4-Wheel or All-Wheel Drive            All-Wheel Drive 
##                      22.63                      24.00 
##          Front-Wheel Drive           Rear-Wheel Drive 
##                      28.07                      24.08
tapply(xxx$hwy, xxx$trans, mean)
##  Automatic (S5) Automatic 4-spd Automatic 5-spd    Manual 5-spd 
##           25.78           25.74           24.40           28.21
tapply(xxx$hwy, xxx$fuel, mean)
##                  Diesel         Gasoline or E85 Gasoline or natural gas 
##                   41.00                   26.88                   28.75 
##                Midgrade                 Premium                 Regular 
##                   25.40                   24.53                   28.05

Audi’s appear to have a lower milage than the other 3 groups. Two-seaters seem to have a lower milage than the other groups Front wheel drive and Mannual 5-spd cars have the highest milage. Diesel appears to have a significantly higher milage than the other 5 fuel types The means will be further analyzed in the boxplots below.

Boxplots:

boxplot(xxx$hwy~xxx$make, xlab="Make", ylab="hwy mpg")

plot of chunk unnamed-chunk-10

boxplot(xxx$hwy~xxx$class, xlab="Class", ylab="hwy mpg")

plot of chunk unnamed-chunk-10

boxplot(xxx$hwy~xxx$drive, xlab="Drive", ylab="hwy mpg")

plot of chunk unnamed-chunk-10

boxplot(xxx$hwy~xxx$trans, xlab="trans", ylab="hwy mpg")

plot of chunk unnamed-chunk-10

boxplot(xxx$hwy~xxx$fuel, xlab="fuel", ylab="hwy mpg")

plot of chunk unnamed-chunk-10

From the boxplots, the trends observed above are reinforced. One important observation is that many of the factor levels have several outliers of high highway mileage. It is also important to note that when looking at fuel type, some levels only have a couple observations (diesel and midgrade),

Histogram of Visits:

hist(xxx$hwy, breaks=20)

plot of chunk unnamed-chunk-11

The most frequent milage is between 25 and 30 mpg.

Testing

ANOVA

Null Hypothesis: The variation in highway fuel milage cannot be explained by anything other than variation.

First, assign interger levels to all factors:

xxx$make<-as.character(xxx$make)

xxx$make[xxx$make == "Acura"]<-1
xxx$make[xxx$make == "Audi"]<-2
xxx$make[xxx$make == "Chevrolet"]<-3
xxx$make[xxx$make == "Dodge"]<-4


xxx$class<-as.character(xxx$class)

xxx$class[xxx$class == "Compact Cars"]<-1
xxx$class[xxx$class == "Midsize Cars"]<-2
xxx$class[xxx$class == "Subcompact Cars"]<-3
xxx$class[xxx$class == "Two Seaters"]<-4


xxx$drive<-as.character(xxx$drive)

xxx$drive[xxx$drive == "4-Wheel or All-Wheel Drive"]<-1
xxx$drive[xxx$drive == "All-Wheel Drive"]<-2
xxx$drive[xxx$drive == "Front-Wheel Drive"]<-3
xxx$drive[xxx$drive == "Rear-Wheel Drive"]<-4


xxx$trans<-as.character(xxx$trans)

xxx$trans[xxx$trans == "Automatic (S5)"]<-1
xxx$trans[xxx$trans == "Automatic 4-spd"]<-2
xxx$trans[xxx$trans == "Automatic 5-spd"]<-3
xxx$trans[xxx$trans == "Manual 5-spd"]<-4


xxx$fuel<-as.character(xxx$fuel)

xxx$fuel[xxx$fuel == "Diesel"]<-1
xxx$fuel[xxx$fuel == "Gasoline or E85"]<-2
xxx$fuel[xxx$fuel == "Gasoline or natural gas"]<-3
xxx$fuel[xxx$fuel == "Midgrade"]<-4
xxx$fuel[xxx$fuel == "Premium"]<-5
xxx$fuel[xxx$fuel == "Regular"]<-6


xxx$make<-as.integer(xxx$make)
xxx$class<-as.integer(xxx$class)
xxx$drive<-as.integer(xxx$drive)
xxx$trans<-as.integer(xxx$trans)
xxx$fuel<-as.integer(xxx$fuel)

Create ANOVA model:

model1=aov(xxx$hwy~xxx$make+xxx$class+xxx$drive+xxx$trans+xxx$fuel)
summary(model1)
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## xxx$make       1    957     957   61.17 1.2e-14 ***
## xxx$class      1     45      45    2.87   0.091 .  
## xxx$drive      1    307     307   19.59 1.0e-05 ***
## xxx$trans      1   1243    1243   79.41 < 2e-16 ***
## xxx$fuel       1    491     491   31.35 2.7e-08 ***
## Residuals   1169  18293      16                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is small for make, drive, trans, and fuel, meaning there is a low probability that the variation is due to randomization alone, we reject the null hypothesis for these factors. It is likely that each of the factors, make, drive, trans, and fuel type can explain some of the variation in highway mileage. There is not evidence, however, that class of the car causes variation in the highway mileage.

Taguchi Design

Before a taguchi design is created, we must download the qualityTools and DoE.base packages:

library("qualityTools", lib.loc="~/R/win-library/3.1")
## Warning: package 'qualityTools' was built under R version 3.1.2
library("DoE.base", lib.loc="~/R/win-library/3.1")
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design

Now, we will create a taguchi design to find the best factor level combinations that lead to the highest S/N ratio (signal-to-noise).

First, create an orthogonal array:

array=oa.design(factor.names=c("make", "class", "drive", "trans", "fuel"), nlevels=c(4,4,4,4,6), columns="min3")
array
##    make class drive trans fuel
## 1     4     4     3     1    2
## 2     4     4     4     4    1
## 3     1     1     3     2    6
## 4     4     2     2     2    3
## 5     3     2     2     2    5
## 6     1     3     2     1    4
## 7     3     1     2     4    6
## 8     3     3     4     2    2
## 9     4     2     1     3    6
## 10    3     2     3     4    1
## 11    1     4     4     4    5
## 12    2     2     4     1    6
## 13    2     1     4     3    4
## 14    4     3     2     1    4
## 15    2     4     3     1    6
## 16    3     3     3     3    1
## 17    3     4     3     1    3
## 18    2     2     1     3    2
## 19    1     4     2     3    5
## 20    3     1     4     3    4
## 21    2     3     3     3    5
## 22    1     2     4     1    2
## 23    4     3     1     4    2
## 24    3     1     1     1    3
## 25    4     1     3     2    5
## 26    1     3     4     2    6
## 27    2     3     1     4    5
## 28    2     1     3     2    2
## 29    3     4     2     3    2
## 30    4     3     4     2    3
## 31    2     1     2     4    3
## 32    3     3     1     4    6
## 33    1     1     2     4    2
## 34    4     4     2     3    6
## 35    3     4     1     2    4
## 36    2     4     4     4    3
## 37    4     2     3     4    4
## 38    1     2     1     3    3
## 39    1     1     1     1    1
## 40    2     4     1     2    4
## 41    1     2     3     4    4
## 42    3     2     4     1    5
## 43    4     1     4     3    1
## 44    1     3     3     3    3
## 45    2     3     2     1    1
## 46    2     2     2     2    1
## 47    1     4     1     2    1
## 48    4     1     1     1    5
## class=design, type= oa

Merger the array with the original data and identify the unique rows in the new set:

new=merge(array, xxx, by=c("make", "class", "drive", "trans", "fuel"), all=FALSE)

unique=unique(new[,1:5])
unique
##    make class drive trans fuel
## 1     1     1     3     2    6
## 7     1     4     4     4    5
## 13    2     3     1     4    5
## 15    4     1     3     2    5
rownames(unique)
## [1] "1"  "7"  "13" "15"

Find and save the highway mileage for the unique rows and use them to create a second array:

hwy2=new$hwy[index=c(1,7,13,15)]
hwy2
## [1] 20 22 22 25
array2 = cbind(unique,hwy2)
array2
##    make class drive trans fuel hwy2
## 1     1     1     3     2    6   20
## 7     1     4     4     4    5   22
## 13    2     3     1     4    5   22
## 15    4     1     3     2    5   25

Now, we can calculate the signal to noise ratio:

s.n = -10*log10(1/array2$hwy2^2)
s.n
## [1] 26.02 26.85 26.85 27.96

The highest S/N ratio is 27.95880, which corresponds to the first row. This combination of factors identifies the best way to get the best (highest) highway fuel milage. Looking back at array2, the factor/level combination associated with row 1 is: make-1-Acura class-1-Compact Car drive-3-Front Wheel Drive trans-2-Automatic 4-Speed fuel-6-Regular

Finally, we run a new ANOVA model with the unique row array (array2):

model2=aov(array2$hwy2~array2$make+array2$class+array2$drive+array2$trans+array2$fuel, data=array2)
summary(model2)
##              Df Sum Sq Mean Sq
## array2$make   2   10.8    5.38
## array2$class  1    2.0    2.00

Since array2 only had 4 rows, this was not a suffcient amount of data to run the analysis.

Diagnostics/Model Adequacy Checking

Original Model 1

Visually inspect normality of original data:

qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-20

The data appears it may not be normal. we would perform a shapiro-wilks normality test to confirm this assumption, however, the model is to large to be teseted. In case the data is infact not normal, alternate tests will be performed in the contingencies section below.

Fitted vs Residuals Plot:

plot(fitted(model1),residuals(model1))

plot of chunk unnamed-chunk-21

The data should be symetric over the zero and spread out over the dynamic range. We can assume the fit is good since both of these are true.

4. Contingencies

Since the data may not have fulfilled the normality assumption of the model, a Kruskal-Wallis non-parametric analysis of variance by Rank Sum Test should be performed:

kruskal.test(xxx$hwy~xxx$make, data=xxx)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  xxx$hwy by xxx$make
## Kruskal-Wallis chi-squared = 223.1, df = 3, p-value < 2.2e-16
kruskal.test(xxx$hwy~xxx$class, data=xxx)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  xxx$hwy by xxx$class
## Kruskal-Wallis chi-squared = 91.99, df = 3, p-value < 2.2e-16
kruskal.test(xxx$hwy~xxx$trans, data=xxx)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  xxx$hwy by xxx$trans
## Kruskal-Wallis chi-squared = 100.7, df = 3, p-value < 2.2e-16
kruskal.test(xxx$hwy~xxx$drive, data=xxx)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  xxx$hwy by xxx$drive
## Kruskal-Wallis chi-squared = 347.2, df = 3, p-value < 2.2e-16
kruskal.test(xxx$hwy~xxx$fuel, data=xxx)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  xxx$hwy by xxx$fuel
## Kruskal-Wallis chi-squared = 206.6, df = 5, p-value < 2.2e-16

The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means).

Since the test results have a low p-value for each of the factors, it is likely that each factor can explain some of the variation in highway mileage.

These results match and back up the results of the ANOVA previously run, except for the case of class. This inconsistencey may be due to the fact that the data was not normal.

5. References to the Literature

None used.

6. Appendicies

Complete R Code

All included above.