Recipe 10: Taguchi Design

Recipes for the Design of Experiments

Fuel Economy

Jane Braun

RPI

December 8th Version 1.0

1. Setting

System under test

The data analyzed in this recipe is from one of Hadley Wickham’s datasets. In this recipe, the fuel economy dataset was used.

This dataset is the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emmisions laboratory in Ann Arbor, Michigan.

# Read in dataset

library("fueleconomy", lib.loc="~/R/win-library/3.1")
x <- vehicles

Factors and Levels

The original data being examined has 33,442 observations of 12 variables.

head(x)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13
summary(x)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57
# As seen below, there are 33,442 observations of 12 variables, including: 56 factors and 6 numeric variables.
str(x)
## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
##  $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
##  $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
##  $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
##  $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
##  $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
##  $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
##  $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...
# Cleaning data:
x$make <- as.factor(x$make)
x$trans <- as.factor(x$trans)
x$drive <- as.factor(x$drive)
x$cyl <- as.factor(x$cyl)
x$fuel <- as.factor(x$fuel)
# Make 'year' into a factor by defining intervals
x$year <- cut(x$year, c(1980,1990,2000,2010,2020))
x$year <- as.character(x$year)
x$year[x$year == "(1.98e+03,1.99e+03]"] <- "1980s"
x$year[x$year == "(1.99e+03,2e+03]"] <- "1990s"
x$year[x$year == "(2e+03,2.01e+03]"] <- "2000s"
x$year[x$year == "(2.01e+03,2.02e+03]"] <- "2010s"
x$year <- as.factor(x$year)
# Subset only top 4 car makes
x <- x[x$make %in% c("Chevrolet","Ford", "Dodge", "GMC"), ]
x$make <- droplevels(x$make)
# Subset top 4 car drive types
x$drive <- as.factor(x$drive)
x <- x[x$drive %in% c("2-Wheel Drive","4-Wheel or All-Wheel Drive", "Front-Wheel Drive", "Rear-Wheel Drive"), ]
x$drive <- droplevels(x$drive)

# Subset top 4 fuel types
x$fuel <- as.factor(x$fuel)
x <- x[x$fuel %in% c("CNG","Diesel", "Premium", "Regular"), ]
x$fuel <- droplevels(x$fuel)

# Subset top 3 cylinders
x$cyl <- as.factor(x$cyl)
x <- x[x$cyl %in% c("4", "6",  "8"), ]
x$cyl <- droplevels(x$cyl)


x$cty <- as.numeric(x$cty)

Continuous variables

The continuous variables in this dataset are mileages of the cars - both city and highway.

Response variables

The response variable being analyzed in this dataset is the city mileage of the car in miles per gallon.

As seen below, the log of the hourly wage ranged from 6 to 138 mpg, with a mean of 17.49.

summary(x$cty)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   13.00   15.00   15.51   17.00   47.00

The Data

The data is originally consisted of 12 variables, but was subseted down to 6 columns: 5 factors, and 1 continuous variable.

Randomization

There was no randomization in this recipe because the data points are based on observations of cars.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

A Taguchi design will be used to analyze which, if any, of the five factors are responsible for variation in the response variable, city mileage.

Randomize:

The experiment was based on observations, and was therefore not randomized.

Replicate:

The experiment had no replicates or repeated measures.

Block:

There was no blocking in this recipe.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

The boxplot below shows the the city mileage of the cars versus the make of the car. All of the cars had relatively similar medians, but the ranges were quite varied. Ford cars had the greatest range in mileage, and GMC had that smallest range.

par(mfrow=c(1,1))
attach(x)
## The following object is masked _by_ .GlobalEnv:
## 
##     model
plot(cty ~ make, xlab = "Make of Car", ylab = "City Mileage (mpg)", main="City Mileage by Make of Car" )

The boxplot below shows the city mileage of the cars versus the year the car was built. Cars in the 80s, 90s, and 00s, all had similar medians. The medians of cars of the 2010s had significantly higher city mileage. The ranges were fairly consistent among time intervals.

plot(cty ~ year, xlab = "Year of Car", ylab = "City Mileage (mpg)", main="City Mileage by Year of Car" )

The boxplot below shows city mileage versus the drive type of the car. Front-Wheel drive cars have the greatest median mileage, and tend to be positively skewed. The other three, 2-wheel drive, Rear-wheel drive and All-wheel drive have similarly lower medians and smaller ranges.

plot(cty ~ drive, xlab = "Drive Type", ylab = "City Mileage (mpg)", main="City Mileage by Drive Type")

The boxplot below shows the city mileage versus the number of cylinders in the car. 4-cylinder cars have the highest medians, and also the greatest range. 6 and 8-cylinder cars have much smaller ranges.

plot(cty ~ cyl, xlab = "Number of Cylinders", ylab = "City Mileage (mpg)", main="City Mileage by Cylinder Type" )

The boxplot below shows the relationship between fuel type and city mileage. Regular gas cars have the largest range, whereas compressed natural gas (CNG) cars has the smallest range. The medians of Diesel, Premium, and Regular are relatively similar, but the median of CNG is fairly lower.

plot(cty ~ fuel, xlab = "Fuel Type", ylab = "City Mileage (mpg)", main="City Mileage by Fuel Type" )

Testing

ANOVA

ANOVA Testing

Comparing the average city gas mileage to five factors: drive type, year, make, cylinders, and fuel type. H0: There is no difference between the means of the samples. The variation in city gas mileage is not caused by anything other than randomization. Ha: The difference in variation of city gas mileage can caused by something other than randomization.

# Assign all factors an integer value
x$year <- as.character(x$year)
x$year[x$year == "1980s"] <- 1
x$year[x$year == "1990s"] <- 2
x$year[x$year == "2000s"] <- 3
x$year[x$year == "2010s"] <- 4
x$year <- as.factor(x$year)
x$drive <- as.character(x$drive)
x$drive[x$drive=="2-Wheel Drive"] <- 1
x$drive[x$drive=="4-Wheel or All-Wheel Drive"] <- 2
x$drive[x$drive=="Front-Wheel Drive"] <-3
x$drive[x$drive=="Rear-Wheel Drive"] <-4
x$drive <- as.factor(x$drive)
x$fuel <- as.character(x$fuel)
x$fuel[x$fuel=="CNG"] <- 1
x$fuel[x$fuel=="Diesel"] <- 2
x$fuel[x$fuel=="Premium"] <- 3
x$fuel[x$fuel=="Regular"] <- 4
x$fuel <- as.factor(x$fuel)

x$make<-as.character(x$make)
x$make[x$make=="Chevrolet"] <- 1
x$make[x$make=="Ford"] <- 2
x$make[x$make=="Dodge"] <- 3
x$make[x$make=="GMC"] <- 4
x$make <- as.factor(x$make)

x$cyl <- as.character(x$cyl)
x$cyl[x$cyl == "4"] <- 1
x$cyl[x$cyl == "6"] <- 2
x$cyl[x$cyl == "8"] <- 3
x$cyl <- as.factor(x$cyl)

x <- na.omit(x)

x$make <- as.integer(x$make)
x$fuel <- as.integer(x$fuel)
x$cyl <- as.integer(x$cyl)
x$drive <- as.integer(x$drive)
x$year<- as.integer(x$year)



model <- aov(cty~make+fuel+cyl+drive+year,data=x)
summary(model)
##               Df Sum Sq Mean Sq  F value Pr(>F)    
## make           1   3756    3756   731.65 <2e-16 ***
## fuel           1    476     476    92.63 <2e-16 ***
## cyl            1  88294   88294 17199.04 <2e-16 ***
## drive          1    440     440    85.61 <2e-16 ***
## year           1   2530    2530   492.78 <2e-16 ***
## Residuals   9669  49637       5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-values of all of the main effects are less than an alpha of 0.05. Therefore, we can fail to reject the null hypothesis, and instead support the alternative, that something other than randomization causes the variation in city mileage.

Taguchi Design

Next, a Taguchi Design was used to find which combination of factors and levels results in the highest signal-to-noise ratio.

#install.packages("qualityTools")
library(qualityTools)
## Warning: package 'qualityTools' was built under R version 3.1.2
#install.packages("DoE.base")
library("DoE.base")
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design
array = oa.design(factor.names=c("make", "fuel", "cyl", "drive", "year"), nlevels=c(4,4,3, 4, 4 ), columns="min3")
array
##    make fuel cyl drive year
## 1     2    3   1     1    4
## 2     3    1   2     1    1
## 3     2    4   1     3    1
## 4     3    3   3     4    2
## 5     1    2   2     4    1
## 6     3    3   1     3    3
## 7     2    4   3     1    2
## 8     2    1   2     3    2
## 9     3    1   1     2    4
## 10    3    3   2     1    4
## 11    2    1   3     2    4
## 12    1    1   2     3    2
## 13    4    2   1     1    3
## 14    4    4   3     3    1
## 15    4    1   1     3    2
## 16    4    2   2     2    2
## 17    1    3   1     4    2
## 18    1    1   1     1    1
## 19    2    1   1     4    3
## 20    4    3   1     2    1
## 21    3    4   1     1    2
## 22    2    2   3     1    3
## 23    3    2   3     2    2
## 24    3    2   1     4    1
## 25    3    2   2     3    4
## 26    4    4   1     4    4
## 27    4    1   3     1    1
## 28    4    4   2     2    3
## 29    1    1   3     2    4
## 30    3    4   2     2    3
## 31    2    2   1     2    2
## 32    4    2   3     3    4
## 33    1    4   3     4    4
## 34    4    1   2     4    3
## 35    4    3   2     1    4
## 36    1    3   2     3    3
## 37    1    3   3     2    1
## 38    1    2   1     3    4
## 39    1    4   2     1    2
## 40    4    3   3     4    2
## 41    2    3   3     3    3
## 42    2    3   2     2    1
## 43    1    2   3     1    3
## 44    2    4   2     4    4
## 45    3    4   3     3    1
## 46    3    1   3     4    3
## 47    1    4   1     2    3
## 48    2    2   2     4    1
## class=design, type= oa
#Merge the Taguchi Design with the response variable from the data
new_x <- merge(array, x, by=c("make", "fuel", "cyl", "drive", "year"), all=FALSE)

# Identify unique rows
unique_new = unique(new_x[,1:5])
unique_new
##     make fuel cyl drive year
## 1      1    2   1     3    4
## 2      1    3   2     3    3
## 6      1    4   1     2    3
## 42     1    4   3     4    4
## 74     2    4   1     3    1
## 199    2    4   2     4    4
## 221    3    1   3     4    3
## 226    3    3   1     3    3
## 230    3    4   2     2    3
## 272    4    4   1     4    4
## 280    4    4   2     2    3
rownames(unique_new)
##  [1] "1"   "2"   "6"   "42"  "74"  "199" "221" "226" "230" "272" "280"
cty = new_x$cty[index=c(1,2,6,42,74,199,221,226,230,272,280)]

cty
##  [1] 27 16 16 12 21 16 12 19 16 18 13
new_array = cbind(unique_new,cty)
new_array
##     make fuel cyl drive year cty
## 1      1    2   1     3    4  27
## 2      1    3   2     3    3  16
## 6      1    4   1     2    3  16
## 42     1    4   3     4    4  12
## 74     2    4   1     3    1  21
## 199    2    4   2     4    4  16
## 221    3    1   3     4    3  12
## 226    3    3   1     3    3  19
## 230    3    4   2     2    3  16
## 272    4    4   1     4    4  18
## 280    4    4   2     2    3  13
# Caluclate the Signal-to-Noise Ratio
s_n = -10*log10(1/new_array$cty^2)
s_n
##  [1] 28.62728 24.08240 24.08240 21.58362 26.44439 24.08240 21.58362
##  [8] 25.57507 24.08240 25.10545 22.27887
#Maximum S/N
max_s_n=which(s_n==max(s_n))
max_s_n
## [1] 1
new_array[max_s_n,]
##   make fuel cyl drive year cty
## 1    1    2   1     3    4  27
#make=1 - Chevrolet
#fuel=2 - Diesel
#cylinder=1 - 4 cylinders
#drive=3 - front-wheel drive
#year=4 - 2010s

new_model = aov(cty~make+fuel+cyl+drive+year, data=new_array)
summary(new_model)
##             Df Sum Sq Mean Sq
## make         3  16.49    5.50
## fuel         3 141.13   47.04
## cyl          2  27.16   13.58
## drive        2   6.12    3.06
# Unfortunately, there were not enough values to perform an ANOVA

Diagnostics/Model Adequacy Checking

Assumptions for ANOVA tests include that the data is normally distributed.

shapiro.test(new_x$cty)
## 
##  Shapiro-Wilk normality test
## 
## data:  new_x$cty
## W = 0.9535, p-value = 6.738e-09

The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the city gas mileage is not normally distributed.

par(mfrow = c(1,1))
qqnorm(residuals(model))
qqline(residuals(model))   

The Q-Q Normality Plots of the residuals for the model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.

plot(fitted(model),residuals(model))   

The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the model do appear to be symmetric over the length of the dynamic range, but do not seem to be symmetric over 0.

4. Contingencies

Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.

A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.

With the kruskal-wallis test, we could only analyze the main effect of one factor at a time.

kruskal.test(cty ~ year, data=x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by year
## Kruskal-Wallis chi-squared = 488.9773, df = 3, p-value < 2.2e-16

The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that year of creation may explain variation in the city mileage.

kruskal.test(cty ~ make, data=x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by make
## Kruskal-Wallis chi-squared = 265.6952, df = 3, p-value < 2.2e-16

The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that car make may explain variation in the city mileage.

kruskal.test(cty ~ fuel, data=x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by fuel
## Kruskal-Wallis chi-squared = 206.7565, df = 3, p-value < 2.2e-16

The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that fuel type may explain variation in the city mileage.

kruskal.test(cty ~ cyl, data=x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by cyl
## Kruskal-Wallis chi-squared = 6070.349, df = 2, p-value < 2.2e-16

The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that number of cylinders may explain variation in the city mileage.

kruskal.test(cty ~ drive, data=x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by drive
## Kruskal-Wallis chi-squared = 4076.988, df = 3, p-value < 2.2e-16

The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that drive type may explain variation in the city mileage.