Recipes for the Design of Experiments

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Analysis of Fuel Economy using Taguchi Design Methods

Brendan Howell

Renselaer Polytechnic Institute

12/08/14 - Version 1.0

1. Setting

EPA Fuel Economy Data (Vehicle Data)

Description: Fuel economy data from the EPA, 1985-2015. This dataset contains selected variables, and removes vehicles with incomplete data (e.g. no drive train data).

In this study that uses fuel economy data collected by the EPA from 1985-2015, a five-factor, multi-level experiment is performed (where ANOVA and Taguchi Design Methods are used) to see if the make, the drive type, the fuel type, the production year, or the transmission of a vehicle (which are the five factors being considered in this analysis) has a statistically significant effect on the city fuel economy (in miles per gallon, ‘mpg’) of that vehicle (which is the response variable in this analysis).

#Install and load the 'fueleconomy' dataset into R.
install.packages("fueleconomy", repos='http://cran.us.r-project.org')

## package 'fueleconomy' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\howelb\AppData\Local\Temp\RtmpszCIHe\downloaded_packages

library("fueleconomy",lib.loc="C:/Program Files/R/R-3.1.1/library")

## Warning: package 'fueleconomy' was built under R version 3.1.2

rm(list=ls())
#Assign a variable, cars, to the complete dataframe, 'vehicles'. Then, determine the "head" and "tail" of the dataset, cars.
cars<-vehicles
head(cars)

##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

tail(cars)

##          id  make                             model year       class
## 33437 31064 smart   fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart       fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart       fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart       fortwo electric drive coupe 2014 Two Seaters
##                trans            drive cyl displ        fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33438 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33439 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33440 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33441 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33442 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122

Factors and Levels

This analysis includes five individual factors that each have multiple levels, including “make”, “drive”, “trans”, fuel“, and”year“. For the factor”make“, the levels ‘Chevrolet’, ‘Subaru’, and ‘Saab’ will be considered (these vehicle makes were selected randomly from the list of”makes" discovered upon performing descriptive statistics of the dataset, “x”). For the factor “drive”, the levels ‘2-Wheel Drive’, ‘Rear-Wheel Drive’, ‘Front-Wheel Drive’, ‘4-Wheel or All-Wheel Drive’, ‘All-Wheel Drive’, and ‘4-Wheel Drive’ will be considered. For the factor “trans”, the levels referring to the different transmission types that are found in ‘Chevrolet’, ‘Subaru’, and ‘Saab’ vehicles will be considered and broken down into two different categories, “Automatic” and “Manual”. For the factor “fuel”, the levels ‘CNG’, ‘Diesel’, ‘Electricity’, ‘Gasoline or E85’, ‘Gasoline or natural gas’, ‘Premium’, ‘Premium Gas or Electricity’, and ‘Regular’ will be considered and divided into two individual categories (“Normal-Grade” and “High-Grade”). For the factor “year”, the individual years included in this dataset will be split into 4 distinct decades: the 1980s, the 1990s, the 2000s, and the 2010s (“80s”, “90s”, “00s”, and “10s”, respectively).

#Display the summary statistics of "cars".
summary(cars)

##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57

#Display the names found in "cars".
names(cars)

##  [1] "id"    "make"  "model" "year"  "class" "trans" "drive" "cyl"  
##  [9] "displ" "fuel"  "hwy"   "cty"

#Display the structure of "cars".
str(cars)

## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
##  $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
##  $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
##  $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
##  $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
##  $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
##  $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
##  $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...

#Create subset of "cars" that contains all vehicles of the make 'Chevrolet', 'Subaru', and 'Saab'.
cars_new <- subset(cars, cars$make =="Chevrolet"|cars$make=="Subaru"|cars$make=="Saab", select = c(id,make,year,trans,drive,fuel,cty))
#Display the head and the tail of "cars_new".
head(cars_new)

##        id      make year           trans            drive    fuel cty
## 3706 1567 Chevrolet 1985    Manual 4-spd Rear-Wheel Drive Regular  19
## 3707 1568 Chevrolet 1985 Automatic 4-spd Rear-Wheel Drive Regular  15
## 3708 1569 Chevrolet 1985    Manual 4-spd Rear-Wheel Drive Regular  15
## 3709 1570 Chevrolet 1985    Manual 5-spd Rear-Wheel Drive Regular  15
## 3710  923 Chevrolet 1985 Automatic 4-spd Rear-Wheel Drive Regular  18
## 3711  924 Chevrolet 1985    Manual 4-spd Rear-Wheel Drive Regular  19

tail(cars_new)

##          id   make year                            trans             drive
## 29475  5279 Subaru 1989                     Manual 5-spd Front-Wheel Drive
## 29476 32640 Subaru 2013                     Manual 5-spd   All-Wheel Drive
## 29477 32641 Subaru 2013 Automatic (variable gear ratios)   All-Wheel Drive
## 29478 34233 Subaru 2014                     Manual 5-spd   All-Wheel Drive
## 29479 34234 Subaru 2014 Automatic (variable gear ratios)   All-Wheel Drive
## 29480 34547 Subaru 2014 Automatic (variable gear ratios)   All-Wheel Drive
##          fuel cty
## 29475 Regular  21
## 29476 Regular  23
## 29477 Regular  25
## 29478 Regular  23
## 29479 Regular  25
## 29480 Regular  29

#Display the summary statistics of "cars_new".
summary(cars_new)

##        id            make                year         trans          
##  Min.   :    3   Length:4649        Min.   :1984   Length:4649       
##  1st Qu.: 6653   Class :character   1st Qu.:1989   Class :character  
##  Median :14714   Mode  :character   Median :1996   Mode  :character  
##  Mean   :15578                      Mean   :1997                     
##  3rd Qu.:23634                      3rd Qu.:2006                     
##  Max.   :34892                      Max.   :2015                     
##     drive               fuel                cty        
##  Length:4649        Length:4649        Min.   :  8.00  
##  Class :character   Class :character   1st Qu.: 14.00  
##  Mode  :character   Mode  :character   Median : 16.00  
##                                        Mean   : 16.88  
##                                        3rd Qu.: 19.00  
##                                        Max.   :128.00

#Display the structure of "cars_new".
str(cars_new)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4649 obs. of  7 variables:
##  $ id   : int  1567 1568 1569 1570 923 924 925 926 927 928 ...
##  $ make : chr  "Chevrolet" "Chevrolet" "Chevrolet" "Chevrolet" ...
##  $ year : int  1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
##  $ trans: chr  "Manual 4-spd" "Automatic 4-spd" "Manual 4-spd" "Manual 5-spd" ...
##  $ drive: chr  "Rear-Wheel Drive" "Rear-Wheel Drive" "Rear-Wheel Drive" "Rear-Wheel Drive" ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ cty  : int  19 15 15 15 18 19 19 16 15 16 ...

#Categorize 'make' as a factor and display its resulting levels.
cars_new$make = as.factor(cars_new$make)
levels(cars_new$make)

## [1] "Chevrolet" "Saab"      "Subaru"

#Categorize 'trans' as a factor and display its resulting levels.
cars_new$trans[cars_new$trans == "Auto (AV)"] = "Automatic"
cars_new$trans[cars_new$trans == "Auto(AV-S6)"] = "Automatic"
cars_new$trans[cars_new$trans == "Auto(AV-S8)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (A1)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (S4)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (S5)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (S6)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (variable gear ratios)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic 3-spd"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic 4-spd"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic 5-spd"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic 6-spd"] = "Automatic"
cars_new$trans[cars_new$trans == "Manual 3-spd"] = "Manual"
cars_new$trans[cars_new$trans == "Manual 4-spd"] = "Manual"
cars_new$trans[cars_new$trans == "Manual 5-spd"] = "Manual"
cars_new$trans[cars_new$trans == "Manual 6-spd"] = "Manual"
cars_new$trans[cars_new$trans == "Manual 7-spd"] = "Manual"
cars_new$trans = as.factor(cars_new$trans)
levels(cars_new$trans)

## [1] "Automatic" "Manual"

#Categorize 'drive' as a factor and display its resulting levels.
cars_new$drive = as.factor(cars_new$drive)
levels(cars_new$drive)

## [1] "2-Wheel Drive"              "4-Wheel Drive"             
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"           
## [5] "Front-Wheel Drive"          "Rear-Wheel Drive"

#Categorize 'fuel' as a factor and display its resulting levels.
cars_new$fuel[cars_new$fuel == "CNG"] = "Normal-Grade"
cars_new$fuel[cars_new$fuel == "Diesel"] = "Normal-Grade"
cars_new$fuel[cars_new$fuel == "Electricity"] = "High-Grade"
cars_new$fuel[cars_new$fuel == "Gasoline or E85"] = "Normal-Grade"
cars_new$fuel[cars_new$fuel == "Gasoline or natural gas"] = "Normal-Grade"
cars_new$fuel[cars_new$fuel == "Premium"] = "High-Grade"
cars_new$fuel[cars_new$fuel == "Premium Gas or Electricity"] = "High-Grade"
cars_new$fuel[cars_new$fuel == "Regular"] = "Normal-Grade"
cars_new$fuel = as.factor(cars_new$fuel)
levels(cars_new$fuel)

## [1] "High-Grade"   "Normal-Grade"

#Transform 'year' into a categorical variable (80s, 90s, 00s, and 10s), categorize them as factors, and display their resulting levels.
cars_new$year[cars_new$year > 1979 & cars_new$year <= 1989] = "80s"
cars_new$year[cars_new$year > 1989 & cars_new$year <= 1999] = "90s"
cars_new$year[cars_new$year > 1999 & cars_new$year <= 2009] = "00s"
cars_new$year[cars_new$year > 2009 & cars_new$year <= 2015] = "10s"
cars_new$year = as.factor(cars_new$year)
levels(cars_new$year)

## [1] "00s" "10s" "80s" "90s"

Continuous variables (if any)

In this dataset, only one variable is noted as being a continuous variable (‘displ’, which refers to the engine displacement, in litres, has a numeric data type). While it could be argued that the highway and city fuel economies [in mpg] (denoted, respectively, as ‘hwy’ and ‘cty’) are generally considered to be continuous variables, this dataset notes each of their data types as being integer values. The five different factors being considered in this analysis, which include “make”, “drive”, “trans”, “fuel”, and “year”, are all going to be treated as categorical variables in this analysis.

Response variables

This analysis will consider one response variable, ‘cty’, which denotes city fuel economy in this analysis.

The Data: How is it organized and what does it look like?

This dataset includes fuel economy data from the EPA during the years 1985-2015. This dataset contains selected variables, and removes vehicles with incomplete data (e.g. no drive train data). It contains 33,442 observations of 12 variables, which include “id”, “make”, “model”, “year”, “class”, “trans”, “drive”, “cyl”, “displ”, “fuel”, “hwy”, and “cty”. The variables that are made up of integer values include “id” (which is a unique EPA identifier), “year” (which refers to the model year), “cyl” (which to the number of cylinders), “hwy” (which refers to the highway fuel economy, in mpg), and “cty” (which refers to the city fuel economy, in mpg). The variables that are made up of character values include “make” (which refers to the manufacturer), “model” (which refers to the model name), “class” (which refers to the EPA vehicle size class), “trans” (which refers to the transmission), “drive” (which refers to the drive train), and “fuel” (which refers to the fuel type). Lastly, the variable “displ” is a numeric variable, and it refers to the engine displacement, in litres.

Randomization

According to the official U.S. government source for fuel economy information, this fuel economy data “are the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA.” [1] Therefore, we can make the assumption that this dataset exhibits randomization.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘cty’ in this analysis) can be explained by the variation existent in the five different treatments of the experiment (which correspond to ‘make’, ‘drive’, ‘trans’, ‘fuel’, and ‘year’). Therefore, the null hypothesis that is being tested states that the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle do not have have a statistically significant effect on city fuel economy. Alternately, the alternative hypothesis that is being tested states that the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle do have have a statistically significant effect on city fuel economy. In carrying out this analysis, an analysis of variance (ANOVA) and Taguchi Methods are used in consideration of the response variable city fuel economy (‘cty’) to see if there is a significant difference in the means for this response variable when considering the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle, which are contained in this dataset.

What is the rationale for this design?

The rationale for this design lies primarily in the fact that we’re trying to determine if the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle have any effect on city fuel economy. So, since city fuel economy (in mpg) is a useful and relevant metric to consider when determining how well vehicles perform, this design of a five-factor, multi-level experiment (as it corresponds to an analysis of variance and the use of Taguchi Methods) was crafted to see if the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle have a statistically significant effect on city fuel economy. By performing this analysis, we can hope to receive some insight regarding the different factors that come into play that typically result in high values of city fuel economy.

Randomize: What is the Randomization Scheme?

Since our original assumption claimed that the entire vehicle dataset exhibits randomization, and since our three specific “makes” of consideration were selected at random from the list of all available “makes” in the dataset (and include every vehicle present in the dataset that corresponds to either of the three selected “makes”), our randomization scheme stems from the existence of randomization in the original dataset collected by the EPA, which allows us to ensure that a completely randomized block design can be created here.

Replicate: Are there replicates and/or repeated measures?

In this experiment, there are no replicates or repeated measures present.

Block: Did you use blocking in the design?

By designing this experiment in such a way that allows for multiple levels to be considered (so as to reach the most accurate conclusions possible), blocking was not considered in this five-factor, multi-level design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and Descriptive Summary

In beginning to display this data graphically, summary statistics were gathered for the dataset being considered here, “cars_new”. Additionally, histograms and boxplots were created to represent the different observations of city fuel economy.

#Display the summary statistics of "cars_new".
summary(cars_new)

##        id               make       year            trans     
##  Min.   :    3   Chevrolet:3461   00s:1394   Automatic:2941  
##  1st Qu.: 6653   Saab     : 424   10s: 595   Manual   :1708  
##  Median :14714   Subaru   : 764   80s:1310                   
##  Mean   :15578                    90s:1350                   
##  3rd Qu.:23634                                               
##  Max.   :34892                                               
##                         drive                fuel           cty        
##  2-Wheel Drive             : 115   High-Grade  : 490   Min.   :  8.00  
##  4-Wheel Drive             :  68   Normal-Grade:4159   1st Qu.: 14.00  
##  4-Wheel or All-Wheel Drive:1350                       Median : 16.00  
##  All-Wheel Drive           : 166                       Mean   : 16.88  
##  Front-Wheel Drive         :1348                       3rd Qu.: 19.00  
##  Rear-Wheel Drive          :1602                       Max.   :128.00

#Display the names found in "cars_new".
names(cars_new)

## [1] "id"    "make"  "year"  "trans" "drive" "fuel"  "cty"

#Display the structure of "cars_new".
str(cars_new)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4649 obs. of  7 variables:
##  $ id   : int  1567 1568 1569 1570 923 924 925 926 927 928 ...
##  $ make : Factor w/ 3 levels "Chevrolet","Saab",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year : Factor w/ 4 levels "00s","10s","80s",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ trans: Factor w/ 2 levels "Automatic","Manual": 2 1 2 2 1 2 2 1 2 2 ...
##  $ drive: Factor w/ 6 levels "2-Wheel Drive",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ fuel : Factor w/ 2 levels "High-Grade","Normal-Grade": 2 2 2 2 2 2 2 2 2 2 ...
##  $ cty  : int  19 15 15 15 18 19 19 16 15 16 ...

#Display the head and tail of "cars_new".
head(cars_new)

##        id      make year     trans            drive         fuel cty
## 3706 1567 Chevrolet  80s    Manual Rear-Wheel Drive Normal-Grade  19
## 3707 1568 Chevrolet  80s Automatic Rear-Wheel Drive Normal-Grade  15
## 3708 1569 Chevrolet  80s    Manual Rear-Wheel Drive Normal-Grade  15
## 3709 1570 Chevrolet  80s    Manual Rear-Wheel Drive Normal-Grade  15
## 3710  923 Chevrolet  80s Automatic Rear-Wheel Drive Normal-Grade  18
## 3711  924 Chevrolet  80s    Manual Rear-Wheel Drive Normal-Grade  19

tail(cars_new)

##          id   make year     trans             drive         fuel cty
## 29475  5279 Subaru  80s    Manual Front-Wheel Drive Normal-Grade  21
## 29476 32640 Subaru  10s    Manual   All-Wheel Drive Normal-Grade  23
## 29477 32641 Subaru  10s Automatic   All-Wheel Drive Normal-Grade  25
## 29478 34233 Subaru  10s    Manual   All-Wheel Drive Normal-Grade  23
## 29479 34234 Subaru  10s Automatic   All-Wheel Drive Normal-Grade  25
## 29480 34547 Subaru  10s Automatic   All-Wheel Drive Normal-Grade  29

#Display the levels of 'make', 'drive', 'trans', 'year', and 'fuel' within "cars_new".
levels(cars_new$make)

## [1] "Chevrolet" "Saab"      "Subaru"

levels(cars_new$drive)

## [1] "2-Wheel Drive"              "4-Wheel Drive"             
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"           
## [5] "Front-Wheel Drive"          "Rear-Wheel Drive"

levels(cars_new$trans)

## [1] "Automatic" "Manual"

levels(cars_new$year)

## [1] "00s" "10s" "80s" "90s"

levels(cars_new$fuel)

## [1] "High-Grade"   "Normal-Grade"

#Create a histogram of city fuel economy ('cty').
par(mfrow=c(1,1))
hist(cars_new$cty, xlim=c(8,60), main = "City Fuel Economy (in mpg)")

#Create a boxplot of 'cty' mileage for all vehicle makes.
par(mfrow=c(1,1))
boxplot(cty~make,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Vehicle Make")

#Create a boxplot of 'cty' mileage for all vehicle drive types.
par(mfrow=c(1,1))
boxplot(cty~drive,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Vehicle Drive Type")

#Create a boxplot of 'cty' mileage for all vehicle transmission types.
par(mfrow=c(1,1))
boxplot(cty~trans,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Vehicle Transmission Type")

#Create a boxplot of 'cty' mileage for all vehicle fuel types.
par(mfrow=c(1,1))
boxplot(cty~fuel,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Vehicle Fuel Type")

#Create a boxplot of 'cty' mileage for all vehicle years.
par(mfrow=c(1,1))
boxplot(cty~year,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Year")

Testing - ANOVA

In order to determine if the variation that is observed in the response variable (which corresponds to ‘cty’ in this analysis) can be explained by the variation existent in the treatments of the experiment (which correspond to ‘make’, ‘drive’, ‘trans’, ‘fuel’, and ‘year’ [and the way in which their categories were determined]), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in city fuel economy values for each of the different drive types, vehicle makes, transmission types, fuel types, and production years being considered in this dataset.

For the ANOVA model that is designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle do not have have a statistically significant effect on city fuel economy, implying that the differences in the mean values of city fuel economy were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of city fuel economy for each of the corresponding drive types, transmission types, fuel types, production years, and vehicle makes being considered in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different drive types, transmission types, fuel types, production years, and vehicle makes being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of city fuel economy cannot be explained by the variation existent in the different drive types, transmission types, fuel types, production years, and vehicle makes being considered in this analysis and, as such, is likely caused by randomization.

#Perform an analysis of variance (ANOVA) for the different mean values observed for city fuel economy, given the factor 'make', 'year', 'trans', 'drive', and 'fuel'.
vehicles_anova <- aov(cty~make+year+trans+drive+fuel, data = cars_new)
anova(vehicles_anova)

## Analysis of Variance Table
## 
## Response: cty
##             Df Sum Sq Mean Sq F value Pr(>F)    
## make         2   8620  4310.1 404.567 <2e-16 ***
## year         3   2834   944.7  88.675 <2e-16 ***
## trans        1   2337  2336.9 219.357 <2e-16 ***
## drive        5  23537  4707.4 441.861 <2e-16 ***
## fuel         1     12    12.1   1.139 0.2859    
## Residuals 4636  49390    10.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the analysis of variance (ANOVA) that is performed where the factor ‘make’ is analyzed against the response variable ‘cty’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (404.567) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different vehicle makes being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where ‘year’ is analyzed against the response variable ‘cty’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (88.675) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different production years being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where ‘trans’ is analyzed against the response variable ‘cty’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (219.357) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different transmission types being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where ‘drive’ is analyzed against the response variable ‘cty’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (441.861) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different drive types being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where ‘fuel’ is analyzed against the response variable ‘cty’, a p-value = 0.2859 is returned, indicating that there is roughly a probability of 0.2859 that the resulting associated F-value (41.139) is the result of solely randomization. Therefore, based on this result, we would fail to reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy cannot be explained by the variation existent in the different fuel types being considered in this analysis and, as such, is likely caused solely by randomization. (See above results for p-value and F-value.) [Note: With consideration being made for the practical use of Taguchi Methods in this experiment, the factor “fuel type” was reduced from having eight distinct levels to having two distinct levels. This type of categorization may have resulted in this ANOVA analysis’ determined insignificance, which is likely not completely accurate in reality.]

Tukey Honest Significant Differences

In further carrying out this analysis, we can compute Tukey Honest Significant Differences (via “TukeyHSD()”) as a means for determining the specific levels of each factor existent in this analysis that are truly independent from each other and that significantly affect the response variable, ‘cty’.

#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$make].
Tukey_make = TukeyHSD(vehicles_anova, which = "make", ordered = FALSE, conf.level = 0.95)
Tukey_make

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
## 
## $make
##                      diff       lwr      upr p adj
## Saab-Chevrolet   1.297767 0.9040352 1.691498     0
## Subaru-Chevrolet 3.677916 3.3720336 3.983798     0
## Subaru-Saab      2.380149 1.9167374 2.843561     0

par(mfrow=c(1,1))
plot(Tukey_make)

#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$year].
Tukey_year = TukeyHSD(vehicles_anova, which = "year", ordered = FALSE, conf.level = 0.95)
Tukey_year

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
## 
## $year
##               diff        lwr        upr     p adj
## 10s-00s  2.2918450  1.8810728  2.7026172 0.0000000
## 80s-00s  0.6368807  0.3140983  0.9596630 0.0000025
## 90s-00s -0.1443446 -0.4646523  0.1759630 0.6533456
## 80s-10s -1.6549643 -2.0696575 -1.2402712 0.0000000
## 90s-10s -2.4361896 -2.8489594 -2.0234198 0.0000000
## 90s-80s -0.7812253 -1.1065461 -0.4559045 0.0000000

par(mfrow=c(1,1))
plot(Tukey_year)

#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$trans].
Tukey_trans = TukeyHSD(vehicles_anova, which = "trans", ordered = FALSE, conf.level = 0.95)
Tukey_trans

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
## 
## $trans
##                      diff      lwr      upr p adj
## Manual-Automatic 1.428982 1.234313 1.623651     0

par(mfrow=c(1,1))
plot(Tukey_trans)

#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$drive].
Tukey_drive = TukeyHSD(vehicles_anova, which = "drive", ordered = FALSE, conf.level = 0.95)
Tukey_drive

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
## 
## $drive
##                                                     diff        lwr
## 4-Wheel Drive-2-Wheel Drive                  -1.87863015 -3.3021080
## 4-Wheel or All-Wheel Drive-2-Wheel Drive     -1.07751231 -1.9814355
## All-Wheel Drive-2-Wheel Drive                -1.14130396 -2.2702644
## Front-Wheel Drive-2-Wheel Drive               3.54981158  2.6458357
## Rear-Wheel Drive-2-Wheel Drive               -0.85632309 -1.7546481
## 4-Wheel or All-Wheel Drive-4-Wheel Drive      0.80111785 -0.3553813
## All-Wheel Drive-4-Wheel Drive                 0.73732620 -0.6024367
## Front-Wheel Drive-4-Wheel Drive               5.42844174  4.2719015
## Rear-Wheel Drive-4-Wheel Drive                1.02230706 -0.1298218
## All-Wheel Drive-4-Wheel or All-Wheel Drive   -0.06379165 -0.8291366
## Front-Wheel Drive-4-Wheel or All-Wheel Drive  4.62732389  4.2690314
## Rear-Wheel Drive-4-Wheel or All-Wheel Drive   0.22118922 -0.1225971
## Front-Wheel Drive-All-Wheel Drive             4.69111554  3.9257085
## Rear-Wheel Drive-All-Wheel Drive              0.28498087 -0.4737441
## Rear-Wheel Drive-Front-Wheel Drive           -4.40613467 -4.7500594
##                                                      upr     p adj
## 4-Wheel Drive-2-Wheel Drive                  -0.45515235 0.0023506
## 4-Wheel or All-Wheel Drive-2-Wheel Drive     -0.17358909 0.0089382
## All-Wheel Drive-2-Wheel Drive                -0.01234351 0.0457721
## Front-Wheel Drive-2-Wheel Drive               4.45378744 0.0000000
## Rear-Wheel Drive-2-Wheel Drive                0.04200195 0.0719630
## 4-Wheel or All-Wheel Drive-4-Wheel Drive      1.95761697 0.3569774
## All-Wheel Drive-4-Wheel Drive                 2.07708914 0.6191827
## Front-Wheel Drive-4-Wheel Drive               6.58498200 0.0000000
## Rear-Wheel Drive-4-Wheel Drive                2.17443592 0.1157128
## All-Wheel Drive-4-Wheel or All-Wheel Drive    0.70155327 0.9998971
## Front-Wheel Drive-4-Wheel or All-Wheel Drive  4.98561636 0.0000000
## Rear-Wheel Drive-4-Wheel or All-Wheel Drive   0.56497552 0.4436496
## Front-Wheel Drive-All-Wheel Drive             5.45652263 0.0000000
## Rear-Wheel Drive-All-Wheel Drive              1.04370581 0.8930407
## Rear-Wheel Drive-Front-Wheel Drive           -4.06221000 0.0000000

par(mfrow=c(1,1))
plot(Tukey_drive)

#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$fuel].
Tukey_fuel = TukeyHSD(vehicles_anova, which = "fuel", ordered = FALSE, conf.level = 0.95)
Tukey_fuel

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
## 
## $fuel
##                              diff        lwr       upr    p adj
## Normal-Grade-High-Grade 0.1539237 -0.1517058 0.4595532 0.323522

par(mfrow=c(1,1))
plot(Tukey_fuel)

After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$make], it’s seemingly clear that each of the different level-comparisons suggest a significant difference in means among all of the different level-pairs, since their respective p-values are all less than 0.05 (see output of “Tukey_make”). This further suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “make” was determined to be significant.

After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$year], it’s seemingly clear that each of the different level-comparisons suggest a significant difference in means among a great majority of the different level-pairs (excluding “90s-00s” since, unlike the other level-pairs, their respective p-values are greater than 0.05) (see output of “Tukey_year”). This further suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “year” was determined to be significant.

After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$trans], it’s seemingly clear that the level comparison of “Manual” and “Automatic” suggests a significant difference in means [for the level-pair “Manual-Automatic”], since its respective p-value is less than 0.05 (see output of “Tukey_trans”). Therefore, this test suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “trans” was determined to be significant.

After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$drive], it’s seemingly clear that approximately more than half of the different level-comparisons suggest a significant difference in means of the different level-pairs, since their respective p-values are less than 0.05 (see output of “Tukey_drive”). Therefore, this test suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “drive” was determined to be significant.

After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$fuel], it’s seemingly clear that the level comparison of “Normal-Grade” and “High-Grade” suggests a non-significant difference in means [for the level-pair “Normal-Grade-High-Grade”], since its respective p-value is greater than 0.05 (see output of “Tukey_fuel”). Therefore, this test suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “fuel” was determined to be insignificant.

Testing - Taguchi Design Methods

In order to determine if the variation that is observed in the response variable (which corresponds to ‘cty’ in this analysis) can be explained by the variation existent in the treatments of the experiment (which correspond to ‘make’, ‘drive’, ‘trans’, ‘fuel’, and ‘year’ [and the way in which their categories were determined]), Taguchi Design Methods via orthogonal array creation can be used as a means for analyzing the differences in city fuel economy values for each of the different drive types, vehicle makes, transmission types, fuel types, and production years being considered in this dataset and determining the most optimal factor-level combination of vehicle to select that maximizes city fuel economy.

#Create a new dataset to be used for the Taguchi Design Methods analysis.
vehicles_taguchi <- cars_new[,]
#Convert the factor 'make' into a factor variable designated by its levels "Chevrolet", "Saab" and "Subaru".
vehicles_taguchi$make <- as.character(vehicles_taguchi$make)
vehicles_taguchi$make[vehicles_taguchi$make == "Chevrolet"] <- 1
vehicles_taguchi$make[vehicles_taguchi$make == "Saab"] <- 2
vehicles_taguchi$make[vehicles_taguchi$make == "Subaru"] <- 3
#Categorize 'make' as a factor variable and display its resulting structure.
vehicles_taguchi$make = as.factor(vehicles_taguchi$make)
str(vehicles_taguchi$make)

##  Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

#Convert the factor 'year' into a factor variable designated by its levels "80s", "90s", "00s", and "10s".
vehicles_taguchi$year <- as.character(vehicles_taguchi$year)
vehicles_taguchi$year[vehicles_taguchi$year == "80s"] <- 1
vehicles_taguchi$year[vehicles_taguchi$year == "90s"] <- 2
vehicles_taguchi$year[vehicles_taguchi$year == "00s"] <- 3
vehicles_taguchi$year[vehicles_taguchi$year == "10s"] <- 4
#Categorize 'year' as a factor variable and display its resulting structure.
vehicles_taguchi$year = as.factor(vehicles_taguchi$year)
str(vehicles_taguchi$year)

##  Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...

#Convert the factor 'trans' into a factor variable designated by its levels "Automatic" and "Manual".
vehicles_taguchi$trans <- as.character(vehicles_taguchi$trans)
vehicles_taguchi$trans[vehicles_taguchi$trans == "Automatic"] <- 1
vehicles_taguchi$trans[vehicles_taguchi$trans == "Manual"] <- 2
#Categorize 'trans' as a factor variable and display its resulting structure.
vehicles_taguchi$trans = as.factor(vehicles_taguchi$trans)
str(vehicles_taguchi$trans)

##  Factor w/ 2 levels "1","2": 2 1 2 2 1 2 2 1 2 2 ...

#Convert the factor 'drive' into a factor variable designated by its levels "2-Wheel Drive", "4-Wheel Drive", "4-Wheel or All-Wheel Drive", "All-Wheel Drive", "Front-Wheel Drive", and "Rear-Wheel Drive".
vehicles_taguchi$drive <- as.character(vehicles_taguchi$drive)
vehicles_taguchi$drive[vehicles_taguchi$drive == "2-Wheel Drive"] <- 1
vehicles_taguchi$drive[vehicles_taguchi$drive == "4-Wheel Drive"] <- 2
vehicles_taguchi$drive[vehicles_taguchi$drive == "4-Wheel or All-Wheel Drive"] <- 3
vehicles_taguchi$drive[vehicles_taguchi$drive == "All-Wheel Drive"] <- 4
vehicles_taguchi$drive[vehicles_taguchi$drive == "Front-Wheel Drive"] <- 5
vehicles_taguchi$drive[vehicles_taguchi$drive == "Rear-Wheel Drive"] <- 6
#Categorize 'drive' as a factor variable and display its resulting structure.
vehicles_taguchi$drive = as.factor(vehicles_taguchi$drive)
str(vehicles_taguchi$drive)

##  Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...

#Convert the factor 'fuel' into a factor variable designated by its levels "Chevrolet", "Saab" and "Subaru".
vehicles_taguchi$fuel <- as.character(vehicles_taguchi$fuel)
vehicles_taguchi$fuel[vehicles_taguchi$fuel == "Normal-Grade"] <- 1
vehicles_taguchi$fuel[vehicles_taguchi$fuel == "High-Grade"] <- 2
#Categorize 'fuel' as a factor variable and display its resulting structure.
vehicles_taguchi$fuel = as.factor(vehicles_taguchi$fuel)
str(vehicles_taguchi$fuel)

##  Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...

#Install and load the 'qualityTools' dataset into R.
install.packages("qualityTools", repos='http://cran.us.r-project.org')

## package 'qualityTools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\howelb\AppData\Local\Temp\RtmpszCIHe\downloaded_packages

library("qualityTools",lib.loc="C:/Program Files/R/R-3.1.1/library")

## Warning: package 'qualityTools' was built under R version 3.1.2

#Install and load the 'DoE.base' dataset into R.
install.packages("DoE.base", repos='http://cran.us.r-project.org')

## package 'DoE.base' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\howelb\AppData\Local\Temp\RtmpszCIHe\downloaded_packages

library("DoE.base",lib.loc="C:/Program Files/R/R-3.1.1/library")

## Warning: package 'DoE.base' was built under R version 3.1.2

## Loading required package: grid
## Loading required package: conf.design

## Warning: package 'conf.design' was built under R version 3.1.2

## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design

#Create core orthogonal array, minimizing aliasing between main effects and two-factor interactions via 'columns = "min3".
core.array = oa.design(factor.names = c("make", "year", "trans", "drive", "fuel"), nlevels=c(3,4,2,6,2), columns = "min3")
print(core.array)

##    make year trans drive fuel
## 1     1    1     2     4    1
## 2     1    3     1     3    1
## 3     3    3     1     5    2
## 4     2    1     2     1    1
## 5     1    2     1     1    2
## 6     1    4     2     4    2
## 7     2    4     2     6    2
## 8     3    4     2     6    1
## 9     1    2     2     3    2
## 10    1    1     1     1    1
## 11    1    1     1     5    1
## 12    1    4     1     2    1
## 13    3    2     1     2    1
## 14    1    3     2     1    2
## 15    2    2     1     3    2
## 16    2    1     2     5    2
## 17    3    1     2     5    1
## 18    1    1     2     3    2
## 19    3    4     2     2    2
## 20    2    4     1     2    2
## 21    2    3     2     4    2
## 22    1    2     2     6    1
## 23    2    2     2     1    2
## 24    2    3     1     2    1
## 25    3    3     1     3    2
## 26    1    4     1     3    1
## 27    2    3     2     3    1
## 28    3    2     1     4    2
## 29    2    4     1     4    2
## 30    1    4     1     6    2
## 31    3    3     2     6    2
## 32    2    1     1     6    2
## 33    2    2     1     6    1
## 34    3    1     2     3    1
## 35    2    1     2     2    2
## 36    3    1     1     1    2
## 37    3    1     1     2    1
## 38    2    4     2     3    1
## 39    2    2     1     5    2
## 40    3    4     2     4    1
## 41    1    1     1     6    2
## 42    2    3     1     1    2
## 43    3    2     1     1    1
## 44    1    1     2     2    2
## 45    2    2     2     4    1
## 46    1    3     1     2    1
## 47    2    4     1     1    1
## 48    3    3     2     2    2
## 49    2    3     2     6    1
## 50    3    4     2     1    2
## 51    2    3     1     5    1
## 52    3    2     2     5    2
## 53    1    3     1     4    2
## 54    3    1     1     6    1
## 55    3    2     2     3    1
## 56    1    2     2     2    2
## 57    3    4     1     3    2
## 58    3    3     1     4    1
## 59    2    2     2     2    1
## 60    1    2     2     5    1
## 61    1    4     2     1    1
## 62    3    2     1     6    2
## 63    1    3     2     6    1
## 64    1    2     1     4    1
## 65    3    1     2     4    2
## 66    2    4     2     5    1
## 67    3    4     1     5    1
## 68    1    4     1     5    2
## 69    1    3     2     5    2
## 70    2    1     1     3    2
## 71    3    3     2     1    1
## 72    2    1     1     4    1
## class=design, type= oa

#Merge "core.array" with "vehicle_taguchi" dataset by main factors.
merged.array <- merge(core.array, vehicles_taguchi, by=c("make", "year", "trans", "drive", "fuel"), all=FALSE)
#Create array from unique rows of "merged.array".
unique.array = unique(merged.array[,1:5])
unique.array

##      make year trans drive fuel
## 1       1    1     1     1    1
## 57      1    1     1     5    1
## 164     1    1     1     6    2
## 172     1    2     1     1    2
## 174     1    2     2     3    2
## 177     1    2     2     5    1
## 237     1    2     2     6    1
## 388     1    3     1     3    1
## 651     1    3     2     5    2
## 661     1    3     2     6    1
## 719     1    4     1     2    1
## 782     1    4     1     5    2
## 787     1    4     1     6    2
## 790     2    1     2     5    2
## 793     2    3     1     5    1
## 822     2    3     2     3    1
## 828     2    4     1     4    2
## 830     2    4     2     5    1
## 841     3    1     2     3    1
## 895     3    1     2     5    1
## 940     3    2     2     3    1
## 1004    3    3     1     3    2
## 1061    3    4     2     4    1

#Create final array for further analysis.
final.cty = merged.array$cty[ind = c(1, 57, 164, 172, 174, 177, 237, 388, 651, 661, 719, 782, 787, 790, 793, 822, 828, 830, 841, 895, 940, 1004, 1061)]
final.array = cbind(unique.array,final.cty)
final.array

##      make year trans drive fuel final.cty
## 1       1    1     1     1    1        13
## 57      1    1     1     5    1        18
## 164     1    1     1     6    2        15
## 172     1    2     1     1    2        28
## 174     1    2     2     3    2        15
## 177     1    2     2     5    1        22
## 237     1    2     2     6    1        13
## 388     1    3     1     3    1        14
## 651     1    3     2     5    2        20
## 661     1    3     2     6    1        17
## 719     1    4     1     2    1        16
## 782     1    4     1     5    2       128
## 787     1    4     1     6    2        12
## 790     2    1     2     5    2        18
## 793     2    3     1     5    1        17
## 822     2    3     2     3    1        20
## 828     2    4     1     4    2        16
## 830     2    4     2     5    1        20
## 841     3    1     2     3    1        21
## 895     3    1     2     5    1        22
## 940     3    2     2     3    1        21
## 1004    3    3     1     3    2        17
## 1061    3    4     2     4    1        22

#Caluclate Signal-to-Noise Ratio for further analysis.
SNR = (-10)*(log10(1/final.array$final.cty^2))
SNR

##  [1] 22.27887 25.10545 23.52183 28.94316 23.52183 26.84845 22.27887
##  [8] 22.92256 26.02060 24.60898 24.08240 42.14420 21.58362 25.10545
## [15] 24.60898 26.02060 24.08240 26.02060 26.44439 26.84845 26.44439
## [22] 24.60898 26.84845

max(SNR)

## [1] 42.1442

#Determine the row that contains the maximum Signal-to-Noise Ratio value.
SNR_maxindex = which(SNR == max(SNR))
SNR_maxindex

## [1] 12

#Update "final.array" so that it solely contains the row of data with maximum value of Signal-to-Noise Ratio.
final.array[SNR_maxindex,]

##     make year trans drive fuel final.cty
## 782    1    4     1     5    2       128

final.array

##      make year trans drive fuel final.cty
## 1       1    1     1     1    1        13
## 57      1    1     1     5    1        18
## 164     1    1     1     6    2        15
## 172     1    2     1     1    2        28
## 174     1    2     2     3    2        15
## 177     1    2     2     5    1        22
## 237     1    2     2     6    1        13
## 388     1    3     1     3    1        14
## 651     1    3     2     5    2        20
## 661     1    3     2     6    1        17
## 719     1    4     1     2    1        16
## 782     1    4     1     5    2       128
## 787     1    4     1     6    2        12
## 790     2    1     2     5    2        18
## 793     2    3     1     5    1        17
## 822     2    3     2     3    1        20
## 828     2    4     1     4    2        16
## 830     2    4     2     5    1        20
## 841     3    1     2     3    1        21
## 895     3    1     2     5    1        22
## 940     3    2     2     3    1        21
## 1004    3    3     1     3    2        17
## 1061    3    4     2     4    1        22

str(final.array)

## 'data.frame':    23 obs. of  6 variables:
##  $ make     : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "contrasts")= num [1:3, 1:2] -7.07e-01 -7.85e-17 7.07e-01 4.08e-01 -8.16e-01 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "1" "2" "3"
##   .. .. ..$ : chr  ".L" ".Q"
##  $ year     : Factor w/ 4 levels "1","2","3","4": 1 1 1 2 2 2 2 3 3 3 ...
##   ..- attr(*, "contrasts")= num [1:4, 1:3] -0.671 -0.224 0.224 0.671 0.5 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "1" "2" "3" "4"
##   .. .. ..$ : chr  ".L" ".Q" ".C"
##  $ trans    : Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 1 2 2 ...
##   ..- attr(*, "contrasts")= num [1:2, 1] -1 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "1" "2"
##   .. .. ..$ : NULL
##  $ drive    : Factor w/ 6 levels "1","2","3","4",..: 1 5 6 1 3 5 6 3 5 6 ...
##   ..- attr(*, "contrasts")= num [1:6, 1:5] -0.598 -0.359 -0.12 0.12 0.359 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "1" "2" "3" "4" ...
##   .. .. ..$ : chr  ".L" ".Q" ".C" "^4" ...
##  $ fuel     : Factor w/ 2 levels "1","2": 1 1 2 2 2 1 1 1 2 1 ...
##   ..- attr(*, "contrasts")= num [1:2, 1] -1 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "1" "2"
##   .. .. ..$ : NULL
##  $ final.cty: int  13 18 15 28 15 22 13 14 20 17 ...

Based on the results of the determination of the row of the array that contains the maximum value of the Signal-to-Noise Ratio, it appears that the 12th unique combination of factor-levels across the entire experimental design corresponds to the most optimal Signal-to-Noise Ratio value (42.1442, which is found in the 782nd row of “final.array”). The resulting optimal unique factor-level combination suggests that an automatic, Chevrolet-made, front-wheel driving vehicle produced sometime in the 2010s with a “high-grade” fuel type is the most desired vehicle to drive, since it optimizes/maximizes city fuel economy at a value of 128 mpg (on the basis of our analysis’ factor-level categorization). In determining the significance of this result, an analysis of variance (ANOVA) on the final orthogonal array can be performed.

#Using the resulting "final.array", perform ANOVA analysis.
model.orthogarray <- aov(final.array$final.cty~final.array$make+final.array$year+final.array$trans+final.array$drive+final.array$fuel, data = final.array)
anova(model.orthogarray)

## Analysis of Variance Table
## 
## Response: final.array$final.cty
##                   Df Sum Sq Mean Sq F value Pr(>F)
## final.array$make   2  222.1  111.04  0.1750 0.8420
## final.array$year   3 1512.0  504.01  0.7941 0.5246
## final.array$trans  1   27.4   27.43  0.0432 0.8395
## final.array$drive  5 3605.6  721.11  1.1362 0.4021
## final.array$fuel   1  159.6  159.56  0.2514 0.6269
## Residuals         10 6346.6  634.66

model.orthogarray

## Call:
##    aov.default(formula = final.array$final.cty ~ final.array$make + 
##     final.array$year + final.array$trans + final.array$drive + 
##     final.array$fuel, data = final.array)
## 
## Terms:
##                 final.array$make final.array$year final.array$trans
## Sum of Squares           222.074         1512.021            27.434
## Deg. of Freedom                2                3                 1
##                 final.array$drive final.array$fuel Residuals
## Sum of Squares           3605.570          159.563  6346.643
## Deg. of Freedom                 5                1        10
## 
## Residual standard error: 25.19255
## Estimated effects may be unbalanced

Based on the above analysis of variance (ANOVA) for a Taguchi-Design-driven model, it appears that each factor’s main effect in the model is not statistically significant with consideration given to the analysis’ null hypothesis that is being tested, which states that the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle do not have have a statistically significant effect on city fuel economy. Ultimately, with this result, we would fail to reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy cannot be explained by the variation existent in the the make, the drive type, the transmission type, the fuel type, or the production year of a vehicle being considered in this analysis and, as such, is likely caused solely by randomization. [Note: The results of this analysis do not fall in line with those discerned from this experiment’s first analysis of variance (ANOVA). This may have to do with the analysis’ original factor-level categorization methodology.]

Estimation (of Parameters)

In estimating the different parameters of the experiment, summary statistics were determined for the relevant data in the dataset pertaining to city fuel economy in “cars_new” (which include both the average city fuel economy values considered with respect to ‘make’, ‘year’, ‘trans’, ‘drive’, and ‘fuel’ contained within the dataset and the standard deviation of those city fuel economy values), the different vehicle makes contained in “cars_new” (which include both the quantities of vehicle makes classified as being ‘Chevrolet’, ‘Subaru’, and ‘Saab’, respectively, and the standard deviation of those distributed quantities), the different vehicle drive types contained in “cars_new” (which include both the quantities of drive types classified as being ‘2-Wheel Drive’, ‘Rear-Wheel Drive’, ‘Front-Wheel Drive’, ‘4-Wheel or All-Wheel Drive’, ‘All-Wheel Drive’, and ‘4-Wheel Drive’, respectively, and the standard deviation of those distributed quantities), the different vehicle transmission types contained in “cars_new” (which include both the quantities of the categorized transmission types [‘Automatic’ and ‘Manual’] that are found in ‘Chevrolet’, ‘Subaru’, and ‘Saab’ vehicles, respectively, and the standard deviation of those distributed quantities), the different production years of vehicles contained in “cars_new” (which include both the quantities of production years of vehicles classified as corresponding to a decade represented by ‘80s’, ‘90s’, ‘00s’, or ‘10s’, respectively, and the standard deviation of those distributed quantities), and the different vehicle fuel types contained in “cars_new” (which include both the quantities of fuel types classified as being either ‘High-Grade’ or ‘Normal-Grade’, respectively, and the standard deviation of those distributed quantities).

Tables of Parameter Values

#Display summary statistics of cars_new$cty.
summary(cars_new$cty)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   14.00   16.00   16.88   19.00  128.00

#Display standard deviation of cars_new$cty.
sd(cars_new$cty, na.rm = FALSE)

## [1] 4.319677

#Display summary statistics of cars_new$make.
summary(cars_new$make)

## Chevrolet      Saab    Subaru 
##      3461       424       764

#Display standard deviation of cars_new$make.
sd(cars_new$make, na.rm = FALSE)

## [1] 0.7565553

#Display summary statistics of cars_new$year.
summary(cars_new$year)

##  00s  10s  80s  90s 
## 1394  595 1310 1350

#Display standard deviation of cars_new$year.
sd(cars_new$year, na.rm = FALSE)

## [1] 1.194506

#Display summary statistics of cars_new$trans.
summary(cars_new$trans)

## Automatic    Manual 
##      2941      1708

#Display standard deviation of cars_new$trans.
sd(cars_new$trans, na.rm = FALSE)

## [1] 0.482146

#Display summary statistics of cars_new$drive.
summary(cars_new$drive)

##              2-Wheel Drive              4-Wheel Drive 
##                        115                         68 
## 4-Wheel or All-Wheel Drive            All-Wheel Drive 
##                       1350                        166 
##          Front-Wheel Drive           Rear-Wheel Drive 
##                       1348                       1602

#Display standard deviation of cars_new$drive.
sd(cars_new$drive, na.rm = FALSE)

## [1] 1.377564

#Display summary statistics of cars_new$fuel.
summary(cars_new$fuel)

##   High-Grade Normal-Grade 
##          490         4159

#Display standard deviation of cars_new$fuel.
sd(cars_new$fuel, na.rm = FALSE)

## [1] 0.3070999

Diagnostics/Model Adequacy Checking

In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can create a Normal Quantile-Quantile (QQ) Plot for the experiment’s data, create a Normal Quantile-Quantile (QQ) Plot of Residuals for the generated ANOVA model, create a Normal Quantile-Quantile (QQ) Plot of Residuals for the generated Taguchi Design model, and perform a Shapiro-Wilk Test of Normality on the experiment’s data.

#Create a Normal Q-Q Plot for city fuel economy (in "cars_new").
qqnorm(cars_new[,"cty"], main = "Normal Q-Q Plot of City Fuel Economy")
qqline(cars_new[,"cty"])

#Create a Normal Q-Q Plot for city fuel economy (in "final.array").
qqnorm(final.array[,"final.cty"], main = "Normal Q-Q Plot of City Fuel Economy [Taguchi Design]")
qqline(final.array[,"final.cty"])

#Create a Normal Q-Q Plot of the residuals for "vehicles_anova".
qqnorm(residuals(vehicles_anova), main = "Normal Q-Q Plot of Residuals of 'vehicles_anova'")
qqline(residuals(vehicles_anova))

#Create a Normal Q-Q Plot of the residuals for "model.orthogarray".
qqnorm(residuals(model.orthogarray), main = "Normal Q-Q Plot of Residuals of 'model.orthogarray'")
qqline(residuals(model.orthogarray))

Shapiro-Wilk Test of Normality Results

#Perform Shapiro-Wilk Test of Normality on city fuel economy (normality is assummed if p > 0.1).
shapiro.test(cars_new[,"cty"])

## 
##  Shapiro-Wilk normality test
## 
## data:  cars_new[, "cty"]
## W = 0.8163, p-value < 2.2e-16

#Perform Shapiro-Wilk Test of Normality on city fuel economy (normality is assummed if p > 0.1) [generated from Taguchi Design's Final Orthogonal Array].
shapiro.test(final.array[,"final.cty"])

## 
##  Shapiro-Wilk normality test
## 
## data:  final.array[, "final.cty"]
## W = 0.3538, p-value = 4.479e-09

Upon both constructing Normal Q-Q Plots on the data in this analysis, it’s likely that we can readily assume that our data exhibits normality, since all of the constructed Normal Q-Q Plots pertaining to both the original ANOVA and the resulting Taguchi Design did seem to display a trend of data that aligned closely with the Normal Q-Q Line. However, upon performing a Shapiro-Wilk Test of Normality on the data in this analysis (for both the original ANOVA and the resulting Taguchi Design), it appears that we cannot readily assume that our data exhibits normality, since the resulting p-value of the Shapiro-Wilk Test of Normality for both “cars_new[,“cty”]” and “final.array[,“final.cty”]” was found to be < 0.1 (at the values of < 2.2e-16 and 4.338e-09, respectively). Nonetheless, research has shown that analyses of variance aren’t very sensitive to even moderate deviations from normality, as per the simulation studies that have been performed using many differenty non-normal distributions (Glass et al. 1972, Harwell et al. 1992, Lix et al. 1996) [2], [3], [4]. Furthermore, because the dataset in consideration (“cars_new”) is a large subset of the total amount data that was collected by the EPA from 1985-2015, we can argue that our dataset “cars_new” intuitively exhibits normality. Either way, we are likely still safe to carry out our analysis of variance and our Taguchi Design in this case.

In further backing up the confidence that we have with our results, we can generate a “quality of fit” model that plots residual error against each of the fitted models that were developed in our original analysis of variance (ANOVA).

#Create a "Quality of Fit Model" that plots the residuals of "vehicles_anova" against its fitted model (not enough data points exist for "model.orthogarray" to generate this kind of plot).
plot(fitted(vehicles_anova), residuals(vehicles_anova))

Since the resulting plot appears to exhibit a fairly symmetric distribution of residuals around the zero line, the ANOVA model developed in this analysis suggests good fit. Thus, we can confidently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.

4. Contingencies to the Experimental Design/Analysis if Modeling Assumptions Failed

If our modeling assumptions of data normality and factor independence failed in our analysis, we can still err on the side of caution by performing the nonparametric Kruskal-Wallis rank sum test to back up our original results (which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution).

Kruskal-Wallis Rank Sum Test Results

#Perform Kruskal-Wallis Rank Sum Test on 'cty' within the "cars_new" dataset for 'make', 'year', 'trans', 'drive', and 'fuel' (identical populations is assummed if p > 0.05).
kruskal.test(cars_new[,"cty"],cars_new$make)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cars_new[, "cty"] and cars_new$make
## Kruskal-Wallis chi-squared = 952.0894, df = 2, p-value < 2.2e-16

kruskal.test(cars_new[,"cty"],cars_new$year)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cars_new[, "cty"] and cars_new$year
## Kruskal-Wallis chi-squared = 121.4982, df = 3, p-value < 2.2e-16

kruskal.test(cars_new[,"cty"],cars_new$trans)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cars_new[, "cty"] and cars_new$trans
## Kruskal-Wallis chi-squared = 231.2584, df = 1, p-value < 2.2e-16

kruskal.test(cars_new[,"cty"],cars_new$drive)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cars_new[, "cty"] and cars_new$drive
## Kruskal-Wallis chi-squared = 1630.978, df = 5, p-value < 2.2e-16

kruskal.test(cars_new[,"cty"],cars_new$fuel)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cars_new[, "cty"] and cars_new$fuel
## Kruskal-Wallis chi-squared = 11.2784, df = 1, p-value = 0.0007842

Since the p-values for both of the resulting Kruskal-Wallis rank sum tests that consider the factors ‘make’, ‘year’, ‘trans’, ‘drive’, and ‘fuel’ against the response variable ‘cty’ are less than 0.05, we can assume that the mean values of city fuel economy compared to the different makes, drive types, production years, fuel types, and transmission types of a vehicle, (considered separately) are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, we would lead us to infer that the differences in mean values of city fuel economy for each of the corresponding makes, drive types, production years, and transmission types of a vehicle being considered in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different different makes, drive types, production years, and transmission types of a vehicle being considered in this analysis (this cannot be said for the factor ‘fuel’, based on this experiment’s complete analysis of variance [ANOVA]). Furthermore, in addition to treating our data in such a way that uses a nonparametric analysis upon any realization that normality cannot be assumed, transformations such as the “Box-Cox Power Transformation” certainly could have been performed on the data to make it approximately normal. However, these transformations would not be necessary for this analysis, since the nonparametric significance results that we generated by using the Kruskal-Wallis rank sums test were suitable in giving us confidence in the results of our analysis.

5. References to the literature

[1] Online Data Source: http://www.fueleconomy.gov/feg/download.shtml

[2] Glass, G.V., P.D. Peckham, and J.R. Sanders. 1972. Consequences of failure to meet assumptions underlying fixed effects analyses of variance and covariance. Rev. Educ. Res. 42: 237-288.

[3] Harwell, M.R., E.N. Rubinstein, W.S. Hayes, and C.C. Olds. 1992. Summarizing Monte Carlo results in methodological research: the one- and two-factor fixed effects ANOVA cases. J. Educ. Stat. 17: 315-339.

[4] Lix, L.M., J.C. Keselman, and H.J. Keselman. 1996. Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Rev. Educ. Res. 66: 579-619.

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code

The fuel economy data for 1984-2015 from the US EPA is conveniently packaged for R users. It can be installed from github with devtools::install_github(“hadley/fueleconomy”).

Recipe 10: Taguchi Designs