This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
In this study that uses fuel economy data collected by the EPA from 1985-2015, a five-factor, multi-level experiment is performed (where ANOVA and Taguchi Design Methods are used) to see if the make, the drive type, the fuel type, the production year, or the transmission of a vehicle (which are the five factors being considered in this analysis) has a statistically significant effect on the city fuel economy (in miles per gallon, ‘mpg’) of that vehicle (which is the response variable in this analysis).
#Install and load the 'fueleconomy' dataset into R.
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## package 'fueleconomy' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\howelb\AppData\Local\Temp\RtmpszCIHe\downloaded_packages
library("fueleconomy",lib.loc="C:/Program Files/R/R-3.1.1/library")
## Warning: package 'fueleconomy' was built under R version 3.1.2
rm(list=ls())
#Assign a variable, cars, to the complete dataframe, 'vehicles'. Then, determine the "head" and "tail" of the dataset, cars.
cars<-vehicles
head(cars)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
tail(cars)
## id make model year class
## 33437 31064 smart fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart fortwo electric drive coupe 2014 Two Seaters
## trans drive cyl displ fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33438 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33439 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33440 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33441 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33442 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
This analysis includes five individual factors that each have multiple levels, including “make”, “drive”, “trans”, fuel“, and”year“. For the factor”make“, the levels ‘Chevrolet’, ‘Subaru’, and ‘Saab’ will be considered (these vehicle makes were selected randomly from the list of”makes" discovered upon performing descriptive statistics of the dataset, “x”). For the factor “drive”, the levels ‘2-Wheel Drive’, ‘Rear-Wheel Drive’, ‘Front-Wheel Drive’, ‘4-Wheel or All-Wheel Drive’, ‘All-Wheel Drive’, and ‘4-Wheel Drive’ will be considered. For the factor “trans”, the levels referring to the different transmission types that are found in ‘Chevrolet’, ‘Subaru’, and ‘Saab’ vehicles will be considered and broken down into two different categories, “Automatic” and “Manual”. For the factor “fuel”, the levels ‘CNG’, ‘Diesel’, ‘Electricity’, ‘Gasoline or E85’, ‘Gasoline or natural gas’, ‘Premium’, ‘Premium Gas or Electricity’, and ‘Regular’ will be considered and divided into two individual categories (“Normal-Grade” and “High-Grade”). For the factor “year”, the individual years included in this dataset will be split into 4 distinct decades: the 1980s, the 1990s, the 2000s, and the 2010s (“80s”, “90s”, “00s”, and “10s”, respectively).
#Display the summary statistics of "cars".
summary(cars)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.772
## 3rd Qu.: 6.000
## Max. :16.000
## NA's :58
## displ fuel hwy cty
## Min. :0.000 Length:33442 Min. : 9.00 Min. : 6.00
## 1st Qu.:2.300 Class :character 1st Qu.: 19.00 1st Qu.: 15.00
## Median :3.000 Mode :character Median : 23.00 Median : 17.00
## Mean :3.353 Mean : 23.55 Mean : 17.49
## 3rd Qu.:4.300 3rd Qu.: 27.00 3rd Qu.: 20.00
## Max. :8.400 Max. :109.00 Max. :138.00
## NA's :57
#Display the names found in "cars".
names(cars)
## [1] "id" "make" "model" "year" "class" "trans" "drive" "cyl"
## [9] "displ" "fuel" "hwy" "cty"
#Display the structure of "cars".
str(cars)
## Classes 'tbl_df', 'tbl' and 'data.frame': 33442 obs. of 12 variables:
## $ id : int 27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
## $ make : chr "AM General" "AM General" "AM General" "AM General" ...
## $ model: chr "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
## $ year : int 1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
## $ class: chr "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
## $ trans: chr "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
## $ drive: chr "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
## $ cyl : int 4 4 6 6 4 6 6 4 4 6 ...
## $ displ: num 2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
## $ fuel : chr "Regular" "Regular" "Regular" "Regular" ...
## $ hwy : int 17 17 13 13 17 13 21 26 28 26 ...
## $ cty : int 18 18 13 13 16 13 14 20 22 18 ...
#Create subset of "cars" that contains all vehicles of the make 'Chevrolet', 'Subaru', and 'Saab'.
cars_new <- subset(cars, cars$make =="Chevrolet"|cars$make=="Subaru"|cars$make=="Saab", select = c(id,make,year,trans,drive,fuel,cty))
#Display the head and the tail of "cars_new".
head(cars_new)
## id make year trans drive fuel cty
## 3706 1567 Chevrolet 1985 Manual 4-spd Rear-Wheel Drive Regular 19
## 3707 1568 Chevrolet 1985 Automatic 4-spd Rear-Wheel Drive Regular 15
## 3708 1569 Chevrolet 1985 Manual 4-spd Rear-Wheel Drive Regular 15
## 3709 1570 Chevrolet 1985 Manual 5-spd Rear-Wheel Drive Regular 15
## 3710 923 Chevrolet 1985 Automatic 4-spd Rear-Wheel Drive Regular 18
## 3711 924 Chevrolet 1985 Manual 4-spd Rear-Wheel Drive Regular 19
tail(cars_new)
## id make year trans drive
## 29475 5279 Subaru 1989 Manual 5-spd Front-Wheel Drive
## 29476 32640 Subaru 2013 Manual 5-spd All-Wheel Drive
## 29477 32641 Subaru 2013 Automatic (variable gear ratios) All-Wheel Drive
## 29478 34233 Subaru 2014 Manual 5-spd All-Wheel Drive
## 29479 34234 Subaru 2014 Automatic (variable gear ratios) All-Wheel Drive
## 29480 34547 Subaru 2014 Automatic (variable gear ratios) All-Wheel Drive
## fuel cty
## 29475 Regular 21
## 29476 Regular 23
## 29477 Regular 25
## 29478 Regular 23
## 29479 Regular 25
## 29480 Regular 29
#Display the summary statistics of "cars_new".
summary(cars_new)
## id make year trans
## Min. : 3 Length:4649 Min. :1984 Length:4649
## 1st Qu.: 6653 Class :character 1st Qu.:1989 Class :character
## Median :14714 Mode :character Median :1996 Mode :character
## Mean :15578 Mean :1997
## 3rd Qu.:23634 3rd Qu.:2006
## Max. :34892 Max. :2015
## drive fuel cty
## Length:4649 Length:4649 Min. : 8.00
## Class :character Class :character 1st Qu.: 14.00
## Mode :character Mode :character Median : 16.00
## Mean : 16.88
## 3rd Qu.: 19.00
## Max. :128.00
#Display the structure of "cars_new".
str(cars_new)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4649 obs. of 7 variables:
## $ id : int 1567 1568 1569 1570 923 924 925 926 927 928 ...
## $ make : chr "Chevrolet" "Chevrolet" "Chevrolet" "Chevrolet" ...
## $ year : int 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
## $ trans: chr "Manual 4-spd" "Automatic 4-spd" "Manual 4-spd" "Manual 5-spd" ...
## $ drive: chr "Rear-Wheel Drive" "Rear-Wheel Drive" "Rear-Wheel Drive" "Rear-Wheel Drive" ...
## $ fuel : chr "Regular" "Regular" "Regular" "Regular" ...
## $ cty : int 19 15 15 15 18 19 19 16 15 16 ...
#Categorize 'make' as a factor and display its resulting levels.
cars_new$make = as.factor(cars_new$make)
levels(cars_new$make)
## [1] "Chevrolet" "Saab" "Subaru"
#Categorize 'trans' as a factor and display its resulting levels.
cars_new$trans[cars_new$trans == "Auto (AV)"] = "Automatic"
cars_new$trans[cars_new$trans == "Auto(AV-S6)"] = "Automatic"
cars_new$trans[cars_new$trans == "Auto(AV-S8)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (A1)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (S4)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (S5)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (S6)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic (variable gear ratios)"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic 3-spd"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic 4-spd"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic 5-spd"] = "Automatic"
cars_new$trans[cars_new$trans == "Automatic 6-spd"] = "Automatic"
cars_new$trans[cars_new$trans == "Manual 3-spd"] = "Manual"
cars_new$trans[cars_new$trans == "Manual 4-spd"] = "Manual"
cars_new$trans[cars_new$trans == "Manual 5-spd"] = "Manual"
cars_new$trans[cars_new$trans == "Manual 6-spd"] = "Manual"
cars_new$trans[cars_new$trans == "Manual 7-spd"] = "Manual"
cars_new$trans = as.factor(cars_new$trans)
levels(cars_new$trans)
## [1] "Automatic" "Manual"
#Categorize 'drive' as a factor and display its resulting levels.
cars_new$drive = as.factor(cars_new$drive)
levels(cars_new$drive)
## [1] "2-Wheel Drive" "4-Wheel Drive"
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"
## [5] "Front-Wheel Drive" "Rear-Wheel Drive"
#Categorize 'fuel' as a factor and display its resulting levels.
cars_new$fuel[cars_new$fuel == "CNG"] = "Normal-Grade"
cars_new$fuel[cars_new$fuel == "Diesel"] = "Normal-Grade"
cars_new$fuel[cars_new$fuel == "Electricity"] = "High-Grade"
cars_new$fuel[cars_new$fuel == "Gasoline or E85"] = "Normal-Grade"
cars_new$fuel[cars_new$fuel == "Gasoline or natural gas"] = "Normal-Grade"
cars_new$fuel[cars_new$fuel == "Premium"] = "High-Grade"
cars_new$fuel[cars_new$fuel == "Premium Gas or Electricity"] = "High-Grade"
cars_new$fuel[cars_new$fuel == "Regular"] = "Normal-Grade"
cars_new$fuel = as.factor(cars_new$fuel)
levels(cars_new$fuel)
## [1] "High-Grade" "Normal-Grade"
#Transform 'year' into a categorical variable (80s, 90s, 00s, and 10s), categorize them as factors, and display their resulting levels.
cars_new$year[cars_new$year > 1979 & cars_new$year <= 1989] = "80s"
cars_new$year[cars_new$year > 1989 & cars_new$year <= 1999] = "90s"
cars_new$year[cars_new$year > 1999 & cars_new$year <= 2009] = "00s"
cars_new$year[cars_new$year > 2009 & cars_new$year <= 2015] = "10s"
cars_new$year = as.factor(cars_new$year)
levels(cars_new$year)
## [1] "00s" "10s" "80s" "90s"
In this dataset, only one variable is noted as being a continuous variable (‘displ’, which refers to the engine displacement, in litres, has a numeric data type). While it could be argued that the highway and city fuel economies [in mpg] (denoted, respectively, as ‘hwy’ and ‘cty’) are generally considered to be continuous variables, this dataset notes each of their data types as being integer values. The five different factors being considered in this analysis, which include “make”, “drive”, “trans”, “fuel”, and “year”, are all going to be treated as categorical variables in this analysis.
This analysis will consider one response variable, ‘cty’, which denotes city fuel economy in this analysis.
This dataset includes fuel economy data from the EPA during the years 1985-2015. This dataset contains selected variables, and removes vehicles with incomplete data (e.g. no drive train data). It contains 33,442 observations of 12 variables, which include “id”, “make”, “model”, “year”, “class”, “trans”, “drive”, “cyl”, “displ”, “fuel”, “hwy”, and “cty”. The variables that are made up of integer values include “id” (which is a unique EPA identifier), “year” (which refers to the model year), “cyl” (which to the number of cylinders), “hwy” (which refers to the highway fuel economy, in mpg), and “cty” (which refers to the city fuel economy, in mpg). The variables that are made up of character values include “make” (which refers to the manufacturer), “model” (which refers to the model name), “class” (which refers to the EPA vehicle size class), “trans” (which refers to the transmission), “drive” (which refers to the drive train), and “fuel” (which refers to the fuel type). Lastly, the variable “displ” is a numeric variable, and it refers to the engine displacement, in litres.
According to the official U.S. government source for fuel economy information, this fuel economy data “are the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA.” [1] Therefore, we can make the assumption that this dataset exhibits randomization.
In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘cty’ in this analysis) can be explained by the variation existent in the five different treatments of the experiment (which correspond to ‘make’, ‘drive’, ‘trans’, ‘fuel’, and ‘year’). Therefore, the null hypothesis that is being tested states that the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle do not have have a statistically significant effect on city fuel economy. Alternately, the alternative hypothesis that is being tested states that the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle do have have a statistically significant effect on city fuel economy. In carrying out this analysis, an analysis of variance (ANOVA) and Taguchi Methods are used in consideration of the response variable city fuel economy (‘cty’) to see if there is a significant difference in the means for this response variable when considering the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle, which are contained in this dataset.
The rationale for this design lies primarily in the fact that we’re trying to determine if the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle have any effect on city fuel economy. So, since city fuel economy (in mpg) is a useful and relevant metric to consider when determining how well vehicles perform, this design of a five-factor, multi-level experiment (as it corresponds to an analysis of variance and the use of Taguchi Methods) was crafted to see if the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle have a statistically significant effect on city fuel economy. By performing this analysis, we can hope to receive some insight regarding the different factors that come into play that typically result in high values of city fuel economy.
Since our original assumption claimed that the entire vehicle dataset exhibits randomization, and since our three specific “makes” of consideration were selected at random from the list of all available “makes” in the dataset (and include every vehicle present in the dataset that corresponds to either of the three selected “makes”), our randomization scheme stems from the existence of randomization in the original dataset collected by the EPA, which allows us to ensure that a completely randomized block design can be created here.
In this experiment, there are no replicates or repeated measures present.
By designing this experiment in such a way that allows for multiple levels to be considered (so as to reach the most accurate conclusions possible), blocking was not considered in this five-factor, multi-level design.
In beginning to display this data graphically, summary statistics were gathered for the dataset being considered here, “cars_new”. Additionally, histograms and boxplots were created to represent the different observations of city fuel economy.
#Display the summary statistics of "cars_new".
summary(cars_new)
## id make year trans
## Min. : 3 Chevrolet:3461 00s:1394 Automatic:2941
## 1st Qu.: 6653 Saab : 424 10s: 595 Manual :1708
## Median :14714 Subaru : 764 80s:1310
## Mean :15578 90s:1350
## 3rd Qu.:23634
## Max. :34892
## drive fuel cty
## 2-Wheel Drive : 115 High-Grade : 490 Min. : 8.00
## 4-Wheel Drive : 68 Normal-Grade:4159 1st Qu.: 14.00
## 4-Wheel or All-Wheel Drive:1350 Median : 16.00
## All-Wheel Drive : 166 Mean : 16.88
## Front-Wheel Drive :1348 3rd Qu.: 19.00
## Rear-Wheel Drive :1602 Max. :128.00
#Display the names found in "cars_new".
names(cars_new)
## [1] "id" "make" "year" "trans" "drive" "fuel" "cty"
#Display the structure of "cars_new".
str(cars_new)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4649 obs. of 7 variables:
## $ id : int 1567 1568 1569 1570 923 924 925 926 927 928 ...
## $ make : Factor w/ 3 levels "Chevrolet","Saab",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : Factor w/ 4 levels "00s","10s","80s",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ trans: Factor w/ 2 levels "Automatic","Manual": 2 1 2 2 1 2 2 1 2 2 ...
## $ drive: Factor w/ 6 levels "2-Wheel Drive",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ fuel : Factor w/ 2 levels "High-Grade","Normal-Grade": 2 2 2 2 2 2 2 2 2 2 ...
## $ cty : int 19 15 15 15 18 19 19 16 15 16 ...
#Display the head and tail of "cars_new".
head(cars_new)
## id make year trans drive fuel cty
## 3706 1567 Chevrolet 80s Manual Rear-Wheel Drive Normal-Grade 19
## 3707 1568 Chevrolet 80s Automatic Rear-Wheel Drive Normal-Grade 15
## 3708 1569 Chevrolet 80s Manual Rear-Wheel Drive Normal-Grade 15
## 3709 1570 Chevrolet 80s Manual Rear-Wheel Drive Normal-Grade 15
## 3710 923 Chevrolet 80s Automatic Rear-Wheel Drive Normal-Grade 18
## 3711 924 Chevrolet 80s Manual Rear-Wheel Drive Normal-Grade 19
tail(cars_new)
## id make year trans drive fuel cty
## 29475 5279 Subaru 80s Manual Front-Wheel Drive Normal-Grade 21
## 29476 32640 Subaru 10s Manual All-Wheel Drive Normal-Grade 23
## 29477 32641 Subaru 10s Automatic All-Wheel Drive Normal-Grade 25
## 29478 34233 Subaru 10s Manual All-Wheel Drive Normal-Grade 23
## 29479 34234 Subaru 10s Automatic All-Wheel Drive Normal-Grade 25
## 29480 34547 Subaru 10s Automatic All-Wheel Drive Normal-Grade 29
#Display the levels of 'make', 'drive', 'trans', 'year', and 'fuel' within "cars_new".
levels(cars_new$make)
## [1] "Chevrolet" "Saab" "Subaru"
levels(cars_new$drive)
## [1] "2-Wheel Drive" "4-Wheel Drive"
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"
## [5] "Front-Wheel Drive" "Rear-Wheel Drive"
levels(cars_new$trans)
## [1] "Automatic" "Manual"
levels(cars_new$year)
## [1] "00s" "10s" "80s" "90s"
levels(cars_new$fuel)
## [1] "High-Grade" "Normal-Grade"
#Create a histogram of city fuel economy ('cty').
par(mfrow=c(1,1))
hist(cars_new$cty, xlim=c(8,60), main = "City Fuel Economy (in mpg)")
#Create a boxplot of 'cty' mileage for all vehicle makes.
par(mfrow=c(1,1))
boxplot(cty~make,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Vehicle Make")
#Create a boxplot of 'cty' mileage for all vehicle drive types.
par(mfrow=c(1,1))
boxplot(cty~drive,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Vehicle Drive Type")
#Create a boxplot of 'cty' mileage for all vehicle transmission types.
par(mfrow=c(1,1))
boxplot(cty~trans,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Vehicle Transmission Type")
#Create a boxplot of 'cty' mileage for all vehicle fuel types.
par(mfrow=c(1,1))
boxplot(cty~fuel,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Vehicle Fuel Type")
#Create a boxplot of 'cty' mileage for all vehicle years.
par(mfrow=c(1,1))
boxplot(cty~year,cars_new, ylim = c(8,60), main = "City Fuel Economy", ylab = "MPG", xlab = "Year")
In order to determine if the variation that is observed in the response variable (which corresponds to ‘cty’ in this analysis) can be explained by the variation existent in the treatments of the experiment (which correspond to ‘make’, ‘drive’, ‘trans’, ‘fuel’, and ‘year’ [and the way in which their categories were determined]), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in city fuel economy values for each of the different drive types, vehicle makes, transmission types, fuel types, and production years being considered in this dataset.
For the ANOVA model that is designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle do not have have a statistically significant effect on city fuel economy, implying that the differences in the mean values of city fuel economy were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of city fuel economy for each of the corresponding drive types, transmission types, fuel types, production years, and vehicle makes being considered in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different drive types, transmission types, fuel types, production years, and vehicle makes being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of city fuel economy cannot be explained by the variation existent in the different drive types, transmission types, fuel types, production years, and vehicle makes being considered in this analysis and, as such, is likely caused by randomization.
#Perform an analysis of variance (ANOVA) for the different mean values observed for city fuel economy, given the factor 'make', 'year', 'trans', 'drive', and 'fuel'.
vehicles_anova <- aov(cty~make+year+trans+drive+fuel, data = cars_new)
anova(vehicles_anova)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## make 2 8620 4310.1 404.567 <2e-16 ***
## year 3 2834 944.7 88.675 <2e-16 ***
## trans 1 2337 2336.9 219.357 <2e-16 ***
## drive 5 23537 4707.4 441.861 <2e-16 ***
## fuel 1 12 12.1 1.139 0.2859
## Residuals 4636 49390 10.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the analysis of variance (ANOVA) that is performed where the factor ‘make’ is analyzed against the response variable ‘cty’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (404.567) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different vehicle makes being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where ‘year’ is analyzed against the response variable ‘cty’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (88.675) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different production years being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where ‘trans’ is analyzed against the response variable ‘cty’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (219.357) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different transmission types being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where ‘drive’ is analyzed against the response variable ‘cty’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (441.861) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different drive types being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where ‘fuel’ is analyzed against the response variable ‘cty’, a p-value = 0.2859 is returned, indicating that there is roughly a probability of 0.2859 that the resulting associated F-value (41.139) is the result of solely randomization. Therefore, based on this result, we would fail to reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy cannot be explained by the variation existent in the different fuel types being considered in this analysis and, as such, is likely caused solely by randomization. (See above results for p-value and F-value.) [Note: With consideration being made for the practical use of Taguchi Methods in this experiment, the factor “fuel type” was reduced from having eight distinct levels to having two distinct levels. This type of categorization may have resulted in this ANOVA analysis’ determined insignificance, which is likely not completely accurate in reality.]
In further carrying out this analysis, we can compute Tukey Honest Significant Differences (via “TukeyHSD()”) as a means for determining the specific levels of each factor existent in this analysis that are truly independent from each other and that significantly affect the response variable, ‘cty’.
#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$make].
Tukey_make = TukeyHSD(vehicles_anova, which = "make", ordered = FALSE, conf.level = 0.95)
Tukey_make
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
##
## $make
## diff lwr upr p adj
## Saab-Chevrolet 1.297767 0.9040352 1.691498 0
## Subaru-Chevrolet 3.677916 3.3720336 3.983798 0
## Subaru-Saab 2.380149 1.9167374 2.843561 0
par(mfrow=c(1,1))
plot(Tukey_make)
#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$year].
Tukey_year = TukeyHSD(vehicles_anova, which = "year", ordered = FALSE, conf.level = 0.95)
Tukey_year
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
##
## $year
## diff lwr upr p adj
## 10s-00s 2.2918450 1.8810728 2.7026172 0.0000000
## 80s-00s 0.6368807 0.3140983 0.9596630 0.0000025
## 90s-00s -0.1443446 -0.4646523 0.1759630 0.6533456
## 80s-10s -1.6549643 -2.0696575 -1.2402712 0.0000000
## 90s-10s -2.4361896 -2.8489594 -2.0234198 0.0000000
## 90s-80s -0.7812253 -1.1065461 -0.4559045 0.0000000
par(mfrow=c(1,1))
plot(Tukey_year)
#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$trans].
Tukey_trans = TukeyHSD(vehicles_anova, which = "trans", ordered = FALSE, conf.level = 0.95)
Tukey_trans
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
##
## $trans
## diff lwr upr p adj
## Manual-Automatic 1.428982 1.234313 1.623651 0
par(mfrow=c(1,1))
plot(Tukey_trans)
#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$drive].
Tukey_drive = TukeyHSD(vehicles_anova, which = "drive", ordered = FALSE, conf.level = 0.95)
Tukey_drive
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
##
## $drive
## diff lwr
## 4-Wheel Drive-2-Wheel Drive -1.87863015 -3.3021080
## 4-Wheel or All-Wheel Drive-2-Wheel Drive -1.07751231 -1.9814355
## All-Wheel Drive-2-Wheel Drive -1.14130396 -2.2702644
## Front-Wheel Drive-2-Wheel Drive 3.54981158 2.6458357
## Rear-Wheel Drive-2-Wheel Drive -0.85632309 -1.7546481
## 4-Wheel or All-Wheel Drive-4-Wheel Drive 0.80111785 -0.3553813
## All-Wheel Drive-4-Wheel Drive 0.73732620 -0.6024367
## Front-Wheel Drive-4-Wheel Drive 5.42844174 4.2719015
## Rear-Wheel Drive-4-Wheel Drive 1.02230706 -0.1298218
## All-Wheel Drive-4-Wheel or All-Wheel Drive -0.06379165 -0.8291366
## Front-Wheel Drive-4-Wheel or All-Wheel Drive 4.62732389 4.2690314
## Rear-Wheel Drive-4-Wheel or All-Wheel Drive 0.22118922 -0.1225971
## Front-Wheel Drive-All-Wheel Drive 4.69111554 3.9257085
## Rear-Wheel Drive-All-Wheel Drive 0.28498087 -0.4737441
## Rear-Wheel Drive-Front-Wheel Drive -4.40613467 -4.7500594
## upr p adj
## 4-Wheel Drive-2-Wheel Drive -0.45515235 0.0023506
## 4-Wheel or All-Wheel Drive-2-Wheel Drive -0.17358909 0.0089382
## All-Wheel Drive-2-Wheel Drive -0.01234351 0.0457721
## Front-Wheel Drive-2-Wheel Drive 4.45378744 0.0000000
## Rear-Wheel Drive-2-Wheel Drive 0.04200195 0.0719630
## 4-Wheel or All-Wheel Drive-4-Wheel Drive 1.95761697 0.3569774
## All-Wheel Drive-4-Wheel Drive 2.07708914 0.6191827
## Front-Wheel Drive-4-Wheel Drive 6.58498200 0.0000000
## Rear-Wheel Drive-4-Wheel Drive 2.17443592 0.1157128
## All-Wheel Drive-4-Wheel or All-Wheel Drive 0.70155327 0.9998971
## Front-Wheel Drive-4-Wheel or All-Wheel Drive 4.98561636 0.0000000
## Rear-Wheel Drive-4-Wheel or All-Wheel Drive 0.56497552 0.4436496
## Front-Wheel Drive-All-Wheel Drive 5.45652263 0.0000000
## Rear-Wheel Drive-All-Wheel Drive 1.04370581 0.8930407
## Rear-Wheel Drive-Front-Wheel Drive -4.06221000 0.0000000
par(mfrow=c(1,1))
plot(Tukey_drive)
#Perform a TukeyHSD Test for "vehicles_anova" [cars_new$fuel].
Tukey_fuel = TukeyHSD(vehicles_anova, which = "fuel", ordered = FALSE, conf.level = 0.95)
Tukey_fuel
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ make + year + trans + drive + fuel, data = cars_new)
##
## $fuel
## diff lwr upr p adj
## Normal-Grade-High-Grade 0.1539237 -0.1517058 0.4595532 0.323522
par(mfrow=c(1,1))
plot(Tukey_fuel)
After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$make], it’s seemingly clear that each of the different level-comparisons suggest a significant difference in means among all of the different level-pairs, since their respective p-values are all less than 0.05 (see output of “Tukey_make”). This further suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “make” was determined to be significant.
After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$year], it’s seemingly clear that each of the different level-comparisons suggest a significant difference in means among a great majority of the different level-pairs (excluding “90s-00s” since, unlike the other level-pairs, their respective p-values are greater than 0.05) (see output of “Tukey_year”). This further suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “year” was determined to be significant.
After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$trans], it’s seemingly clear that the level comparison of “Manual” and “Automatic” suggests a significant difference in means [for the level-pair “Manual-Automatic”], since its respective p-value is less than 0.05 (see output of “Tukey_trans”). Therefore, this test suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “trans” was determined to be significant.
After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$drive], it’s seemingly clear that approximately more than half of the different level-comparisons suggest a significant difference in means of the different level-pairs, since their respective p-values are less than 0.05 (see output of “Tukey_drive”). Therefore, this test suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “drive” was determined to be significant.
After observing the results of these Tukey Honest Significant Differences for “vehicles_anova” [cars_new$fuel], it’s seemingly clear that the level comparison of “Normal-Grade” and “High-Grade” suggests a non-significant difference in means [for the level-pair “Normal-Grade-High-Grade”], since its respective p-value is greater than 0.05 (see output of “Tukey_fuel”). Therefore, this test suggests the significance of the results that were gathered in the above ANOVA model “vehicles_anova”, since the factor “fuel” was determined to be insignificant.
In order to determine if the variation that is observed in the response variable (which corresponds to ‘cty’ in this analysis) can be explained by the variation existent in the treatments of the experiment (which correspond to ‘make’, ‘drive’, ‘trans’, ‘fuel’, and ‘year’ [and the way in which their categories were determined]), Taguchi Design Methods via orthogonal array creation can be used as a means for analyzing the differences in city fuel economy values for each of the different drive types, vehicle makes, transmission types, fuel types, and production years being considered in this dataset and determining the most optimal factor-level combination of vehicle to select that maximizes city fuel economy.
#Create a new dataset to be used for the Taguchi Design Methods analysis.
vehicles_taguchi <- cars_new[,]
#Convert the factor 'make' into a factor variable designated by its levels "Chevrolet", "Saab" and "Subaru".
vehicles_taguchi$make <- as.character(vehicles_taguchi$make)
vehicles_taguchi$make[vehicles_taguchi$make == "Chevrolet"] <- 1
vehicles_taguchi$make[vehicles_taguchi$make == "Saab"] <- 2
vehicles_taguchi$make[vehicles_taguchi$make == "Subaru"] <- 3
#Categorize 'make' as a factor variable and display its resulting structure.
vehicles_taguchi$make = as.factor(vehicles_taguchi$make)
str(vehicles_taguchi$make)
## Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
#Convert the factor 'year' into a factor variable designated by its levels "80s", "90s", "00s", and "10s".
vehicles_taguchi$year <- as.character(vehicles_taguchi$year)
vehicles_taguchi$year[vehicles_taguchi$year == "80s"] <- 1
vehicles_taguchi$year[vehicles_taguchi$year == "90s"] <- 2
vehicles_taguchi$year[vehicles_taguchi$year == "00s"] <- 3
vehicles_taguchi$year[vehicles_taguchi$year == "10s"] <- 4
#Categorize 'year' as a factor variable and display its resulting structure.
vehicles_taguchi$year = as.factor(vehicles_taguchi$year)
str(vehicles_taguchi$year)
## Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
#Convert the factor 'trans' into a factor variable designated by its levels "Automatic" and "Manual".
vehicles_taguchi$trans <- as.character(vehicles_taguchi$trans)
vehicles_taguchi$trans[vehicles_taguchi$trans == "Automatic"] <- 1
vehicles_taguchi$trans[vehicles_taguchi$trans == "Manual"] <- 2
#Categorize 'trans' as a factor variable and display its resulting structure.
vehicles_taguchi$trans = as.factor(vehicles_taguchi$trans)
str(vehicles_taguchi$trans)
## Factor w/ 2 levels "1","2": 2 1 2 2 1 2 2 1 2 2 ...
#Convert the factor 'drive' into a factor variable designated by its levels "2-Wheel Drive", "4-Wheel Drive", "4-Wheel or All-Wheel Drive", "All-Wheel Drive", "Front-Wheel Drive", and "Rear-Wheel Drive".
vehicles_taguchi$drive <- as.character(vehicles_taguchi$drive)
vehicles_taguchi$drive[vehicles_taguchi$drive == "2-Wheel Drive"] <- 1
vehicles_taguchi$drive[vehicles_taguchi$drive == "4-Wheel Drive"] <- 2
vehicles_taguchi$drive[vehicles_taguchi$drive == "4-Wheel or All-Wheel Drive"] <- 3
vehicles_taguchi$drive[vehicles_taguchi$drive == "All-Wheel Drive"] <- 4
vehicles_taguchi$drive[vehicles_taguchi$drive == "Front-Wheel Drive"] <- 5
vehicles_taguchi$drive[vehicles_taguchi$drive == "Rear-Wheel Drive"] <- 6
#Categorize 'drive' as a factor variable and display its resulting structure.
vehicles_taguchi$drive = as.factor(vehicles_taguchi$drive)
str(vehicles_taguchi$drive)
## Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
#Convert the factor 'fuel' into a factor variable designated by its levels "Chevrolet", "Saab" and "Subaru".
vehicles_taguchi$fuel <- as.character(vehicles_taguchi$fuel)
vehicles_taguchi$fuel[vehicles_taguchi$fuel == "Normal-Grade"] <- 1
vehicles_taguchi$fuel[vehicles_taguchi$fuel == "High-Grade"] <- 2
#Categorize 'fuel' as a factor variable and display its resulting structure.
vehicles_taguchi$fuel = as.factor(vehicles_taguchi$fuel)
str(vehicles_taguchi$fuel)
## Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#Install and load the 'qualityTools' dataset into R.
install.packages("qualityTools", repos='http://cran.us.r-project.org')
## package 'qualityTools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\howelb\AppData\Local\Temp\RtmpszCIHe\downloaded_packages
library("qualityTools",lib.loc="C:/Program Files/R/R-3.1.1/library")
## Warning: package 'qualityTools' was built under R version 3.1.2
#Install and load the 'DoE.base' dataset into R.
install.packages("DoE.base", repos='http://cran.us.r-project.org')
## package 'DoE.base' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\howelb\AppData\Local\Temp\RtmpszCIHe\downloaded_packages
library("DoE.base",lib.loc="C:/Program Files/R/R-3.1.1/library")
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
#Create core orthogonal array, minimizing aliasing between main effects and two-factor interactions via 'columns = "min3".
core.array = oa.design(factor.names = c("make", "year", "trans", "drive", "fuel"), nlevels=c(3,4,2,6,2), columns = "min3")
print(core.array)
## make year trans drive fuel
## 1 1 1 2 4 1
## 2 1 3 1 3 1
## 3 3 3 1 5 2
## 4 2 1 2 1 1
## 5 1 2 1 1 2
## 6 1 4 2 4 2
## 7 2 4 2 6 2
## 8 3 4 2 6 1
## 9 1 2 2 3 2
## 10 1 1 1 1 1
## 11 1 1 1 5 1
## 12 1 4 1 2 1
## 13 3 2 1 2 1
## 14 1 3 2 1 2
## 15 2 2 1 3 2
## 16 2 1 2 5 2
## 17 3 1 2 5 1
## 18 1 1 2 3 2
## 19 3 4 2 2 2
## 20 2 4 1 2 2
## 21 2 3 2 4 2
## 22 1 2 2 6 1
## 23 2 2 2 1 2
## 24 2 3 1 2 1
## 25 3 3 1 3 2
## 26 1 4 1 3 1
## 27 2 3 2 3 1
## 28 3 2 1 4 2
## 29 2 4 1 4 2
## 30 1 4 1 6 2
## 31 3 3 2 6 2
## 32 2 1 1 6 2
## 33 2 2 1 6 1
## 34 3 1 2 3 1
## 35 2 1 2 2 2
## 36 3 1 1 1 2
## 37 3 1 1 2 1
## 38 2 4 2 3 1
## 39 2 2 1 5 2
## 40 3 4 2 4 1
## 41 1 1 1 6 2
## 42 2 3 1 1 2
## 43 3 2 1 1 1
## 44 1 1 2 2 2
## 45 2 2 2 4 1
## 46 1 3 1 2 1
## 47 2 4 1 1 1
## 48 3 3 2 2 2
## 49 2 3 2 6 1
## 50 3 4 2 1 2
## 51 2 3 1 5 1
## 52 3 2 2 5 2
## 53 1 3 1 4 2
## 54 3 1 1 6 1
## 55 3 2 2 3 1
## 56 1 2 2 2 2
## 57 3 4 1 3 2
## 58 3 3 1 4 1
## 59 2 2 2 2 1
## 60 1 2 2 5 1
## 61 1 4 2 1 1
## 62 3 2 1 6 2
## 63 1 3 2 6 1
## 64 1 2 1 4 1
## 65 3 1 2 4 2
## 66 2 4 2 5 1
## 67 3 4 1 5 1
## 68 1 4 1 5 2
## 69 1 3 2 5 2
## 70 2 1 1 3 2
## 71 3 3 2 1 1
## 72 2 1 1 4 1
## class=design, type= oa
#Merge "core.array" with "vehicle_taguchi" dataset by main factors.
merged.array <- merge(core.array, vehicles_taguchi, by=c("make", "year", "trans", "drive", "fuel"), all=FALSE)
#Create array from unique rows of "merged.array".
unique.array = unique(merged.array[,1:5])
unique.array
## make year trans drive fuel
## 1 1 1 1 1 1
## 57 1 1 1 5 1
## 164 1 1 1 6 2
## 172 1 2 1 1 2
## 174 1 2 2 3 2
## 177 1 2 2 5 1
## 237 1 2 2 6 1
## 388 1 3 1 3 1
## 651 1 3 2 5 2
## 661 1 3 2 6 1
## 719 1 4 1 2 1
## 782 1 4 1 5 2
## 787 1 4 1 6 2
## 790 2 1 2 5 2
## 793 2 3 1 5 1
## 822 2 3 2 3 1
## 828 2 4 1 4 2
## 830 2 4 2 5 1
## 841 3 1 2 3 1
## 895 3 1 2 5 1
## 940 3 2 2 3 1
## 1004 3 3 1 3 2
## 1061 3 4 2 4 1
#Create final array for further analysis.
final.cty = merged.array$cty[ind = c(1, 57, 164, 172, 174, 177, 237, 388, 651, 661, 719, 782, 787, 790, 793, 822, 828, 830, 841, 895, 940, 1004, 1061)]
final.array = cbind(unique.array,final.cty)
final.array
## make year trans drive fuel final.cty
## 1 1 1 1 1 1 13
## 57 1 1 1 5 1 18
## 164 1 1 1 6 2 15
## 172 1 2 1 1 2 28
## 174 1 2 2 3 2 15
## 177 1 2 2 5 1 22
## 237 1 2 2 6 1 13
## 388 1 3 1 3 1 14
## 651 1 3 2 5 2 20
## 661 1 3 2 6 1 17
## 719 1 4 1 2 1 16
## 782 1 4 1 5 2 128
## 787 1 4 1 6 2 12
## 790 2 1 2 5 2 18
## 793 2 3 1 5 1 17
## 822 2 3 2 3 1 20
## 828 2 4 1 4 2 16
## 830 2 4 2 5 1 20
## 841 3 1 2 3 1 21
## 895 3 1 2 5 1 22
## 940 3 2 2 3 1 21
## 1004 3 3 1 3 2 17
## 1061 3 4 2 4 1 22
#Caluclate Signal-to-Noise Ratio for further analysis.
SNR = (-10)*(log10(1/final.array$final.cty^2))
SNR
## [1] 22.27887 25.10545 23.52183 28.94316 23.52183 26.84845 22.27887
## [8] 22.92256 26.02060 24.60898 24.08240 42.14420 21.58362 25.10545
## [15] 24.60898 26.02060 24.08240 26.02060 26.44439 26.84845 26.44439
## [22] 24.60898 26.84845
max(SNR)
## [1] 42.1442
#Determine the row that contains the maximum Signal-to-Noise Ratio value.
SNR_maxindex = which(SNR == max(SNR))
SNR_maxindex
## [1] 12
#Update "final.array" so that it solely contains the row of data with maximum value of Signal-to-Noise Ratio.
final.array[SNR_maxindex,]
## make year trans drive fuel final.cty
## 782 1 4 1 5 2 128
final.array
## make year trans drive fuel final.cty
## 1 1 1 1 1 1 13
## 57 1 1 1 5 1 18
## 164 1 1 1 6 2 15
## 172 1 2 1 1 2 28
## 174 1 2 2 3 2 15
## 177 1 2 2 5 1 22
## 237 1 2 2 6 1 13
## 388 1 3 1 3 1 14
## 651 1 3 2 5 2 20
## 661 1 3 2 6 1 17
## 719 1 4 1 2 1 16
## 782 1 4 1 5 2 128
## 787 1 4 1 6 2 12
## 790 2 1 2 5 2 18
## 793 2 3 1 5 1 17
## 822 2 3 2 3 1 20
## 828 2 4 1 4 2 16
## 830 2 4 2 5 1 20
## 841 3 1 2 3 1 21
## 895 3 1 2 5 1 22
## 940 3 2 2 3 1 21
## 1004 3 3 1 3 2 17
## 1061 3 4 2 4 1 22
str(final.array)
## 'data.frame': 23 obs. of 6 variables:
## $ make : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "contrasts")= num [1:3, 1:2] -7.07e-01 -7.85e-17 7.07e-01 4.08e-01 -8.16e-01 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "1" "2" "3"
## .. .. ..$ : chr ".L" ".Q"
## $ year : Factor w/ 4 levels "1","2","3","4": 1 1 1 2 2 2 2 3 3 3 ...
## ..- attr(*, "contrasts")= num [1:4, 1:3] -0.671 -0.224 0.224 0.671 0.5 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "1" "2" "3" "4"
## .. .. ..$ : chr ".L" ".Q" ".C"
## $ trans : Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 1 2 2 ...
## ..- attr(*, "contrasts")= num [1:2, 1] -1 1
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "1" "2"
## .. .. ..$ : NULL
## $ drive : Factor w/ 6 levels "1","2","3","4",..: 1 5 6 1 3 5 6 3 5 6 ...
## ..- attr(*, "contrasts")= num [1:6, 1:5] -0.598 -0.359 -0.12 0.12 0.359 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "1" "2" "3" "4" ...
## .. .. ..$ : chr ".L" ".Q" ".C" "^4" ...
## $ fuel : Factor w/ 2 levels "1","2": 1 1 2 2 2 1 1 1 2 1 ...
## ..- attr(*, "contrasts")= num [1:2, 1] -1 1
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "1" "2"
## .. .. ..$ : NULL
## $ final.cty: int 13 18 15 28 15 22 13 14 20 17 ...
Based on the results of the determination of the row of the array that contains the maximum value of the Signal-to-Noise Ratio, it appears that the 12th unique combination of factor-levels across the entire experimental design corresponds to the most optimal Signal-to-Noise Ratio value (42.1442, which is found in the 782nd row of “final.array”). The resulting optimal unique factor-level combination suggests that an automatic, Chevrolet-made, front-wheel driving vehicle produced sometime in the 2010s with a “high-grade” fuel type is the most desired vehicle to drive, since it optimizes/maximizes city fuel economy at a value of 128 mpg (on the basis of our analysis’ factor-level categorization). In determining the significance of this result, an analysis of variance (ANOVA) on the final orthogonal array can be performed.
#Using the resulting "final.array", perform ANOVA analysis.
model.orthogarray <- aov(final.array$final.cty~final.array$make+final.array$year+final.array$trans+final.array$drive+final.array$fuel, data = final.array)
anova(model.orthogarray)
## Analysis of Variance Table
##
## Response: final.array$final.cty
## Df Sum Sq Mean Sq F value Pr(>F)
## final.array$make 2 222.1 111.04 0.1750 0.8420
## final.array$year 3 1512.0 504.01 0.7941 0.5246
## final.array$trans 1 27.4 27.43 0.0432 0.8395
## final.array$drive 5 3605.6 721.11 1.1362 0.4021
## final.array$fuel 1 159.6 159.56 0.2514 0.6269
## Residuals 10 6346.6 634.66
model.orthogarray
## Call:
## aov.default(formula = final.array$final.cty ~ final.array$make +
## final.array$year + final.array$trans + final.array$drive +
## final.array$fuel, data = final.array)
##
## Terms:
## final.array$make final.array$year final.array$trans
## Sum of Squares 222.074 1512.021 27.434
## Deg. of Freedom 2 3 1
## final.array$drive final.array$fuel Residuals
## Sum of Squares 3605.570 159.563 6346.643
## Deg. of Freedom 5 1 10
##
## Residual standard error: 25.19255
## Estimated effects may be unbalanced
Based on the above analysis of variance (ANOVA) for a Taguchi-Design-driven model, it appears that each factor’s main effect in the model is not statistically significant with consideration given to the analysis’ null hypothesis that is being tested, which states that the make, the drive type, the transmission type, the fuel type, and the production year of a vehicle do not have have a statistically significant effect on city fuel economy. Ultimately, with this result, we would fail to reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of city fuel economy cannot be explained by the variation existent in the the make, the drive type, the transmission type, the fuel type, or the production year of a vehicle being considered in this analysis and, as such, is likely caused solely by randomization. [Note: The results of this analysis do not fall in line with those discerned from this experiment’s first analysis of variance (ANOVA). This may have to do with the analysis’ original factor-level categorization methodology.]
In estimating the different parameters of the experiment, summary statistics were determined for the relevant data in the dataset pertaining to city fuel economy in “cars_new” (which include both the average city fuel economy values considered with respect to ‘make’, ‘year’, ‘trans’, ‘drive’, and ‘fuel’ contained within the dataset and the standard deviation of those city fuel economy values), the different vehicle makes contained in “cars_new” (which include both the quantities of vehicle makes classified as being ‘Chevrolet’, ‘Subaru’, and ‘Saab’, respectively, and the standard deviation of those distributed quantities), the different vehicle drive types contained in “cars_new” (which include both the quantities of drive types classified as being ‘2-Wheel Drive’, ‘Rear-Wheel Drive’, ‘Front-Wheel Drive’, ‘4-Wheel or All-Wheel Drive’, ‘All-Wheel Drive’, and ‘4-Wheel Drive’, respectively, and the standard deviation of those distributed quantities), the different vehicle transmission types contained in “cars_new” (which include both the quantities of the categorized transmission types [‘Automatic’ and ‘Manual’] that are found in ‘Chevrolet’, ‘Subaru’, and ‘Saab’ vehicles, respectively, and the standard deviation of those distributed quantities), the different production years of vehicles contained in “cars_new” (which include both the quantities of production years of vehicles classified as corresponding to a decade represented by ‘80s’, ‘90s’, ‘00s’, or ‘10s’, respectively, and the standard deviation of those distributed quantities), and the different vehicle fuel types contained in “cars_new” (which include both the quantities of fuel types classified as being either ‘High-Grade’ or ‘Normal-Grade’, respectively, and the standard deviation of those distributed quantities).
#Display summary statistics of cars_new$cty.
summary(cars_new$cty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 14.00 16.00 16.88 19.00 128.00
#Display standard deviation of cars_new$cty.
sd(cars_new$cty, na.rm = FALSE)
## [1] 4.319677
#Display summary statistics of cars_new$make.
summary(cars_new$make)
## Chevrolet Saab Subaru
## 3461 424 764
#Display standard deviation of cars_new$make.
sd(cars_new$make, na.rm = FALSE)
## [1] 0.7565553
#Display summary statistics of cars_new$year.
summary(cars_new$year)
## 00s 10s 80s 90s
## 1394 595 1310 1350
#Display standard deviation of cars_new$year.
sd(cars_new$year, na.rm = FALSE)
## [1] 1.194506
#Display summary statistics of cars_new$trans.
summary(cars_new$trans)
## Automatic Manual
## 2941 1708
#Display standard deviation of cars_new$trans.
sd(cars_new$trans, na.rm = FALSE)
## [1] 0.482146
#Display summary statistics of cars_new$drive.
summary(cars_new$drive)
## 2-Wheel Drive 4-Wheel Drive
## 115 68
## 4-Wheel or All-Wheel Drive All-Wheel Drive
## 1350 166
## Front-Wheel Drive Rear-Wheel Drive
## 1348 1602
#Display standard deviation of cars_new$drive.
sd(cars_new$drive, na.rm = FALSE)
## [1] 1.377564
#Display summary statistics of cars_new$fuel.
summary(cars_new$fuel)
## High-Grade Normal-Grade
## 490 4159
#Display standard deviation of cars_new$fuel.
sd(cars_new$fuel, na.rm = FALSE)
## [1] 0.3070999
In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can create a Normal Quantile-Quantile (QQ) Plot for the experiment’s data, create a Normal Quantile-Quantile (QQ) Plot of Residuals for the generated ANOVA model, create a Normal Quantile-Quantile (QQ) Plot of Residuals for the generated Taguchi Design model, and perform a Shapiro-Wilk Test of Normality on the experiment’s data.
#Create a Normal Q-Q Plot for city fuel economy (in "cars_new").
qqnorm(cars_new[,"cty"], main = "Normal Q-Q Plot of City Fuel Economy")
qqline(cars_new[,"cty"])
#Create a Normal Q-Q Plot for city fuel economy (in "final.array").
qqnorm(final.array[,"final.cty"], main = "Normal Q-Q Plot of City Fuel Economy [Taguchi Design]")
qqline(final.array[,"final.cty"])
#Create a Normal Q-Q Plot of the residuals for "vehicles_anova".
qqnorm(residuals(vehicles_anova), main = "Normal Q-Q Plot of Residuals of 'vehicles_anova'")
qqline(residuals(vehicles_anova))
#Create a Normal Q-Q Plot of the residuals for "model.orthogarray".
qqnorm(residuals(model.orthogarray), main = "Normal Q-Q Plot of Residuals of 'model.orthogarray'")
qqline(residuals(model.orthogarray))
#Perform Shapiro-Wilk Test of Normality on city fuel economy (normality is assummed if p > 0.1).
shapiro.test(cars_new[,"cty"])
##
## Shapiro-Wilk normality test
##
## data: cars_new[, "cty"]
## W = 0.8163, p-value < 2.2e-16
#Perform Shapiro-Wilk Test of Normality on city fuel economy (normality is assummed if p > 0.1) [generated from Taguchi Design's Final Orthogonal Array].
shapiro.test(final.array[,"final.cty"])
##
## Shapiro-Wilk normality test
##
## data: final.array[, "final.cty"]
## W = 0.3538, p-value = 4.479e-09
Upon both constructing Normal Q-Q Plots on the data in this analysis, it’s likely that we can readily assume that our data exhibits normality, since all of the constructed Normal Q-Q Plots pertaining to both the original ANOVA and the resulting Taguchi Design did seem to display a trend of data that aligned closely with the Normal Q-Q Line. However, upon performing a Shapiro-Wilk Test of Normality on the data in this analysis (for both the original ANOVA and the resulting Taguchi Design), it appears that we cannot readily assume that our data exhibits normality, since the resulting p-value of the Shapiro-Wilk Test of Normality for both “cars_new[,“cty”]” and “final.array[,“final.cty”]” was found to be < 0.1 (at the values of < 2.2e-16 and 4.338e-09, respectively). Nonetheless, research has shown that analyses of variance aren’t very sensitive to even moderate deviations from normality, as per the simulation studies that have been performed using many differenty non-normal distributions (Glass et al. 1972, Harwell et al. 1992, Lix et al. 1996) [2], [3], [4]. Furthermore, because the dataset in consideration (“cars_new”) is a large subset of the total amount data that was collected by the EPA from 1985-2015, we can argue that our dataset “cars_new” intuitively exhibits normality. Either way, we are likely still safe to carry out our analysis of variance and our Taguchi Design in this case.
In further backing up the confidence that we have with our results, we can generate a “quality of fit” model that plots residual error against each of the fitted models that were developed in our original analysis of variance (ANOVA).
#Create a "Quality of Fit Model" that plots the residuals of "vehicles_anova" against its fitted model (not enough data points exist for "model.orthogarray" to generate this kind of plot).
plot(fitted(vehicles_anova), residuals(vehicles_anova))
Since the resulting plot appears to exhibit a fairly symmetric distribution of residuals around the zero line, the ANOVA model developed in this analysis suggests good fit. Thus, we can confidently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.
If our modeling assumptions of data normality and factor independence failed in our analysis, we can still err on the side of caution by performing the nonparametric Kruskal-Wallis rank sum test to back up our original results (which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution).
#Perform Kruskal-Wallis Rank Sum Test on 'cty' within the "cars_new" dataset for 'make', 'year', 'trans', 'drive', and 'fuel' (identical populations is assummed if p > 0.05).
kruskal.test(cars_new[,"cty"],cars_new$make)
##
## Kruskal-Wallis rank sum test
##
## data: cars_new[, "cty"] and cars_new$make
## Kruskal-Wallis chi-squared = 952.0894, df = 2, p-value < 2.2e-16
kruskal.test(cars_new[,"cty"],cars_new$year)
##
## Kruskal-Wallis rank sum test
##
## data: cars_new[, "cty"] and cars_new$year
## Kruskal-Wallis chi-squared = 121.4982, df = 3, p-value < 2.2e-16
kruskal.test(cars_new[,"cty"],cars_new$trans)
##
## Kruskal-Wallis rank sum test
##
## data: cars_new[, "cty"] and cars_new$trans
## Kruskal-Wallis chi-squared = 231.2584, df = 1, p-value < 2.2e-16
kruskal.test(cars_new[,"cty"],cars_new$drive)
##
## Kruskal-Wallis rank sum test
##
## data: cars_new[, "cty"] and cars_new$drive
## Kruskal-Wallis chi-squared = 1630.978, df = 5, p-value < 2.2e-16
kruskal.test(cars_new[,"cty"],cars_new$fuel)
##
## Kruskal-Wallis rank sum test
##
## data: cars_new[, "cty"] and cars_new$fuel
## Kruskal-Wallis chi-squared = 11.2784, df = 1, p-value = 0.0007842
Since the p-values for both of the resulting Kruskal-Wallis rank sum tests that consider the factors ‘make’, ‘year’, ‘trans’, ‘drive’, and ‘fuel’ against the response variable ‘cty’ are less than 0.05, we can assume that the mean values of city fuel economy compared to the different makes, drive types, production years, fuel types, and transmission types of a vehicle, (considered separately) are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, we would lead us to infer that the differences in mean values of city fuel economy for each of the corresponding makes, drive types, production years, and transmission types of a vehicle being considered in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of city fuel economy can be explained by the variation existent in the different different makes, drive types, production years, and transmission types of a vehicle being considered in this analysis (this cannot be said for the factor ‘fuel’, based on this experiment’s complete analysis of variance [ANOVA]). Furthermore, in addition to treating our data in such a way that uses a nonparametric analysis upon any realization that normality cannot be assumed, transformations such as the “Box-Cox Power Transformation” certainly could have been performed on the data to make it approximately normal. However, these transformations would not be necessary for this analysis, since the nonparametric significance results that we generated by using the Kruskal-Wallis rank sums test were suitable in giving us confidence in the results of our analysis.
[1] Online Data Source: http://www.fueleconomy.gov/feg/download.shtml
[2] Glass, G.V., P.D. Peckham, and J.R. Sanders. 1972. Consequences of failure to meet assumptions underlying fixed effects analyses of variance and covariance. Rev. Educ. Res. 42: 237-288.
[3] Harwell, M.R., E.N. Rubinstein, W.S. Hayes, and C.C. Olds. 1992. Summarizing Monte Carlo results in methodological research: the one- and two-factor fixed effects ANOVA cases. J. Educ. Stat. 17: 315-339.
[4] Lix, L.M., J.C. Keselman, and H.J. Keselman. 1996. Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Rev. Educ. Res. 66: 579-619.
The fuel economy data for 1984-2015 from the US EPA is conveniently packaged for R users. It can be installed from github with devtools::install_github(“hadley/fueleconomy”).