========================================================
For this recipie, one of the Hadley Wickham datasets, vehicles from the fueleconomy package will be examined. Specifically, we will examine the effect of five factors (make, class, drive, trans, fuel ) with multiple levels on the highway fuel economy, measured in mpg.
Read in and subset data:
library("fueleconomy", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
data<-vehicles
For more on the dataset:
?vehicles
## starting httpd help server ... done
We will examine 5 factors in this experiment: make(4 levels), class(4 levels), drive(4 levels), trans(4 levels) and fuel(6 levels).
Setup Make as factor (only acuras, audis, chevys and dodge)
x<-subset(data,data$make=="Acura" | data$make=="Audi" | data$make=="Chevrolet" | data$make=="Dodge")
x$make<-as.factor(x$make)
Setup Class as factor (only compact, subcompact, midsize and two seaters levels)
xx<-subset(x, x$class=="Compact Cars"| x$class=="Subcompact Cars"| x$class=="Midsize Cars"| x$class=="Two Seaters")
xx$class<-as.factor(xx$class)
Set up drive as factor (4 levels):
xx$drive<-as.factor(xx$drive)
Set up trans as factor (4 levels)
xxx<-subset(xx, xx$trans=="Automatic (S5)" |xx$trans=="Manual 5-spd" |xx$trans=="Automatic 4-spd"|xx$trans=="Automatic 5-spd")
xxx$trans<-as.factor(xxx$trans)
Set up fuel as factor 6 levels:
xxx$fuel<-as.factor(xxx$fuel)
The only continous variable under study in this experiment is the hwy, which is highway fuel economy in mpg
Hwy will also serve as the response variable.
The dataset under study has 1175 observations of 12 variables, although only 6 variables are of interest
Structure and first/last observations of dataset:
str(xxx)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1175 obs. of 12 variables:
## $ id : int 13309 13310 13311 14038 14039 14040 14834 14835 14836 11789 ...
## $ make : Factor w/ 4 levels "Acura","Audi",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model: chr "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
## $ year : int 1997 1997 1997 1998 1998 1998 1999 1999 1999 1995 ...
## $ class: Factor w/ 4 levels "Compact Cars",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ trans: Factor w/ 4 levels "Automatic (S5)",..: 2 4 2 2 4 2 2 4 2 2 ...
## $ drive: Factor w/ 4 levels "4-Wheel or All-Wheel Drive",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ cyl : int 4 4 6 4 4 6 4 4 6 5 ...
## $ displ: num 2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
## $ fuel : Factor w/ 6 levels "Diesel","Gasoline or E85",..: 6 6 6 6 6 6 6 6 6 5 ...
## $ hwy : int 26 28 26 27 29 26 27 29 26 23 ...
## $ cty : int 20 22 18 19 21 17 20 21 17 18 ...
head(xxx)
## id make model year class trans
## 8 13309 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 9 13310 Acura 2.2CL/3.0CL 1997 Subcompact Cars Manual 5-spd
## 10 13311 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 11 14038 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
## 12 14039 Acura 2.3CL/3.0CL 1998 Subcompact Cars Manual 5-spd
## 13 14040 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
## drive cyl displ fuel hwy cty
## 8 Front-Wheel Drive 4 2.2 Regular 26 20
## 9 Front-Wheel Drive 4 2.2 Regular 28 22
## 10 Front-Wheel Drive 6 3.0 Regular 26 18
## 11 Front-Wheel Drive 4 2.3 Regular 27 19
## 12 Front-Wheel Drive 4 2.3 Regular 29 21
## 13 Front-Wheel Drive 6 3.0 Regular 26 17
tail(xxx)
## id make model year class trans
## 10115 20864 Dodge Stratus 4 Door 2005 Midsize Cars Automatic 4-spd
## 10116 21499 Dodge Stratus 4 Door 2005 Midsize Cars Automatic 4-spd
## 10117 21500 Dodge Stratus 4 Door 2005 Midsize Cars Automatic 4-spd
## 10118 21954 Dodge Stratus 4 Door 2006 Midsize Cars Automatic 4-spd
## 10119 22615 Dodge Stratus 4 Door 2006 Midsize Cars Automatic 4-spd
## 10120 22616 Dodge Stratus 4 Door 2006 Midsize Cars Automatic 4-spd
## drive cyl displ fuel hwy cty
## 10115 Front-Wheel Drive 4 2.4 Regular 27 20
## 10116 Front-Wheel Drive 6 2.7 Gasoline or E85 25 19
## 10117 Front-Wheel Drive 6 2.7 Gasoline or E85 25 19
## 10118 Front-Wheel Drive 4 2.4 Regular 27 20
## 10119 Front-Wheel Drive 6 2.7 Gasoline or E85 25 19
## 10120 Front-Wheel Drive 6 2.7 Gasoline or E85 25 19
In this recipie, a taguchi design will be created to analyze if any of the 5 factors listed above have effect on highway fuel economy.
The goal is to determine if any of these factors cause a change in the highway milage of a car.
The dataset is a collection of survey data from the EPA on cars from 1985 to 2015.
There are no replicates or repeated measures.
Blocking is not used in this design.
Mean number of affairts by the 5 factors:
tapply(xxx$hwy, xxx$make, mean)
## Acura Audi Chevrolet Dodge
## 25.75 23.55 28.67 26.98
tapply(xxx$hwy, xxx$class, mean)
## Compact Cars Midsize Cars Subcompact Cars Two Seaters
## 27.40 25.82 27.72 22.74
tapply(xxx$hwy, xxx$drive, mean)
## 4-Wheel or All-Wheel Drive All-Wheel Drive
## 22.63 24.00
## Front-Wheel Drive Rear-Wheel Drive
## 28.07 24.08
tapply(xxx$hwy, xxx$trans, mean)
## Automatic (S5) Automatic 4-spd Automatic 5-spd Manual 5-spd
## 25.78 25.74 24.40 28.21
tapply(xxx$hwy, xxx$fuel, mean)
## Diesel Gasoline or E85 Gasoline or natural gas
## 41.00 26.88 28.75
## Midgrade Premium Regular
## 25.40 24.53 28.05
Audi’s appear to have a lower milage than the other 3 groups. Two-seaters seem to have a lower milage than the other groups Front wheel drive and Mannual 5-spd cars have the highest milage. Diesel appears to have a significantly higher milage than the other 5 fuel types The means will be further analyzed in the boxplots below.
Boxplots:
boxplot(xxx$hwy~xxx$make, xlab="Make", ylab="hwy mpg")
boxplot(xxx$hwy~xxx$class, xlab="Class", ylab="hwy mpg")
boxplot(xxx$hwy~xxx$drive, xlab="Drive", ylab="hwy mpg")
boxplot(xxx$hwy~xxx$trans, xlab="trans", ylab="hwy mpg")
boxplot(xxx$hwy~xxx$fuel, xlab="fuel", ylab="hwy mpg")
From the boxplots, the trends observed above are reinforced. One important observation is that many of the factor levels have several outliers of high highway mileage. It is also important to note that when looking at fuel type, some levels only have a couple observations (diesel and midgrade),
Histogram of Visits:
hist(xxx$hwy, breaks=20)
The most frequent milage is between 25 and 30 mpg.
Null Hypothesis: The variation in highway fuel milage cannot be explained by anything other than variation.
First, assign interger levels to all factors:
xxx$make<-as.character(xxx$make)
xxx$make[xxx$make == "Acura"]<-1
xxx$make[xxx$make == "Audi"]<-2
xxx$make[xxx$make == "Chevrolet"]<-3
xxx$make[xxx$make == "Dodge"]<-4
xxx$class<-as.character(xxx$class)
xxx$class[xxx$class == "Compact Cars"]<-1
xxx$class[xxx$class == "Midsize Cars"]<-2
xxx$class[xxx$class == "Subcompact Cars"]<-3
xxx$class[xxx$class == "Two Seaters"]<-4
xxx$drive<-as.character(xxx$drive)
xxx$drive[xxx$drive == "4-Wheel or All-Wheel Drive"]<-1
xxx$drive[xxx$drive == "All-Wheel Drive"]<-2
xxx$drive[xxx$drive == "Front-Wheel Drive"]<-3
xxx$drive[xxx$drive == "Rear-Wheel Drive"]<-4
xxx$trans<-as.character(xxx$trans)
xxx$trans[xxx$trans == "Automatic (S5)"]<-1
xxx$trans[xxx$trans == "Automatic 4-spd"]<-2
xxx$trans[xxx$trans == "Automatic 5-spd"]<-3
xxx$trans[xxx$trans == "Manual 5-spd"]<-4
xxx$fuel<-as.character(xxx$fuel)
xxx$fuel[xxx$fuel == "Diesel"]<-1
xxx$fuel[xxx$fuel == "Gasoline or E85"]<-2
xxx$fuel[xxx$fuel == "Gasoline or natural gas"]<-3
xxx$fuel[xxx$fuel == "Midgrade"]<-4
xxx$fuel[xxx$fuel == "Premium"]<-5
xxx$fuel[xxx$fuel == "Regular"]<-6
xxx$make<-as.integer(xxx$make)
xxx$class<-as.integer(xxx$class)
xxx$drive<-as.integer(xxx$drive)
xxx$trans<-as.integer(xxx$trans)
xxx$fuel<-as.integer(xxx$fuel)
Create ANOVA model:
model1=aov(xxx$hwy~xxx$make+xxx$class+xxx$drive+xxx$trans+xxx$fuel)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## xxx$make 1 957 957 61.17 1.2e-14 ***
## xxx$class 1 45 45 2.87 0.091 .
## xxx$drive 1 307 307 19.59 1.0e-05 ***
## xxx$trans 1 1243 1243 79.41 < 2e-16 ***
## xxx$fuel 1 491 491 31.35 2.7e-08 ***
## Residuals 1169 18293 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is small for make, drive, trans, and fuel, meaning there is a low probability that the variation is due to randomization alone, we reject the null hypothesis for these factors. It is likely that each of the factors, make, drive, trans, and fuel type can explain some of the variation in highway mileage. There is not evidence, however, that class of the car causes variation in the highway mileage.
Before a taguchi design is created, we must download the qualityTools and DoE.base packages:
library("qualityTools", lib.loc="~/R/win-library/3.1")
## Warning: package 'qualityTools' was built under R version 3.1.2
library("DoE.base", lib.loc="~/R/win-library/3.1")
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
Now, we will create a taguchi design to find the best factor level combinations that lead to the highest S/N ratio (signal-to-noise).
First, create an orthogonal array:
array=oa.design(factor.names=c("make", "class", "drive", "trans", "fuel"), nlevels=c(4,4,4,4,6), columns="min3")
array
## make class drive trans fuel
## 1 4 4 3 1 2
## 2 4 4 4 4 1
## 3 1 1 3 2 6
## 4 4 2 2 2 3
## 5 3 2 2 2 5
## 6 1 3 2 1 4
## 7 3 1 2 4 6
## 8 3 3 4 2 2
## 9 4 2 1 3 6
## 10 3 2 3 4 1
## 11 1 4 4 4 5
## 12 2 2 4 1 6
## 13 2 1 4 3 4
## 14 4 3 2 1 4
## 15 2 4 3 1 6
## 16 3 3 3 3 1
## 17 3 4 3 1 3
## 18 2 2 1 3 2
## 19 1 4 2 3 5
## 20 3 1 4 3 4
## 21 2 3 3 3 5
## 22 1 2 4 1 2
## 23 4 3 1 4 2
## 24 3 1 1 1 3
## 25 4 1 3 2 5
## 26 1 3 4 2 6
## 27 2 3 1 4 5
## 28 2 1 3 2 2
## 29 3 4 2 3 2
## 30 4 3 4 2 3
## 31 2 1 2 4 3
## 32 3 3 1 4 6
## 33 1 1 2 4 2
## 34 4 4 2 3 6
## 35 3 4 1 2 4
## 36 2 4 4 4 3
## 37 4 2 3 4 4
## 38 1 2 1 3 3
## 39 1 1 1 1 1
## 40 2 4 1 2 4
## 41 1 2 3 4 4
## 42 3 2 4 1 5
## 43 4 1 4 3 1
## 44 1 3 3 3 3
## 45 2 3 2 1 1
## 46 2 2 2 2 1
## 47 1 4 1 2 1
## 48 4 1 1 1 5
## class=design, type= oa
Merger the array with the original data and identify the unique rows in the new set:
new=merge(array, xxx, by=c("make", "class", "drive", "trans", "fuel"), all=FALSE)
unique=unique(new[,1:5])
unique
## make class drive trans fuel
## 1 1 1 3 2 6
## 7 1 4 4 4 5
## 13 2 3 1 4 5
## 15 4 1 3 2 5
rownames(unique)
## [1] "1" "7" "13" "15"
Find and save the highway mileage for the unique rows and use them to create a second array:
hwy2=new$hwy[index=c(1,7,13,15)]
hwy2
## [1] 20 22 22 25
array2 = cbind(unique,hwy2)
array2
## make class drive trans fuel hwy2
## 1 1 1 3 2 6 20
## 7 1 4 4 4 5 22
## 13 2 3 1 4 5 22
## 15 4 1 3 2 5 25
Now, we can calculate the signal to noise ratio:
s.n = -10*log10(1/array2$hwy2^2)
s.n
## [1] 26.02 26.85 26.85 27.96
The highest S/N ratio is 27.95880, which corresponds to the first row. This combination of factors identifies the best way to get the best (highest) highway fuel milage. Looking back at array2, the factor/level combination associated with row 1 is: make-1-Acura class-1-Compact Car drive-3-Front Wheel Drive trans-2-Automatic 4-Speed fuel-6-Regular
Finally, we run a new ANOVA model with the unique row array (array2):
model2=aov(array2$hwy2~array2$make+array2$class+array2$drive+array2$trans+array2$fuel, data=array2)
summary(model2)
## Df Sum Sq Mean Sq
## array2$make 2 10.8 5.38
## array2$class 1 2.0 2.00
Since array2 only had 4 rows, this was not a suffcient amount of data to run the analysis.
Visually inspect normality of original data:
qqnorm(residuals(model1))
qqline(residuals(model1))
The data appears it may not be normal. we would perform a shapiro-wilks normality test to confirm this assumption, however, the model is to large to be teseted. In case the data is infact not normal, alternate tests will be performed in the contingencies section below.
Fitted vs Residuals Plot:
plot(fitted(model1),residuals(model1))
The data should be symetric over the zero and spread out over the dynamic range. We can assume the fit is good since both of these are true.
Since the data may not have fulfilled the normality assumption of the model, a Kruskal-Wallis non-parametric analysis of variance by Rank Sum Test should be performed:
kruskal.test(xxx$hwy~xxx$make, data=xxx)
##
## Kruskal-Wallis rank sum test
##
## data: xxx$hwy by xxx$make
## Kruskal-Wallis chi-squared = 223.1, df = 3, p-value < 2.2e-16
kruskal.test(xxx$hwy~xxx$class, data=xxx)
##
## Kruskal-Wallis rank sum test
##
## data: xxx$hwy by xxx$class
## Kruskal-Wallis chi-squared = 91.99, df = 3, p-value < 2.2e-16
kruskal.test(xxx$hwy~xxx$trans, data=xxx)
##
## Kruskal-Wallis rank sum test
##
## data: xxx$hwy by xxx$trans
## Kruskal-Wallis chi-squared = 100.7, df = 3, p-value < 2.2e-16
kruskal.test(xxx$hwy~xxx$drive, data=xxx)
##
## Kruskal-Wallis rank sum test
##
## data: xxx$hwy by xxx$drive
## Kruskal-Wallis chi-squared = 347.2, df = 3, p-value < 2.2e-16
kruskal.test(xxx$hwy~xxx$fuel, data=xxx)
##
## Kruskal-Wallis rank sum test
##
## data: xxx$hwy by xxx$fuel
## Kruskal-Wallis chi-squared = 206.6, df = 5, p-value < 2.2e-16
The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means).
Since the test results have a low p-value for each of the factors, it is likely that each factor can explain some of the variation in highway mileage.
These results match and back up the results of the ANOVA previously run, except for the case of class. This inconsistencey may be due to the fact that the data was not normal.
None used.
Data is from the fueleconomy Package
All included above.