This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
This analysis uses fuel economy data to test the effect of 5 factors on the response variable of highway fuel economy us a Taguchi Design of experiment.
Below is the installation and initial examination of the dataset:
#Installing data package
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/tothk2/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tothk2\AppData\Local\Temp\RtmpAh9Ain\downloaded_packages
library("fueleconomy", lib.loc="~/R/win-library/3.1")
## Warning: package 'fueleconomy' was built under R version 3.1.2
data<-vehicles
attach(data)
## The following object is masked _by_ .GlobalEnv:
##
## model
For this experiment we will have 5 factors which are year, fuel type, displacement, cylinders and drive.
Year will have 16 levels, which will be each year from 2000 to 2016.
Fuel type has three levels which are Regular, Premium and Diesel.
Cylinders has three levels which are <6, 6-10 and >=10 cylinders.
Displacement has 4 levels which are <2, 2-4, 4-6 and >6
Lastly drive has 4 levels which are 4-wheel drive, 2-wheel drive, rear wheel drive, front wheel drive.
#Subsetting Cylinders into 3 levels
data$cyl[as.numeric(data$cyl)>= 2 & as.numeric(data$cyl) < 6 ] ="<6"
data$cyl[as.numeric(data$cyl)>= 6 & as.numeric(data$cyl) < 10 ] = "6-10"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$cyl[as.numeric(data$cyl)>= 10] =">10"
## Warning: NAs introduced by coercion
#Subsetting Displacement into 4 levels
data$displ[data$displ >= 0 & data$displ < 3] = "<3"
data$displ[data$displ >= 3 & data$displ < 6] = "3-6"
data$displ[data$displ >= 6] = ">6"
#Subset from year 2000 and on
datasub <- subset(data, year >= 2000)
datasub$year[datasub$year >= 2000 & datasub$year < 2005] = "2000-2005"
datasub$year[datasub$year >= 2005 & datasub$year < 2010] = "2005-2010"
datasub$year[datasub$year >= 2010] = ">2010"
#Subset Fuel Types
datasub<-datasub[datasub$fuel %in% c("Regular","Premium","Diesel"), ]
#Subset Drive Types
datasub<-datasub[datasub$drive %in% c("Front-Wheel Drive", "Rear-Wheel Drive", "4-Wheel Drive","2-Wheel Drive"), ]
#Levels of Year
unique(datasub$year)
## [1] "2000-2005" ">2010" "2005-2010"
#Levels of each Fuel Type
unique(datasub$fuel)
## [1] "Premium" "Regular" "Diesel"
# Levels of Drive Type
unique(datasub$drive)
## [1] "Front-Wheel Drive" "Rear-Wheel Drive" "4-Wheel Drive"
## [4] "2-Wheel Drive"
#Levels of Number of Cylinders
unique(datasub$cyl)
## [1] "6-10" "<6" ">10"
#Levels of Displacement
unique(datasub$displ)
## [1] "3-6" "<3" ">6"
head(datasub)
## id make model year class trans
## 24 16573 Acura 3.2CL 2000-2005 Compact Cars Automatic (S5)
## 25 17489 Acura 3.2CL 2000-2005 Compact Cars Automatic (S5)
## 26 18458 Acura 3.2CL 2000-2005 Compact Cars Manual 6-spd
## 27 18459 Acura 3.2CL 2000-2005 Compact Cars Automatic (S5)
## 29 15871 Acura 3.2TL 2000-2005 Midsize Cars Automatic 5-spd
## 30 16734 Acura 3.2TL 2000-2005 Midsize Cars Automatic (S5)
## drive cyl displ fuel hwy cty
## 24 Front-Wheel Drive 6-10 3-6 Premium 27 17
## 25 Front-Wheel Drive 6-10 3-6 Premium 27 17
## 26 Front-Wheel Drive 6-10 3-6 Premium 26 17
## 27 Front-Wheel Drive 6-10 3-6 Premium 27 17
## 29 Front-Wheel Drive 6-10 3-6 Premium 27 17
## 30 Front-Wheel Drive 6-10 3-6 Premium 27 17
tail(datasub)
## id make model year class trans
## 33431 26294 smart fortwo coupe 2005-2010 Two Seaters Automatic (AM5)
## 33432 29812 smart fortwo coupe >2010 Two Seaters Auto(AM5)
## 33433 30918 smart fortwo coupe >2010 Two Seaters Auto(AM5)
## 33434 32172 smart fortwo coupe >2010 Two Seaters Auto(AM5)
## 33435 32357 smart fortwo coupe >2010 Two Seaters Auto(AM5)
## 33436 34460 smart fortwo coupe >2010 Two Seaters Auto(AM5)
## drive cyl displ fuel hwy cty
## 33431 Rear-Wheel Drive <6 <3 Premium 41 33
## 33432 Rear-Wheel Drive <6 <3 Premium 41 33
## 33433 Rear-Wheel Drive <6 <3 Premium 41 33
## 33434 Rear-Wheel Drive <6 <3 Premium 38 34
## 33435 Rear-Wheel Drive <6 <3 Premium 38 34
## 33436 Rear-Wheel Drive <6 <3 Premium 38 34
summary(datasub)
## id make model year
## Min. :15589 Length:11085 Length:11085 Length:11085
## 1st Qu.:19432 Class :character Class :character Class :character
## Median :23901 Mode :character Mode :character Mode :character
## Mean :24692
## 3rd Qu.:30656
## Max. :34920
## class trans drive
## Length:11085 Length:11085 Length:11085
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## cyl displ fuel hwy
## Length:11085 Length:11085 Length:11085 Min. :11.0
## Class :character Class :character Class :character 1st Qu.:22.0
## Mode :character Mode :character Mode :character Median :25.0
## Mean :25.6
## 3rd Qu.:29.0
## Max. :61.0
## cty
## Min. : 7.0
## 1st Qu.:15.0
## Median :18.0
## Mean :18.5
## 3rd Qu.:21.0
## Max. :53.0
The continuous variables in this data are the city (“cty”) and highway (“hwy”) gas mileage of each vehicle. Highway mileage ranges from 9 to 109 and City mileage ranges from 6 to 138.
For this experiment we will be focusing on the highway gas mileage for the response variable being effected by the chosen factors.
The data set is organized by the following variables: id, make, model, year, class, trans, drive, cyl, displ, fuel, hwy, cty
Make, model and class are indications of the manufacturer and type of the vehicle such as Audi and Ford and the model of the vehicle is model from that manufacturere such as Passat or Gran Prix. Class indicates vehicle type such as compact car or van.
Year is simply the year that vehicle was manufactured.
trans, drive, cyl, and displ all describe the type of set up the car has, mostly relating to the engine. The trans is the transmission which is thinks like Automatic 9-spd or Manual 5-spd. The drive describes the type of wheel drive like All-wheel or front-wheel. Cyl is the number of cylinders the engine has and displ shows the displacement in liters of the engine.
Fuel is the type of fuel the engine uses such as Regular or Premium.
Cty and hwy are the gas mileage for city driving and gas mileage for highway driving.
It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.
To perform the experiment the data will first be altered so that our data has a reduced number of levels to decrease complexity. A linear model will be created and an analysis of variance will be performed on that model. We will then create an orthogonal taguchi design to test our anova. From these analysis we will be able to test the null hypothesis, that highway gas mileage is independant of drive type, year, cylinders, displacement and fuel type.
The rationale for using an analysis of variance test is used when multiple factors are considered. It checks whether the means of several groups are equal. The alternative would be to use multiple two-sample t-tests however there is more likely chance of the test resulting in a false hypothesis.
The data was collected in an unknown way so we do not know if there was any randomization to it.
There are no replicated or repeated measures in the data. Each unique vehicle had it’s fuel economy statistics measured once.
There was no blocking.
To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of those factors.
#Defining Fuel Type as a factor
datasub$fuel = as.factor(datasub$fuel)
#Defining Year as a factor
datasub$year = as.factor(datasub$year)
#Defining Drive Type as a factor
datasub$drive = as.factor(datasub$drive)
#Defining Cylinders as a factor
datasub$cyl = as.factor(datasub$cyl)
#Defining Displacement as a factor
datasub$displ = as.factor(datasub$displ)
#Boxplots of the means of each factor against response variable
boxplot(hwy~fuel, data=datasub, xlab="Fuel Type", ylab="Highway Fuel Economy")
boxplot(hwy~year, data=datasub, xlab="Year", ylab="Highway Fuel Economy")
boxplot(hwy~drive, data=datasub, xlab="Drive Type", ylab="Highway Fuel Economy")
boxplot(hwy~cyl, data=datasub, xlab="Number of Cylinders", ylab="Highway Fuel Economy")
boxplot(hwy~displ, data=datasub, xlab="Enginge Displacement", ylab="Highway Fuel Economy")
We can see a positibve relationship between highway mileage and years and inverse relationships with cylinders and displacement. There are no discernible relationships with fuel type or number of cylinders.
To test the hypotheses we perform an ANOVA test on the linear model of the factors
The null hypothesis of the tests is that the factors do not have an effect on the response variable of highway gas mileage.
#Analysis of Variance
model=aov(hwy~year+drive+cyl+displ+fuel, data=datasub)
anova(model)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## year 2 21558 10779 941 <2e-16 ***
## drive 3 140501 46834 4089 <2e-16 ***
## cyl 2 47139 23569 2058 <2e-16 ***
## displ 2 5862 2931 256 <2e-16 ***
## fuel 2 15787 7894 689 <2e-16 ***
## Residuals 11073 126811 11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For each of our individual tests we found resulting p-values of 2.2e-16. The p-value represents the probability that we can get the F value using the null hypothesis. Since our probability is extremely close to 0 we can assume that each factor demonstrates an effect on the response variable. We are lead to reject the null hypothesis.
Next we will load another R library in order to create an orthogonal array for a taguchi design of experiment.
library(DoE.base)
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
#Defining factors as integers
datasub$fuel = as.integer(datasub$fuel)
datasub$year = as.integer(datasub$year)
datasub$drive = as.integer(datasub$drive)
datasub$cyl = as.integer(datasub$cyl)
datasub$displ = as.integer(datasub$displ)
fuelarray = oa.design(factor.names=c("year", "drive", "cyl","displ","fuel"), nlevels=c(3,4,3,3,3), columns="min3")
Then we combine our experimental array with the original dataset to get the results of the experimental runs.
dataarray = merge(fuelarray, datasub, by=c("year","drive","cyl","displ","fuel"), all=FALSE)
head(dataarray)
## year drive cyl displ fuel id make model
## 1 1 2 1 1 3 30777 Ford Escape Hybrid 4WD
## 2 1 2 1 1 3 29946 Hyundai Santa Fe 4WD
## 3 1 2 1 1 3 32461 Kia Sportage 4WD
## 4 1 2 1 1 3 29464 Mercury Mariner Hybrid 4WD
## 5 1 2 1 1 3 33129 Mitsubishi Outlander Sport 4WD
## 6 1 2 1 1 3 31969 Jeep Compass 4WD
## class trans hwy cty
## 1 Sport Utility Vehicle - 4WD Automatic (variable gear ratios) 27 30
## 2 Sport Utility Vehicle - 4WD Automatic 6-spd 27 21
## 3 Small Sport Utility Vehicle 4WD Manual 6-spd 25 19
## 4 Sport Utility Vehicle - 4WD Automatic (variable gear ratios) 27 30
## 5 Small Sport Utility Vehicle 4WD Auto(AV-S6) 29 24
## 6 Sport Utility Vehicle - 4WD Automatic (variable gear ratios) 23 20
Then we need to eliminate all duplicate entries in our data and find unique rows for our data.
factors = unique(dataarray[ ,1:5])
factors
## year drive cyl displ fuel
## 1 1 2 1 1 3
## 108 1 3 3 3 3
## 394 1 4 3 1 2
## 423 2 3 1 1 3
## 1266 2 4 3 2 3
## 1270 3 2 3 3 3
## 1272 3 3 3 1 2
## 1290 3 4 1 3 3
## 1328 3 4 2 3 2
rownames(factors)
## [1] "1" "108" "394" "423" "1266" "1270" "1272" "1290" "1328"
We then use those rows to pull in the response variable to our entries.
#Select Response Variables
responsevar = dataarray$hwy[index=c(1,108,394,423,1266,1270,1272,1290,1328)]
#Combine data
finaldata = cbind(factors,responsevar)
finaldata
## year drive cyl displ fuel responsevar
## 1 1 2 1 1 3 27
## 108 1 3 3 3 3 22
## 394 1 4 3 1 2 30
## 423 2 3 1 1 3 27
## 1266 2 4 3 2 3 15
## 1270 3 2 3 3 3 18
## 1272 3 3 3 1 2 26
## 1290 3 4 1 3 3 23
## 1328 3 4 2 3 2 17
We then calculate the signal to noise ratio for our experimental runs.
SN=-10*log10(1/finaldata$responsevar^2)
SN
## [1] 28.63 26.85 29.54 28.63 23.52 25.11 28.30 27.23 24.61
bestsn = which(SN==max(SN))
bestsn
## [1] 3
We find that our best solution is the one with the highest signal to noise ratio. This for us is entry 3 with levels 1, 4, 3, 1, 2 for our factors year, drive, cyl, displ and fuel respectively.
Lastly we run another anova on the new experimental array.
TaguchiModel = lm(finaldata$responsevar~finaldata$year+finaldata$drive+finaldata$cyl+finaldata$displ+finaldata$fuel, data=finaldata)
anova(TaguchiModel)
## Warning: ANOVA F-tests on an essentially perfect fit are unreliable
## Analysis of Variance Table
##
## Response: finaldata$responsevar
## Df Sum Sq Mean Sq F value Pr(>F)
## finaldata$year 2 56.9 28.4
## finaldata$drive 2 23.3 11.6
## finaldata$cyl 2 55.2 27.6
## finaldata$displ 2 80.2 40.1
## Residuals 0 0.0
Unfortunately this anova yields a perfect fit which is unreliable, therefore we fail to reject the null hypothesis using a Taguchi design.
Next we graph Q-Q plots to check our data in our models for normality. If the data is not normal the results of the analysis may not be valid.
#QQ plots for residuals of first model
qqnorm(residuals(model))
qqline(residuals(model))
#QQ plots for residuals of second model
qqnorm(residuals(TaguchiModel))
qqline(residuals(TaguchiModel))
We also perform the Tukey test which is used to determine which groups in the sample differ as opposed to the anova which tells whether or not there are differences in groups.
#Tukey HSD test for first model
TukeyHSD(model, ordered=FALSE, conf.level = .95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = hwy ~ year + drive + cyl + displ + fuel, data = datasub)
##
## $year
## diff lwr upr p adj
## 2000-2005->2010 -3.0641 -3.249 -2.8796 0e+00
## 2005-2010->2010 -2.7297 -2.911 -2.5480 0e+00
## 2005-2010-2000-2005 0.3344 0.146 0.5228 1e-04
##
## $drive
## diff lwr upr p adj
## 4-Wheel Drive-2-Wheel Drive 2.486 -1.4196 6.392 0.3586
## Front-Wheel Drive-2-Wheel Drive 10.956 7.0657 14.847 0.0000
## Rear-Wheel Drive-2-Wheel Drive 4.050 0.1598 7.941 0.0376
## Front-Wheel Drive-4-Wheel Drive 8.470 8.0850 8.855 0.0000
## Rear-Wheel Drive-4-Wheel Drive 1.564 1.1777 1.951 0.0000
## Rear-Wheel Drive-Front-Wheel Drive -6.906 -7.0754 -6.736 0.0000
##
## $cyl
## diff lwr upr p adj
## >10-<6 -7.021 -7.476 -6.566 0
## 6-10-<6 -2.903 -3.057 -2.748 0
## 6-10->10 4.118 3.666 4.570 0
##
## $displ
## diff lwr upr p adj
## >6-<3 -2.2674 -2.6515 -1.883 0
## 3-6-<3 -0.3612 -0.5153 -0.207 0
## 3-6->6 1.9062 1.5237 2.289 0
##
## $fuel
## diff lwr upr p adj
## Premium-Diesel -8.628 -9.289 -7.9674 0
## Regular-Diesel -9.678 -10.335 -9.0205 0
## Regular-Premium -1.050 -1.204 -0.8948 0
Lastly we plot a fitted model against the residuals. We do not see a very large degree of variation among the plot.
#Plot of Fitted vs Residuals of original linear model
plot(fitted(model), residuals(model))
#Plot of Fitted vs Residuals of Taguchi models
plot(fitted(TaguchiModel), residuals(TaguchiModel))
Overrall the results of our model lead us to believe our original linear model is adequate and does explain the effect of the cylinders, displacement, fuel type, year, and drive type on the variance in highway gas mileage. However we cannot use the Taguchi method anova as it’s perfect leaves us skeptical of the actual fit and any model adequacy measures.
No literature was used
The R package in which this data was found can be located at https://github.com/hadley/fueleconomy
It is possible that the conclusions of our analysis are the results of chance. One concern is that the data for each car was collected only once. We don’t know the condition of the tests and it is possible that the conditions of each test may have resulted in better or worse outputs for highway gas mileage. There are other factors involved as well. Two vehicles could have the exact same engine set up yet it is possible that they have different weights therefore effecting their mileage output.
Also since our data appears to not be normal, and data being normal is an assumption for the ANOVA test we can instead perform a Kruskal-Wallis rank sum test. Kruskal-Wallis is a non-parametric to test whether the different groups all have the same distribution. We use this on each of our factors.
#Kruskal-Wallis test on Fuel Type factor
kruskal.test(hwy~fuel, data=datasub)
##
## Kruskal-Wallis rank sum test
##
## data: hwy by fuel
## Kruskal-Wallis chi-squared = 447.3, df = 2, p-value < 2.2e-16
#Kruskal-Wallis test on Year factor
kruskal.test(hwy~year, data=datasub)
##
## Kruskal-Wallis rank sum test
##
## data: hwy by year
## Kruskal-Wallis chi-squared = 614.1, df = 2, p-value < 2.2e-16
#Kruskal-Wallis test on Drive Type factor
kruskal.test(hwy~drive, data=datasub)
##
## Kruskal-Wallis rank sum test
##
## data: hwy by drive
## Kruskal-Wallis chi-squared = 4711, df = 3, p-value < 2.2e-16
#Kruskal-Wallis test on Cylinder Number factor
kruskal.test(hwy~cyl, data=datasub)
##
## Kruskal-Wallis rank sum test
##
## data: hwy by cyl
## Kruskal-Wallis chi-squared = 5382, df = 2, p-value < 2.2e-16
#Kruskal-Wallis test on Displacement factor
kruskal.test(hwy~displ, data=datasub)
##
## Kruskal-Wallis rank sum test
##
## data: hwy by displ
## Kruskal-Wallis chi-squared = 5262, df = 2, p-value < 2.2e-16