The purpose of this project is to use the Taguchi design method to determine the best combination of inputs to achieve the optimum fuel economy of vehicles in the US.
The dataset used for this experiment is from the “2014 EPA fuel economy” package in R, which is used to explore the influencing factors of vehicle fuel economy in the US. The original dataset has 33442 observations with 12 variables. In this study, we focus on the effects of the following 5 factors : 1. year (year of manufacture) 2. make (vehicle make) 3. trans (type of transmission) 4. cyl (number of cylinders) 5. disp (vehicle displacement)
The null hypothesis and alternative hyphothesis of the experiment can be stated as:
H0: The variation in vehicle highway fuel economy is due to sample randomization only. (i.e, the selected 5 factors cannot explain the variation in vehicle highway fuel economy) HA: The variation in vehicle highway fuel economy is due to something else other than sample randomization (i.e., the selected 5 factors may affect the highway fuel economy)
To test the hypothesis, we first conduct an anlysis of variance of the selected 5 variables to exam the variations in vehicle highway fuel economy. Then we apply the Taguchi design method to the same five factors, compare the model estimates (including residues, coefficients and ANOVA results) to the initial analysis of variance model results, and determine the optimum point.
#Read in the data
library("fueleconomy", lib.loc="~/R/win-library/3.1")
## Warning: package 'fueleconomy' was built under R version 3.1.2
data1<- vehicles
summary(data1)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :58
## displ fuel hwy cty
## Min. :0.00 Length:33442 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.: 19.0 1st Qu.: 15.0
## Median :3.00 Mode :character Median : 23.0 Median : 17.0
## Mean :3.35 Mean : 23.6 Mean : 17.5
## 3rd Qu.:4.30 3rd Qu.: 27.0 3rd Qu.: 20.0
## Max. :8.40 Max. :109.0 Max. :138.0
## NA's :57
Sub-sampling: we select a subsample from the original database to conduct the analysis: vehicles made by eleven major vehicle manuafactuers (Chevrolet, Ford, Dodge, GMC, Toyota, BMW, Nissan, Mercedes-Benz, Volkswagen,Mitsubishi and Mazda) in the US, Europe and Japan are selected for this study. The subsample has 18488 observations with 17 variables.
Factor: The subsample has five selected factors, we re-group each one of them so that all the factors have three levels: 1. cyl1: 1 if the number of cylinders is less or equal than 5; 2 if the number of cylinders is larger than five and less than or equal to 8; 3 if the number of cylinders is equal to or more than 10. 2. drive1: 1 if the vehicle is 4/all-wheel drive; 2 if the vehicle is front-wheel drive; 3 if the vehicle is rear-wheel drive. 3. disp1: 1 if the vehicle displacement is less than 3; 2 if the vehicle displacement is more than/equal to 3 and less than 5; 3 if the vehicle displacement is equal to/more than 5. 4. year1: 1 if the vehicle was made between year 1990 to year 2000; 2 if the vehicle was made between year 2000 and year 2010; 3 if the vehicle was made after year 2010. 5. make1: 1 if the vehicle was made by a US manuafactuer; 2 if the vehicle was made by a Japanese manuafaturer; 3 if the vehicle was made by a European manuafacturer.
Continuous variable and Response Variable: In this study, we treat “hwy” (the vehicle fuel economy on highway) as the response variable.
Organization: This dataset summarizes the Envrionmental Protection Agency (EPA) fuel economy data from 1985 to 2015, it contains the highway and city fuel economy information and data on vehicle characteristics such as number of cylinders, make, model and so on.
Randomization: The data were randomly selected among the vehicles sold in the US market, however, there is no randomize assignment and random execution order in the experiment.
# Data sub-sampling
data1<-subset(data1, data1$make=="Chevrolet"|data1$make=="Ford"|data1$make=="Dodge"|data1$make =="GMC"|data1$make == "Toyota"|data1$make == "BMW"|data1$make == "Nissan"|data1$make == "Mercedes-Benz"|data1$make == "Volkswagen"|data1$make == "Mitsubishi"|data1$make =="Mazda")
head(data1)
## id make model year class trans
## 1234 29823 BMW 128ci Convertible 2010 Subcompact Cars Automatic (S6)
## 1235 29824 BMW 128ci Convertible 2010 Subcompact Cars Manual 6-spd
## 1236 30007 BMW 128ci Convertible 2011 Subcompact Cars Manual 6-spd
## 1237 30008 BMW 128ci Convertible 2011 Subcompact Cars Automatic (S6)
## 1238 31152 BMW 128ci Convertible 2012 Subcompact Cars Automatic (S6)
## 1239 31153 BMW 128ci Convertible 2012 Subcompact Cars Manual 6-spd
## drive cyl displ fuel hwy cty
## 1234 Rear-Wheel Drive 6 3 Premium 27 18
## 1235 Rear-Wheel Drive 6 3 Premium 28 18
## 1236 Rear-Wheel Drive 6 3 Premium 28 18
## 1237 Rear-Wheel Drive 6 3 Premium 27 18
## 1238 Rear-Wheel Drive 6 3 Premium 27 18
## 1239 Rear-Wheel Drive 6 3 Premium 28 18
attach(data1)
#data recoding
data1$cyl1<-NA
data1$cyl1[cyl<=5]<-1
data1$cyl1[cyl>5&cyl<=8]<-2
data1$cyl1[cyl>=10]<-3
data1$drive1<-NA
data1$drive1[drive=="4-Wheel Drive"|drive=="4-Wheel or All-Wheel Drive"|drive=="All-Wheel Drive"]<-1
data1$drive1[drive=="Front-Wheel Drive"]<-2
data1$drive1[drive=="Rear-Wheel Drive"]<-3
data1$disp1<-NA
data1$disp1[displ<3]<-1
data1$disp1[displ>=3&displ<5]<-2
data1$disp1[displ>=5]<-3
data1$year1<-NA
data1$year1[year <2000]<-1
data1$year1[year >=2000 & year<2010]<-2
data1$year1[year >=2010]<-3
data1$make1<-NA
data1$make1[make =="Chevrolet"|make =="Ford"|make =="Dodge"]<-1
data1$make1[make == "Toyota"|make=="Nissan"|make == "Mitsubishi"|make == "Mazda"]<-2
data1$make1[make == "GMC"|make=="BMW"|make == "Mercedes-Benz"|make == "Volkswagen"]<-3
attach(data1)
## The following objects are masked from data1 (pos = 3):
##
## class, cty, cyl, displ, drive, fuel, hwy, id, make, model,
## trans, year
data1$make1<-as.factor(make1)
data1$year1<-as.factor(year1)
data1$cyl1<-as.factor(cyl1)
data1$disp1<-as.factor(disp1)
data1$drive1<-as.factor(drive1)
attach(data1)
## The following objects are masked from data1 (pos = 3):
##
## class, cty, cyl, cyl1, disp1, displ, drive, drive1, fuel, hwy,
## id, make, make1, model, trans, year, year1
##
## The following objects are masked from data1 (pos = 4):
##
## class, cty, cyl, displ, drive, fuel, hwy, id, make, model,
## trans, year
data1<-na.omit(data1)
summary(data1[,c(13:17)])
## cyl1 drive1 disp1 year1 make1
## 1: 6064 1:4803 1:6936 1:9737 1:8285
## 2:12228 2:5415 2:7122 2:5768 2:4613
## 3: 196 3:8270 3:4430 3:2983 3:5590
str(data1[,c(13:17)])
## Classes 'tbl_df', 'tbl' and 'data.frame': 18488 obs. of 5 variables:
## $ cyl1 : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
## $ drive1: Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
## $ disp1 : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
## $ year1 : Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 2 2 ...
## $ make1 : Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
In this sample recipe, the effects of the selected 5 factors on the vehicle highway fuel economy are studied. In the first,an analysis of variance is performed on the data to verify if the variation in vehicle highway fuel economy are due to pure sample randomization.
Then we apply the Taguchi design method to the same data, estimate the residuls, coefficients and ANOVA, the optimum is also estimated.By doing so, we can test the hypothesis, and reach an optimum efficiently.
As discussed in the previous section, the data are randomly selected, however, they are not randomly assigned and executed.
There are replicates for the same vehicle make in different years, however, no blocking used in this experiment.
The boxplot shows that the vehicle highway fuel economy do vary a lot among different levels of the selected factors, indicating that the variation of vehicle highway fuel economy may due to the variation in manuafacturing year, vehicle make, number of cylinders, drivetrain, or/and displacement. Comparatively speaking, the variation in number of cylinders, drivetrain and displacement seem to have the strongest effects (as the median shown among different levels of these three factors do vary a lot).
#bloxplots
par(mfrow=c(2,3))
boxplot(hwy~year1,xlab="Year",ylab="Highway fuel economy (mpg)")
boxplot(hwy~make1,xlab="Make",ylab="Highway fuel economy (mpg)")
boxplot(hwy~cyl1,xlab="Number of cylinders",ylab="Highway fuel economy (mpg)")
boxplot(hwy~drive1,xlab="Drivetrain",ylab ="Highway fuel economy (mpg)")
boxplot(hwy~disp1,xlab="Displacement",ylab="Highway fuel economy (mpg)")
According to the analysis of variance result, the main effect of all five selected variables have shown statistical significance (p-value <2e-16) at the 0.05 level. Therefore we reject the null hypothesis that the variation in vehicle highway fuel economy is due to sample randomization only, year, make, cyl, drive and disp all shown to have an effect on the variation in vehicle highway fuel economy.
model1 = aov(hwy~year1+make1+cyl1+drive1+disp1,data=data1)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## year1 2 46759 23380 2233 <2e-16 ***
## make1 2 26967 13484 1288 <2e-16 ***
## cyl1 2 222985 111492 10646 <2e-16 ***
## drive1 2 97720 48860 4666 <2e-16 ***
## disp1 2 25792 12896 1231 <2e-16 ***
## Residuals 18477 193497 10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We then apply the Taguchi Design to the same data, estimate the residuals, coefficients and ANOVA. Since we have five factors with 3 levels, a L18 design array with 18 observations is constructed and saved in “array”. We then use the merge function to select the corresponding 18 matching observations in the original data. However, due to the data limitation, only 10 unique matching observations are able to identified. As a result the orthogonal array with 10 observations is stored in “orthogonalarray”. Noted that in an ideal experimental design, we should have exactly 18 unique matching observations, however, since the data used in this analysis were collected before the experimental design, not all possible combinations of the factors were included.
We then estimate the S/N ratios of all the observations in the orthogonal array, and the optimum with the maximum S/N ratio is obtained (observation 117). The result shows that vehicles with number of cylinders less than 5, displacement less than 3, 4 wheel/all wheel drive, manuafactuered in Japan, and in year between 2000 to 2010, tend to have higher highway fuel economy. In a more general way, Japanese small vehicles tend to have high highway fuel economy, which is consistent with the marketing findings.
The last thing we do is try to replicate the ANOVA results in model1. However, since only 10 observations were obtained in the orthogonal array, we are not able to construct a full 5 factor (each with 3 levels) analysis of variance. Therefore, we select four out of the five variables to construct an ANOVA to conduct a general comparison. As expected, the ANOVA estimates are very different from the previous model, although the selected variables still show some statistical siginificance, the effects are much weaker than the ones in model1. Partly this is due to the fact that the sample size has reduced significantly, for future analysis, it is recommended to include other combinations of variables/to include more variables (in this study, we have tried five different combinations, with two re-grouping strategies, 3-level/5-level, and the one with 10 matching cases the best results we obtained.) As a result, based on the new ANOVA estimates, we can again reject the null hypothesis at the 0.1 level, indicating that the variance in vehicle highway fuel economy is not due to sample randomization only.
#Construct the tolerence design
library("qualityTools", lib.loc="~/R/win-library/3.1")
## Warning: package 'qualityTools' was built under R version 3.1.2
library("DoE.base", lib.loc="~/R/win-library/3.1")
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
array <- oa.design(factor.names = c("cyl1","disp1","drive1","make1","year1"),nlevels = c(3,3,3,3,3),columns = "min3")
array
## cyl1 disp1 drive1 make1 year1
## 1 2 2 3 3 1
## 2 3 1 2 3 1
## 3 3 2 2 1 3
## 4 1 2 3 1 2
## 5 3 2 1 2 1
## 6 3 3 1 1 2
## 7 3 1 3 2 2
## 8 2 1 1 3 2
## 9 2 2 2 2 2
## 10 2 1 3 1 3
## 11 3 3 3 3 3
## 12 1 1 2 2 3
## 13 1 1 1 1 1
## 14 1 2 1 3 3
## 15 2 3 2 1 1
## 16 1 3 3 2 1
## 17 1 3 2 3 2
## 18 2 3 1 2 3
## class=design, type= oa
newdata <- merge(array, data1, by=c("cyl1","disp1","drive1","make1","year1"),all = FALSE)
head(newdata)
## cyl1 disp1 drive1 make1 year1 id make model
## 1 1 1 1 1 1 27602 Chevrolet T10 (S10) Blazer 4WD
## 2 1 1 1 1 1 28344 Chevrolet T10 (S10) Pickup 4WD
## 3 1 1 1 1 1 28273 Dodge Power Ram 50 Pickup 4WD
## 4 1 1 1 1 1 15512 Chevrolet Tracker 4WD Convertible
## 5 1 1 1 1 1 4019 Chevrolet T10 (S10) Blazer 4WD
## 6 1 1 1 1 1 4018 Chevrolet T10 (S10) Blazer 4WD
## year class trans
## 1 1984 Special Purpose Vehicle 4WD Manual 4-spd
## 2 1984 Standard Pickup Trucks 4WD Manual 4-spd
## 3 1984 Small Pickup Trucks 4WD Manual 4-spd
## 4 1999 Sport Utility Vehicle - 4WD Automatic 4-spd
## 5 1987 Special Purpose Vehicles Manual 4-spd
## 6 1987 Special Purpose Vehicles Automatic 4-spd
## drive cyl displ fuel hwy cty
## 1 4-Wheel or All-Wheel Drive 4 2.0 Regular 24 18
## 2 4-Wheel or All-Wheel Drive 4 2.0 Regular 23 18
## 3 4-Wheel or All-Wheel Drive 4 2.0 Regular 20 18
## 4 4-Wheel or All-Wheel Drive 4 2.0 Regular 23 20
## 5 4-Wheel or All-Wheel Drive 4 2.5 Regular 23 19
## 6 4-Wheel or All-Wheel Drive 4 2.5 Regular 21 17
#remove duplicate rows with same value combinations existing in range of columns 1:5
unique = unique(newdata[ , 1:5])
unique
## cyl1 disp1 drive1 make1 year1
## 1 1 1 1 1 1
## 117 1 1 2 2 3
## 440 1 2 1 3 3
## 449 1 2 3 1 2
## 469 2 1 1 3 2
## 511 2 1 3 1 3
## 512 2 2 2 2 2
## 751 2 2 3 3 1
## 1270 2 3 1 2 3
## 1297 3 3 3 3 3
rownames(unique)
## [1] "1" "117" "440" "449" "469" "511" "512" "751" "1270" "1297"
#construct the orthogonal array
hwy = newdata$cty[index=c(1,117,440,449,469,511,512,751,1270,1297)]
hwy
## [1] 18 23 16 16 17 18 15 11 13 14
orthogonalarray = cbind(unique,hwy)
orthogonalarray
## cyl1 disp1 drive1 make1 year1 hwy
## 1 1 1 1 1 1 18
## 117 1 1 2 2 3 23
## 440 1 2 1 3 3 16
## 449 1 2 3 1 2 16
## 469 2 1 1 3 2 17
## 511 2 1 3 1 3 18
## 512 2 2 2 2 2 15
## 751 2 2 3 3 1 11
## 1270 2 3 1 2 3 13
## 1297 3 3 3 3 3 14
#Compute S/N ratio
sn = -10*log10(1/orthogonalarray$hwy^2)
sn
## [1] 25.11 27.23 24.08 24.08 24.61 25.11 23.52 20.83 22.28 22.92
index = which(sn==max(-10*log10(1/orthogonalarray$hwy^2)))
index
## [1] 2
orthogonalarray[index, ]
## cyl1 disp1 drive1 make1 year1 hwy
## 117 1 1 2 2 3 23
#ANOVA analysis
attach(orthogonalarray)
## The following object is masked _by_ .GlobalEnv:
##
## hwy
##
## The following objects are masked from data1 (pos = 7):
##
## cyl1, disp1, drive1, hwy, make1, year1
##
## The following objects are masked from data1 (pos = 8):
##
## cyl1, disp1, drive1, hwy, make1, year1
##
## The following object is masked from data1 (pos = 9):
##
## hwy
model2<-aov(hwy~make1+cyl1+disp1+drive1)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## make1 2 17.2 8.62 2.46 0.29
## cyl1 2 22.6 11.28 3.22 0.24
## disp1 2 49.7 24.83 7.09 0.12
## drive1 1 0.4 0.45 0.13 0.75
## Residuals 2 7.0 3.50
The qq plots generated from the model estimates show that the normal distribution assumption might be correct for model 2. Again, this is likely due to the fact that we only have 10 observations in the orthogonal array, and normal distribution is more likely to present in data with large sample size (according to the central limit theorem). If we compare the qqplot of model 2 to model 1, we can see that the sample size did play an important role in satisfying the normal distribution assumption: In model 1’s qq plot, the points tend to lie on the qqline (except for the tails) more closely.
Due the sample size limitation, the residual vs. fitted Y plot of model 2 doesn’t tell a very convincing story (there’s no obvious trend, however, trend may show when the number of samples increases)
Therefore the bottom line: we should try to increase the sample size in future studies.
#qqplot
par(mfrow=c(1,1))
qqnorm(residuals(model2),ylab="Highway Fuel Economy (mpg)")
qqline(residuals(model2),ylab="Highway Fuel Economy (mpg)")
par(mfrow=c(1,1))
qqnorm(residuals(model1),ylab="Highway Fuel Economy (mpg)")
qqline(residuals(model1),ylab="Highway Fuel Economy (mpg)")
plot(fitted(model2), residuals(model2))
## 6. References to the literature http://www.epa.gov/fueleconomy/basicinformation.htm