The data analyzed in this recipe is from one of Hadley Wickham’s datasets. In this recipe, the fuel economy dataset was used.
This dataset is the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emmisions laboratory in Ann Arbor, Michigan.
# Read in dataset
library("fueleconomy", lib.loc="~/R/win-library/3.1")
x <- vehicles
The original data being examined has 33,442 observations of 12 variables.
head(x)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
summary(x)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.772
## 3rd Qu.: 6.000
## Max. :16.000
## NA's :58
## displ fuel hwy cty
## Min. :0.000 Length:33442 Min. : 9.00 Min. : 6.00
## 1st Qu.:2.300 Class :character 1st Qu.: 19.00 1st Qu.: 15.00
## Median :3.000 Mode :character Median : 23.00 Median : 17.00
## Mean :3.353 Mean : 23.55 Mean : 17.49
## 3rd Qu.:4.300 3rd Qu.: 27.00 3rd Qu.: 20.00
## Max. :8.400 Max. :109.00 Max. :138.00
## NA's :57
# As seen below, there are 33,442 observations of 12 variables, including: 56 factors and 6 numeric variables.
str(x)
## Classes 'tbl_df', 'tbl' and 'data.frame': 33442 obs. of 12 variables:
## $ id : int 27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
## $ make : chr "AM General" "AM General" "AM General" "AM General" ...
## $ model: chr "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
## $ year : int 1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
## $ class: chr "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
## $ trans: chr "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
## $ drive: chr "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
## $ cyl : int 4 4 6 6 4 6 6 4 4 6 ...
## $ displ: num 2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
## $ fuel : chr "Regular" "Regular" "Regular" "Regular" ...
## $ hwy : int 17 17 13 13 17 13 21 26 28 26 ...
## $ cty : int 18 18 13 13 16 13 14 20 22 18 ...
# Cleaning data:
x$make <- as.factor(x$make)
x$trans <- as.factor(x$trans)
x$drive <- as.factor(x$drive)
x$cyl <- as.factor(x$cyl)
x$fuel <- as.factor(x$fuel)
# Make 'year' into a factor by defining intervals
x$year <- cut(x$year, c(1980,1990,2000,2010,2020))
x$year <- as.character(x$year)
x$year[x$year == "(1.98e+03,1.99e+03]"] <- "1980s"
x$year[x$year == "(1.99e+03,2e+03]"] <- "1990s"
x$year[x$year == "(2e+03,2.01e+03]"] <- "2000s"
x$year[x$year == "(2.01e+03,2.02e+03]"] <- "2010s"
x$year <- as.factor(x$year)
# Subset only top 4 car makes
x <- x[x$make %in% c("Chevrolet","Ford", "Dodge", "GMC"), ]
x$make <- droplevels(x$make)
# Subset top 4 car drive types
x$drive <- as.factor(x$drive)
x <- x[x$drive %in% c("2-Wheel Drive","4-Wheel or All-Wheel Drive", "Front-Wheel Drive", "Rear-Wheel Drive"), ]
x$drive <- droplevels(x$drive)
# Subset top 4 fuel types
x$fuel <- as.factor(x$fuel)
x <- x[x$fuel %in% c("CNG","Diesel", "Premium", "Regular"), ]
x$fuel <- droplevels(x$fuel)
# Subset top 3 cylinders
x$cyl <- as.factor(x$cyl)
x <- x[x$cyl %in% c("4", "6", "8"), ]
x$cyl <- droplevels(x$cyl)
x$cty <- as.numeric(x$cty)
The continuous variables in this dataset are mileages of the cars - both city and highway.
The response variable being analyzed in this dataset is the city mileage of the car in miles per gallon.
As seen below, the log of the hourly wage ranged from 6 to 138 mpg, with a mean of 17.49.
summary(x$cty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 13.00 15.00 15.51 17.00 47.00
The data is originally consisted of 12 variables, but was subseted down to 6 columns: 5 factors, and 1 continuous variable.
There was no randomization in this recipe because the data points are based on observations of cars.
A Taguchi design will be used to analyze which, if any, of the five factors are responsible for variation in the response variable, city mileage.
The experiment was based on observations, and was therefore not randomized.
The experiment had no replicates or repeated measures.
There was no blocking in this recipe.
The boxplot below shows the the city mileage of the cars versus the make of the car. All of the cars had relatively similar medians, but the ranges were quite varied. Ford cars had the greatest range in mileage, and GMC had that smallest range.
par(mfrow=c(1,1))
attach(x)
## The following object is masked _by_ .GlobalEnv:
##
## model
plot(cty ~ make, xlab = "Make of Car", ylab = "City Mileage (mpg)", main="City Mileage by Make of Car" )
The boxplot below shows the city mileage of the cars versus the year the car was built. Cars in the 80s, 90s, and 00s, all had similar medians. The medians of cars of the 2010s had significantly higher city mileage. The ranges were fairly consistent among time intervals.
plot(cty ~ year, xlab = "Year of Car", ylab = "City Mileage (mpg)", main="City Mileage by Year of Car" )
The boxplot below shows city mileage versus the drive type of the car. Front-Wheel drive cars have the greatest median mileage, and tend to be positively skewed. The other three, 2-wheel drive, Rear-wheel drive and All-wheel drive have similarly lower medians and smaller ranges.
plot(cty ~ drive, xlab = "Drive Type", ylab = "City Mileage (mpg)", main="City Mileage by Drive Type")
The boxplot below shows the city mileage versus the number of cylinders in the car. 4-cylinder cars have the highest medians, and also the greatest range. 6 and 8-cylinder cars have much smaller ranges.
plot(cty ~ cyl, xlab = "Number of Cylinders", ylab = "City Mileage (mpg)", main="City Mileage by Cylinder Type" )
The boxplot below shows the relationship between fuel type and city mileage. Regular gas cars have the largest range, whereas compressed natural gas (CNG) cars has the smallest range. The medians of Diesel, Premium, and Regular are relatively similar, but the median of CNG is fairly lower.
plot(cty ~ fuel, xlab = "Fuel Type", ylab = "City Mileage (mpg)", main="City Mileage by Fuel Type" )
Comparing the average city gas mileage to five factors: drive type, year, make, cylinders, and fuel type. H0: There is no difference between the means of the samples. The variation in city gas mileage is not caused by anything other than randomization. Ha: The difference in variation of city gas mileage can caused by something other than randomization.
# Assign all factors an integer value
x$year <- as.character(x$year)
x$year[x$year == "1980s"] <- 1
x$year[x$year == "1990s"] <- 2
x$year[x$year == "2000s"] <- 3
x$year[x$year == "2010s"] <- 4
x$year <- as.factor(x$year)
x$drive <- as.character(x$drive)
x$drive[x$drive=="2-Wheel Drive"] <- 1
x$drive[x$drive=="4-Wheel or All-Wheel Drive"] <- 2
x$drive[x$drive=="Front-Wheel Drive"] <-3
x$drive[x$drive=="Rear-Wheel Drive"] <-4
x$drive <- as.factor(x$drive)
x$fuel <- as.character(x$fuel)
x$fuel[x$fuel=="CNG"] <- 1
x$fuel[x$fuel=="Diesel"] <- 2
x$fuel[x$fuel=="Premium"] <- 3
x$fuel[x$fuel=="Regular"] <- 4
x$fuel <- as.factor(x$fuel)
x$make<-as.character(x$make)
x$make[x$make=="Chevrolet"] <- 1
x$make[x$make=="Ford"] <- 2
x$make[x$make=="Dodge"] <- 3
x$make[x$make=="GMC"] <- 4
x$make <- as.factor(x$make)
x$cyl <- as.character(x$cyl)
x$cyl[x$cyl == "4"] <- 1
x$cyl[x$cyl == "6"] <- 2
x$cyl[x$cyl == "8"] <- 3
x$cyl <- as.factor(x$cyl)
x <- na.omit(x)
x$make <- as.integer(x$make)
x$fuel <- as.integer(x$fuel)
x$cyl <- as.integer(x$cyl)
x$drive <- as.integer(x$drive)
x$year<- as.integer(x$year)
model <- aov(cty~make+fuel+cyl+drive+year,data=x)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## make 1 3756 3756 731.65 <2e-16 ***
## fuel 1 476 476 92.63 <2e-16 ***
## cyl 1 88294 88294 17199.04 <2e-16 ***
## drive 1 440 440 85.61 <2e-16 ***
## year 1 2530 2530 492.78 <2e-16 ***
## Residuals 9669 49637 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values of all of the main effects are less than an alpha of 0.05. Therefore, we can fail to reject the null hypothesis, and instead support the alternative, that something other than randomization causes the variation in city mileage.
Next, a Taguchi Design was used to find which combination of factors and levels results in the highest signal-to-noise ratio.
#install.packages("qualityTools")
library(qualityTools)
## Warning: package 'qualityTools' was built under R version 3.1.2
#install.packages("DoE.base")
library("DoE.base")
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
array = oa.design(factor.names=c("make", "fuel", "cyl", "drive", "year"), nlevels=c(4,4,3, 4, 4 ), columns="min3")
array
## make fuel cyl drive year
## 1 2 3 1 1 4
## 2 3 1 2 1 1
## 3 2 4 1 3 1
## 4 3 3 3 4 2
## 5 1 2 2 4 1
## 6 3 3 1 3 3
## 7 2 4 3 1 2
## 8 2 1 2 3 2
## 9 3 1 1 2 4
## 10 3 3 2 1 4
## 11 2 1 3 2 4
## 12 1 1 2 3 2
## 13 4 2 1 1 3
## 14 4 4 3 3 1
## 15 4 1 1 3 2
## 16 4 2 2 2 2
## 17 1 3 1 4 2
## 18 1 1 1 1 1
## 19 2 1 1 4 3
## 20 4 3 1 2 1
## 21 3 4 1 1 2
## 22 2 2 3 1 3
## 23 3 2 3 2 2
## 24 3 2 1 4 1
## 25 3 2 2 3 4
## 26 4 4 1 4 4
## 27 4 1 3 1 1
## 28 4 4 2 2 3
## 29 1 1 3 2 4
## 30 3 4 2 2 3
## 31 2 2 1 2 2
## 32 4 2 3 3 4
## 33 1 4 3 4 4
## 34 4 1 2 4 3
## 35 4 3 2 1 4
## 36 1 3 2 3 3
## 37 1 3 3 2 1
## 38 1 2 1 3 4
## 39 1 4 2 1 2
## 40 4 3 3 4 2
## 41 2 3 3 3 3
## 42 2 3 2 2 1
## 43 1 2 3 1 3
## 44 2 4 2 4 4
## 45 3 4 3 3 1
## 46 3 1 3 4 3
## 47 1 4 1 2 3
## 48 2 2 2 4 1
## class=design, type= oa
#Merge the Taguchi Design with the response variable from the data
new_x <- merge(array, x, by=c("make", "fuel", "cyl", "drive", "year"), all=FALSE)
# Identify unique rows
unique_new = unique(new_x[,1:5])
unique_new
## make fuel cyl drive year
## 1 1 2 1 3 4
## 2 1 3 2 3 3
## 6 1 4 1 2 3
## 42 1 4 3 4 4
## 74 2 4 1 3 1
## 199 2 4 2 4 4
## 221 3 1 3 4 3
## 226 3 3 1 3 3
## 230 3 4 2 2 3
## 272 4 4 1 4 4
## 280 4 4 2 2 3
rownames(unique_new)
## [1] "1" "2" "6" "42" "74" "199" "221" "226" "230" "272" "280"
cty = new_x$cty[index=c(1,2,6,42,74,199,221,226,230,272,280)]
cty
## [1] 27 16 16 12 21 16 12 19 16 18 13
new_array = cbind(unique_new,cty)
new_array
## make fuel cyl drive year cty
## 1 1 2 1 3 4 27
## 2 1 3 2 3 3 16
## 6 1 4 1 2 3 16
## 42 1 4 3 4 4 12
## 74 2 4 1 3 1 21
## 199 2 4 2 4 4 16
## 221 3 1 3 4 3 12
## 226 3 3 1 3 3 19
## 230 3 4 2 2 3 16
## 272 4 4 1 4 4 18
## 280 4 4 2 2 3 13
# Caluclate the Signal-to-Noise Ratio
s_n = -10*log10(1/new_array$cty^2)
s_n
## [1] 28.62728 24.08240 24.08240 21.58362 26.44439 24.08240 21.58362
## [8] 25.57507 24.08240 25.10545 22.27887
#Maximum S/N
max_s_n=which(s_n==max(s_n))
max_s_n
## [1] 1
new_array[max_s_n,]
## make fuel cyl drive year cty
## 1 1 2 1 3 4 27
#make=1 - Chevrolet
#fuel=2 - Diesel
#cylinder=1 - 4 cylinders
#drive=3 - front-wheel drive
#year=4 - 2010s
new_model = aov(cty~make+fuel+cyl+drive+year, data=new_array)
summary(new_model)
## Df Sum Sq Mean Sq
## make 3 16.49 5.50
## fuel 3 141.13 47.04
## cyl 2 27.16 13.58
## drive 2 6.12 3.06
# Unfortunately, there were not enough values to perform an ANOVA
Assumptions for ANOVA tests include that the data is normally distributed.
shapiro.test(new_x$cty)
##
## Shapiro-Wilk normality test
##
## data: new_x$cty
## W = 0.9535, p-value = 6.738e-09
The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the city gas mileage is not normally distributed.
par(mfrow = c(1,1))
qqnorm(residuals(model))
qqline(residuals(model))
The Q-Q Normality Plots of the residuals for the model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.
plot(fitted(model),residuals(model))
The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the model do appear to be symmetric over the length of the dynamic range, but do not seem to be symmetric over 0.
Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.
A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.
With the kruskal-wallis test, we could only analyze the main effect of one factor at a time.
kruskal.test(cty ~ year, data=x)
##
## Kruskal-Wallis rank sum test
##
## data: cty by year
## Kruskal-Wallis chi-squared = 488.9773, df = 3, p-value < 2.2e-16
The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that year of creation may explain variation in the city mileage.
kruskal.test(cty ~ make, data=x)
##
## Kruskal-Wallis rank sum test
##
## data: cty by make
## Kruskal-Wallis chi-squared = 265.6952, df = 3, p-value < 2.2e-16
The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that car make may explain variation in the city mileage.
kruskal.test(cty ~ fuel, data=x)
##
## Kruskal-Wallis rank sum test
##
## data: cty by fuel
## Kruskal-Wallis chi-squared = 206.7565, df = 3, p-value < 2.2e-16
The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that fuel type may explain variation in the city mileage.
kruskal.test(cty ~ cyl, data=x)
##
## Kruskal-Wallis rank sum test
##
## data: cty by cyl
## Kruskal-Wallis chi-squared = 6070.349, df = 2, p-value < 2.2e-16
The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that number of cylinders may explain variation in the city mileage.
kruskal.test(cty ~ drive, data=x)
##
## Kruskal-Wallis rank sum test
##
## data: cty by drive
## Kruskal-Wallis chi-squared = 4076.988, df = 3, p-value < 2.2e-16
The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in city mileage cannot be explained by anything other than randomization. This suggests that drive type may explain variation in the city mileage.