as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
Dataset under analysis includes data on vehicles from the EPA. This experiment will seek to understand which factors have an effect on highway fuel efficiency (miles per gallon).
remove(list=ls())
library("fueleconomy", lib.loc="~/R/win-library/3.1")
#view first few lines
head(vehicles)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
#observe the structure of the data, ie. how many variables
str(vehicles)
## Classes 'tbl_df', 'tbl' and 'data.frame': 33442 obs. of 12 variables:
## $ id : int 27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
## $ make : chr "AM General" "AM General" "AM General" "AM General" ...
## $ model: chr "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
## $ year : int 1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
## $ class: chr "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
## $ trans: chr "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
## $ drive: chr "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
## $ cyl : int 4 4 6 6 4 6 6 4 4 6 ...
## $ displ: num 2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
## $ fuel : chr "Regular" "Regular" "Regular" "Regular" ...
## $ hwy : int 17 17 13 13 17 13 21 26 28 26 ...
## $ cty : int 18 18 13 13 16 13 14 20 22 18 ...
#55 observations of 26 variables
attach(vehicles)
#typecast variables as factors for use in ANOVA
vehicles$year=as.factor(vehicles$year)
vehicles$class=as.factor(vehicles$class)
vehicles$trans=as.factor(vehicles$trans)
vehicles$drive=as.factor(vehicles$drive)
vehicles$cyl=as.factor(vehicles$cyl)
vehicles$make=as.factor(vehicles$make)
vehicles$model=as.factor(vehicles$model)
vehicles$displ=as.factor(vehicles$displ)
vehicles$fuel=as.factor(vehicles$fuel)
year(32 levels), class(34 levels), trans(48 levels), displ(64 levels), cyl(9 levels)
summary(vehicles)
## id make model
## Min. : 1 Chevrolet: 3461 F150 Pickup 2WD : 202
## 1st Qu.: 8361 Ford : 2807 Truck 2WD : 187
## Median :16724 Dodge : 2350 F150 Pickup 4WD : 180
## Mean :17038 GMC : 2265 Ranger Pickup 2WD: 169
## 3rd Qu.:25265 Toyota : 1727 Mustang : 154
## Max. :34932 BMW : 1395 Jetta : 150
## (Other) :19437 (Other) :32400
## year class
## 1985 : 1701 Compact Cars : 4739
## 1987 : 1247 Subcompact Cars : 4185
## 2014 : 1214 Midsize Cars : 3621
## 1986 : 1210 Standard Pickup Trucks : 2354
## 2008 : 1186 Sport Utility Vehicle - 4WD: 2091
## 2009 : 1184 Sport Utility Vehicle - 2WD: 1626
## (Other):25700 (Other) :14826
## trans drive
## Automatic 4-spd:10771 2-Wheel Drive : 507
## Manual 5-spd : 7858 4-Wheel Drive : 699
## Automatic 3-spd: 2720 4-Wheel or All-Wheel Drive: 6647
## Automatic 5-spd: 2158 All-Wheel Drive : 1267
## Manual 6-spd : 2067 Front-Wheel Drive :12233
## Automatic (S6) : 1946 Part-time 4-Wheel Drive : 96
## (Other) : 5922 Rear-Wheel Drive :11993
## cyl displ fuel hwy
## 4 :12381 2 : 2806 Regular :22622 Min. : 9.0
## 6 :11885 3 : 2573 Premium : 8617 1st Qu.: 19.0
## 8 : 7550 2.5 : 2153 Gasoline or E85: 1043 Median : 23.0
## 5 : 718 2.4 : 1721 Diesel : 874 Mean : 23.6
## 12 : 478 1.8 : 1353 Premium or E85 : 88 3rd Qu.: 27.0
## (Other): 372 (Other):22779 CNG : 58 Max. :109.0
## NA's : 58 NA's : 57 (Other) : 140
## cty
## Min. : 6.0
## 1st Qu.: 15.0
## Median : 17.0
## Mean : 17.5
## 3rd Qu.: 20.0
## Max. :138.0
##
Highway mileage is the continuous response variable.
The data are tabluated into 12 columns, with detailed information on vehicle id, make, model, year, class, transmission, drive, cylinders, displacement, fuel type, and highway and city mileage.
The data were collected by the Environmental Protection Agency (EPA) from 1985 to 2015, containing the above listed variables. Each vehicle has complete data.
Analysis of Variance will be used to understand the effects of year, class, and transmission type on highway gas mileage. Blocking will also be used to remove any variation due to possible nuisance factors, namely engine displacement and number of cylinders (determined by boxplots below, small within group variation and large between group variation).
This project seeks to discover if year, class, and transmission type may have an individual or interaction effect on highway gas mileage, while holding engine displacement and number of cylinders constant. A linear model will be designed and analysis of variance conducted to determine if the individual or joint effects are statistically significant. Blocking is necessary because it helps remove variation in the response due to possible confounding variables (displ and cyl).
There are replicates but not repeated measures. There are multiple observations for the same make and model vehicles but one vehicle will only have one predetermined set of static attributes, therefore there are no repeated measures.
Yes, the analysis of variance accounts for nuisance variables (displ and cyl) and therefore controls for them appropriately.
Discover which factors should be used as blocking variables
par(mfrow=c(1,1))
boxplot(hwy~year,las=3,xlab="Year",ylab="Highway Gas Mileage")
boxplot(hwy~class,las=3,xlab="Class",ylab="Highway Gas Mileage")
boxplot(hwy~trans,las=3,xlab="Transmission Type",ylab="Highway Gas Mileage")
boxplot(hwy~drive,las=3,xlab="Drive",ylab="Highway Gas Mileage")
boxplot(hwy~cyl,las=3,xlab="Cylinders",ylab="Highway Gas Mileage")
boxplot(hwy~displ,las=3,xlab="Displacement",ylab="Highway Gas Mileage")
boxplot(hwy~fuel,las=3,xlab="Fuel Type",ylab="Highway Gas Mileage")
displ and cyl appear to have small within group variation but large between group variation. Thus, these factors will be chosen as blocking variables.
set up factorial design
library("DoE.base", lib.loc="~/R/win-library/3.1")
## Loading required package: grid
## Loading required package: conf.design
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
library("agricolae", lib.loc="~/R/win-library/3.1")
design = fac.design(nlevels=c(32,34,48),factor.names=c("year","class","trans"),blocks=2)
## creating full factorial with 52224 runs ...
head(design)
## Blocks year class trans
## 1 1 25 34 19
## 2 1 23 12 10
## 3 1 32 18 16
## 4 1 10 4 3
## 5 1 22 16 27
## 6 1 28 25 38
str(design)
## Classes 'design' and 'data.frame': 52224 obs. of 4 variables:
## $ Blocks: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ year : Factor w/ 32 levels "1","2","3","4",..: 25 23 32 10 22 28 15 4 2 31 ...
## ..- attr(*, "contrasts")= num [1:32, 1:31] -0.297 -0.278 -0.258 -0.239 -0.22 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "1" "2" "3" "4" ...
## .. .. ..$ : chr ".L" ".Q" ".C" "^4" ...
## $ class : Factor w/ 34 levels "1","2","3","4",..: 34 12 18 4 16 25 30 18 17 21 ...
## ..- attr(*, "contrasts")= num [1:34, 1:33] -0.288 -0.271 -0.253 -0.236 -0.219 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "1" "2" "3" "4" ...
## .. .. ..$ : chr ".L" ".Q" ".C" "^4" ...
## $ trans : Factor w/ 48 levels "1","2","3","4",..: 19 10 16 3 27 38 20 18 3 26 ...
## ..- attr(*, "contrasts")= num [1:48, 1:47] -0.245 -0.234 -0.224 -0.214 -0.203 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "1" "2" "3" "4" ...
## .. .. ..$ : chr ".L" ".Q" ".C" "^4" ...
## - attr(*, "desnum")= num [1:52224, 1:112] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "1" "2" "3" "4" ...
## .. ..$ : chr "Blocks2" "year.L" "year.Q" "year.C" ...
## - attr(*, "run.order")='data.frame': 52224 obs. of 3 variables:
## ..$ run.no.in.std.order: Factor w/ 52224 levels "1.1.1","4.1.2",..: 10333 5084 8448 1141 14395 20526 10808 9522 1345 13936 ...
## ..$ run.no : int 1 2 3 4 5 6 7 8 9 10 ...
## ..$ run.no.std.rp : Factor w/ 52224 levels "1.1.1","10.1.5",..: 11853 189 7664 14247 20880 34506 12909 10052 18780 19859 ...
## - attr(*, "design.info")=List of 16
## ..$ type : chr "full factorial.blocked"
## ..$ nruns : num 52224
## ..$ nfactors : int 3
## ..$ nlevels : num 32 34 48
## ..$ factor.names:List of 3
## .. ..$ year : int 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ class: int 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ trans: int 1 2 3 4 5 6 7 8 9 10 ...
## ..$ replications: num 1
## ..$ repeat.only : logi FALSE
## ..$ randomize : logi TRUE
## ..$ seed : NULL
## ..$ creator : language fac.design(nlevels = c(32, 34, 48), factor.names = c("year", "class", "trans"), blocks = 2)
## ..$ block.name : chr "Blocks"
## ..$ bbreps : num 1
## ..$ wbreps : num 1
## ..$ nblocks : num 2
## ..$ blocksize : num 26112
## ..$ block.gen : num [1, 1:12] 1 1 1 1 1 1 0 1 1 1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : Named chr "year" "year" "year" "year" ...
## .. .. .. ..- attr(*, "names")= chr "year1" "year2" "year3" "year4" ...
Run analysis of variance for individual factor effects
model1=(aov(hwy~year+displ,data=vehicles))
anova(model1)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## year 31 81071 2615 273 <2e-16 ***
## displ 63 694098 11017 1150 <2e-16 ***
## Residuals 33290 319025 10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2=(aov(hwy~class+displ,data=vehicles))
anova(model2)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## class 33 455707 13809 1540 <2e-16 ***
## displ 63 339954 5396 602 <2e-16 ***
## Residuals 33288 298533 9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3=(aov(hwy~trans+displ,data=vehicles))
anova(model3)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 46 160547 3490 358 <2e-16 ***
## displ 63 608935 9666 990 <2e-16 ***
## Residuals 33275 324712 10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4=(aov(hwy~year+cyl,data=vehicles))
anova(model4)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## year 31 81069 2615 187 <2e-16 ***
## cyl 8 547024 68378 4892 <2e-16 ***
## Residuals 33344 466098 14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model5=(aov(hwy~class+cyl,data=vehicles))
anova(model5)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## class 33 455708 13809 1267 <2e-16 ***
## cyl 8 275026 34378 3154 <2e-16 ***
## Residuals 33342 363457 11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model6=(aov(hwy~trans+cyl,data=vehicles))
anova(model6)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 46 160557 3490 257 <2e-16 ***
## cyl 8 481345 60168 4434 <2e-16 ***
## Residuals 33329 452289 14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Each of the above variables have an individual effect on the highway gas mileage when blocking for engine displacement and number of cylinders.For each, the p-value is essentially zero so we can reject the H0 that each has no effect on the response. Further, any variation in the response can be solely attributed to variation in each of the factor levels.
Run analysis of variance using interaction. Data must first be subset so there is ample computational memory to execute the ANOVA command to include interaction.We will seek to find if there are interaction effects for year, class, and transmission type just for the years 1985 and 1986.
subset=subset(vehicles,year==1985:1986)
interaction=aov(hwy~year*class*trans+displ+cyl, data=subset)
anova(interaction)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 21 21 2.56 0.11
## class 13 22627 1741 211.13 < 2e-16 ***
## trans 5 5502 1100 133.47 < 2e-16 ***
## displ 37 14023 379 45.97 < 2e-16 ***
## cyl 4 945 236 28.64 < 2e-16 ***
## year:class 12 113 9 1.14 0.33
## year:trans 4 53 13 1.61 0.17
## class:trans 39 712 18 2.21 3.1e-05 ***
## year:class:trans 32 209 7 0.79 0.79
## Residuals 1312 10816 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this particular model, we fail to reject the H0 that year and class jointly have no effect on highway gas mileage in the years 1985 to 1986. Further, we fail to reject the H0 that year and transmission have no effect on highway gas mileage in the years 1985 to 1986. However, we reject the H0 that vehicle class and transmission type have no effect in the years 1985 to 1986. That is, variation in the response variable can be explained by the interaction effect between vehicle class and transmission. Since this model blocks on the nuisance variables displ and cyl, it is safe to assume the model is more accurate because variation due to erroneous factors has been accounted for.
# Shapiro-Wilk test of normality. Adequate if p < 0.
# Cannot perform test on entire dataset, so sampling was used
sampling = vehicles[sample(1:nrow(vehicles), 500, replace=FALSE),]
shapiro.test(sampling$hwy)
##
## Shapiro-Wilk normality test
##
## data: sampling$hwy
## W = 0.9756, p-value = 2.085e-07
# We reject the H0 that the data are normally distributed.
qqnorm(sampling$hwy)
qqline(sampling$hwy)
# Thus, we attempt to transform the data
shapiro.test(sampling$hwy^2) #still not normally distributed
##
## Shapiro-Wilk normality test
##
## data: sampling$hwy^2
## W = 0.9058, p-value < 2.2e-16
shapiro.test(sqrt(sampling$hwy)) #still not normally distributed
##
## Shapiro-Wilk normality test
##
## data: sqrt(sampling$hwy)
## W = 0.9911, p-value = 0.004213
shapiro.test(log(sampling$hwy)) #still not normallly distributed
##
## Shapiro-Wilk normality test
##
## data: log(sampling$hwy)
## W = 0.9922, p-value = 0.01008
shapiro.test(log10(sampling$hwy)) #still not normallly distributed
##
## Shapiro-Wilk normality test
##
## data: log10(sampling$hwy)
## W = 0.9922, p-value = 0.01008
shapiro.test(1/sampling$hwy) #still not normallly distributed
##
## Shapiro-Wilk normality test
##
## data: 1/sampling$hwy
## W = 0.951, p-value = 8.094e-12
shapiro.test(1/sqrt(sampling$hwy)) #still not normallly distributed
##
## Shapiro-Wilk normality test
##
## data: 1/sqrt(sampling$hwy)
## W = 0.9787, p-value = 1.077e-06
# It appears for this sampling subset the data cannot be readily transformed into a normal distribution.
One of the primary assumptions of the analysis of variance is that the data are normally distributed, and since they are (and cannot be readily transformed into such), the results from the analysis of variance can potentially be invalid.
Describe
qqnorm(residuals(interaction))
qqline(residuals(interaction))
plot(fitted(interaction),residuals(interaction))
#The model appears to have a good fit to the data since the points are evenly dispersed across the Cartesian plane of resiual error and the fitted model.
interaction.plot(class,trans,year)
#There appears to be significant interaction among the factors because the lines are not perfectly parallel.
These data are one of the original Hadley Wickham datasets compiled by the US EPA.
Please see above.
Please see above.