Here I look at fuel economy data acquired by the US EPA from 1985-2015.
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/Anthony/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Anthony\AppData\Local\Temp\RtmpuMAy2v\downloaded_packages
library("fueleconomy", lib.loc="~/R/win-library/3.1")
data<-vehicles
head(data)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
As part of a two-factor, multi-level analysis, I will be focusing on the response variable of highway gas mileage.
The first factor that I will consider is the number of cylinders in the vehicle’s engine, and the 9 levels of this factor will be V2, V3, V4, V5, V6, V8, V10, V12, V16.
The second factor that I will consider is denoted as ‘drive’ which describes the transmission of the car and has levels of 4 wheel drive, 2 wheel drive, all wheel drive, etc.
head(data)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
tail(data)
## id make model year class
## 33437 31064 smart fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart fortwo electric drive coupe 2014 Two Seaters
## trans drive cyl displ fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33438 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33439 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33440 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33441 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33442 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
summary(data)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :58
## displ fuel hwy cty
## Min. :0.00 Length:33442 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.: 19.0 1st Qu.: 15.0
## Median :3.00 Mode :character Median : 23.0 Median : 17.0
## Mean :3.35 Mean : 23.6 Mean : 17.5
## 3rd Qu.:4.30 3rd Qu.: 27.0 3rd Qu.: 20.0
## Max. :8.40 Max. :109.0 Max. :138.0
## NA's :57
The continuous variables that this data set looks at include:
City gas mileage
Highway gas mileage
Here I focus on highway gas mileage as the response variable that will describe differences between the levels of the two factors of interest.
This data set contains the variables ‘make’, ‘model’, ‘year’, ‘trans’, ‘class’, ‘cyl’, ‘displ’, ‘hwy’, and ‘cty’. The variable ‘make’ corresponds to the vehicle manufacturer i.e. Toyota, Honda, etc. ‘Model’ refers to the car model from the specific manufacturer. ‘Year’ is the model year of the vehicle of interest. ‘Trans’ is used to define whether the vehicle has a manual or an automatic transmission. ‘Class’ is a descriptive catergory used to assign different car models into different experimental blocks such as midsize car, compact car, sports utility vehicle, etc. ‘Cyl’ is the variable used to describe how many cylinders each car’s engine has. ‘Displ’ is the car’s engine displacement. And ‘hwy’ and ‘cty’ are the variables that display the vehicle’s highway and city gas mileage respectively.
This data was obtained as a result of testing done at the EPA’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan*. This is all of the background information that is available regarding the data collection. Thus it may not be safe to assume that this experiment had a completely randomized design.
How will the experiment be organized and conducted to test the hypothesis?
In this experiment I will begin by analyzing multiple levels of the first factor of interest which is the amount of cylinders of the vehicles. I will look into the highway gas mileages of all levels of cylinder amounts to see if there are any discernable differences or correlations.
Next, I will look at the second factor which is the ‘drive’ of the vehicle. The response variable that I will be analyzing will again be the highway mileage of the vehicleswith different drives.
What is the rationale for this design?
I have chosen to do use this experimental design to demonstrate proper experimentation with a data set with at least two factors and at least two levels of each factor.
Randomize: What is the Randomization Scheme?
Because it is not clear if the data was originally collected in a random matter, the only randomization involved in this experimentation comes from the fact that I chose the two factors and their corresponding levels randomly.
Replicate: Are there replicates and/or repeated measures?
There were no replicates or repeated measures in the original data collection for this data set.
Block: Did you use blocking in the design?
The blocking that I performed in this experimental data analysis is seen in the blocking of vehicles into the different levels of their respective factors.
Here i define the amount of cylinders in the vehicle’s engine (‘cyl’) and the drive of the vehicle (‘drive’) as factors for analysis.
# To define the'make' of a car as a factor
data$cyl=as.factor(data$cyl)
# To define the 'class' of a car as a factor
data$drive=as.factor(data$drive)
Below are boxplots of the highway gas mileages of all levels of the two factors of interest.
boxplot(hwy~cyl,data=data)
boxplot(hwy~drive,data=data)
Here the first two Analyses of Variance (ANOVA) are used to analyze the differences in the mean highway gas mileage of vehicles with varying numbers of cylinders and varying transmission types. The third ANOVA test analyzes the interaction effect between the two factors.
# Assign models for the data of interest
model_cyl=aov(hwy~cyl,data=data)
model_drive=aov(hwy~drive,data=data)
model_cyl_drive=aov(hwy~cyl*drive,data=data)
# Perform ANOVA on the two models
anova(model_cyl)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 8 525219 65652 3851 <2e-16 ***
## Residuals 33375 568972 17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_drive)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## drive 6 471740 78623 3212 <2e-16 ***
## Residuals 33435 818471 24
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_cyl_drive)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 8 525219 65652 5337.1 <2e-16 ***
## drive 6 149845 24974 2030.2 <2e-16 ***
## cyl:drive 24 8942 373 30.3 <2e-16 ***
## Residuals 33345 410185 12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test that analyzed the variation in highway gas mileage as a result in variation of the amount of cylinders in the vehicle’s engine returned a p-value of 2e-16. This very small p-value translates to the fact that there is a very small probability that the variations in highway gas mileage with regards to number of cylinders is a result of randomization. Thus the conclusion may be drawn that the change in highway gas mileage is a result in the change of the number of cylinders in the vehicle’s engine.
The ANOVA test that analyzed the variation in highway gas mileage as a result in variation of the transmission type of the car also returned a p-value of 2e-16, because this is the lowest calculable value for ANOVA tests in R. This very small p-value translates to the fact that there is a very small probability that the variations in highway gas mileage with regards to the type of vehicle transmission is a result of randomization. Thus another conclusion may be drawn that the change in highway gas mileage is a result in the change of the vehicle’s transmission type.
Because both ANOVAs alluded to the fact that both factors can effect the highway mileage of the vehicles I then performed an ANOVA to analyze the interaction effect of the two factors. The resulting p-value was once again 2e-16 which indicates that when the two factors work together there is a very small probability that the changes in the highway gas milage is a result of randomization.
To check the adequacy of using the ANOVA as a means of analyzing this set of data I performed Quantile-Quantile (Q-Q) tests on the residual error to determine if the residuals followed a normal distribution. I also created an interaction plot to see if there was an interaction effect between the two factors.
The nearly linear fit of the residuals in the two QQ plots are an indication that the model is adequate for this analysis.
The interaction plot following the QQ plots shows that the two factors are interacting with eachother to create an effect in the response variable whenever there is an intersection of curves on the plot.
The third type of plot is a Residuals vs.Fits plot which is used to identify the linearity of the residual values and to detemrine if there are any outlying values. Because there are slightly more outliers in the ‘drive’ response variable than in the ‘cyl’ response variables it can be reasoned that the model is lightly less adequate to model the ‘drive’ data.
# QQ Plot for residuals in analysis of cylinder effect on highway gas mileage
qqnorm(residuals(model_cyl))
qqline(residuals(model_cyl))
# QQ Plot for residuals in analysis of drive effect on highway gas mileage
qqnorm(residuals(model_drive))
qqline(residuals(model_drive))
interaction.plot(data$cyl,data$drive,data$hwy)
plot(fitted(model_cyl),residuals(model_cyl))
plot(fitted(model_drive),residuals(model_drive))
The data from the fueleconomy data set can be accessed at https://github.com/hadley/fueleconomy.