This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
This analysis uses fuel economy data collected by the EPA from 1985 to 2015 to test the effect different factors may have on fuel consumption.
Below is the installation and initial examination of the dataset:
#Installing data package
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/tothk2/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tothk2\AppData\Local\Temp\Rtmp088G8X\downloaded_packages
library("fueleconomy", lib.loc="~/R/win-library/3.1")
data<-vehicles
head(data)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
attach(data)
For this two factor analysis we will be examining the factors “cyl” which represents the number of cylinders the engine has and “fuel” which represents the fuel type consumed by the car.
There are nine levels of “cyl” which are 2,3,4,5,8,10,12,16 this analysis will only examine 3 of the 13 levels of fuel which will be Regular, Premium and Gasoline.
#Subset data for specific values of fuel type and provide summary statistics
datasub<-data[data$fuel %in% c("Regular","Premium","Diesel"), ]
head(datasub)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
tail(datasub)
## id make model year class trans
## 33431 26294 smart fortwo coupe 2009 Two Seaters Automatic (AM5)
## 33432 29812 smart fortwo coupe 2010 Two Seaters Auto(AM5)
## 33433 30918 smart fortwo coupe 2011 Two Seaters Auto(AM5)
## 33434 32172 smart fortwo coupe 2012 Two Seaters Auto(AM5)
## 33435 32357 smart fortwo coupe 2013 Two Seaters Auto(AM5)
## 33436 34460 smart fortwo coupe 2014 Two Seaters Auto(AM5)
## drive cyl displ fuel hwy cty
## 33431 Rear-Wheel Drive 3 1 Premium 41 33
## 33432 Rear-Wheel Drive 3 1 Premium 41 33
## 33433 Rear-Wheel Drive 3 1 Premium 41 33
## 33434 Rear-Wheel Drive 3 1 Premium 38 34
## 33435 Rear-Wheel Drive 3 1 Premium 38 34
## 33436 Rear-Wheel Drive 3 1 Premium 38 34
summary(datasub)
## id make model year
## Min. : 1 Length:32113 Length:32113 Min. :1984
## 1st Qu.: 8029 Class :character Class :character 1st Qu.:1990
## Median :16066 Mode :character Mode :character Median :1999
## Mean :16577 Mean :1999
## 3rd Qu.:24620 3rd Qu.:2007
## Max. :34931 Max. :2015
##
## class trans drive cyl
## Length:32113 Length:32113 Length:32113 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.72
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :3
## displ fuel hwy cty
## Min. :0.00 Length:32113 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.20 Class :character 1st Qu.:20.0 1st Qu.:15.0
## Median :3.00 Mode :character Median :23.0 Median :17.0
## Mean :3.31 Mean :23.5 Mean :17.4
## 3rd Qu.:4.30 3rd Qu.:27.0 3rd Qu.:20.0
## Max. :8.40 Max. :61.0 Max. :53.0
## NA's :2
#Levels of Cylinders
unique(datasub$cyl)
## [1] 4 6 5 8 12 10 16 3 2 NA
#Level of Fuel
unique(datasub$fuel)
## [1] "Regular" "Premium" "Diesel"
The continuous variables in this data are the city (“cty”) and highway (“hwy”) gas mileage of each vehicle. Highway mileage ranges from 9 to 109 and City mileage ranges from 6 to 138. ### Response variables For this experiment we will be focusing on the city gas mileage for the response variable being effected by the number of cylinders and fuel type. ### The Data: How is it organized and what does it look like? The data set is organized by the following variables: id, make, model, year, class, trans, drive, cyl, displ, fuel, hwy, cty
Make, model and class are indications of the manufacturer and type of the vehicle such as Audi and Ford and the model of the vehicle is model from that manufacturere such as Passat or Gran Prix. Class indicates vehicle type such as compact car or van.
Year is simply the year that vehicle was manufactured.
trans, drive, cyl, and displ all describe the type of set up the car has, mostly relating to the engine. The trans is the transmission which is thinks like Automatic 9-spd or Manual 5-spd. The drive describes the type of wheel drive like All-wheel or front-wheel. Cyl is the number of cylinders the engine has and displ shows the displacement in liters of the engine.
Fuel is the type of fuel the engine uses like Regular or Premium.
Cty and hwy are the gas mileage for city driving and gas mileage for highway driving.
It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.
To perform the experiment the data will first be subset by fuel type. Then an analysis of variance will be performed on each factor individually and then a combination of both factors. From these analysis we will be able to test the null hypothesis, that city gas mileage is independant of fuel type and number of cylinders.
The rationale for using an analysis of variance test is used when multiple factors are considered. It checks whether the means of several groups are equal. The alternative would be to use multiple two-sample t-tests however there is more likely chance of the test resulting in a false hypothesis.
The data was collected in an unknown way so we do not know if there was any randomization to it.
There are no replicated or repeated measures in the data. Each unique vehicle had it’s fuel economy statistics measured once.
There was no blocking performed in the design of this experiment. All vehicles that contained the observed levels of each factor were analyzed together.
To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of those factors.
#Defining "cyl" as a factor
datasub$cyl = as.factor(datasub$cyl)
#Defining "fuel" as a factor
datasub$fuel = as.factor(datasub$fuel)
#Boxplots of of the means of each variable
boxplot(cty~cyl, data=datasub, xlab="Number of Cylinders", ylab="City Gas Mileage")
boxplot(cty~fuel, data=datasub, xlab="Fuel Type", ylab="City Gas Mileage")
The boxplot showing cylinders implies an inverse relationship between city gas mileage and number of cylinders while the fuel boxplot shows a somewhat neutral effect on city gas mileage. From both boxplots we see a large number of outliers for certain values of each variable, particularly 4 and 6 cylinders and Premium and Regular fuel type.
To test the hypotheses we perform an ANOVA test on the factors individually and then in combination.
The null hypothesis of the first two tests is that the single factor does not have an effect on the response variable of city gas mileage.
#Analysis of Variance for Factor "cyl"
cylmodel=aov(cty~cyl, data=datasub)
anova(cylmodel)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 8 390755 48844 5574 <2e-16 ***
## Residuals 32101 281291 9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Analysis of Variance for Factor "fuel"
fuelmodel=aov(cty~fuel, data=datasub)
anova(fuelmodel)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## fuel 2 18986 9493 467 <2e-16 ***
## Residuals 32110 653099 20
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Analysis of Variance for Both Factors "cyl" and "fuel"
cylfuelmodel=aov(cty~cyl*fuel, data=datasub)
anova(cylfuelmodel)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 8 390755 48844 6229 <2e-16 ***
## fuel 2 19485 9743 1243 <2e-16 ***
## cyl:fuel 11 10202 927 118 <2e-16 ***
## Residuals 32088 251604 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For each of our tests we found resulting p-values of 2.2e-16. The p-value represents the probability that we can get the F value using the null hypothesis. Since our probability is extremely close to 0 we can assume that each factor demonstrates an effect on the response variable. We are lead to reject the null hypothesis.
Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid.
#QQ plots for residuals of cylinder only model
qqnorm(residuals(cylmodel))
qqline(residuals(cylmodel))
#QQ plots for residuals of fuel only model
qqnorm(residuals(fuelmodel))
qqline(residuals(fuelmodel))
#QQ plots for residuals of combination fuel and cylinder model
qqnorm(residuals(cylfuelmodel))
qqline(residuals(cylfuelmodel))
We also use an interaction plot to visualize the interaction of the factors on the response variable. We can see somewhat parrallelism which implies there is no interaction effect however we also see a few instances of intersection leading us to belive there may be an interaction effect.
# Interaction Plot of factors
interaction.plot(datasub$cyl,datasub$fuel,datasub$cty)
Lastly we plot a fitted model against the residuals. We do not see a very large degree of variation among the plot.
#Plot of Fitted vs Residuals of combination cylinder and fuel type model
plot(fitted(cylfuelmodel), residuals(cylfuelmodel))
Overrall the results of our model lead us to believe our model is not adequate and does not explain the effect of the number of cylinders and fuel types on the variance in city gas mileage.
No literature was used
Data can be found at https://github.com/hadley/fueleconomy
It is possible that the conclusions of our analysis are the results of chance. One concern is that the data for each car was collected only once. We don’t know the condition of the tests and it is possible that the conditions of each test may have resulted in better or worse outputs for city gas mileage. There are other factors involved as well. Two vehicles could have the exact same engine set up yet it is possible that they have different weights therefore effecting their mileage output.
It may be better to use a blocking factor to address some of these contingencies. By blocking the data by vehicle type (van, compact car, truck) we may be able to lower the variation in unknown factors such as weight and drag.