This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
This study involves designing an experiment with two factors and more than 2 levels corresponding to each factor in order to study the effect of ‘number of cylinders’ and ‘type of transmission’ used in vehicles on their fuel economy.In the data, there are two types of response variables i.e. mileage of each vehicle in city and that on the highway, but we consider only the values for ‘city’ in this experiment.The analysis is aimed at finding the effect of the number of cylinders (4,6,8:levels) and the type of transmission (3 types considered as levels: 2 automatic and one manual)on the mileage of a vehicle. For analysis purposes we take a subset of the data set with the make as ‘Toyota’ and the vehicles from the past 10 years only. We further subset the data by explicitly eliminating certain transmission types to focus on the four levels only.
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/uzma/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\uzma\AppData\Local\Temp\RtmpUfwqTM\downloaded_packages
library("fueleconomy", lib.loc="C:\\Users\\uzma\\Documents\\R\\win-library\\3.1")
x<-vehicles
head(x)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
This experiment is a two factor multiple level (3 level) experiment where we consider the ‘number of cylinders’ in a car and its ‘transmission’ type as two factors. We further take three levels of each factor to see the effect on the fuel economy of each vehicle.
head(x)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
tail(x)
## id make model year class
## 33437 31064 smart fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart fortwo electric drive coupe 2014 Two Seaters
## trans drive cyl displ fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33438 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33439 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33440 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33441 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33442 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
summary(x)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :58
## displ fuel hwy cty
## Min. :0.00 Length:33442 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.: 19.0 1st Qu.: 15.0
## Median :3.00 Mode :character Median : 23.0 Median : 17.0
## Mean :3.35 Mean : 23.6 Mean : 17.5
## 3rd Qu.:4.30 3rd Qu.: 27.0 3rd Qu.: 20.0
## Max. :8.40 Max. :109.0 Max. :138.0
## NA's :57
Most of the numeric values in the data set are integers which indicates that all of them are discrete variables. The values of make and transmission are categorical variables.
The response variable is the mileage (in mpg) of each vehicle. However, there are two different values given in the data set for the mileage. One is for the highway (hwy) and the other for the city (cty). For analysis purposes we consider only the city mileage as the response variable in our experiment.
The given data set is the fuel economy data from the EPA. It ranges from the year 1985 to 2015 for various car models and each row has a detailed specification of the vehicle. ### Randomization
We can safely assume the data to be randomized because it is a result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory. Since almost every vehicle needs to clear this testing therefore data is a true representative of population as a whole.
This is a factorial design experiment where we consider two factors and multiple levels in order to analyze the main effect and the interaction effect.So our null hypotheses is that there is a significant effect of the number of cylinders used and the type of transmission installed on the vehicle mileage in the city.
There is a possibility that if the number of cylinders in a vehicle is increased, it will impart more power to the vehicle. This can lower the fuel economy. Similarly the transmission types (manual or automatic) can have an impact on the fuel economy of a vehicle. Since, mileage is adversely impacted in city traffic therefore we want to analyze the main effect and the interaction effect in this setting only.
The data can be assumed to be well randomized because essentially every vehicle is required to pass the fuel economy test at EPA.
Since the testing of each vehicle is carried out once before they are sold, therefore there are no replicates or repeated measures in the experiment.
In the Statistical analysis we use ANOVA as a tool. This is a test for statistical significance used when we have more than two groups. So it generalizes the t-test to a more complex setting.
In the data anlysis, we consider a subset of the data such that vehicles from only the past ten years are considered. Also this analysis will be carried out only for ‘Toyota’ vehicles and some of the levels within the factor-‘transmission’ are explicitly removed in order to focus on three specific levels.
Y<-subset(x,year>2003 & make=='Toyota' & trans !='Automatic (S6)'& trans !='Automatic (S5)'& trans!='Automatic (variable gear ratios)'& trans!='Manual 5-spd' & trans!='Auto(AV-S7)'& trans!='Automatic 5-spd'& trans!='Automatic (S4)')
Y$cyl=as.factor(Y$cyl)
Y$trans=as.factor(Y$trans)
summary(Y)
## id make model year
## Min. :19390 Length:146 Length:146 Min. :2004
## 1st Qu.:21030 Class :character Class :character 1st Qu.:2005
## Median :24498 Mode :character Mode :character Median :2008
## Mean :25819 Mean :2008
## 3rd Qu.:30809 3rd Qu.:2011
## Max. :34724 Max. :2014
## class trans drive cyl
## Length:146 Auto(AV-S6) : 4 Length:146 4:103
## Class :character Automatic 4-spd:100 Class :character 6: 39
## Mode :character Manual 6-spd : 42 Mode :character 8: 4
##
##
##
## displ fuel hwy cty
## Min. :1.50 Length:146 Min. :16 Min. :13.0
## 1st Qu.:1.80 Class :character 1st Qu.:20 1st Qu.:16.0
## Median :2.45 Mode :character Median :26 Median :20.0
## Mean :2.70 Mean :26 Mean :20.5
## 3rd Qu.:3.50 3rd Qu.:31 3rd Qu.:25.0
## Max. :4.70 Max. :39 Max. :40.0
# Boxplots
boxplot(cty~cyl,data=Y)
boxplot(cty~trans,data=Y)
Analysis of variance for the factor cylinder
model=aov(cty~cyl,data=Y)
anova(model)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 2 1620 810 52.4 <2e-16 ***
## Residuals 143 2212 15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of variance for the factor transmission
model=aov(cty~trans,data=Y)
anova(model)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 2 1287 644 36.1 2e-13 ***
## Residuals 143 2545 18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of variance for the both the factors:cylinder and transmission
model=aov(cty~cyl*trans,data=Y)
anova(model)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 2 1620 810 87.10 <2e-16 ***
## trans 2 891 445 47.90 <2e-16 ***
## cyl:trans 2 28 14 1.52 0.22
## Residuals 139 1293 9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test for the both the factors taken separately as well as taken together return a very small p-value. We can therefore conclude than in all three cases there is a very small probability that the variation in city gas mileage (with respect to number of cylinders, transmission type or both respectively) is a result of randomization.
In this section we check the adequacy of the ANOVA model.
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))
Interaction Plot
interaction.plot(Y$cyl,Y$trans,Y$cty)
The data from the fueleconomy data set is available at https://github.com/hadley/fueleconomy.