This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Data of this test is obtained from the ‘fueleconomy’ dataset. Last time we analyzed effect of two factors: manufacture and year, on fuel comsumption. This time we are going to analyze other possible factors that might influence aim of this analysis is to test factors that may influence fuel comsumption using complete randomized block design. A quick view of the dataset is as below.
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/acer1/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\acer1\AppData\Local\Temp\RtmpgJbLsd\downloaded_packages
library("fueleconomy")
data<-vehicles
head(data)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
tail(data)
## id make model year class
## 33437 31064 smart fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart fortwo electric drive coupe 2014 Two Seaters
## trans drive cyl displ fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33438 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33439 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33440 Automatic (A1) Rear-Wheel Drive NA NA Electricity 79 94
## 33441 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
## 33442 Automatic (A1) Rear-Wheel Drive NA NA Electricity 93 122
summary(data)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :58
## displ fuel hwy cty
## Min. :0.00 Length:33442 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.: 19.0 1st Qu.: 15.0
## Median :3.00 Mode :character Median : 23.0 Median : 17.0
## Mean :3.35 Mean : 23.6 Mean : 17.5
## 3rd Qu.:4.30 3rd Qu.: 27.0 3rd Qu.: 20.0
## Max. :8.40 Max. :109.0 Max. :138.0
## NA's :57
This time we set five factors that may influence vehicle fuel consumption, namely drive train, transmissio type, fuel type, numbber of cylinders and vehicle size, which are present as ‘drive’, ‘trans’, ‘fuel’, ‘cyl’ and ‘class’ in the dataset. Initially, we hope to test all data provided in order to get a more convincing conclusion, however in the factors of ‘trans’ and ‘class’, there are too many levels which R is unable to analyze. Therefore we select several most common types of transmission and vehicle size as levels for these factors as an example.
# factors and levels setting
M<-data$class=='Compact Cars'|data$class=='Midsize Cars'|data$class=='Sport Utility Vehicle - 4WD'|data$class=='Vans'|data$class=='Standard Pickup Trucks'
N<-data[M,]
P<-N$trans=='Auto (AV-S8)'|N$trans=='Automatic 6-spd'|N$trans=='Manual 6-spd'|N$trans=='Auto (AV)'|N$trans=='Manual 4-spd'|N$trans=='Automatic 4-spd'
sample<-N[P,]
#factors and levels
fuel<-factor(sample$fuel)
cyl<-factor(sample$cyl)
drive<-factor(sample$drive)
class<-factor(sample$class)
trans<-factor(sample$trans)
levels(fuel)
## [1] "CNG" "Diesel"
## [3] "Gasoline or E85" "Midgrade"
## [5] "Premium" "Premium Gas or Electricity"
## [7] "Premium or E85" "Regular"
levels(cyl)
## [1] "4" "5" "6" "8" "10" "12"
levels(drive)
## [1] "4-Wheel Drive" "4-Wheel or All-Wheel Drive"
## [3] "All-Wheel Drive" "Front-Wheel Drive"
## [5] "Part-time 4-Wheel Drive" "Rear-Wheel Drive"
levels(class)
## [1] "Compact Cars" "Midsize Cars"
## [3] "Sport Utility Vehicle - 4WD" "Standard Pickup Trucks"
## [5] "Vans"
levels(trans)
## [1] "Auto (AV-S8)" "Auto (AV)" "Automatic 4-spd" "Automatic 6-spd"
## [5] "Manual 4-spd" "Manual 6-spd"
All factors are categorical variables, and the response variable ‘cty’, which means city fuel economy in mpg (miles per galon) is continuous variable which can be any integer between 6 to 138.
Response variable in this test is ‘cty’, referring to city fuel economy in mpg (miles per gallon)
Detailed information is provided in the ‘fueleconomy’ dataset, including almost everything to evaluate an auto.There are five factors selected in this analysis to investigate how these factors affect vehicle fuel economy.
str(sample)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6225 obs. of 12 variables:
## $ id : int 3347 11789 12673 12674 13421 13422 14163 14164 18458 15069 ...
## $ make : chr "ASC Incorporated" "Acura" "Acura" "Acura" ...
## $ model: chr "GNX" "2.5TL" "2.5TL/3.2TL" "2.5TL/3.2TL" ...
## $ year : int 1987 1995 1996 1996 1997 1997 1998 1998 2003 1999 ...
## $ class: chr "Midsize Cars" "Compact Cars" "Compact Cars" "Compact Cars" ...
## $ trans: chr "Automatic 4-spd" "Automatic 4-spd" "Automatic 4-spd" "Automatic 4-spd" ...
## $ drive: chr "Rear-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
## $ cyl : int 6 5 5 6 5 6 5 6 6 6 ...
## $ displ: num 3.8 2.5 2.5 3.2 2.5 3.2 2.5 3.2 3.2 3.2 ...
## $ fuel : chr "Premium" "Premium" "Premium" "Premium" ...
## $ hwy : int 21 23 23 22 23 22 23 22 26 25 ...
## $ cty : int 14 18 18 17 18 17 17 17 17 17 ...
Fuel economy data are the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA. Therefore it is reasonalbe to assume these data are complete randomized. In this case, we will set two factors as blocks, so it is a complete randomized block design.
There is not replication or repeated measures in raw data or this analysis.
There will be two blocks in this study which are selected depend on following exploratory data analysis. (Which turn out to be fuel type and numbers of cylinder)
Obtain data from ‘fueleconomy’ dataset and collect sample data based on most common conditions. Then through exploratory data analysis to decide two blocks (‘fuel’ and ‘cyl’). Conduct ANOVA to test effect of individual factor and interaction between factors on vehicle fuel economy. Following analysis include Tukey’s Honest test and interaction plot. Therefore, null hypothesis is: H0: the difference in vehicle city fuel economy can only be explained by randomization. Alternative hypothesis: Ha: the difference in vehicle city fuel economy can be explained by anything other than randomization.
Fuel economy is a critical element when purchasing a car, and there are many facotrs that may influence car fuel economy other than manufacture and year a car was made, including vehicle type, transmission type, vehicle size and so on. Therefore through this analysis we are able to get a useful reference on how possible factors influence city fuel economy, and then get experience on choosing our own cars.
#Histogram of response variable
hist(sample$cty, ylab="City Fuel Economy(mpg)")
#Boxplot
boxplot(cty~trans, data=sample)
boxplot(cty~drive, data=sample)
boxplot(cty~class, data=sample)
boxplot(cty~fuel, data=sample)
boxplot(cty~cyl, data=sample)
Boxplots revel that for factors of transmission, drive train and vehicle size class, variation between groups is mainly smaller than variation within groups, therefore they can be selected as factors under test. However for fuel type and number of cylinders, varation between groups is larger than variation within groups, therefore we select these two factors as block, and will not consider them in future tests.
ANOVA test is conducted as below:
#variance of drive train, vehicle size class, and transmission
model1=aov(cty ~ drive, data=sample)
anova(model1)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## drive 5 42492 8498 990 <2e-16 ***
## Residuals 6219 53405 9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2=aov(cty ~ class, data=sample)
anova(model2)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## class 4 36995 9249 977 <2e-16 ***
## Residuals 6220 58902 9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3=aov(cty ~ trans, data=sample)
anova(model3)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 5 8887 1777 127 <2e-16 ***
## Residuals 6219 87010 14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#variance of interaction between factors
modeli=aov(cty ~ drive*class*trans, data=sample)
anova(modeli)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## drive 5 42492 8498 1315.9 < 2e-16 ***
## class 4 5745 1436 222.4 < 2e-16 ***
## trans 5 3380 676 104.7 < 2e-16 ***
## drive:class 8 636 80 12.3 < 2e-16 ***
## drive:trans 10 2467 247 38.2 < 2e-16 ***
## class:trans 7 877 125 19.4 < 2e-16 ***
## drive:class:trans 4 381 95 14.7 5.5e-12 ***
## Residuals 6181 39919 6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
modeli1=aov(cty ~ drive*class, data=sample)
anova(modeli1)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## drive 5 42492 8498 1122.4 < 2e-16 ***
## class 4 5745 1436 189.7 < 2e-16 ***
## drive:class 8 663 83 10.9 1.9e-15 ***
## Residuals 6207 46997 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
modeli2=aov(cty ~ drive*trans, data=sample)
anova(modeli2)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## drive 5 42492 8498 1149.0 <2e-16 ***
## trans 5 4956 991 134.0 <2e-16 ***
## drive:trans 10 2563 256 34.6 <2e-16 ***
## Residuals 6204 45886 7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
modeli3=aov(cty ~ class*trans, data=sample)
anova(modeli3)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## class 4 36995 9249 1068.0 <2e-16 ***
## trans 5 3000 600 69.3 <2e-16 ***
## class:trans 7 2140 306 35.3 <2e-16 ***
## Residuals 6208 53762 9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All ANOVA tests for each individual factor have a very small p-value, indicating variance of all three factors are able to explain variance of vehicle city fuel economy. ANOVA tests for interactions between factors also have small p-value, however p-value of some interactions is smaller than p-value of individual factor, therefore probably interaction will decrease relative importance of each factor, but interaction can still explain variance of response variable. Then we can reject the null hypothesis and accept alternative hypothesis, which is variance of vehicle city fuel economy can be explained by something other than randomization.
Summary of data sample and raw data is as below. Compared to raw data set, sample data change in the levels of ‘trans’ and ‘class’, therefore may lead to limitations. However, since sample data contain most common types of transmissions and vehicle class, it is believed to be enough in daily use.
summary(sample)
## id make model year
## Min. : 242 Length:6225 Length:6225 Min. :1985
## 1st Qu.: 7108 Class :character Class :character 1st Qu.:1990
## Median :12821 Mode :character Mode :character Median :1995
## Mean :14338 Mean :1997
## 3rd Qu.:20682 3rd Qu.:2004
## Max. :34909 Max. :2015
## class trans drive cyl
## Length:6225 Length:6225 Length:6225 Min. : 4
## Class :character Class :character Class :character 1st Qu.: 4
## Mode :character Mode :character Mode :character Median : 6
## Mean : 6
## 3rd Qu.: 8
## Max. :12
## displ fuel hwy cty
## Min. :0.00 Length:6225 Min. :11.0 Min. : 8.0
## 1st Qu.:2.40 Class :character 1st Qu.:18.0 1st Qu.:14.0
## Median :3.40 Mode :character Median :22.0 Median :16.0
## Mean :3.63 Mean :22.5 Mean :16.5
## 3rd Qu.:5.00 3rd Qu.:26.0 3rd Qu.:18.0
## Max. :7.40 Max. :43.0 Max. :35.0
summary(data)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :58
## displ fuel hwy cty
## Min. :0.00 Length:33442 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.: 19.0 1st Qu.: 15.0
## Median :3.00 Mode :character Median : 23.0 Median : 17.0
## Mean :3.35 Mean : 23.6 Mean : 17.5
## 3rd Qu.:4.30 3rd Qu.: 27.0 3rd Qu.: 20.0
## Max. :8.40 Max. :109.0 Max. :138.0
## NA's :57
#shapiro test
#shapiro.test(sample$cty)
Shapiro test reveals a p-value much smaller than 0.1, which means a normally distribution.
#qqplot
qqnorm(residuals(modeli))
qqline(residuals(modeli))
Q-Q plot and Q-Q line of residuals exhibit linear pattern of residuals, which means previous model used is valid.
interaction.plot(sample$trans,sample$class,sample$cty)
interaction.plot(sample$trans,sample$drive,sample$cty)
interaction.plot(sample$drive,sample$class,sample$cty)
Interaction polt does not appear as parallel lines, which means interactions between factor are probably adequate to explain variance of cty.
plot(fitted(modeli),residuals(modeli))
Plot of fitted model and residuals model shows residuals are mainly evenly distributed on each side of 0, indicating the model of interaction among those three factors is adequate enough to explian the variance of cty.
To further understand how interactions of factors affect response variable and avoid the false positives we conduct Tukey’s HSD test.
Tukey_modeli<-TukeyHSD(modeli, order=FALSE, conf.level=0.95)
plot(Tukey_modeli)
Since there are so many combinations, Tukey’s plot is not appear as clear lines, howeve we can still know that not all levels have difference in means containing zero, so it can be assumed that there is difference in means to some extent.
For Shapiro-wilk normality test, we can run this function in R Markdown, but it cannot be knitted to HTML since there are more than 5000 pieces of data.
#Shapiro-Wilk normality test
#data: sample$cty
#W = 0.818, p-value < 2.2e-16