This recipe is examining the vehicle data from the fueleconomy package.This dataset contains fuel economy data as a result of vehicle testing done at the Enviornmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor,Michigan.This experiment is testing the effect of number of cylinders and fuel type on the highway fuel economy for Hondas
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## package 'fueleconomy' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tranc3\AppData\Local\Temp\Rtmp0OcwLo\downloaded_packages
library("fueleconomy", lib.loc="C:/Program Files/R/R-3.1.1/library")
v<-vehicles
In this experiment, the two factors being observed are the fuel type and the number of cylinders. The types of fuel are CNG, electricity, premium,and regular. The number of cylinders were 3, 4, and 6.
head(v)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
summary(v)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :58
## displ fuel hwy cty
## Min. :0.00 Length:33442 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.: 19.0 1st Qu.: 15.0
## Median :3.00 Mode :character Median : 23.0 Median : 17.0
## Mean :3.35 Mean : 23.6 Mean : 17.5
## 3rd Qu.:4.30 3rd Qu.: 27.0 3rd Qu.: 20.0
## Max. :8.40 Max. :109.0 Max. :138.0
## NA's :57
The continuous variables in the data set are the engine displacement, in litres, highway fuel economy in mpg, and city fuel economy in mpg.
in this experiment, the response variable is the highway fuel economy, in mpg.
The dataset was obtained from vehicle testing done at the Enviornmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor,Michigan. The 12 variables are id, make, model, year, class, trans, cyl, displ, fuel, hwy, and cty.
The dataset contains categorical variables such as make/manufacturer, model, year, class, transmission, drive train,number of cylinders, and fuel type. The testing is performed on pre-production vehicles.The vehicle is placed on a machine called a dynamometer that simulates the driving enviornment.A professional driver runs the vehicle through a standardized driving routine simulating trips in the city or highway.The engine exhaust is collected during tests and measured to calculate the amount of fuel burned during the test.
The anova test is analyzing if the variation in highway milage can be attributed to variation in number of cylinders or type of fuel. The null hypothesis for this experiment is that the variation in highway milage can not be attributed to the variation in number of cylinders or type of fuel.The alternative is that the variation can be attributed to the variation in number of cylinders or type of fuel.
The anova test is used to analyze the observed variance in a variable. This variable is broken down into factors and tested to determine if the factors can be used to explain the variation. One may assume that the number of cylinders or fuel type could affect the amount of milage you would observe from driving on the highway. However, this may not be true therefore this experiment is used to test the hypothesis.
Under controlled conditions in a laboratory and using a standardized test procedure, the engine exhaust is collected to calculate the amount of fuel burned during the test. When looking at the id numbers, it doesnt seem that the trials in an experiment were in random order because there are a bunch of Hondas tested consecutively.
There are no replicates. Each car is tested and the engine exhaust is collected and measured.
Blocking was used for this design by subsetting the Honda data from the whole dataset.
# Subsetting Hondas
Honda<-subset(v,v$make=='Honda')
# boxplots of highway miles for Hondas
Honda$cyl=as.factor(Honda$cyl)
Honda$fuel=as.factor(Honda$fuel)
boxplot(hwy~cyl,data=Honda)
boxplot(hwy~fuel, data=Honda)
model1=aov(hwy~cyl, data=Honda)
anova(model1)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 2 14736 7368 198 <2e-16 ***
## Residuals 783 29094 37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2=aov(hwy~fuel, data=Honda)
anova(model2)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## fuel 4 12651 3163 58.5 <2e-16 ***
## Residuals 783 42364 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3=aov(hwy~fuel*cyl, data=Honda)
anova(model3)
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## fuel 3 1466 489 14 7e-09 ***
## cyl 2 15084 7542 216 <2e-16 ***
## Residuals 780 27279 35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the results of the first ANOVA, we would reject the null hypothesis and the variation in highway fuel economy for Hondas can be explained by something other than randomiation. The highway fuel economy for Hondas can be attributed to the number of cylinders.The probability of getting and F value of 198 under randomization is 2.2e-16. For the second ANOVA test, we would also reject the null hypothesis and highway fuel economy for Hondas can be attributed to the type of fuel. For the third ANOVA test, we would reject the null hypothesis and highway fuel economy for Hondas can be attributed to the number of cylinders, fuel type, or interaction.
qqnorm(residuals(model3))
qqline(residuals(model3))
plot(fitted(model3), residuals(model3))
interaction.plot(Honda$fuel,Honda$cyl,Honda$hwy)
A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals do not appear to be normal. The plot of the fitted model and the residuals do not appear to be scattered or random. There does not appear to be any interaction based off of the interaction plot.
I tried to fix my model 3 with the interaction of cyl and fuel but was not able to fix it so my interaction plot didnt work.
A non parametric test could be used to test the hypothesis. For example, a Kruskal Wallis or Friedmans test are some non-parametric methods.The Friedmans test and kruskal Wallis performs a rank sum test.The Kruskal Wallis test does not assume a normal distrubtion of the residuals.