This recipe is examining the vehicle data from the fueleconomy package.This dataset contains fuel economy data as a result of vehicle testing done at the Enviornmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor,Michigan.This experiment is testing the effect of three factors and 2 blocking factors on the city fuel economy.
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## package 'fueleconomy' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tranc3\AppData\Local\Temp\RtmpAXxAlv\downloaded_packages
library("fueleconomy", lib.loc="C:/Program Files/R/R-3.1.1/library")
v<-vehicles
In this experiment, the three factors being observed are fuel type, number of cylinders, and transmission.The types of fuel are Regular, Premium, Diesel, and Premium or E85. The number of cylinders were 3, 4,5 and 6.The types of transmissions are Automatic 4-spd,Manual 5-spd,Automatic (S5),Manual 6-spd,Automatic 5-spd,Auto(AV-S7),Automatic (S6),Automatic (S4),Automatic (S7),Automatic 3-spd,Auto(AM-S6),Automatic (variable gear ratios),Automatic (AV),Auto(AV-S8),Automatic (S8),Automatic (AM6),Auto(AM6),Auto(AM-S7).
head(v)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
The continuous variables in the data set are the engine displacement, in litres, highway fuel economy in mpg, and city fuel economy in mpg.
in this experiment, the response variable is the city fuel economy, in mpg.
The dataset was obtained from vehicle testing done at the Enviornmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor,Michigan. The 12 variables are id, make, model, year, class, trans, cyl, displ, fuel, hwy, and cty.
The dataset contains categorical variables such as make/manufacturer, model, year, class, transmission, drive train,number of cylinders, and fuel type. The testing is performed on pre-production vehicles.The vehicle is placed on a machine called a dynamometer that simulates the driving enviornment.A professional driver runs the vehicle through a standardized driving routine simulating trips in the city or highway.The engine exhaust is collected during tests and measured to calculate the amount of fuel burned during the test.The data collected is survey data.
The anova test is analyzing the variation in city gas milage. It will examine different elements such as fuel type, number of cylinders, and transmission type to determine which effects the city gas milage. The null hypothesis for this experiment is that the variation in city milage can not be attributed to the variation in fuel type, number of cylinders or transmission type while blocking on the year and the make of the car.
The anova test is used to analyze the observed variance in a variable. This variable is broken down into factors and tested to determine if the factors can be used to explain the variation. The factor of interest in this experiment is the city milage. The test is looking at the components of the car such as fuel type, number of cylinders and transmission to see the effect on the city milage. The year and make of the car is blocked to reduce the variation in the city milage.
Under controlled conditions in a laboratory and using a standardized test procedure, the engine exhaust is collected to calculate the amount of fuel burned during the test. It is unclear if randomization was used.
There are no replicates. Each car is tested and the engine exhaust is collected and measured.
The design is blocked by year and make. The years split into two levels, vehicles made before and after 2000. The makes of cars were split into Audis and Acuras.
summary (v)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## NA's :58
## displ fuel hwy cty
## Min. :0.00 Length:33442 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.: 19.0 1st Qu.: 15.0
## Median :3.00 Mode :character Median : 23.0 Median : 17.0
## Mean :3.35 Mean : 23.6 Mean : 17.5
## 3rd Qu.:4.30 3rd Qu.: 27.0 3rd Qu.: 20.0
## Max. :8.40 Max. :109.0 Max. :138.0
## NA's :57
#removing NAs in cylinder types
v2<-v[!is.na(v$cyl),]
summary(v2)
## id make model year
## Min. : 1 Length:33384 Length:33384 Min. :1984
## 1st Qu.: 8347 Class :character Class :character 1st Qu.:1991
## Median :16696 Mode :character Mode :character Median :1999
## Mean :17014 Mean :1999
## 3rd Qu.:25227 3rd Qu.:2007
## Max. :34932 Max. :2015
## class trans drive cyl
## Length:33384 Length:33384 Length:33384 Min. : 2.00
## Class :character Class :character Class :character 1st Qu.: 4.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 5.77
## 3rd Qu.: 6.00
## Max. :16.00
## displ fuel hwy cty
## Min. :0.00 Length:33384 Min. : 9.0 Min. : 6.0
## 1st Qu.:2.30 Class :character 1st Qu.:19.0 1st Qu.:15.0
## Median :3.00 Mode :character Median :23.0 Median :17.0
## Mean :3.35 Mean :23.5 Mean :17.4
## 3rd Qu.:4.30 3rd Qu.:27.0 3rd Qu.:20.0
## Max. :8.40 Max. :61.0 Max. :53.0
# Focusing on Vehicles with 4,5,6, and 8 Cylinders
v3<-subset(v2,v2$cyl>"3")
v4<-subset(v3,v3$cyl<10)
unique(v4$cyl)
## [1] 4 6 5 8
v4$cyl=as.factor(v4$cyl)
v4$fuel=as.factor(v4$fuel)
v4$trans=as.factor(v4$trans)
#Subset for cars made by Acura or Audi
v5<-subset(v4, make == "Acura" |make== "Audi")
#Years for the cars are placed into bins of cars before and after 2000
v5$year[v5$year <=2000]="<=2000"
B42000<-subset(v5, year=="<=2000")
v5$year[v5$year>2000]=">2000"
A2000<-subset(v5, year==">2000")
IQR(B42000$cty)
## [1] 2
IQR(A2000$cty)
## [1] 4
mean(B42000$cty)-mean(A2000$cty)
## [1] -0.8627
Acura<-subset(v5, make == "Acura")
Audi<-subset(v5, make == "Audi")
IQR(Acura$cty)
## [1] 5
IQR(Audi$cty)
## [1] 3
mean(Acura$cty)-mean(Audi$cty)
## [1] 1.263
When looking at the number of cylinders, this experiment was focusing on the more common numbers such as 4, 5, 6, and 8. Also assuming that the components of the car are similar with respect to year, we blocked the sample into cars made before 2000 and cars made after 2000. Also, assuming that nice cars like Acura or Audi would be similar in their components of the car, we blocked them by make. However, when looking at the IQRs and the distance between medians, it does not seem like the year or make would be a sufficient blocking variable for this design.
boxplot(cty~cyl,data=v5, ylab="City fuel economy, in mpg", xlab= "Number of Cylinders")
boxplot(cty~fuel, data=v5, ylab="City Fuel Economy, in mpg", xlab="Type of Fuel")
boxplot(cty~trans, data=v5, ylab="City fuel economy, in mpg", xlab="Transmission Type")
When looking at the boxplots of the number of cylinders, we can see that the range of values for city fuel economy of a 4 cylinder is higher than the range of vaules for 5, 6, or 8 cylinders. This makes sense because you have higher inner friction with more cylinderes which causes higher fuel comsumption. The four boxplots for Fuel type are Diesel, Premium, Premium or E85, and Regular.Diesel has the largest range of values for city fuel economy but does have a median higher than the rest. According to Carsdirect.com, the fuel economy for diesel is more than gasoline so it explains why the median is higher than the other types of fuel. I believe that the median for premium and regular should be close because they are basically the same kind of fuel except the octane rating.The boxplots of transmission types show that there is significant variation in the city gas mileage of cars.
v5$year=as.factor(v5$year)
model11=aov(cty~fuel+year, data=v5)
model12=aov(cty~fuel+make, data=v5)
anova(model11)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## fuel 3 511 170.4 21.5 1.6e-13 ***
## year 1 238 237.8 30.0 5.4e-08 ***
## Residuals 1005 7959 7.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model12)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## fuel 3 511 170 21.9 9.3e-14 ***
## make 1 384 384 49.5 3.7e-12 ***
## Residuals 1005 7812 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model21=aov(cty~cyl+year, data=v5)
model22=aov(cty~cyl+make, data=v5)
anova(model21)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 3 5082 1694 476.4 < 2e-16 ***
## year 1 53 53 14.8 0.00012 ***
## Residuals 1005 3574 4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model22)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 3 5082 1694 475.6 < 2e-16 ***
## make 1 47 47 13.1 0.00031 ***
## Residuals 1005 3580 4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model31=aov(cty~trans+year, data=v5)
model32=aov(cty~trans+make, data=v5)
anova(model31)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 17 2409 141.7 22.5 <2e-16 ***
## year 1 66 65.8 10.4 0.0013 **
## Residuals 991 6234 6.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model32)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## trans 17 2409 142 24.7 <2e-16 ***
## make 1 614 614 107.1 <2e-16 ***
## Residuals 991 5685 6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the results of the first ANOVA, we would reject the null hypothesis for both blocking by year and by make and the variation in city fuel economy can be explained by something other than randomization. The city fuel economy can be attributed to the type of fuel. For the second set of ANOVAs, we would also reject the null hypothesis and the city fuel economy can be atrributed to the number of cylinders. Lastly for the third set of ANOVAs, we would reject the null hypothesis and the city fuel economy can be attributed to the type of transmission. However in the three sets of ANOVAs since the p value for make or year was low, year and make could possibly effect the city fuel economy. For blocking, we would have prefered if make or year had a higher p value so that the three factors such as cylinders, transmission, or type of fuel would be the only factors to effect city fuel economy.
qqnorm(residuals(model11))
qqline(residuals(model11))
qqnorm(residuals(model12))
qqline(residuals(model12))
qqnorm(residuals(model21))
qqline(residuals(model21))
qqnorm(residuals(model22))
qqline(residuals(model22))
qqnorm(residuals(model31))
qqline(residuals(model31))
qqnorm(residuals(model32))
qqline(residuals(model32))
plot(fitted(model11), residuals(model11))
plot(fitted(model12), residuals(model12))
plot(fitted(model21), residuals(model21))
plot(fitted(model22), residuals(model22))
plot(fitted(model31), residuals(model31))
plot(fitted(model32), residuals(model32))
A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals for the models appear to start off normal however when moving towards the right, the residuals are more on the high end. The plot of the fitted model and the residuals would be a good fit if they are symmetric at zero and have a high variation. The points for the different models appear to be symmetric at zero. However for some models it is hard to judge because over the range there arent many points towards the right of the graph.
tukey11<-TukeyHSD(model11, ordered=FALSE, conf.level=.95)
tukey11
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ fuel + year, data = v5)
##
## $fuel
## diff lwr upr p adj
## Premium-Diesel -5.5436 -7.4308 -3.6564 0.0000
## Premium or E85-Diesel -3.0667 -6.0231 -0.1103 0.0386
## Regular-Diesel -5.2700 -7.2153 -3.3246 0.0000
## Premium or E85-Premium 2.4770 0.1727 4.7812 0.0294
## Regular-Premium 0.2737 -0.3209 0.8682 0.6368
## Regular-Premium or E85 -2.2033 -4.5554 0.1488 0.0757
##
## $year
## diff lwr upr p adj
## >2000-<=2000 0.8278 0.4624 1.193 0
plot(tukey11)
tukey12<-TukeyHSD(model12, ordered=FALSE, conf.level=.95)
plot(tukey12)
tukey21<-TukeyHSD(model21, ordered=FALSE, conf.level=.95)
tukey21
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ cyl + year, data = v5)
##
## $cyl
## diff lwr upr p adj
## 5-4 -4.3993 -4.94474 -3.8538 0.0000
## 6-4 -3.9541 -4.30835 -3.5998 0.0000
## 8-4 -6.2562 -6.74303 -5.7695 0.0000
## 6-5 0.4452 -0.08471 0.9751 0.1347
## 8-5 -1.8570 -2.48328 -1.2307 0.0000
## 8-6 -2.3022 -2.77144 -1.8329 0.0000
##
## $year
## diff lwr upr p adj
## >2000-<=2000 0.4202 0.1754 0.665 8e-04
plot(tukey21)
tukey22<-TukeyHSD(model22, ordered=FALSE, conf.level=.95)
tukey22
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ cyl + make, data = v5)
##
## $cyl
## diff lwr upr p adj
## 5-4 -4.3993 -4.94520 -3.8533 0.0000
## 6-4 -3.9541 -4.30865 -3.5995 0.0000
## 8-4 -6.2562 -6.74345 -5.7690 0.0000
## 6-5 0.4452 -0.08516 0.9755 0.1352
## 8-5 -1.8570 -2.48381 -1.2302 0.0000
## 8-6 -2.3022 -2.77184 -1.8325 0.0000
##
## $make
## diff lwr upr p adj
## Audi-Acura -0.4647 -0.7283 -0.201 6e-04
plot(tukey22)
tukey31<-TukeyHSD(model31, ordered=FALSE, conf.level=.95)
plot(tukey31)
tukey32<-TukeyHSD(model32, ordered=FALSE, conf.level=.95)
plot(tukey32)
When looking at the tukey test, the null hypothesis states that there are no differences between the mean of pairs of data. When looking at differences in mean levels of fuel, the plot shows that there is a difference in means between the pairs except for the regular and premium pair.Also when looking at the differences in mean levels of cylinders, the tukey test shows that there is a difference in means between pairs except for the pair with 6 and 5 cylinders.When looking at the differences in mean levels of the transmission, there shows that there is a difference in means between the pairs that don’t include 0 in the interval.
Since the dataset was a survey and there is only one set of information for each car, it is possible that these values may be from chance. The test simulated city driving but actually driving in the city may produce different results than the ones found in the lab. It seems that the nuisance factors are controlled in the lab but in real life there are many more factors that can come into play. Also when looking at the qq-plots, the data set appears to be normal except at the end. Since the ANOVA test assumes normality, we can use the Kruskal Wallis test. The kruskal Wallis test will test if the variation can be attributed to anything other than randomization without assuming a normal distribution.
kruskal.test(cty~cyl, data=v5)
##
## Kruskal-Wallis rank sum test
##
## data: cty by cyl
## Kruskal-Wallis chi-squared = 658.5, df = 3, p-value < 2.2e-16
kruskal.test(cty~trans, data=v5)
##
## Kruskal-Wallis rank sum test
##
## data: cty by trans
## Kruskal-Wallis chi-squared = 190.7, df = 17, p-value < 2.2e-16
kruskal.test(cty~fuel, data=v5)
##
## Kruskal-Wallis rank sum test
##
## data: cty by fuel
## Kruskal-Wallis chi-squared = 28.8, df = 3, p-value = 2.466e-06
The p-values for the three Kruskal-Wallis tests is less than alpha =.05. That means that we would reject the null hypothesis and the number of cylinders, transmission, and fuel can help explain the variation in city mileage.