Recipe 5

Cheryl Tran

RPI

10/23/2014 Version 1

1. Setting

System under test

This recipe is examining the vehicle data from the fueleconomy package.This dataset contains fuel economy data as a result of vehicle testing done at the Enviornmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor,Michigan.This experiment is testing the effect of three factors and 2 blocking factors on the city fuel economy.

install.packages("fueleconomy", repos='http://cran.us.r-project.org')

## package 'fueleconomy' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\RtmpAXxAlv\downloaded_packages

library("fueleconomy", lib.loc="C:/Program Files/R/R-3.1.1/library")
v<-vehicles

Factors and Levels

In this experiment, the three factors being observed are fuel type, number of cylinders, and transmission.The types of fuel are Regular, Premium, Diesel, and Premium or E85. The number of cylinders were 3, 4,5 and 6.The types of transmissions are Automatic 4-spd,Manual 5-spd,Automatic (S5),Manual 6-spd,Automatic 5-spd,Auto(AV-S7),Automatic (S6),Automatic (S4),Automatic (S7),Automatic 3-spd,Auto(AM-S6),Automatic (variable gear ratios),Automatic (AV),Auto(AV-S8),Automatic (S8),Automatic (AM6),Auto(AM6),Auto(AM-S7).

head(v)

##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

Continuous variables (if any)

The continuous variables in the data set are the engine displacement, in litres, highway fuel economy in mpg, and city fuel economy in mpg.

Response variables

in this experiment, the response variable is the city fuel economy, in mpg.

The Data: How is it organized and what does it look like?

The dataset was obtained from vehicle testing done at the Enviornmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor,Michigan. The 12 variables are id, make, model, year, class, trans, cyl, displ, fuel, hwy, and cty.

Randomization

The dataset contains categorical variables such as make/manufacturer, model, year, class, transmission, drive train,number of cylinders, and fuel type. The testing is performed on pre-production vehicles.The vehicle is placed on a machine called a dynamometer that simulates the driving enviornment.A professional driver runs the vehicle through a standardized driving routine simulating trips in the city or highway.The engine exhaust is collected during tests and measured to calculate the amount of fuel burned during the test.The data collected is survey data.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The anova test is analyzing the variation in city gas milage. It will examine different elements such as fuel type, number of cylinders, and transmission type to determine which effects the city gas milage. The null hypothesis for this experiment is that the variation in city milage can not be attributed to the variation in fuel type, number of cylinders or transmission type while blocking on the year and the make of the car.

What is the rationale for this design?

The anova test is used to analyze the observed variance in a variable. This variable is broken down into factors and tested to determine if the factors can be used to explain the variation. The factor of interest in this experiment is the city milage. The test is looking at the components of the car such as fuel type, number of cylinders and transmission to see the effect on the city milage. The year and make of the car is blocked to reduce the variation in the city milage.

Randomize: What is the Randomization Scheme?

Under controlled conditions in a laboratory and using a standardized test procedure, the engine exhaust is collected to calculate the amount of fuel burned during the test. It is unclear if randomization was used.

Replicate: Are there replicates and/or repeated measures?

There are no replicates. Each car is tested and the engine exhaust is collected and measured.

Block: Did you use blocking in the design?

The design is blocked by year and make. The years split into two levels, vehicles made before and after 2000. The makes of cars were split into Audis and Acuras.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

summary (v)

##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl       
##  Length:33442       Length:33442       Length:33442       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.77  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##                                                           NA's   :58     
##      displ          fuel                hwy             cty       
##  Min.   :0.00   Length:33442       Min.   :  9.0   Min.   :  6.0  
##  1st Qu.:2.30   Class :character   1st Qu.: 19.0   1st Qu.: 15.0  
##  Median :3.00   Mode  :character   Median : 23.0   Median : 17.0  
##  Mean   :3.35                      Mean   : 23.6   Mean   : 17.5  
##  3rd Qu.:4.30                      3rd Qu.: 27.0   3rd Qu.: 20.0  
##  Max.   :8.40                      Max.   :109.0   Max.   :138.0  
##  NA's   :57

#removing NAs in cylinder types
v2<-v[!is.na(v$cyl),]
summary(v2)

##        id            make              model                year     
##  Min.   :    1   Length:33384       Length:33384       Min.   :1984  
##  1st Qu.: 8347   Class :character   Class :character   1st Qu.:1991  
##  Median :16696   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17014                                         Mean   :1999  
##  3rd Qu.:25227                                         3rd Qu.:2007  
##  Max.   :34932                                         Max.   :2015  
##     class              trans              drive                cyl       
##  Length:33384       Length:33384       Length:33384       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.77  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##      displ          fuel                hwy            cty      
##  Min.   :0.00   Length:33384       Min.   : 9.0   Min.   : 6.0  
##  1st Qu.:2.30   Class :character   1st Qu.:19.0   1st Qu.:15.0  
##  Median :3.00   Mode  :character   Median :23.0   Median :17.0  
##  Mean   :3.35                      Mean   :23.5   Mean   :17.4  
##  3rd Qu.:4.30                      3rd Qu.:27.0   3rd Qu.:20.0  
##  Max.   :8.40                      Max.   :61.0   Max.   :53.0

# Focusing on Vehicles with 4,5,6, and 8 Cylinders
v3<-subset(v2,v2$cyl>"3")
v4<-subset(v3,v3$cyl<10)
unique(v4$cyl)

## [1] 4 6 5 8

v4$cyl=as.factor(v4$cyl)
v4$fuel=as.factor(v4$fuel)
v4$trans=as.factor(v4$trans)
#Subset for cars made by Acura or Audi
v5<-subset(v4, make == "Acura" |make== "Audi")
#Years for the cars are placed into bins of cars before and after 2000
v5$year[v5$year <=2000]="<=2000"
B42000<-subset(v5, year=="<=2000")
v5$year[v5$year>2000]=">2000"
A2000<-subset(v5, year==">2000")
IQR(B42000$cty)

## [1] 2

IQR(A2000$cty)

## [1] 4

mean(B42000$cty)-mean(A2000$cty)

## [1] -0.8627

Acura<-subset(v5, make == "Acura")
Audi<-subset(v5, make == "Audi")
IQR(Acura$cty)

## [1] 5

IQR(Audi$cty)

## [1] 3

mean(Acura$cty)-mean(Audi$cty)

## [1] 1.263

When looking at the number of cylinders, this experiment was focusing on the more common numbers such as 4, 5, 6, and 8. Also assuming that the components of the car are similar with respect to year, we blocked the sample into cars made before 2000 and cars made after 2000. Also, assuming that nice cars like Acura or Audi would be similar in their components of the car, we blocked them by make. However, when looking at the IQRs and the distance between medians, it does not seem like the year or make would be a sufficient blocking variable for this design.

boxplot(cty~cyl,data=v5, ylab="City fuel economy, in mpg", xlab= "Number of Cylinders")

plot of chunk unnamed-chunk-4

boxplot(cty~fuel, data=v5, ylab="City Fuel Economy, in mpg", xlab="Type of Fuel")

plot of chunk unnamed-chunk-4

boxplot(cty~trans, data=v5, ylab="City fuel economy, in mpg", xlab="Transmission Type")

plot of chunk unnamed-chunk-4 When looking at the boxplots of the number of cylinders, we can see that the range of values for city fuel economy of a 4 cylinder is higher than the range of vaules for 5, 6, or 8 cylinders. This makes sense because you have higher inner friction with more cylinderes which causes higher fuel comsumption. The four boxplots for Fuel type are Diesel, Premium, Premium or E85, and Regular.Diesel has the largest range of values for city fuel economy but does have a median higher than the rest. According to Carsdirect.com, the fuel economy for diesel is more than gasoline so it explains why the median is higher than the other types of fuel. I believe that the median for premium and regular should be close because they are basically the same kind of fuel except the octane rating.The boxplots of transmission types show that there is significant variation in the city gas mileage of cars.

Testing

v5$year=as.factor(v5$year)
model11=aov(cty~fuel+year, data=v5)
model12=aov(cty~fuel+make, data=v5)
anova(model11)

## Analysis of Variance Table
## 
## Response: cty
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## fuel         3    511   170.4    21.5 1.6e-13 ***
## year         1    238   237.8    30.0 5.4e-08 ***
## Residuals 1005   7959     7.9                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model12)

## Analysis of Variance Table
## 
## Response: cty
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## fuel         3    511     170    21.9 9.3e-14 ***
## make         1    384     384    49.5 3.7e-12 ***
## Residuals 1005   7812       8                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model21=aov(cty~cyl+year, data=v5)
model22=aov(cty~cyl+make, data=v5)
anova(model21)

## Analysis of Variance Table
## 
## Response: cty
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## cyl          3   5082    1694   476.4 < 2e-16 ***
## year         1     53      53    14.8 0.00012 ***
## Residuals 1005   3574       4                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model22)

## Analysis of Variance Table
## 
## Response: cty
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## cyl          3   5082    1694   475.6 < 2e-16 ***
## make         1     47      47    13.1 0.00031 ***
## Residuals 1005   3580       4                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model31=aov(cty~trans+year, data=v5)
model32=aov(cty~trans+make, data=v5)
anova(model31)

## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value Pr(>F)    
## trans      17   2409   141.7    22.5 <2e-16 ***
## year        1     66    65.8    10.4 0.0013 ** 
## Residuals 991   6234     6.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model32)

## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value Pr(>F)    
## trans      17   2409     142    24.7 <2e-16 ***
## make        1    614     614   107.1 <2e-16 ***
## Residuals 991   5685       6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the results of the first ANOVA, we would reject the null hypothesis for both blocking by year and by make and the variation in city fuel economy can be explained by something other than randomization. The city fuel economy can be attributed to the type of fuel. For the second set of ANOVAs, we would also reject the null hypothesis and the city fuel economy can be atrributed to the number of cylinders. Lastly for the third set of ANOVAs, we would reject the null hypothesis and the city fuel economy can be attributed to the type of transmission. However in the three sets of ANOVAs since the p value for make or year was low, year and make could possibly effect the city fuel economy. For blocking, we would have prefered if make or year had a higher p value so that the three factors such as cylinders, transmission, or type of fuel would be the only factors to effect city fuel economy.

Diagnostics/Model Adequacy Checking

qqnorm(residuals(model11))
qqline(residuals(model11))

plot of chunk unnamed-chunk-6

qqnorm(residuals(model12))
qqline(residuals(model12))

plot of chunk unnamed-chunk-6

qqnorm(residuals(model21))
qqline(residuals(model21))

plot of chunk unnamed-chunk-6

qqnorm(residuals(model22))
qqline(residuals(model22))

plot of chunk unnamed-chunk-6

qqnorm(residuals(model31))
qqline(residuals(model31))

plot of chunk unnamed-chunk-6

qqnorm(residuals(model32))
qqline(residuals(model32))

plot of chunk unnamed-chunk-6

plot(fitted(model11), residuals(model11))

plot of chunk unnamed-chunk-6

plot(fitted(model12), residuals(model12))

plot of chunk unnamed-chunk-6

plot(fitted(model21), residuals(model21))

plot of chunk unnamed-chunk-6

plot(fitted(model22), residuals(model22))

plot of chunk unnamed-chunk-6

plot(fitted(model31), residuals(model31))

plot of chunk unnamed-chunk-6

plot(fitted(model32), residuals(model32))

plot of chunk unnamed-chunk-6

A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals for the models appear to start off normal however when moving towards the right, the residuals are more on the high end. The plot of the fitted model and the residuals would be a good fit if they are symmetric at zero and have a high variation. The points for the different models appear to be symmetric at zero. However for some models it is hard to judge because over the range there arent many points towards the right of the graph.

tukey11<-TukeyHSD(model11, ordered=FALSE, conf.level=.95)
tukey11

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ fuel + year, data = v5)
## 
## $fuel
##                           diff     lwr     upr  p adj
## Premium-Diesel         -5.5436 -7.4308 -3.6564 0.0000
## Premium or E85-Diesel  -3.0667 -6.0231 -0.1103 0.0386
## Regular-Diesel         -5.2700 -7.2153 -3.3246 0.0000
## Premium or E85-Premium  2.4770  0.1727  4.7812 0.0294
## Regular-Premium         0.2737 -0.3209  0.8682 0.6368
## Regular-Premium or E85 -2.2033 -4.5554  0.1488 0.0757
## 
## $year
##                diff    lwr   upr p adj
## >2000-<=2000 0.8278 0.4624 1.193     0

plot(tukey11)

plot of chunk unnamed-chunk-7

tukey12<-TukeyHSD(model12, ordered=FALSE, conf.level=.95)
plot(tukey12)

plot of chunk unnamed-chunk-7

tukey21<-TukeyHSD(model21, ordered=FALSE, conf.level=.95)
tukey21

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ cyl + year, data = v5)
## 
## $cyl
##        diff      lwr     upr  p adj
## 5-4 -4.3993 -4.94474 -3.8538 0.0000
## 6-4 -3.9541 -4.30835 -3.5998 0.0000
## 8-4 -6.2562 -6.74303 -5.7695 0.0000
## 6-5  0.4452 -0.08471  0.9751 0.1347
## 8-5 -1.8570 -2.48328 -1.2307 0.0000
## 8-6 -2.3022 -2.77144 -1.8329 0.0000
## 
## $year
##                diff    lwr   upr p adj
## >2000-<=2000 0.4202 0.1754 0.665 8e-04

plot(tukey21)

plot of chunk unnamed-chunk-7

tukey22<-TukeyHSD(model22, ordered=FALSE, conf.level=.95)
tukey22

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ cyl + make, data = v5)
## 
## $cyl
##        diff      lwr     upr  p adj
## 5-4 -4.3993 -4.94520 -3.8533 0.0000
## 6-4 -3.9541 -4.30865 -3.5995 0.0000
## 8-4 -6.2562 -6.74345 -5.7690 0.0000
## 6-5  0.4452 -0.08516  0.9755 0.1352
## 8-5 -1.8570 -2.48381 -1.2302 0.0000
## 8-6 -2.3022 -2.77184 -1.8325 0.0000
## 
## $make
##               diff     lwr    upr p adj
## Audi-Acura -0.4647 -0.7283 -0.201 6e-04

plot(tukey22)

plot of chunk unnamed-chunk-7

tukey31<-TukeyHSD(model31, ordered=FALSE, conf.level=.95)
plot(tukey31)

plot of chunk unnamed-chunk-7

tukey32<-TukeyHSD(model32, ordered=FALSE, conf.level=.95)
plot(tukey32)

plot of chunk unnamed-chunk-7

When looking at the tukey test, the null hypothesis states that there are no differences between the mean of pairs of data. When looking at differences in mean levels of fuel, the plot shows that there is a difference in means between the pairs except for the regular and premium pair.Also when looking at the differences in mean levels of cylinders, the tukey test shows that there is a difference in means between pairs except for the pair with 6 and 5 cylinders.When looking at the differences in mean levels of the transmission, there shows that there is a difference in means between the pairs that don’t include 0 in the interval.

4. Contingencies

Since the dataset was a survey and there is only one set of information for each car, it is possible that these values may be from chance. The test simulated city driving but actually driving in the city may produce different results than the ones found in the lab. It seems that the nuisance factors are controlled in the lab but in real life there are many more factors that can come into play. Also when looking at the qq-plots, the data set appears to be normal except at the end. Since the ANOVA test assumes normality, we can use the Kruskal Wallis test. The kruskal Wallis test will test if the variation can be attributed to anything other than randomization without assuming a normal distribution.

kruskal.test(cty~cyl, data=v5)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by cyl
## Kruskal-Wallis chi-squared = 658.5, df = 3, p-value < 2.2e-16

kruskal.test(cty~trans, data=v5)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by trans
## Kruskal-Wallis chi-squared = 190.7, df = 17, p-value < 2.2e-16

kruskal.test(cty~fuel, data=v5)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by fuel
## Kruskal-Wallis chi-squared = 28.8, df = 3, p-value = 2.466e-06

The p-values for the three Kruskal-Wallis tests is less than alpha =.05. That means that we would reject the null hypothesis and the number of cylinders, transmission, and fuel can help explain the variation in city mileage.

Recipe 5

Cheryl Tran

RPI

10/23/2014 Version 1

1. Setting

System under test

Factors and Levels

Continuous variables (if any)

Response variables

The Data: How is it organized and what does it look like?

Randomization

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

What is the rationale for this design?

Randomize: What is the Randomization Scheme?

Replicate: Are there replicates and/or repeated measures?

Block: Did you use blocking in the design?

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Testing

Diagnostics/Model Adequacy Checking

4. Contingencies

5. References to the literature

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code