Recipe 2

Cheryl Tran

RPI

10/1/2014 Version 1

1. Setting

System under test

This recipe is examining the vehicle data from the fueleconomy package.This dataset contains fuel economy data as a result of vehicle testing done at the Enviornmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor,Michigan.This experiment is testing the effect of number of cylinders and fuel type on the highway fuel economy for Hondas

install.packages("fueleconomy", repos='http://cran.us.r-project.org')

## package 'fueleconomy' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\Rtmp0OcwLo\downloaded_packages

library("fueleconomy", lib.loc="C:/Program Files/R/R-3.1.1/library")
v<-vehicles

Factors and Levels

In this experiment, the two factors being observed are the fuel type and the number of cylinders. The types of fuel are CNG, electricity, premium,and regular. The number of cylinders were 3, 4, and 6.

head(v)

##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

summary(v)

##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl       
##  Length:33442       Length:33442       Length:33442       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.77  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##                                                           NA's   :58     
##      displ          fuel                hwy             cty       
##  Min.   :0.00   Length:33442       Min.   :  9.0   Min.   :  6.0  
##  1st Qu.:2.30   Class :character   1st Qu.: 19.0   1st Qu.: 15.0  
##  Median :3.00   Mode  :character   Median : 23.0   Median : 17.0  
##  Mean   :3.35                      Mean   : 23.6   Mean   : 17.5  
##  3rd Qu.:4.30                      3rd Qu.: 27.0   3rd Qu.: 20.0  
##  Max.   :8.40                      Max.   :109.0   Max.   :138.0  
##  NA's   :57

Continuous variables (if any)

The continuous variables in the data set are the engine displacement, in litres, highway fuel economy in mpg, and city fuel economy in mpg.

Response variables

in this experiment, the response variable is the highway fuel economy, in mpg.

The Data: How is it organized and what does it look like?

The dataset was obtained from vehicle testing done at the Enviornmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor,Michigan. The 12 variables are id, make, model, year, class, trans, cyl, displ, fuel, hwy, and cty.

Randomization

The dataset contains categorical variables such as make/manufacturer, model, year, class, transmission, drive train,number of cylinders, and fuel type. The testing is performed on pre-production vehicles.The vehicle is placed on a machine called a dynamometer that simulates the driving enviornment.A professional driver runs the vehicle through a standardized driving routine simulating trips in the city or highway.The engine exhaust is collected during tests and measured to calculate the amount of fuel burned during the test.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The anova test is analyzing if the variation in highway milage can be attributed to variation in number of cylinders or type of fuel. The null hypothesis for this experiment is that the variation in highway milage can not be attributed to the variation in number of cylinders or type of fuel.The alternative is that the variation can be attributed to the variation in number of cylinders or type of fuel.

What is the rationale for this design?

The anova test is used to analyze the observed variance in a variable. This variable is broken down into factors and tested to determine if the factors can be used to explain the variation. One may assume that the number of cylinders or fuel type could affect the amount of milage you would observe from driving on the highway. However, this may not be true therefore this experiment is used to test the hypothesis.

Randomize: What is the Randomization Scheme?

Under controlled conditions in a laboratory and using a standardized test procedure, the engine exhaust is collected to calculate the amount of fuel burned during the test. When looking at the id numbers, it doesnt seem that the trials in an experiment were in random order because there are a bunch of Hondas tested consecutively.

Replicate: Are there replicates and/or repeated measures?

There are no replicates. Each car is tested and the engine exhaust is collected and measured.

Block: Did you use blocking in the design?

Blocking was used for this design by subsetting the Honda data from the whole dataset.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

# Subsetting Hondas
Honda<-subset(v,v$make=='Honda')

# boxplots of highway miles for Hondas
Honda$cyl=as.factor(Honda$cyl)
Honda$fuel=as.factor(Honda$fuel)
boxplot(hwy~cyl,data=Honda)

plot of chunk unnamed-chunk-4

boxplot(hwy~fuel, data=Honda)

plot of chunk unnamed-chunk-4

Testing

model1=aov(hwy~cyl, data=Honda)
anova(model1)

## Analysis of Variance Table
## 
## Response: hwy
##            Df Sum Sq Mean Sq F value Pr(>F)    
## cyl         2  14736    7368     198 <2e-16 ***
## Residuals 783  29094      37                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model2=aov(hwy~fuel, data=Honda)
anova(model2)

## Analysis of Variance Table
## 
## Response: hwy
##            Df Sum Sq Mean Sq F value Pr(>F)    
## fuel        4  12651    3163    58.5 <2e-16 ***
## Residuals 783  42364      54                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model3=aov(hwy~fuel*cyl, data=Honda)
anova(model3)

## Analysis of Variance Table
## 
## Response: hwy
##            Df Sum Sq Mean Sq F value Pr(>F)    
## fuel        3   1466     489      14  7e-09 ***
## cyl         2  15084    7542     216 <2e-16 ***
## Residuals 780  27279      35                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the results of the first ANOVA, we would reject the null hypothesis and the variation in highway fuel economy for Hondas can be explained by something other than randomiation. The highway fuel economy for Hondas can be attributed to the number of cylinders.The probability of getting and F value of 198 under randomization is 2.2e-16. For the second ANOVA test, we would also reject the null hypothesis and highway fuel economy for Hondas can be attributed to the type of fuel. For the third ANOVA test, we would reject the null hypothesis and highway fuel economy for Hondas can be attributed to the number of cylinders, fuel type, or interaction.

Diagnostics/Model Adequacy Checking

qqnorm(residuals(model3))
qqline(residuals(model3))

plot of chunk unnamed-chunk-6

plot(fitted(model3), residuals(model3))

plot of chunk unnamed-chunk-6

interaction.plot(Honda$fuel,Honda$cyl,Honda$hwy)

plot of chunk unnamed-chunk-6 A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals do not appear to be normal. The plot of the fitted model and the residuals do not appear to be scattered or random. There does not appear to be any interaction based off of the interaction plot.

I tried to fix my model 3 with the interaction of cyl and fuel but was not able to fix it so my interaction plot didnt work.

4. Contingencies

A non parametric test could be used to test the hypothesis. For example, a Kruskal Wallis or Friedmans test are some non-parametric methods.The Friedmans test and kruskal Wallis performs a rank sum test.The Kruskal Wallis test does not assume a normal distrubtion of the residuals.

Recipe 2

Cheryl Tran

RPI

10/1/2014 Version 1

1. Setting

System under test

Factors and Levels

Continuous variables (if any)

Response variables

The Data: How is it organized and what does it look like?

Randomization

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

What is the rationale for this design?

Randomize: What is the Randomization Scheme?

Replicate: Are there replicates and/or repeated measures?

Block: Did you use blocking in the design?

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Testing

Diagnostics/Model Adequacy Checking

4. Contingencies

5. References to the literature

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code