Recipe 2

Matthew Macchi

Rensselaer Polytechnic Institute

10/2/14 Version 1

1. Setting

System under test

This recipe will conduct an experiment on the fueleconomy dataset. The experiment will attempt to investigate the Audi make susbet and examine the analysis of variance between engine displacement and fuel type on city mpg in hopes of supporting or refuting the claim that engine displacement and fuel type on city mpg do not have much variance.

install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/55/ql66yz5j3jzgkn6dmnb9sk1c0000gn/T//Rtmp2ANCwu/downloaded_packages
library("fueleconomy", lib.loc="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
x<-vehicles
head(x)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

Factors and Levels

A factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter. In this instance, I am conducting a two-factor analysis.

The term level is also used for categorical variables. In this case, this is a multi-level analysis.

The first factor that this experiment will examine is the amount of engine displacement in the vehicle, specially Audis.

The second factor that I will consider is the fuel type which describes which type of fuel is optimal to the upkeek of the engine and optimal performance.

head(x)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13
tail(x)
##          id  make                             model year       class
## 33437 31064 smart   fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart       fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart       fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart       fortwo electric drive coupe 2014 Two Seaters
##                trans            drive cyl displ        fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33438 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33439 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33440 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33441 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33442 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
summary(x)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl       
##  Length:33442       Length:33442       Length:33442       Min.   : 2.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.77  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :16.00  
##                                                           NA's   :58     
##      displ          fuel                hwy             cty       
##  Min.   :0.00   Length:33442       Min.   :  9.0   Min.   :  6.0  
##  1st Qu.:2.30   Class :character   1st Qu.: 19.0   1st Qu.: 15.0  
##  Median :3.00   Mode  :character   Median : 23.0   Median : 17.0  
##  Mean   :3.35                      Mean   : 23.6   Mean   : 17.5  
##  3rd Qu.:4.30                      3rd Qu.: 27.0   3rd Qu.: 20.0  
##  Max.   :8.40                      Max.   :109.0   Max.   :138.0  
##  NA's   :57

Continuous variables (if any)

If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable.

In this instance, only one variable can be considered continuous. Since city mpg is not a categorical variable, it is continuous.

Response variables

A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is city gas mileage, since it will attempt to describe the difference between levels of the two factors of interst.

The Data: How is it organized and what does it look like?

The data is organized initially into an 12 column table: The columns are titled as follows: id, make, model, year, class, trans, drive, cyl, displ, fuel, hwy, cty. All data is numeric minus make, model, class, trans, drive, and fuel, which are textual. Since the experiment is focusing on Audi vehicles, the data has been subset to only look at values with that make.

yaudi<-subset(x,x$make=='Audi')
summary(yaudi)
##        id            make              model                year     
##  Min.   :   50   Length:772         Length:772         Min.   :1985  
##  1st Qu.:15072   Class :character   Class :character   1st Qu.:1999  
##  Median :21352   Mode  :character   Mode  :character   Median :2005  
##  Mean   :20672                                         Mean   :2004  
##  3rd Qu.:28635                                         3rd Qu.:2010  
##  Max.   :34931                                         Max.   :2015  
##     class              trans              drive                cyl       
##  Length:772         Length:772         Length:772         Min.   : 4.00  
##  Class :character   Class :character   Class :character   1st Qu.: 4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.00  
##                                                           Mean   : 5.84  
##                                                           3rd Qu.: 6.00  
##                                                           Max.   :12.00  
##      displ          fuel                hwy            cty      
##  Min.   :1.80   Length:772         Min.   :17.0   Min.   :11.0  
##  1st Qu.:2.00   Class :character   1st Qu.:22.0   1st Qu.:15.0  
##  Median :2.80   Mode  :character   Median :24.0   Median :17.0  
##  Mean   :2.88                      Mean   :24.6   Mean   :17.2  
##  3rd Qu.:3.20                      3rd Qu.:27.0   3rd Qu.:18.0  
##  Max.   :6.30                      Max.   :42.0   Max.   :30.0

Randomization

This data comes from a test conducted at the EPA’s national Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan. Since this is the only information available in regards to background information about the data collection, it is entirely possible that this data might not be completely randomized or the experiment had a completely randomized design.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In order to conduct this experiment, I will conduct two separate analysis of the factors at hand. First, I will analyze multiple levels of the engine displacement (displ) of Audis. I will then look at the City Mileage (Cty) values to see if an obvious difference or pattern can be seen.

Second, I will analyze multiple levels of the fuel type (fuel) of the vehicles, which is the second factor. Again, I will then look at the Cty values to see if an obvious difference or pattern can be seen.

What is the rationale for this design?

I have chosen to use this type of experimental design to demonstrate proper experimentation with a data set with at least two factors and at least two levels of each factor.

Randomize: What is the Randomization Scheme?

Like I previously stated, since there is no credible proof this data is randomized, the only randomization involved with this experiments lies in the fact that the factors and their corresponding levels were chosen completely randomly by myself, the experiment conductor.

Replicate: Are there replicates and/or repeated measures?

There are no replicates, but repeated measures do occur between the factors and levels.

Block: Did you use blocking in the design?

The only blocking that I performed in this experimental data analysis is seen in the blocking of vehicles into the different levels of their respective factors.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

At this point, I must define the amount of engine displacement (displ) and the fuel type (fuel) as the factors for analysis.

yaudi$displ=as.factor(yaudi$displ)
yaudi$fuel=as.factor(yaudi$fuel)

Below are the boxplots of the city gas mileage of all levels of the two factors of interest.

par(mfrow=c(1,1))
hist(yaudi$cty)

plot of chunk unnamed-chunk-5

par(mfrow=c(1,1))
boxplot(yaudi$cty, main="Boxplot of Cty MPG for Audi", xlab="Audi", ylab=" Cty MPG", names=c("Cty"))

plot of chunk unnamed-chunk-5

boxplot(cty~displ, data=yaudi)

plot of chunk unnamed-chunk-5

boxplot(cty~fuel, data=yaudi)

plot of chunk unnamed-chunk-5

Testing

At this point, I am introducitng the Analysis of Variance (ANOVA) test. The ANOVA test is used to analyze the differences in the mean city gas mileage of Audis with varying number of engine displacement and varying fuel types. A third ANOVA test analyzes the interaction effect between the two factors.

model_displ=aov(cty~displ,data=yaudi)
model_fuel=aov(cty~fuel,data=yaudi)
model_displ_fuel=aov(cty~displ*fuel,data=yaudi)
anova(model_displ)
## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value Pr(>F)    
## displ      17   4740     279     137 <2e-16 ***
## Residuals 754   1532       2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_fuel)
## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value Pr(>F)    
## fuel        3    658   219.4      30 <2e-16 ***
## Residuals 768   5614     7.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_displ_fuel)
## Analysis of Variance Table
## 
## Response: cty
##             Df Sum Sq Mean Sq F value Pr(>F)    
## displ       17   4740   278.9   200.0 <2e-16 ***
## fuel         3    353   117.8    84.5 <2e-16 ***
## displ:fuel   5    138    27.7    19.8 <2e-16 ***
## Residuals  746   1040     1.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Results

The ANOVA test that analyzed the variation in city gas mileage as a result in variation of the amount of engine displacement in the Audi’s engine returned a p-value of 2e-16. This very small p-value translates to the fact that there is a very small probability that the variations in city gas mileage with regards to amount of engine displacement is a result of randomization. Thus the conclusion may be drawn that the change in city gas mileage is a result in the change of the engine’s displacement in the Audi.

The ANOVA test that analyzed the variation in city gas mileage as a result in variation of the fuel type of the Audi also returned a p-value of 2e-16, because this is the lowest calculable value for ANOVA tests in R. This very small p-value translates to the fact that there is a very small probability that the variations in city gas mileage with regards to the type of fuel is a result of randomization. Thus another conclusion may be drawn that the change in city gas mileage is a result in the change of the Audi’s fuel type.

Because both ANOVAs alluded to the fact that both factors can effect the city mileage of the vehicles I then performed an ANOVA to analyze the interaction effect of the two factors. The resulting p-value was once again 2e-16 which indicates that when the two factors work together there is a very small probability that the changes in the city gas milage is a result of randomization.

Diagnostics/Model Adequacy Checking

To check the adequacy of using the ANOVA as a means of analyzing this set of data I performed Quantile-Quantile (Q-Q) tests on the residual error to determine if the residuals followed a normal distribution. I also created an interaction plot to see if there was an interaction effect between the two factors.

The nearly linear fit of the residuals in the first QQ plot in reference to ‘displ’ is an indication that the model is adequate for this analysis.

The non-linear fit of the residuals in the second QQ plot in refernece to ‘fuel’ is an indication that the model is not adequate for this analysis.

The interaction plot following the QQ plots shows that the two factors are interacting with eachother to create an effect in the response variable whenever there is an intersection of curves on the plot.

The third type of plot is a Residuals vs.Fits plot which is used to identify the linearity of the residual values and to detemrine if there are any outlying values. Because there are slightly more outliers in the ‘fuel’ response variable than in the ‘displ’ response variables it can be reasoned that the model is slightly less adequate to model the ‘fuel’ data.

qqnorm(residuals(model_displ))
qqline(residuals(model_displ))

plot of chunk unnamed-chunk-8

qqnorm(residuals(model_fuel))
qqline(residuals(model_fuel))

plot of chunk unnamed-chunk-9

interaction.plot(yaudi$displ, yaudi$fuel, yaudi$cty)

plot of chunk unnamed-chunk-10

plot(fitted(model_displ),residuals(model_displ))