Recipe 10

Cheryl Tran

RPI

12/8/2014 Version 1

1. Setting

System under test

This recipe is examining the the fuel economy dataset. We are looking at how 5 different factors, each with 3 or more levels effect the city mileage.

#installing package
install.packages("fueleconomy", repos='http://cran.us.r-project.org')

## package 'fueleconomy' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\RtmpeOLzjH\downloaded_packages

library("fueleconomy", lib.loc="C:/Program Files/R/R-3.1.1/library")
v<-vehicles

head(v)

##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

summary(v)

##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57

vs<-v[v$make %in% c("Acura","Audi","Honda"),] #subset with three levels of make
vs$make<-as.factor(vs$make)
vs$cyl<-as.numeric(vs$cyl)
vs$cyl<-cut(vs$cyl, c(0,4,6,12))
vs$cyl<-as.factor(vs$cyl) #cyl into three levels
vs$displ<-cut(vs$displ,c(1.0,1.8,2.4,3.1,6.3)) #displ into four levels
vs$displ<-as.factor(vs$displ)
vs<-vs[vs$class %in% c("Compact Cars", "Midsize Cars", "Subcompact Cars"),] #class into three levels
vs$class<-as.factor(vs$class)
vs<-vs[vs$fuel %in% c("Regular", "Premium", "CNG"),] #three levels of fuel
vs$fuel<-as.factor(vs$fuel)
summary(vs)

##        id           make        model                year     
##  Min.   :   50   Acura:197   Length:1072        Min.   :1985  
##  1st Qu.:10678   Audi :463   Class :character   1st Qu.:1993  
##  Median :17365   Honda:412   Mode  :character   Median :2001  
##  Mean   :17691                                  Mean   :2001  
##  3rd Qu.:24788                                  3rd Qu.:2008  
##  Max.   :34926                                  Max.   :2015  
##              class        trans              drive               cyl     
##  Compact Cars   :388   Length:1072        Length:1072        (0,4] :589  
##  Midsize Cars   :269   Class :character   Class :character   (4,6] :385  
##  Subcompact Cars:415   Mode  :character   Mode  :character   (6,12]: 98  
##                                                                          
##                                                                          
##                                                                          
##        displ          fuel          hwy             cty       
##  (1,1.8]  :304   CNG    : 17   Min.   :17.00   Min.   :12.00  
##  (1.8,2.4]:358   Premium:539   1st Qu.:24.00   1st Qu.:17.00  
##  (2.4,3.1]:203   Regular:516   Median :27.00   Median :19.00  
##  (3.1,6.3]:207                 Mean   :27.77   Mean   :20.56  
##                                3rd Qu.:30.00   3rd Qu.:22.00  
##                                Max.   :50.00   Max.   :50.00

Factors and Levels

In this experiment, we are looking at 5 factors with at least 3 levels. The 5 factors are Make, Cyl, Fuel, Displ, and Class. Make has three levels: Acura, Audi, and Honda. Cyl has three levels: (0,4], (4,6], and (6,12]. Fuel has three levels: Regular, Premium, and CNG. Displ has four levels (1.0,1.8], (1.8,2.4],(2.4,3.1], and (3.1,6.3]. And Class has three levels: Compact Cars, Midsize Cars, and SUbcompact Cars.

Continuous variables (if any)

The continuous variable in the experiment is the city milage and displacement

Response variables

In this experiment, we are looking at the City Milage.

The Data: How is it organized and what does it look like?

The dataset used for the experiment has 1072 observations. Organized with 12 variables: Id, make, model, year, class, trans, drive, cyl, displ, fuel, hwy, and cty.

Randomization

The dataset consists of observations so it unknow if there was randomization.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment being conducted is a Taguchi design to analyze if any of the five factors has an effect on the variation in city mileage.

What is the rationale for this design?

This method is used to model and analyze problems to help improve quality.

Randomize: What is the Randomization Scheme?

The dataset consists of observations so it was not randomized

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures

Block: Did you use blocking in the design?

There was no blocking used in the design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

boxplot(cty~make, data=vs, xlab="Make", ylab="City Mileage")
title("City Mileage by Make")

boxplot(cty~cyl, data=vs, xlab="Number of Cylinders", ylab="City Mileage")
title("City Mileage by Number of Cylinders")

boxplot(cty~fuel, data=vs, xlab="Fuel Type", ylab="City Mileage by Fuel Type")
title("City Mileage by Fuel Type")

boxplot(cty~displ, data=vs, xlab="Displacement", ylab="City Mileage")
title("City Mileage by Displacement")

boxplot(cty~class, data=vs, xlab="Class", ylab="City Mileage")
title("City Mileage by Class")

When looking at the boxplots of the city milage by Make, Audis has the smallest range while Honda has the largest range. Also, Honda’s median is higher than Acura and Audi. When looking at the City Mileage by number of Cylinders, it makes sense that the number of mileage goes down as the number of cylinders goes up. A car with more cylinders use up more gas because cars with more cylinders accelerate quicker. Causing gas consumption to be quicker and less mileage. When looking at fuel type, Regular gas has a larger range of values but has a lot of outliers. Engine displacement is the volume of an engine’s cylinders. It makes sense that as displacement goes up there is less mileage which can be seen in the boxplots because the medians of each group is less and less as displacement goes up. When looking at city mileage by class, it seems that they would all be around the same and their medians are similar values.

Testing

#Anova
model1=aov(cty~make+cyl+fuel+displ+class, data=vs)
anova(model1)

## Analysis of Variance Table
## 
## Response: cty
##             Df  Sum Sq Mean Sq  F value    Pr(>F)    
## make         2 12853.0  6426.5 545.2150 < 2.2e-16 ***
## cyl          2  4371.0  2185.5 185.4164 < 2.2e-16 ***
## fuel         2   137.6    68.8   5.8387  0.003007 ** 
## displ        3  3452.9  1151.0  97.6476 < 2.2e-16 ***
## class        2   403.8   201.9  17.1282 4.775e-08 ***
## Residuals 1060 12494.3    11.8                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis for this ANOVA is that the variation in the city mileage can not be attributed to anything other than randomization. The alternate hypothesis is that the variation in the city mileage can be attributed to one of the 5 factors. When looking at the p-value with an alpha level of .05, we would reject the null hypothesis so something other than randomization explains the variation in the city mileage.

library(DoE.base)

## Loading required package: grid
## Loading required package: conf.design
## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design

#making factors into integers
vs$make=as.integer(vs$make)
vs$cyl=as.integer(vs$cyl)
vs$fuel=as.integer(vs$fuel)
vs$displ=as.integer(vs$displ)
vs$class=as.integer(vs$class)
#make an array
fuel=oa.design(factor.names=c("make","class","cyl","displ","fuel"), nlevels=c(3,3,3,4,3),columns="min3")
#combining array with original dataset
fuel2=merge(fuel,vs,by=c("make","class","cyl","displ","fuel"),all=FALSE)
head(fuel2)

##   make class cyl displ fuel    id       model year           trans
## 1    1     1   2     3    2 12673 2.5TL/3.2TL 1996 Automatic 4-spd
## 2    1     1   2     3    2 11789       2.5TL 1995 Automatic 4-spd
## 3    1     1   2     3    2  8834       Vigor 1992    Manual 5-spd
## 4    1     1   2     3    2 14163 2.5TL/3.2TL 1998 Automatic 4-spd
## 5    1     1   2     3    2  9880       Vigor 1993    Manual 5-spd
## 6    1     1   2     3    2 10847       Vigor 1994 Automatic 4-spd
##               drive hwy cty
## 1 Front-Wheel Drive  23  18
## 2 Front-Wheel Drive  23  18
## 3 Front-Wheel Drive  24  18
## 4 Front-Wheel Drive  23  17
## 5 Front-Wheel Drive  24  18
## 6 Front-Wheel Drive  24  18

#unique rows for data
factors=unique(fuel2[,1:5])
factors

##    make class cyl displ fuel
## 1     1     1   2     3    2
## 11    2     1   2     4    2
## 12    2     2   1     2    2
## 17    2     2   2     2    3
## 55    2     3   1     1    3
## 61    3     2   2     3    3

rownames(factors)

## [1] "1"  "11" "12" "17" "55" "61"

#select response variables
responsevar=fuel2$cty[index=c(1,11,12,17,55,61)]
#combine data
final=cbind(factors,responsevar)
final

##    make class cyl displ fuel responsevar
## 1     1     1   2     3    2          18
## 11    2     1   2     4    2          17
## 12    2     2   1     2    2          20
## 17    2     2   2     2    3          16
## 55    2     3   1     1    3          21
## 61    3     2   2     3    3          18

#signal to noise ratio
SN=-10*log10(1/final$responsevar^2)
SN

## [1] 25.10545 24.60898 26.02060 24.08240 26.44439 25.10545

bestsn=which(SN==max(SN))
bestsn

## [1] 5

Taguchi=lm(final$responsevar~final$make+final$class+final$cyl+final$displ+final$fuel, data=final)
anova(Taguchi)

## Warning in anova.lm(Taguchi): ANOVA F-tests on an essentially perfect fit
## are unreliable

## Analysis of Variance Table
## 
## Response: final$responsevar
##             Df Sum Sq Mean Sq F value Pr(>F)
## final$make   2 0.3333  0.1667               
## final$class  2 9.0000  4.5000               
## final$cyl    1 8.0000  8.0000               
## Residuals    0 0.0000

When looking at our signal to noise ratio, we want a high value. We find that entry 5 and 6 is our best solution. Both entries have 1, 1, 2, 3, 2 for the factors Make, Class, Cyl, Displ, Fuel. From our results of the anova, we got that it essentially was a perfect fit so our results are unreliable. We fail to reject the null hypothesis.

Diagnostics/Model Adequacy Checking

qqnorm(residuals(model1))
qqline(residuals(model1))

plot(fitted(model1),residuals(model1))

A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals for original ANOVA models appear to be normal. However at each of the ends, the residuals seem to stray from the line. When looking at the plot of the fitted model and the residuals, a good fit would be if the plot was symmetric at zero and had a high variation. The points seem to be symmetric at zero however the variation gets larger as it moves towards the right.

qqnorm(residuals(Taguchi))
qqline(residuals(Taguchi))

plot(fitted(Taguchi),residuals(Taguchi))

TukeyHSD(model1, ordered=FALSE, conf.level=.95)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ make + cyl + fuel + displ + class, data = vs)
## 
## $make
##                  diff       lwr       upr p adj
## Audi-Acura  -2.122935 -2.808369 -1.437501     0
## Honda-Acura  5.454463  4.756481  6.152445     0
## Honda-Audi   7.577397  7.031662  8.123133     0
## 
## $cyl
##                    diff       lwr         upr     p adj
## (4,6]-(0,4]  -3.0024068 -3.530498 -2.47431581 0.0000000
## (6,12]-(0,4] -3.9806021 -4.859675 -3.10152911 0.0000000
## (6,12]-(4,6] -0.9781954 -1.889886 -0.06650485 0.0319963
## 
## $fuel
##                       diff        lwr      upr     p adj
## Premium-CNG     -0.1286058 -2.1134929 1.856281 0.9873324
## Regular-CNG      0.4069273 -1.5793120 2.393167 0.8803879
## Regular-Premium  0.5355330  0.0392559 1.031810 0.0307718
## 
## $displ
##                           diff       lwr        upr     p adj
## (1.8,2.4]-(1,1.8]   -4.3269419 -5.015932 -3.6379522 0.0000000
## (2.4,3.1]-(1,1.8]   -2.0150731 -2.815794 -1.2143520 0.0000000
## (3.1,6.3]-(1,1.8]   -2.1519842 -2.948053 -1.3559155 0.0000000
## (2.4,3.1]-(1.8,2.4]  2.3118689  1.535704  3.0880337 0.0000000
## (3.1,6.3]-(1.8,2.4]  2.1749577  1.403593  2.9463221 0.0000000
## (3.1,6.3]-(2.4,3.1] -0.1369112 -1.009522  0.7357001 0.9777464
## 
## $class
##                                    diff       lwr        upr     p adj
## Midsize Cars-Compact Cars    -0.3982910 -1.037596  0.2410138 0.3096613
## Subcompact Cars-Compact Cars -1.2861759 -1.855206 -0.7171460 0.0000004
## Subcompact Cars-Midsize Cars -0.8878849 -1.518617 -0.2571525 0.0028308

The Taguchi model fits perfectly on the QQnorm plots and is perfectly on zero for the plot of the fitted by residuals. The tukey test was used to see which groups in the sample differ.Our original ANOVA model seems acceptable to explain the variation in city mileage by make, class, cyl, displ, and fuel. The taguchi method does not seem appropriate because our results were perfect and unreliable.

4. Contingencies

Since the dataset consist of observations of fuel economy, it is possible that all of the values were from chance and that there might have been something different about the year the observations were made.There may be other nuisance factors that were not recorded. The Kruskal Wallis test does not assume a normal distrubtion of the residuals and could be used to test if the variation can be attributed to anything other than randomization without assuming a normal distribution.

kruskal.test(cty~make, data=vs)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by make
## Kruskal-Wallis chi-squared = 509.6748, df = 2, p-value < 2.2e-16

kruskal.test(cty~class, data=vs)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by class
## Kruskal-Wallis chi-squared = 169.0658, df = 2, p-value < 2.2e-16

kruskal.test(cty~cyl, data=vs)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by cyl
## Kruskal-Wallis chi-squared = 770.7305, df = 2, p-value < 2.2e-16

kruskal.test(cty~displ, data=vs)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by displ
## Kruskal-Wallis chi-squared = 660.3455, df = 3, p-value < 2.2e-16

kruskal.test(cty~fuel, data=vs)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cty by fuel
## Kruskal-Wallis chi-squared = 282.2526, df = 2, p-value < 2.2e-16

When looking at the results of the Kruskal-Wallis test, we would reject the null hypothesis and the variation in the city mileage can be attributed to make, class, cyl, displ, and fuel with an alpha level of .05.

5. References to the literature

The R package in which this data was found can be located at https://github.com/hadley/fueleconomy