1. Setting

System under test

The data being analyzed in this recipe is from one of the Hadley Wickham’s datasets. The fuel economy dataset is the result of vehicle testing done at the EPA’s national Vehicle and Fuel Emissions laboratory in Ann Arbor, Michigan.

install.packages("fueleconomy", repos='http://cran.us.r-project.org')

## 
## The downloaded binary packages are in
##  /var/folders/55/ql66yz5j3jzgkn6dmnb9sk1c0000gn/T//RtmpPhi85m/downloaded_packages

library("fueleconomy", lib.loc="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
x<-vehicles

Factors and Levels

The data being examined has 33442 observations of 12 variables.

head(x)

##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

tail(x)

##          id  make                             model year       class
## 33437 31064 smart   fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart       fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart       fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart       fortwo electric drive coupe 2014 Two Seaters
##                trans            drive cyl displ        fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33438 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33439 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33440 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33441 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33442 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122

str(x)

## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
##  $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
##  $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
##  $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
##  $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
##  $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
##  $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
##  $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...

summary(x)

##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57

Continuous variables (if any)

If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable. The continuous variables in this dataset the gas mileages, both city and highway ### Response variables A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is the highway gas mileage in miles per gallon.
### The Data: How is it organized and what does it look like? The data is organized initially into a 12 column table. Data in the selected columns have been transformed into binary integers.The integers were assigned a 1, 2, or 3 depending on whether or not the initial value fell in between a certain range. In order to perform the Taguchi methods, the data must be maniupulated to show ranges of factors as levels and then assigned a binary counterpart.

x$make <- as.factor(x$make)
x$trans <- as.factor(x$trans)
x$drive <- as.factor(x$drive)
x$cyl <- as.factor(x$cyl)
x$fuel <- as.factor(x$fuel)
x$year <- cut(x$year, c(1980,1990,2000,2010,2020))
x$year <- as.character(x$year)
x$year[x$year == "(1.98e+03,1.99e+03]"] <- "1980s"
x$year[x$year == "(1.99e+03,2e+03]"] <- "1990s"
x$year[x$year == "(2e+03,2.01e+03]"] <- "2000s"
x$year[x$year == "(2.01e+03,2.02e+03]"] <- "2010s"
x$year <- as.factor(x$year)
x <- x[x$make %in% c("Acura","Audi", "Chevrolet", "Aston Martin"), ]
x$make <- droplevels(x$make)
x$drive <- as.factor(x$drive)
x <- x[x$drive %in% c("2-Wheel Drive","4-Wheel or All-Wheel Drive", "Front-Wheel Drive", "Rear-Wheel Drive"), ]
x$drive <- droplevels(x$drive)
x$fuel <- as.factor(x$fuel)
x <- x[x$fuel %in% c("CNG","Diesel", "Premium", "Regular"), ]
x$fuel <- droplevels(x$fuel)
x$cyl <- as.factor(x$cyl)
x <- x[x$cyl %in% c("4", "6", "8"), ]
x$cyl <- droplevels(x$cyl)
x$hwy <- as.numeric(x$hwy)
summary(x$hwy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   18.00   22.00   22.25   26.00   46.00

Randomization

Since this is the only information available in regards to background information about the data collection, it is entirely possible that this data might not be completely randomized or the experiment had a completely randomized design. # 2. (Experimental) Design ### How will the experiment be organized and conducted to test the hypothesis? In order to conduct this experiment, I will conduct an an experiment using Taguchi methods. A Taguchi design will be used to analyze which, if any, of the five factors are responsible for variation in the response variable, highway mileage. ### Randomize: What is the Randomization Scheme? The experiment was based on observations and therefore was not randomized. ### Replicate: Are there replicates and/or repeated measures? There are no replicates, but repeated measures do occur between the factors and levels. ### Block: Did you use blocking in the design? No blocking was performed in this design. # 3. Statistical Analysis ##Exploratory Data Analysis: Graphics and Decriptive Summary The boxplot below shows the the highway mileage of the cars versus the make of the car. Based on the results of the boxplots, it can be argued that all of the cars had relatively dissimilar medians, but the ranges were quite varied. Graphs showing trends in the data:

par(mfrow=c(1,1))

Next, I create plots to attempt to identify any patterns in the data.

plot(x$hwy ~ x$make, xlab = "Make of Car", ylab = "Highway Mileage (mpg)", main = "Highway Milage by Make of Car")

plot(x$hwy ~ x$year, xlab = "Make of Car", ylab = "Highway Mileage (mpg)", main = "Highway Milage by Make of Car")

plot(x$hwy ~ x$drive, xlab = "Make of Car", ylab = "Highway Mileage (mpg)", main = "Highway Milage by Make of Car")

plot(x$hwy ~ x$cyl, xlab = "Make of Car", ylab = "Highway Mileage (mpg)", main = "Highway Milage by Make of Car")

plot(x$hwy ~ x$fuel, xlab = "Make of Car", ylab = "Highway Mileage (mpg)", main = "Highway Milage by Make of Car")

##ANOVA Testing Comparing the average highway gas mileage to five factors: drive type, year, make, cylinders, and fuel type. H0: There is no difference between the means of the samples. The variation in highway gas mileage is not caused by anything other than randomization. Ha: The difference in variation of highway gas mileage can caused by something other than randomization.

x$year <- as.character(x$year)
x$year[x$year == "1980s"] <- 1
x$year[x$year == "1990s"] <- 2
x$year[x$year == "2000s"] <- 3
x$year[x$year == "2010s"] <- 4
x$year <- as.factor(x$year)
x$drive <- as.character(x$drive)
x$drive[x$drive=="2-Wheel Drive"] <- 1
x$drive[x$drive=="4-Wheel or All-Wheel Drive"] <- 2
x$drive[x$drive=="Front-Wheel Drive"] <-3
x$drive[x$drive=="Rear-Wheel Drive"] <-4
x$drive <- as.factor(x$drive)
x$fuel <- as.character(x$fuel)
x$fuel[x$fuel=="CNG"] <- 1
x$fuel[x$fuel=="Diesel"] <- 2
x$fuel[x$fuel=="Premium"] <- 3
x$fuel[x$fuel=="Regular"] <- 4
x$fuel <- as.factor(x$fuel)
x$make<-as.character(x$make)
x$make[x$make=="Acura"] <- 1
x$make[x$make=="Aston Martin"] <- 2
x$make[x$make=="Audi"] <- 3
x$make[x$make=="Chevrolet"] <- 4
x$make <- as.factor(x$make)
x$cyl <- as.character(x$cyl)
x$cyl[x$cyl == "4"] <- 1
x$cyl[x$cyl == "6"] <- 2
x$cyl[x$cyl == "8"] <- 3
x$cyl <- as.factor(x$cyl)
x <- na.omit(x)
x$make <- as.integer(x$make)
x$fuel <- as.integer(x$fuel)
x$cyl <- as.integer(x$cyl)
x$drive <- as.integer(x$drive)
x$year<- as.integer(x$year)
model <- aov(hwy~make+fuel+cyl+drive+year,data=x)
summary(model)

##               Df Sum Sq Mean Sq  F value Pr(>F)    
## make           1   3516    3516  305.764 <2e-16 ***
## fuel           1     58      58    5.076 0.0243 *  
## cyl            1  54802   54802 4765.762 <2e-16 ***
## drive          1   1327    1327  115.415 <2e-16 ***
## year           1   3902    3902  339.328 <2e-16 ***
## Residuals   3787  43547      11                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-values of all of the main effects except for fuel are less than an alpha of 0.05. Therefore, we can fail to reject the null hypothesis, and instead support the alternative, that something other than randomization causes the variation in highway mileage for 4 out of 5 factors. However, we do reject the null hypothesis in regards to fuel. ##Taguchi Design Next, a Taguchi Design was used to find which combination of factors and levels results in the highest signal-to-noise ratio.

install.packages("qualityTools", repos='http://cran.us.r-project.org')

## 
## The downloaded binary packages are in
##  /var/folders/55/ql66yz5j3jzgkn6dmnb9sk1c0000gn/T//RtmpPhi85m/downloaded_packages

library(qualityTools)
install.packages("DoE.base", repos='http://cran.us.r-project.org')

## 
## The downloaded binary packages are in
##  /var/folders/55/ql66yz5j3jzgkn6dmnb9sk1c0000gn/T//RtmpPhi85m/downloaded_packages

library("DoE.base")

## Loading required package: grid
## Loading required package: conf.design
## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design

array = oa.design(factor.names=c("make", "fuel", "cyl", "drive", "year"), nlevels=c(4,4,3, 4, 4 ), columns="min3")
array

##    make fuel cyl drive year
## 1     4    3   1     2    1
## 2     1    3   2     3    3
## 3     1    3   3     2    1
## 4     3    4   1     1    2
## 5     2    1   2     3    2
## 6     1    1   1     1    1
## 7     3    2   2     3    4
## 8     3    3   1     3    3
## 9     4    3   2     1    4
## 10    3    4   2     2    3
## 11    3    1   2     1    1
## 12    1    4   2     1    2
## 13    4    4   2     2    3
## 14    2    4   1     3    1
## 15    2    1   3     2    4
## 16    1    4   3     4    4
## 17    4    1   2     4    3
## 18    4    2   3     3    4
## 19    1    2   2     4    1
## 20    3    1   1     2    4
## 21    1    1   3     2    4
## 22    4    4   1     4    4
## 23    1    1   2     3    2
## 24    2    3   2     2    1
## 25    4    2   1     1    3
## 26    4    1   1     3    2
## 27    1    2   3     1    3
## 28    2    3   3     3    3
## 29    2    1   1     4    3
## 30    2    2   2     4    1
## 31    4    2   2     2    2
## 32    1    3   1     4    2
## 33    4    1   3     1    1
## 34    3    3   3     4    2
## 35    2    4   3     1    2
## 36    4    3   3     4    2
## 37    2    3   1     1    4
## 38    2    2   1     2    2
## 39    3    2   3     2    2
## 40    2    2   3     1    3
## 41    3    1   3     4    3
## 42    1    4   1     2    3
## 43    2    4   2     4    4
## 44    4    4   3     3    1
## 45    3    2   1     4    1
## 46    3    4   3     3    1
## 47    3    3   2     1    4
## 48    1    2   1     3    4
## class=design, type= oa

x2 <- merge(array, x, by=c("make", "fuel", "cyl", "drive", "year"), all=FALSE)
xunique = unique(x2[,1:5])
xunique

##     make fuel cyl drive year
## 1      1    3   2     3    3
## 27     3    3   1     3    3
## 81     4    3   3     4    2
## 129    4    4   1     4    4
## 137    4    4   2     2    3

rownames(xunique)

## [1] "1"   "27"  "81"  "129" "137"

hiway = x2$hwy[index=c(1,27,81,129,137)]
hiway

## [1] 26 28 23 25 17

array2 = cbind(xunique, hiway)
array2

##     make fuel cyl drive year hiway
## 1      1    3   2     3    3    26
## 27     3    3   1     3    3    28
## 81     4    3   3     4    2    23
## 129    4    4   1     4    4    25
## 137    4    4   2     2    3    17

snrate = -10*log10(1/array2$hiway^2)
snrate

## [1] 28.29947 28.94316 27.23456 27.95880 24.60898

maxsn = which(snrate==max(snrate))
array2[maxsn,]

##    make fuel cyl drive year hiway
## 27    3    3   1     3    3    28

model2 = aov(hiway~make+fuel+cyl+drive+year, data=array2)

Shapiro-Wilk

Here I use the Shapiro-Wilk normality test to determine if the response variable is normally distributed.

shapiro.test(x2$hwy)

## 
##  Shapiro-Wilk normality test
## 
## data:  x2$hwy
## W = 0.9482, p-value = 4.563e-07

The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the highway mileage data is normally distributed. ##Diagnostics/Model Adequacy Checking Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. Based on the theoretical distribution, the expected value for each datum is determined. If the data values in a set follow the theoretical distribution, then they will appear as a straight line on a Q-Q plot. When an anova is performed, it is done so with the assumption that the test statistic follows a normal distribution. Visualization of a Q-Q plot will further confirm if that assumption is correct for the anova tests that were performed.

qqnorm(residuals(model))
qqline(residuals(model))

plot(fitted(model),residuals(model))

A Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. These plots are used to identify linearity, outliers, and error variances.

Recipe 10