Recipe 8: 2^k Fractional Factorial Design

Recipes for the Design of Experiments

Zoe Konrad

Rensselaer Polytechnic Institute

Fall 2014 v1

1. Setting

System under test

This analysis uses the popular Vinho Verde data collected from May/2004 to February/2007. We will look specifically at almost 1600 data points for red wine sensory preference (pervieced quality on a 1-10 scale.)

We are interested in modeling the effect of several physicochemical factors on the percieved quality of red wine.

x <- read.csv("~/Desktop/Zoe/winequality-red.csv")
str(x)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Factors and Levels

We chose 6 physicochemical factors: pH, sulfates, total sulfur dioxide, alcohol, volatile acidity, and residual sugar. Each factor is transformed into two levels each: -1 for below the average and +1 for above the average.

summary(x$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.74    3.21    3.31    3.31    3.40    4.01
summary(x$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.330   0.550   0.620   0.658   0.730   2.000
summary(x$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    22.0    38.0    46.5    62.0   289.0
summary(x$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.4     9.5    10.2    10.4    11.1    14.9
summary(x$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.120   0.390   0.520   0.528   0.640   1.580
summary(x$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.90    1.90    2.20    2.54    2.60   15.50
msulphates=mean(x$sulphates)
x$sulphates[x$sulphates<msulphates] = "-1"
x$sulphates[x$sulphates>=msulphates] = "+1"

mpH=mean(x$pH)
x$pH[x$pH<mpH] = "-1"
x$pH[x$pH>=mpH] = "+1"

mtotal.sulfur.dioxide=mean(x$total.sulfur.dioxide)
x$total.sulfur.dioxide[x$total.sulfur.dioxide<mtotal.sulfur.dioxide] = "-1"
x$total.sulfur.dioxide[as.numeric(x$total.sulfur.dioxide)>=mtotal.sulfur.dioxide] = "+1"

malcohol=mean(x$alcohol)
x$alcohol[x$alcohol<malcohol] = "-1"
x$alcohol[x$alcohol>=malcohol] = "+1"

mvolatile.acidity=mean(x$volatile.acidity)
x$volatile.acidity[x$volatile.acidity<mvolatile.acidity] = "-1"
x$volatile.acidity[x$volatile.acidity>=mvolatile.acidity] = "+1"

mresidual.sugar=mean(x$residual.sugar)
x$residual.sugar[x$residual.sugar<mresidual.sugar] = "-1"
x$residual.sugar[as.numeric(x$residual.sugar)>=mresidual.sugar] = "+1"

exp = ddply(.data = x ,
    .variables = .(pH, sulphates, total.sulfur.dioxide, alcohol, volatile.acidity, residual.sugar) ,
    .fun = function(x)
    { return(data.frame(m=mean(x$quality))) } )
attach(exp)

Response variables

The response variable is now the average of quality (sensory preference) accross all experimental runs for each combination of factor levels.

summary(x$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.64    6.00    8.00

2. (Experimental) Design

We will run two ANOVAs: First, for this 2^6 experiment we will start by analyzing the full factorial model and second, we will use the FrF2 package to design and analyze a 2^6-1 (half factorial) experiment.

3. (Statistical) Analysis

Exploratory Data Analysis Graphics

Below are boxplots of response variable quality with respect to each of the 6 factors. Most appear to have some difference in means, except for pH and residual sugar which are both rather close.

par(mfrow=c(3,2))
boxplot(m~pH, xlab="pH")
boxplot(m~sulphates, xlab="sulphates")
boxplot(m~total.sulfur.dioxide, xlab="total sulfur dioxide")
boxplot(m~alcohol, xlab="alcohol")
boxplot(m~volatile.acidity, xlab="volatile acidity")
boxplot(m~residual.sugar, xlab="residual sugar")

plot of chunk unnamed-chunk-5

The response varaible ‘quality’ appears to be close to normally distributed. A shapiro-wilks test confirms the assumption.

summary(m)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.94    5.26    5.59    5.67    6.06    6.55
qqnorm(m); qqline(m)

plot of chunk unnamed-chunk-6

Testing

model <- aov(m~pH+sulphates+total.sulfur.dioxide+alcohol+volatile.acidity+residual.sugar)
anova(model)
## Analysis of Variance Table
## 
## Response: m
##                      Df Sum Sq Mean Sq F value  Pr(>F)    
## pH                    1   0.14    0.14    2.23  0.1410    
## sulphates             1   1.93    1.93   30.01 1.0e-06 ***
## total.sulfur.dioxide  1   0.76    0.76   11.74  0.0011 ** 
## alcohol               1   4.04    4.04   62.76 9.3e-11 ***
## volatile.acidity      1   1.44    1.44   22.38 1.5e-05 ***
## residual.sugar        1   0.07    0.07    1.06  0.3068    
## Residuals            57   3.67    0.06                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All factors are significant except for pH and residual sugar. We reject the null hypothesis that randmoization alone can account for the variation in quality.

model <- aov(m~pH+sulphates+total.sulfur.dioxide+alcohol+volatile.acidity+residual.sugar, data=exp)
anova(model)
## Analysis of Variance Table
## 
## Response: m
##                      Df Sum Sq Mean Sq F value  Pr(>F)    
## pH                    1   0.14    0.14    2.23  0.1410    
## sulphates             1   1.93    1.93   30.01 1.0e-06 ***
## total.sulfur.dioxide  1   0.76    0.76   11.74  0.0011 ** 
## alcohol               1   4.04    4.04   62.76 9.3e-11 ***
## volatile.acidity      1   1.44    1.44   22.38 1.5e-05 ***
## residual.sugar        1   0.07    0.07    1.06  0.3068    
## Residuals            57   3.67    0.06                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fractional = FrF2(32, nfactors=6,
     factor.names=c("pH","sulphates","total.sulfur.dioxide","alcohol","volatile.acidity","residual.sugar"),
     estimable=formula("~A+B+C+D+E+F:(A+B+C+D+E+F)"),
     default.levels = c("-1", "+1"),
     res5=TRUE, clear = FALSE)

aliasprint(fractional)
## $legend
## [1] A=pH                   B=sulphates            C=total.sulfur.dioxide
## [4] D=alcohol              E=volatile.acidity     F=residual.sugar      
## 
## [[2]]
## [1] no aliasing among main effects and 2fis
exp2 = merge(fractional,exp,by=c("pH","sulphates","total.sulfur.dioxide","alcohol","volatile.acidity","residual.sugar"),all=FALSE)
model2 <- aov(m~pH+sulphates+total.sulfur.dioxide+alcohol+volatile.acidity+residual.sugar, data=exp2)
anova(model2)
## Analysis of Variance Table
## 
## Response: m
##                      Df Sum Sq Mean Sq F value  Pr(>F)    
## pH                    1  0.005   0.005    0.08 0.77866    
## sulphates             1  0.902   0.902   14.34 0.00085 ***
## total.sulfur.dioxide  1  0.224   0.224    3.56 0.07067 .  
## alcohol               1  2.516   2.516   40.01 1.3e-06 ***
## volatile.acidity      1  0.951   0.951   15.12 0.00066 ***
## residual.sugar        1  0.009   0.009    0.15 0.70599    
## Residuals            25  1.572   0.063                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimation (of Parameters)

# TukeyHSD(model)

The most significant effect on quality is alcohol with a differrence of .5 quality points between high/low values. Sulfates, total sulfur dioxide, and volatile acidity were all significant as well with differences around .3 qualilty points between high/low values. PH and residual sugar did not have a significant difference in means between levels.

Diagnostics/Model Adequacy Checking

The residuals of the model are perfectly normally distributed which validates the primary assumption of normality. We can concluded that this model is appropriate.

qqnorm(residuals(model)) ; qqline(residuals(model))

plot of chunk unnamed-chunk-13

qqnorm(residuals(model2)) ; qqline(residuals(model2))

plot of chunk unnamed-chunk-13

4. References to the literature

Application of data mining using support vector machines to model wine preferences from physicochemical properties: http://www.sciencedirect.com/science/article/pii/S0167923609001377