Recipie for Fractional Factorial Designs

Ali Svoobda

RPI

11/20/14

1. Setting

System under test

For this recipie, we will examine the Somerville dataset from the Ecdat package. This dataset contains statistics on individuals who visit Lake Somerville.

Read in and subset data:

install.packages("Ecdat")
## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("Ecdat", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
x<-Somerville

For more on the dataset:

?Somerville
## starting httpd help server ... done

Factors and Levels

The dataset contains 7 factors and we will examine 6 of them. We will convert each of these to binary factors quality: quality ranking score for the lake 1 if high(5-10) -1 if low(0-4)

ski: if visitors are engaged in water skiiing or not 1 if yes -1 if no

income: annual household income 1 if high(5-10) -1 if smaller(0-4)

feeSom: is a annual user fee paid at Lake Somerville? 1 if yes -1 if no

costSom: expenditures when visiting Lake Somerville 1 if high(60-500) -1 if low(0-60)

costCon: expenditures when visiting a nearby lake, lake Conroe 1 if high(60-500) -1 if low(0-60)

Set up factors with binary levels: Quality:

x$quality<-as.numeric(x$quality)
x$quality<-cut(x$quality, c(0,4,10))
x$quality<-as.character(x$quality)

x$quality[x$quality == "(0,4]"] <- -1
x$quality[x$quality == "(4,10]"] <- 1

data<-na.omit(x)

Now that the NA's are omitted, before we set up the rest of the factors, we look at a summary of the data to decide how to cut up the data (levels to choose for continous variables)

summary(data)
##      visits        quality           ski          income     feeSom   
##  Min.   : 0.00   Length:285         no :161   Min.   :1.00   no :272  
##  1st Qu.: 1.00   Class :character   yes:124   1st Qu.:3.00   yes: 13  
##  Median : 2.00   Mode  :character             Median :4.00            
##  Mean   : 5.12                                Mean   :3.97            
##  3rd Qu.: 5.00                                3rd Qu.:5.00            
##  Max.   :88.00                                Max.   :9.00            
##     costCon         costSom        costHoust    
##  Min.   : 12.7   Min.   :  4.8   Min.   : 14.4  
##  1st Qu.: 30.9   1st Qu.: 30.3   1st Qu.: 32.4  
##  Median : 44.4   Median : 47.3   Median : 48.2  
##  Mean   : 59.5   Mean   : 59.7   Mean   : 61.0  
##  3rd Qu.: 74.6   3rd Qu.: 75.0   3rd Qu.: 73.7  
##  Max.   :493.8   Max.   :491.5   Max.   :491.0

The mean for the expenditure factors (costSom, costCon) are about 60 and the mean for income is about 4.

Ski:

data$ski<-as.character(data$ski)

data$ski[data$ski == "no"] <- -1
data$ski[data$ski == "yes"] <- 1

Income:

data$income<-as.numeric(data$income)
data$income<-cut(data$income, c(0,4,10))
data$income<-as.character(data$income)


data$income[data$income == "(0,4]"] <- -1
data$income[data$income == "(4,10]"] <- 1

feeSom:

data$feeSom<-as.character(data$feeSom)

data$feeSom[data$feeSom == "no"] <- -1
data$feeSom[data$feeSom == "yes"] <- 1

costSom:

data$costSom<-as.numeric(data$costSom)
data$costSom<-cut(data$costSom, c(0,60,500))
data$costSom<-as.character(data$costSom)


data$costSom[data$costSom == "(0,60]"] <- -1
data$costSom[data$costSom == "(60,500]"] <- 1

costCon:

data$costCon<-as.numeric(data$costCon)
data$costCon<-cut(data$costCon, c(0,60,500))
data$costCon<-as.character(data$costCon)


data$costCon[data$costCon == "(0,60]"] <- -1
data$costCon[data$costCon == "(60,500]"] <- 1

Now that the factors are set up as binary values, we can save them as integers:

data$quality<-as.integer(data$quality)
data$ski<-as.integer(data$ski)
data$income<-as.integer(data$income)
data$feeSom<-as.integer(data$feeSom)
data$costCon<-as.integer(data$costCon)
data$costSom<-as.integer(data$costSom)

Continuous Variables

Originally, all variables besides ski and feeSom were continuous but they were turned to binary values. For this recipie, visits is the only continous variable.

Response Variables

The number of visits to Lake Somerville (visits) will be the response variable for this experiment.

The Data: How is it organized and what does it look like?

The Somerville dataset has 659 observations of 8 total variables from individuals in 1980 . Each individual is only observed once.

Structure and first/last observations of dataset:

str(data)
## 'data.frame':    285 obs. of  8 variables:
##  $ visits   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ quality  : int  -1 -1 -1 -1 -1 -1 1 -1 -1 -1 ...
##  $ ski      : int  -1 1 1 -1 -1 -1 1 1 -1 -1 ...
##  $ income   : int  -1 -1 1 -1 -1 1 1 1 1 -1 ...
##  $ feeSom   : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ costCon  : int  -1 -1 -1 -1 1 1 -1 -1 -1 -1 ...
##  $ costSom  : int  -1 -1 -1 -1 1 1 -1 -1 -1 -1 ...
##  $ costHoust: num  28.1 28 29.9 14.6 57.9 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:374] 1 2 3 4 5 6 7 8 9 11 ...
##   .. ..- attr(*, "names")= chr [1:374] "1" "2" "3" "4" ...
head(data)
##    visits quality ski income feeSom costCon costSom costHoust
## 10      0      -1  -1     -1     -1      -1      -1     28.07
## 49      0      -1   1     -1     -1      -1      -1     27.98
## 61      0      -1   1      1     -1      -1      -1     29.89
## 62      0      -1  -1     -1     -1      -1      -1     14.62
## 73      0      -1  -1     -1     -1       1       1     57.86
## 75      0      -1  -1      1     -1       1       1    350.39
tail(data)
##     visits quality ski income feeSom costCon costSom costHoust
## 654     30      -1   1     -1     -1      -1      -1     57.30
## 655     40       1   1      1      1      -1      -1     29.68
## 656     40      -1   1     -1     -1      -1      -1     25.80
## 657     40      -1   1     -1     -1      -1      -1     62.76
## 658     50      -1   1     -1     -1      -1      -1     37.27
## 659     88      -1  -1     -1     -1      -1      -1     25.46

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment will utilize a fractional factorial design [26-1]. Each factor has been manipulated (see Factors and Levels section above) to have only two levels (1 or -1).

An analysis of variance will be performed. We will test to see if the six factors mentioned above have an effect on the number of visits an indivdual visits Lake Somerville.

Null: The variation in number of visits cannot be explained by anything other than randomization.

What is the Rationale for this design?

The goal is to determine what factors make someone more likely to visit the lake. For example, does a persons engagemnet in water skiing make them visit more frequently? Are people with higher incomes able to come more frequently? Which of the six factors, if any, cause variation in the number of visits.

Randomize: What is the Randomization Scheme?

The dataset is a collection of survey data on individuals visits to Lake Somerville.

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures.

Block: Did you use blocking in the design?

Blocking is not used in this design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Mean number of visits by the 6 factors:

tapply(data$visits, data$quality, mean)
##    -1     1 
## 5.078 5.315
tapply(data$visits, data$income, mean)
##    -1     1 
## 5.720 3.718
tapply(data$visits, data$ski, mean)
##    -1     1 
## 4.832 5.500
tapply(data$visits, data$feeSom, mean)
##     -1      1 
##  4.669 14.615
tapply(data$visits, data$costCon, mean)
##    -1     1 
## 5.955 3.738
tapply(data$visits, data$costSom, mean)
##    -1     1 
## 6.337 3.103

It appears quality ranking of the lake may not affect the number of visits. The lower income group has a higher mean visit rate. The mean number of visits from skiers is higher than non-skiers. The mean for visitors who pay an annual fee is 3 times that of guests who do not have an annual fee. People who spend less money at both lakes tend to visit lake somerville more often.

Boxplots:

boxplot(data$visits~data$quality, xlab="Quality Rank", ylab="Number of Visits)")

plot of chunk unnamed-chunk-13

boxplot(data$visits~data$income, xlab="Visitor Income", ylab="Number of Visits)")

plot of chunk unnamed-chunk-13

boxplot(data$visits~data$ski, xlab="non-Skier and skier", ylab="Number of Visits)")

plot of chunk unnamed-chunk-13

boxplot(data$visits~data$feeSom, xlab="No Annual Fee and Annual Fee", ylab="Number of Visits)")

plot of chunk unnamed-chunk-13

boxplot(data$visits~data$costCon, xlab="Amount of Expenditures at Lake Conroe", ylab="Number of Visits)")

plot of chunk unnamed-chunk-13

boxplot(data$visits~data$costSom, xlab="Amount of Expenditures at Lake Somerville", ylab="Number of Visits)")

plot of chunk unnamed-chunk-13

From the boxplots, the factor that seems to have the biggest influence on number of visits is whether or not the individual pays an annual fee. Further anlaysis will be required to tell if other factors are explanatory variables of number of visits

Histogram of Visits:

hist(data$visits, breaks=20)

plot of chunk unnamed-chunk-14

The most frequent number of visits is between 0 and 5. Few people visit more than 20 times.

Testing

ANOVA Model 1

Null Hypothesis: The variation in the response number of visits cannot be explained by anything other than randomization. Alternative Hypothesis: The variation in response can be explained by something other than randomization (such as ). Assumption: Data is normally distributed.

model1=aov(data$visits~data$quality+data$ski+data$income+data$feeSom+data$costSom+data$costCon)
summary(model1)
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## data$quality   1      2       2    0.04  0.8514    
## data$ski       1     31      31    0.44  0.5054    
## data$income    1    284     284    4.07  0.0446 *  
## data$feeSom    1   1181    1181   16.90 5.2e-05 ***
## data$costSom   1    616     616    8.82  0.0032 ** 
## data$costCon   1    112     112    1.60  0.2066    
## Residuals    278  19426      70                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Income level, if an annual fee is paid, and cost spent on expenditures at lake Somerville all have low p-values (less than .05), so therefore we reject the null hypothesis that number of visits cannot be explained by anything other than randomization. It is likely that these three factors can explain some of the variation in number of visits.

Fractional Factorial Design

We need to install the FrF2 package in order to design the matrix:

install.packages(FrF2)
## Error: object 'FrF2' not found
library("FrF2", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
## Warning: package 'FrF2' was built under R version 3.1.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design

We will generate a 26-1 fractional factorial design. A full design would use 64 runs (26) but a half-fractional design will be used (32 runs or 25 ). We have 6 factors.

?
data$quality<-as.factor(data$quality)
data$ski<-as.factor(data$ski)
data$income<-as.factor(data$income)
data$feeSom<-as.factor(data$feeSom)
data$costCon<-as.factor(data$costCon)
data$costSom<-as.factor(data$costSom)

attach(data)

fracdesign<- FrF2(32, nfactors=6, estimable=formula("~quality+ski+income+feeSom+costSom+costCon+quality:(ski+income+feeSom+costSom+costCon)"), factor.names= c("Quality", "Ski?", "Income", "Fee?", "costSom", "costCon"), res4=TRUE, clear=FALSE)
## Error: formula must refer to elements of Letters or to factor names.

Diagnostics/Model Adequacy Checking

Visually inspect normality of data:

qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-18

The data appears it may not be normal.

Shapiro-Wilks normality test to check:

shapiro.test(residuals(model1))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model1)
## W = 0.6277, p-value < 2.2e-16

Null hypothesis: The data came from a normally distributed population. We reject the null. We cannot assume the data is normal. This will be addressed in the contingencies section below.

Fitted vs Residuals Plot

plot(fitted(model1),residuals(model1))

plot of chunk unnamed-chunk-20

The data should be symetric over the zero and spread out over the dynamic range. The model may not be a good fit since many points are clustered along the zero/left side.

4. Contingencies

Since the data did not fulfill the normality assumption of the ancova model (which could be why the fit of the model was questionable), a Kruskal-Wallis non-parametric analysis of variance by Rank Sum Test should be performed:

kruskal.test(visits~quality)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  visits by quality
## Kruskal-Wallis chi-squared = 3.052, df = 1, p-value = 0.08062
kruskal.test(visits~ski)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  visits by ski
## Kruskal-Wallis chi-squared = 0.0279, df = 1, p-value = 0.8673
kruskal.test(visits~income)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  visits by income
## Kruskal-Wallis chi-squared = 1.977, df = 1, p-value = 0.1597
kruskal.test(visits~feeSom)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  visits by feeSom
## Kruskal-Wallis chi-squared = 17.3, df = 1, p-value = 3.193e-05
kruskal.test(visits~costSom)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  visits by costSom
## Kruskal-Wallis chi-squared = 4.34, df = 1, p-value = 0.03722
kruskal.test(visits~costCon)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  visits by costCon
## Kruskal-Wallis chi-squared = 0.4542, df = 1, p-value = 0.5004

The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means).
Since the test results have a low p-value for the ski and costSom factors, we reject this null hypothesis. It is likely that the variation in the rank means of if someone pays an annual fee and how much money they spend at Somerville Lake can explain the variaion in number of visits. The only difference between this result and the results of the ANOVA above is that the ANOVA also returned income level as being significant.

5. References to the Literature

None used.

6. Appendicies

Link to raw data

Data is from the Ecdat Package

Complete R Code

All included above.