Recipie for Fractional Factorial Designs

Ali Svoobda

RPI

11/20/14

1. Setting

System under test

For this recipie, we will examine the Somerville dataset from the Ecdat package. This dataset contains statistics on individuals who visit Lake Somerville.

Read in and subset data:

install.packages("Ecdat")

## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)

## Error: trying to use CRAN without setting a mirror

library("Ecdat", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")

## Warning: package 'Ecdat' was built under R version 3.1.2

## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

x<-Somerville

For more on the dataset:

?Somerville

## starting httpd help server ... done

Factors and Levels

The dataset contains 7 factors and we will examine 6 of them. We will convert each of these to binary factors quality: quality ranking score for the lake 1 if high(5-10) -1 if low(0-4)

ski: if visitors are engaged in water skiiing or not 1 if yes -1 if no

income: annual household income 1 if high(5-10) -1 if smaller(0-4)

feeSom: is a annual user fee paid at Lake Somerville? 1 if yes -1 if no

costSom: expenditures when visiting Lake Somerville 1 if high(60-500) -1 if low(0-60)

costCon: expenditures when visiting a nearby lake, lake Conroe 1 if high(60-500) -1 if low(0-60)

Set up factors with binary levels: Quality:

x$quality<-as.numeric(x$quality)
x$quality<-cut(x$quality, c(0,4,10))
x$quality<-as.character(x$quality)

x$quality[x$quality == "(0,4]"] <- -1
x$quality[x$quality == "(4,10]"] <- 1

data<-na.omit(x)

Now that the NA's are omitted, before we set up the rest of the factors, we look at a summary of the data to decide how to cut up the data (levels to choose for continous variables)

summary(data)

##      visits        quality           ski          income     feeSom   
##  Min.   : 0.00   Length:285         no :161   Min.   :1.00   no :272  
##  1st Qu.: 1.00   Class :character   yes:124   1st Qu.:3.00   yes: 13  
##  Median : 2.00   Mode  :character             Median :4.00            
##  Mean   : 5.12                                Mean   :3.97            
##  3rd Qu.: 5.00                                3rd Qu.:5.00            
##  Max.   :88.00                                Max.   :9.00            
##     costCon         costSom        costHoust    
##  Min.   : 12.7   Min.   :  4.8   Min.   : 14.4  
##  1st Qu.: 30.9   1st Qu.: 30.3   1st Qu.: 32.4  
##  Median : 44.4   Median : 47.3   Median : 48.2  
##  Mean   : 59.5   Mean   : 59.7   Mean   : 61.0  
##  3rd Qu.: 74.6   3rd Qu.: 75.0   3rd Qu.: 73.7  
##  Max.   :493.8   Max.   :491.5   Max.   :491.0

The mean for the expenditure factors (costSom, costCon) are about 60 and the mean for income is about 4.

Ski:

data$ski<-as.character(data$ski)

data$ski[data$ski == "no"] <- -1
data$ski[data$ski == "yes"] <- 1

Income:

data$income<-as.numeric(data$income)
data$income<-cut(data$income, c(0,4,10))
data$income<-as.character(data$income)


data$income[data$income == "(0,4]"] <- -1
data$income[data$income == "(4,10]"] <- 1

feeSom:

data$feeSom<-as.character(data$feeSom)

data$feeSom[data$feeSom == "no"] <- -1
data$feeSom[data$feeSom == "yes"] <- 1

costSom:

data$costSom<-as.numeric(data$costSom)
data$costSom<-cut(data$costSom, c(0,60,500))
data$costSom<-as.character(data$costSom)


data$costSom[data$costSom == "(0,60]"] <- -1
data$costSom[data$costSom == "(60,500]"] <- 1

costCon:

data$costCon<-as.numeric(data$costCon)
data$costCon<-cut(data$costCon, c(0,60,500))
data$costCon<-as.character(data$costCon)


data$costCon[data$costCon == "(0,60]"] <- -1
data$costCon[data$costCon == "(60,500]"] <- 1

Now that the factors are set up as binary values, we can save them as integers:

data$quality<-as.integer(data$quality)
data$ski<-as.integer(data$ski)
data$income<-as.integer(data$income)
data$feeSom<-as.integer(data$feeSom)
data$costCon<-as.integer(data$costCon)
data$costSom<-as.integer(data$costSom)

Continuous Variables

Originally, all variables besides ski and feeSom were continuous but they were turned to binary values. For this recipie, visits is the only continous variable.

Response Variables

The number of visits to Lake Somerville (visits) will be the response variable for this experiment.

The Data: How is it organized and what does it look like?

The Somerville dataset has 659 observations of 8 total variables from individuals in 1980 . Each individual is only observed once.

Structure and first/last observations of dataset:

str(data)

## 'data.frame':    285 obs. of  8 variables:
##  $ visits   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ quality  : int  -1 -1 -1 -1 -1 -1 1 -1 -1 -1 ...
##  $ ski      : int  -1 1 1 -1 -1 -1 1 1 -1 -1 ...
##  $ income   : int  -1 -1 1 -1 -1 1 1 1 1 -1 ...
##  $ feeSom   : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ costCon  : int  -1 -1 -1 -1 1 1 -1 -1 -1 -1 ...
##  $ costSom  : int  -1 -1 -1 -1 1 1 -1 -1 -1 -1 ...
##  $ costHoust: num  28.1 28 29.9 14.6 57.9 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:374] 1 2 3 4 5 6 7 8 9 11 ...
##   .. ..- attr(*, "names")= chr [1:374] "1" "2" "3" "4" ...

head(data)

##    visits quality ski income feeSom costCon costSom costHoust
## 10      0      -1  -1     -1     -1      -1      -1     28.07
## 49      0      -1   1     -1     -1      -1      -1     27.98
## 61      0      -1   1      1     -1      -1      -1     29.89
## 62      0      -1  -1     -1     -1      -1      -1     14.62
## 73      0      -1  -1     -1     -1       1       1     57.86
## 75      0      -1  -1      1     -1       1       1    350.39

tail(data)

##     visits quality ski income feeSom costCon costSom costHoust
## 654     30      -1   1     -1     -1      -1      -1     57.30
## 655     40       1   1      1      1      -1      -1     29.68
## 656     40      -1   1     -1     -1      -1      -1     25.80
## 657     40      -1   1     -1     -1      -1      -1     62.76
## 658     50      -1   1     -1     -1      -1      -1     37.27
## 659     88      -1  -1     -1     -1      -1      -1     25.46

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment will utilize a fractional factorial design [2^6-1]. Each factor has been manipulated (see Factors and Levels section above) to have only two levels (1 or -1).

An analysis of variance will be performed. We will test to see if the six factors mentioned above have an effect on the number of visits an indivdual visits Lake Somerville.

Null: The variation in number of visits cannot be explained by anything other than randomization.

What is the Rationale for this design?

The goal is to determine what factors make someone more likely to visit the lake. For example, does a persons engagemnet in water skiing make them visit more frequently? Are people with higher incomes able to come more frequently? Which of the six factors, if any, cause variation in the number of visits.

Randomize: What is the Randomization Scheme?

The dataset is a collection of survey data on individuals visits to Lake Somerville.

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures.

Block: Did you use blocking in the design?

Blocking is not used in this design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Mean number of visits by the 6 factors:

tapply(data$visits, data$quality, mean)

##    -1     1 
## 5.078 5.315

tapply(data$visits, data$income, mean)

##    -1     1 
## 5.720 3.718

tapply(data$visits, data$ski, mean)

##    -1     1 
## 4.832 5.500

tapply(data$visits, data$feeSom, mean)

##     -1      1 
##  4.669 14.615

tapply(data$visits, data$costCon, mean)

##    -1     1 
## 5.955 3.738

tapply(data$visits, data$costSom, mean)

##    -1     1 
## 6.337 3.103

It appears quality ranking of the lake may not affect the number of visits. The lower income group has a higher mean visit rate. The mean number of visits from skiers is higher than non-skiers. The mean for visitors who pay an annual fee is 3 times that of guests who do not have an annual fee. People who spend less money at both lakes tend to visit lake somerville more often.

Boxplots:

boxplot(data$visits~data$quality, xlab="Quality Rank", ylab="Number of Visits)")