For this recipie, we will examine the Somerville dataset from the Ecdat package. This dataset contains statistics on individuals who visit Lake Somerville.
Read in and subset data:
install.packages("Ecdat")
## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("Ecdat", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
x<-Somerville
For more on the dataset:
?Somerville
## starting httpd help server ... done
The dataset contains 7 factors and we will examine 6 of them. We will convert each of these to binary factors quality: quality ranking score for the lake 1 if high(5-10) -1 if low(0-4)
ski: if visitors are engaged in water skiiing or not 1 if yes -1 if no
income: annual household income 1 if high(5-10) -1 if smaller(0-4)
feeSom: is a annual user fee paid at Lake Somerville? 1 if yes -1 if no
costSom: expenditures when visiting Lake Somerville 1 if high(60-500) -1 if low(0-60)
costCon: expenditures when visiting a nearby lake, lake Conroe 1 if high(60-500) -1 if low(0-60)
Set up factors with binary levels: Quality:
x$quality<-as.numeric(x$quality)
x$quality<-cut(x$quality, c(0,4,10))
x$quality<-as.character(x$quality)
x$quality[x$quality == "(0,4]"] <- -1
x$quality[x$quality == "(4,10]"] <- 1
data<-na.omit(x)
Now that the NA's are omitted, before we set up the rest of the factors, we look at a summary of the data to decide how to cut up the data (levels to choose for continous variables)
summary(data)
## visits quality ski income feeSom
## Min. : 0.00 Length:285 no :161 Min. :1.00 no :272
## 1st Qu.: 1.00 Class :character yes:124 1st Qu.:3.00 yes: 13
## Median : 2.00 Mode :character Median :4.00
## Mean : 5.12 Mean :3.97
## 3rd Qu.: 5.00 3rd Qu.:5.00
## Max. :88.00 Max. :9.00
## costCon costSom costHoust
## Min. : 12.7 Min. : 4.8 Min. : 14.4
## 1st Qu.: 30.9 1st Qu.: 30.3 1st Qu.: 32.4
## Median : 44.4 Median : 47.3 Median : 48.2
## Mean : 59.5 Mean : 59.7 Mean : 61.0
## 3rd Qu.: 74.6 3rd Qu.: 75.0 3rd Qu.: 73.7
## Max. :493.8 Max. :491.5 Max. :491.0
The mean for the expenditure factors (costSom, costCon) are about 60 and the mean for income is about 4.
Ski:
data$ski<-as.character(data$ski)
data$ski[data$ski == "no"] <- -1
data$ski[data$ski == "yes"] <- 1
Income:
data$income<-as.numeric(data$income)
data$income<-cut(data$income, c(0,4,10))
data$income<-as.character(data$income)
data$income[data$income == "(0,4]"] <- -1
data$income[data$income == "(4,10]"] <- 1
feeSom:
data$feeSom<-as.character(data$feeSom)
data$feeSom[data$feeSom == "no"] <- -1
data$feeSom[data$feeSom == "yes"] <- 1
costSom:
data$costSom<-as.numeric(data$costSom)
data$costSom<-cut(data$costSom, c(0,60,500))
data$costSom<-as.character(data$costSom)
data$costSom[data$costSom == "(0,60]"] <- -1
data$costSom[data$costSom == "(60,500]"] <- 1
costCon:
data$costCon<-as.numeric(data$costCon)
data$costCon<-cut(data$costCon, c(0,60,500))
data$costCon<-as.character(data$costCon)
data$costCon[data$costCon == "(0,60]"] <- -1
data$costCon[data$costCon == "(60,500]"] <- 1
Now that the factors are set up as binary values, we can save them as integers:
data$quality<-as.integer(data$quality)
data$ski<-as.integer(data$ski)
data$income<-as.integer(data$income)
data$feeSom<-as.integer(data$feeSom)
data$costCon<-as.integer(data$costCon)
data$costSom<-as.integer(data$costSom)
Originally, all variables besides ski and feeSom were continuous but they were turned to binary values. For this recipie, visits is the only continous variable.
The number of visits to Lake Somerville (visits) will be the response variable for this experiment.
The Somerville dataset has 659 observations of 8 total variables from individuals in 1980 . Each individual is only observed once.
Structure and first/last observations of dataset:
str(data)
## 'data.frame': 285 obs. of 8 variables:
## $ visits : num 0 0 0 0 0 0 0 0 0 0 ...
## $ quality : int -1 -1 -1 -1 -1 -1 1 -1 -1 -1 ...
## $ ski : int -1 1 1 -1 -1 -1 1 1 -1 -1 ...
## $ income : int -1 -1 1 -1 -1 1 1 1 1 -1 ...
## $ feeSom : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ costCon : int -1 -1 -1 -1 1 1 -1 -1 -1 -1 ...
## $ costSom : int -1 -1 -1 -1 1 1 -1 -1 -1 -1 ...
## $ costHoust: num 28.1 28 29.9 14.6 57.9 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:374] 1 2 3 4 5 6 7 8 9 11 ...
## .. ..- attr(*, "names")= chr [1:374] "1" "2" "3" "4" ...
head(data)
## visits quality ski income feeSom costCon costSom costHoust
## 10 0 -1 -1 -1 -1 -1 -1 28.07
## 49 0 -1 1 -1 -1 -1 -1 27.98
## 61 0 -1 1 1 -1 -1 -1 29.89
## 62 0 -1 -1 -1 -1 -1 -1 14.62
## 73 0 -1 -1 -1 -1 1 1 57.86
## 75 0 -1 -1 1 -1 1 1 350.39
tail(data)
## visits quality ski income feeSom costCon costSom costHoust
## 654 30 -1 1 -1 -1 -1 -1 57.30
## 655 40 1 1 1 1 -1 -1 29.68
## 656 40 -1 1 -1 -1 -1 -1 25.80
## 657 40 -1 1 -1 -1 -1 -1 62.76
## 658 50 -1 1 -1 -1 -1 -1 37.27
## 659 88 -1 -1 -1 -1 -1 -1 25.46
This experiment will utilize a fractional factorial design [26-1]. Each factor has been manipulated (see Factors and Levels section above) to have only two levels (1 or -1).
An analysis of variance will be performed. We will test to see if the six factors mentioned above have an effect on the number of visits an indivdual visits Lake Somerville.
Null: The variation in number of visits cannot be explained by anything other than randomization.
The goal is to determine what factors make someone more likely to visit the lake. For example, does a persons engagemnet in water skiing make them visit more frequently? Are people with higher incomes able to come more frequently? Which of the six factors, if any, cause variation in the number of visits.
The dataset is a collection of survey data on individuals visits to Lake Somerville.
There are no replicates or repeated measures.
Blocking is not used in this design.
Mean number of visits by the 6 factors:
tapply(data$visits, data$quality, mean)
## -1 1
## 5.078 5.315
tapply(data$visits, data$income, mean)
## -1 1
## 5.720 3.718
tapply(data$visits, data$ski, mean)
## -1 1
## 4.832 5.500
tapply(data$visits, data$feeSom, mean)
## -1 1
## 4.669 14.615
tapply(data$visits, data$costCon, mean)
## -1 1
## 5.955 3.738
tapply(data$visits, data$costSom, mean)
## -1 1
## 6.337 3.103
It appears quality ranking of the lake may not affect the number of visits. The lower income group has a higher mean visit rate. The mean number of visits from skiers is higher than non-skiers. The mean for visitors who pay an annual fee is 3 times that of guests who do not have an annual fee. People who spend less money at both lakes tend to visit lake somerville more often.
Boxplots:
boxplot(data$visits~data$quality, xlab="Quality Rank", ylab="Number of Visits)")
boxplot(data$visits~data$income, xlab="Visitor Income", ylab="Number of Visits)")
boxplot(data$visits~data$ski, xlab="non-Skier and skier", ylab="Number of Visits)")
boxplot(data$visits~data$feeSom, xlab="No Annual Fee and Annual Fee", ylab="Number of Visits)")
boxplot(data$visits~data$costCon, xlab="Amount of Expenditures at Lake Conroe", ylab="Number of Visits)")
boxplot(data$visits~data$costSom, xlab="Amount of Expenditures at Lake Somerville", ylab="Number of Visits)")
From the boxplots, the factor that seems to have the biggest influence on number of visits is whether or not the individual pays an annual fee. Further anlaysis will be required to tell if other factors are explanatory variables of number of visits
Histogram of Visits:
hist(data$visits, breaks=20)
The most frequent number of visits is between 0 and 5. Few people visit more than 20 times.
Null Hypothesis: The variation in the response number of visits cannot be explained by anything other than randomization. Alternative Hypothesis: The variation in response can be explained by something other than randomization (such as ). Assumption: Data is normally distributed.
model1=aov(data$visits~data$quality+data$ski+data$income+data$feeSom+data$costSom+data$costCon)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## data$quality 1 2 2 0.04 0.8514
## data$ski 1 31 31 0.44 0.5054
## data$income 1 284 284 4.07 0.0446 *
## data$feeSom 1 1181 1181 16.90 5.2e-05 ***
## data$costSom 1 616 616 8.82 0.0032 **
## data$costCon 1 112 112 1.60 0.2066
## Residuals 278 19426 70
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Income level, if an annual fee is paid, and cost spent on expenditures at lake Somerville all have low p-values (less than .05), so therefore we reject the null hypothesis that number of visits cannot be explained by anything other than randomization. It is likely that these three factors can explain some of the variation in number of visits.
We need to install the FrF2 package in order to design the matrix:
install.packages(FrF2)
## Error: object 'FrF2' not found
library("FrF2", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
## Warning: package 'FrF2' was built under R version 3.1.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
We will generate a 26-1 fractional factorial design. A full design would use 64 runs (26) but a half-fractional design will be used (32 runs or 25 ). We have 6 factors.
?
data$quality<-as.factor(data$quality)
data$ski<-as.factor(data$ski)
data$income<-as.factor(data$income)
data$feeSom<-as.factor(data$feeSom)
data$costCon<-as.factor(data$costCon)
data$costSom<-as.factor(data$costSom)
attach(data)
fracdesign<- FrF2(32, nfactors=6, estimable=formula("~quality+ski+income+feeSom+costSom+costCon+quality:(ski+income+feeSom+costSom+costCon)"), factor.names= c("Quality", "Ski?", "Income", "Fee?", "costSom", "costCon"), res4=TRUE, clear=FALSE)
## Error: formula must refer to elements of Letters or to factor names.
Visually inspect normality of data:
qqnorm(residuals(model1))
qqline(residuals(model1))
The data appears it may not be normal.
Shapiro-Wilks normality test to check:
shapiro.test(residuals(model1))
##
## Shapiro-Wilk normality test
##
## data: residuals(model1)
## W = 0.6277, p-value < 2.2e-16
Null hypothesis: The data came from a normally distributed population. We reject the null. We cannot assume the data is normal. This will be addressed in the contingencies section below.
plot(fitted(model1),residuals(model1))
The data should be symetric over the zero and spread out over the dynamic range. The model may not be a good fit since many points are clustered along the zero/left side.
Since the data did not fulfill the normality assumption of the ancova model (which could be why the fit of the model was questionable), a Kruskal-Wallis non-parametric analysis of variance by Rank Sum Test should be performed:
kruskal.test(visits~quality)
##
## Kruskal-Wallis rank sum test
##
## data: visits by quality
## Kruskal-Wallis chi-squared = 3.052, df = 1, p-value = 0.08062
kruskal.test(visits~ski)
##
## Kruskal-Wallis rank sum test
##
## data: visits by ski
## Kruskal-Wallis chi-squared = 0.0279, df = 1, p-value = 0.8673
kruskal.test(visits~income)
##
## Kruskal-Wallis rank sum test
##
## data: visits by income
## Kruskal-Wallis chi-squared = 1.977, df = 1, p-value = 0.1597
kruskal.test(visits~feeSom)
##
## Kruskal-Wallis rank sum test
##
## data: visits by feeSom
## Kruskal-Wallis chi-squared = 17.3, df = 1, p-value = 3.193e-05
kruskal.test(visits~costSom)
##
## Kruskal-Wallis rank sum test
##
## data: visits by costSom
## Kruskal-Wallis chi-squared = 4.34, df = 1, p-value = 0.03722
kruskal.test(visits~costCon)
##
## Kruskal-Wallis rank sum test
##
## data: visits by costCon
## Kruskal-Wallis chi-squared = 0.4542, df = 1, p-value = 0.5004
The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means).
Since the test results have a low p-value for the ski and costSom factors, we reject this null hypothesis. It is likely that the variation in the rank means of if someone pays an annual fee and how much money they spend at Somerville Lake can explain the variaion in number of visits.
The only difference between this result and the results of the ANOVA above is that the ANOVA also returned income level as being significant.
None used.
Data is from the Ecdat Package
All included above.