Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Design of Experiments: Recipe 7 - One Factor ANOVA with Resampling

Kevin Toth

RPI

11/05/2014 V2.0

1. Setting

Marital Affairs data

This analysis uses marital affairs data to test the effect of one factor on the response variable of cigarette sales per pack per capita.

Below is the installation and initial examination of the dataset:

#Installing data package
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
#Attaching Marital Affair Data Set
data<-Fair
attach(data)

Factors and Levels

For this single factor analysis we will be examining the effect of religion (religious) on the response variable number of affairs (nbaffairs)

Our factor of religion is broken up into 5 levels which are 1,2,3,4, and 5. 1 represents that the person identifies as anti or opposed to religion, 3 is neutral and 5 is very religous.

#Levels of Religious
unique(data$religious)
## [1] 3 4 1 5 2
#Summary of Religious
summary(data$religious)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    3.12    4.00    5.00
#Overall Look at Data
head(data)
##      sex age    ym child religious education occupation rate nbaffairs
## 1   male  37 10.00    no         3        18          7    4         0
## 2 female  27  4.00    no         4        14          6    4         0
## 3 female  32 15.00   yes         1        12          1    4         0
## 4   male  57 15.00   yes         5        18          6    5         0
## 5   male  22  0.75    no         2        17          6    3         0
## 6 female  32  1.50    no         2        17          5    5         0
tail(data)
##        sex age   ym child religious education occupation rate nbaffairs
## 596   male  47 15.0   yes         3        16          4    2         7
## 597   male  22  1.5   yes         1        12          2    5         1
## 598 female  32 10.0   yes         2        18          5    4         7
## 599   male  32 10.0   yes         2        17          6    5         2
## 600   male  22  7.0   yes         3        18          6    2         2
## 601 female  32 15.0   yes         3        14          1    5         1
summary(data)
##      sex           age             ym         child       religious   
##  female:315   Min.   :17.5   Min.   : 0.125   no :171   Min.   :1.00  
##  male  :286   1st Qu.:27.0   1st Qu.: 4.000   yes:430   1st Qu.:2.00  
##               Median :32.0   Median : 7.000             Median :3.00  
##               Mean   :32.5   Mean   : 8.178             Mean   :3.12  
##               3rd Qu.:37.0   3rd Qu.:15.000             3rd Qu.:4.00  
##               Max.   :57.0   Max.   :15.000             Max.   :5.00  
##    education      occupation       rate        nbaffairs    
##  Min.   : 9.0   Min.   :1.0   Min.   :1.00   Min.   : 0.00  
##  1st Qu.:14.0   1st Qu.:3.0   1st Qu.:3.00   1st Qu.: 0.00  
##  Median :16.0   Median :5.0   Median :4.00   Median : 0.00  
##  Mean   :16.2   Mean   :4.2   Mean   :3.93   Mean   : 1.46  
##  3rd Qu.:18.0   3rd Qu.:6.0   3rd Qu.:5.00   3rd Qu.: 0.00  
##  Max.   :20.0   Max.   :7.0   Max.   :5.00   Max.   :12.00

Continuous variables (if any)

No observed variables are continuous.

Response variables

For this experiment we will be focusing on the number of affairs for the response variable being effected by the chosen factor.

The Data: How is it organized and what does it look like?

The data set is organized by the following variables: sex, age, ym, child, religious, education, occupation, rate, nbaffairs

Sex and age are descipritive of the person interviewed. Sex can be male or female and age ranges from 17 to 57.

ym and child represent characteristics of the marriage. ym is years married which ranges in the data from .125 to 15 and child is number of children which is a binary yes or no factor.

religious represents where the person stands with religion ranging from 1 (meaning anti-religion or opposed) to 5 (representing very religious).

education is a number ranging from 9 to 20 however there is no description of this variable and we can reasonably assuming this is represented in years.

Occupation is the persons job represented by the numbers 1 to 7 using the hollingshead classification

rate represents the persons rating of their marriage from 1 being very unhappy to 5 being very happy.

nbaffairs is our response variable which represents the number of affairs in the past year. This variable ranges from 0 to 12 in the data.

Randomization

It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

To perform the experiment we will create linear models using the factor and with that model perform an analysis of variance. From these analysis we will be able to test the null hypothesis, that the number of affairs is independant of a persons stance on religion. The alternative is that religion does have an effect on the number of affairs. We will also perform several tests to check if our data is normal and therefore fits assumptions made when using analysis of covariance.

We will also perform a resampling analysis of variance on our data as a secondary test.

What is the rationale for this design?

The rationale for using an analysis of covariance test is that it checks whether the means of several groups are equal. The alternative would be to use a two-sample t-test however there is more likely chance of the test resulting in a false positive of the hypothesis. By following up with a resampling we can further test the validity of our analysis.

Randomize: What is the Randomization Scheme?

The data was collected in an unknown way so we do not know if there was any randomization to it.

Replicate: Are there replicates and/or repeated measures?

There are no replicated or repeated measures in the data. There is no specific identification of individuals so we can assume each person was only surveyed once that year.

Block: Did you use blocking in the design?

There was no blocking in the design of this experiment

3. Statistical Analysis

Exploratory Data Analysis: Graphics and Descriptive summary

To start our statistical analysis we will make our variable into a factor for the analysis of variance and look at the boxplot of that factor. We will also use this to make an initial determination as to whether or not our variable appears to have an effect

#Defining stance on religion as a factor
data$religious = as.factor(data$religious)

#Boxplots of the means of of religious against the response variable number of affairs.
boxplot(nbaffairs~religious, data=data, xlab="Religious", ylab="Number of Affairs")

plot of chunk unnamed-chunk-3

By examining the boxplot there seems to be a negative trend between religious and number of affairs. It is also worth noting that for each level of religious there is an outlier at 12.

Testing

First we generate a linear model with our factor. Then to test the hypotheses we perform an ANOVA test on the factor.

The null hypothesis of the tests is that the single factor does not have an effect on the response variable of number of affairs.

#Generation of Linear Model
model=aov(nbaffairs~religious, data=data)
anova(model)
## Analysis of Variance Table
## 
## Response: nbaffairs
##            Df Sum Sq Mean Sq F value Pr(>F)   
## religious   4    193    48.3    4.54 0.0013 **
## Residuals 596   6336    10.6                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For each our analysis of variance test we found a resulting p-value of .001281. The p-value represents the probability that we can get the F value using the null hypothesis. Since our probability is very low we can assume that the factor demonstrates an effect on the response variable. We are lead to reject the null hypothesis.

As a secondary test we will perform a resampling anova using bootstrap.

data(Fair)
x <- data(Fair)
with(Fair,tapply(nbaffairs,religious,mean))
##      1      2      3      4      5 
## 2.5833 1.6524 1.9767 0.8579 0.8857
with(Fair,tapply(nbaffairs,religious,var))
##      1      2      3      4      5 
## 17.567 12.474 13.945  6.588  6.480
with(Fair,tapply(nbaffairs, religious,length))
##   1   2   3   4   5 
##  48 164 129 190  70
# We need the sd(residual error) from the summary for the Monte Carlo simulation
summary(aov(nbaffairs~religious,data=Fair))
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## religious     1    136   136.3    12.8 0.00038 ***
## Residuals   599   6393    10.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
meanstar = mean(Fair$nbaffairs)
# sd is sqrt(residual error). Get this value from summary(), above.
sdstar = sqrt(10.67)
simreligious = Fair$religious
R = 10000  # 10,000 iterations
Fstar = numeric(R)
for (i in 1:R) {
  # make residual normally distributed with known pooled variance
  # this loop is a Monte Carlo simulation since distribution is known normal
  groupA = rnorm(120, mean=meanstar, sd=sdstar)
  groupB = rnorm(120, mean=meanstar, sd=sdstar)
  groupC = rnorm(120, mean=meanstar, sd=sdstar)
  groupD = rnorm(120, mean=meanstar, sd=sdstar)
  groupE = rnorm(121, mean=meanstar, sd=sdstar)
  simaffair = c(groupA,groupB,groupC,groupD,groupE)
  simdata = data.frame(simaffair,simreligious)
  Fstar[i] = oneway.test(simaffair~simreligious, var.equal=T, data=simdata)$statistic
}

# computes empirical distribution of F under H0 and display as histogram
hist(Fstar, ylim=c(0,1), xlim=c(0, 8), prob=T)
# Plot F distribution from .25 to 6 in .5 increments
x=seq(.25,6,.25)
# Overlay the analytic F distribution as red dots (degrees of freedom known)
# 5 = 6 levels-1 mean; 599= 604 (n)-6 means
# Notice the quality of the fit of the Monte Carlo simulation to the empirical distribution of F
points(x,y=df(x,1,604),type="b",col="red")

plot of chunk unnamed-chunk-5

#using Bootstrap to do ANOVA
meanstar = with(data, tapply(nbaffairs,religious,mean))
grpA = data$nbaffairs[data$religious=="1"] - meanstar[1]
grpB = data$nbaffairs[data$religious=="2"] - meanstar[2]
grpC = data$nbaffairs[data$religious=="3"] - meanstar[3]
grpD = data$nbaffairs[data$religious=="4"] - meanstar[4]
grpE = data$nbaffairs[data$religious=="5"] - meanstar[5]
simreligious2 = data$religious
R = 10000
Fstar2 = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=120, replace=T)
  groupB = sample(grpB, size=120, replace=T)
  groupC = sample(grpC, size=120, replace=T)
  groupD = sample(grpD, size=120, replace=T)
  groupE = sample(grpE, size=121, replace=T)
  simaffair2 = c(groupA,groupB,groupC,groupD,groupE)
  simdata2 = data.frame(simaffair2,simreligious2)
  Fstar2[i] = oneway.test(simaffair2~simreligious2, var.equal=T, data=simdata2)$statistic
}

# Now generate a similar plot, comparing bootstrapped distribution to the known F distribution
hist(Fstar2, ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,1,604),type="b",col="red")  # The F distribution

plot of chunk unnamed-chunk-5

# Here is the alpha level from the bootstrapped distribution
print(realFstar <- oneway.test(nbaffairs~religious, var.equal=T, data=Fair)$statistic)
##    F 
## 4.54
mean(Fstar>=realFstar)
## [1] 0.0011
# estimate alpha = 0.05 value of the test statistic, given F* from empirical distribution above
# generate quantiles of the analytic F distribution
qf(.95,1,604)
## [1] 3.857
# alpha = 0.05 value of bootstrapped F* is output here
quantile(Fstar2,.95)
##   95% 
## 2.402

Next we preform another ANOVA on the newly created simaffair and simreligious.

model2=aov(simaffair2~simreligious2, data=simdata2)
anova(model2)
## Analysis of Variance Table
## 
## Response: simaffair2
##                Df Sum Sq Mean Sq F value Pr(>F)
## simreligious2   4     78    19.6    1.61   0.17
## Residuals     596   7259    12.2

Our p-value in this case is very high at .7365 and we are led to accept the null hypothesis that the number of affairs is independent of how religious a person is.

Diagnostics/Model Adequacy Checking

Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid.

#QQ plots for residuals of the model
qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-7

qqnorm(residuals(model2))
qqline(residuals(model2))

plot of chunk unnamed-chunk-7

Our QQ plots show an extreme jump towards the end, leading us to think our data is not normal.

Second we plot a fitted model against the residuals. We see a large amount of symmetry across the horizontal.

#Plot of Fitted vs Residuals of model
plot(fitted(model), residuals(model))

plot of chunk unnamed-chunk-8

plot(fitted(model2), residuals(model2))

plot of chunk unnamed-chunk-8

Overrall the results of our model lead us to believe our model is not adequate and does not explain the effect of how religious a person is on the response variable of number of affairs. This is largely due to our qq plot making the data appear extremely non-parametric.

4. References to the literature

No literature was used

5. Appendices

The R package in which this data was found can be located in the R library named Ecdat

6. Contingencies

It is possible that the results of our study are the result of chance. One concern could be the inherant honesty of the data. Marital affairs are sensitive matters and some participating parties may have been untruthful when the data was collected. While there is not an obvious causation of religiousness on the number of affairs there may in turn be an effect of religiousness on whether or not affairs occured.

There are also factors that may not have been accounted for. Perhaps the data was collected in an area where religion isn’t extremely prevalent or vice versa. Religion is also a complicated topic that may not easily be placed upon a 5 point scale. There also was no inclusion as to whether it was a specific religion or just religion in general.

The Kruskal-Wallis test is a non-parametric test of two different groups of means. As we can see from our following tests our p-value is close to 0 and therefore significant.

#Kruskal-Wallis test on explanatory variable price
kruskal.test(nbaffairs~religious, data=data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  nbaffairs by religious
## Kruskal-Wallis chi-squared = 20.55, df = 4, p-value = 0.0003879