as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
This analysis uses marital affairs data to test the effect of one factor on the response variable of cigarette sales per pack per capita.
Below is the installation and initial examination of the dataset:
#Installing data package
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
#Attaching Marital Affair Data Set
data<-Fair
attach(data)
For this single factor analysis we will be examining the effect of religion (religious) on the response variable number of affairs (nbaffairs)
Our factor of religion is broken up into 5 levels which are 1,2,3,4, and 5. 1 represents that the person identifies as anti or opposed to religion, 3 is neutral and 5 is very religous.
#Levels of Religious
unique(data$religious)
## [1] 3 4 1 5 2
#Summary of Religious
summary(data$religious)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 3.12 4.00 5.00
#Overall Look at Data
head(data)
## sex age ym child religious education occupation rate nbaffairs
## 1 male 37 10.00 no 3 18 7 4 0
## 2 female 27 4.00 no 4 14 6 4 0
## 3 female 32 15.00 yes 1 12 1 4 0
## 4 male 57 15.00 yes 5 18 6 5 0
## 5 male 22 0.75 no 2 17 6 3 0
## 6 female 32 1.50 no 2 17 5 5 0
tail(data)
## sex age ym child religious education occupation rate nbaffairs
## 596 male 47 15.0 yes 3 16 4 2 7
## 597 male 22 1.5 yes 1 12 2 5 1
## 598 female 32 10.0 yes 2 18 5 4 7
## 599 male 32 10.0 yes 2 17 6 5 2
## 600 male 22 7.0 yes 3 18 6 2 2
## 601 female 32 15.0 yes 3 14 1 5 1
summary(data)
## sex age ym child religious
## female:315 Min. :17.5 Min. : 0.125 no :171 Min. :1.00
## male :286 1st Qu.:27.0 1st Qu.: 4.000 yes:430 1st Qu.:2.00
## Median :32.0 Median : 7.000 Median :3.00
## Mean :32.5 Mean : 8.178 Mean :3.12
## 3rd Qu.:37.0 3rd Qu.:15.000 3rd Qu.:4.00
## Max. :57.0 Max. :15.000 Max. :5.00
## education occupation rate nbaffairs
## Min. : 9.0 Min. :1.0 Min. :1.00 Min. : 0.00
## 1st Qu.:14.0 1st Qu.:3.0 1st Qu.:3.00 1st Qu.: 0.00
## Median :16.0 Median :5.0 Median :4.00 Median : 0.00
## Mean :16.2 Mean :4.2 Mean :3.93 Mean : 1.46
## 3rd Qu.:18.0 3rd Qu.:6.0 3rd Qu.:5.00 3rd Qu.: 0.00
## Max. :20.0 Max. :7.0 Max. :5.00 Max. :12.00
No observed variables are continuous.
For this experiment we will be focusing on the number of affairs for the response variable being effected by the chosen factor.
The data set is organized by the following variables: sex, age, ym, child, religious, education, occupation, rate, nbaffairs
Sex and age are descipritive of the person interviewed. Sex can be male or female and age ranges from 17 to 57.
ym and child represent characteristics of the marriage. ym is years married which ranges in the data from .125 to 15 and child is number of children which is a binary yes or no factor.
religious represents where the person stands with religion ranging from 1 (meaning anti-religion or opposed) to 5 (representing very religious).
education is a number ranging from 9 to 20 however there is no description of this variable and we can reasonably assuming this is represented in years.
Occupation is the persons job represented by the numbers 1 to 7 using the hollingshead classification
rate represents the persons rating of their marriage from 1 being very unhappy to 5 being very happy.
nbaffairs is our response variable which represents the number of affairs in the past year. This variable ranges from 0 to 12 in the data.
It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.
To perform the experiment we will create linear models using the factor and with that model perform an analysis of variance. From these analysis we will be able to test the null hypothesis, that the number of affairs is independant of a persons stance on religion. The alternative is that religion does have an effect on the number of affairs. We will also perform several tests to check if our data is normal and therefore fits assumptions made when using analysis of covariance.
We will also perform a resampling analysis of variance on our data as a secondary test.
The rationale for using an analysis of covariance test is that it checks whether the means of several groups are equal. The alternative would be to use a two-sample t-test however there is more likely chance of the test resulting in a false positive of the hypothesis. By following up with a resampling we can further test the validity of our analysis.
The data was collected in an unknown way so we do not know if there was any randomization to it.
There are no replicated or repeated measures in the data. There is no specific identification of individuals so we can assume each person was only surveyed once that year.
There was no blocking in the design of this experiment
To start our statistical analysis we will make our variable into a factor for the analysis of variance and look at the boxplot of that factor. We will also use this to make an initial determination as to whether or not our variable appears to have an effect
#Defining stance on religion as a factor
data$religious = as.factor(data$religious)
#Boxplots of the means of of religious against the response variable number of affairs.
boxplot(nbaffairs~religious, data=data, xlab="Religious", ylab="Number of Affairs")
By examining the boxplot there seems to be a negative trend between religious and number of affairs. It is also worth noting that for each level of religious there is an outlier at 12.
First we generate a linear model with our factor. Then to test the hypotheses we perform an ANOVA test on the factor.
The null hypothesis of the tests is that the single factor does not have an effect on the response variable of number of affairs.
#Generation of Linear Model
model=aov(nbaffairs~religious, data=data)
anova(model)
## Analysis of Variance Table
##
## Response: nbaffairs
## Df Sum Sq Mean Sq F value Pr(>F)
## religious 4 193 48.3 4.54 0.0013 **
## Residuals 596 6336 10.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For each our analysis of variance test we found a resulting p-value of .001281. The p-value represents the probability that we can get the F value using the null hypothesis. Since our probability is very low we can assume that the factor demonstrates an effect on the response variable. We are lead to reject the null hypothesis.
As a secondary test we will perform a resampling anova using bootstrap.
data(Fair)
x <- data(Fair)
with(Fair,tapply(nbaffairs,religious,mean))
## 1 2 3 4 5
## 2.5833 1.6524 1.9767 0.8579 0.8857
with(Fair,tapply(nbaffairs,religious,var))
## 1 2 3 4 5
## 17.567 12.474 13.945 6.588 6.480
with(Fair,tapply(nbaffairs, religious,length))
## 1 2 3 4 5
## 48 164 129 190 70
# We need the sd(residual error) from the summary for the Monte Carlo simulation
summary(aov(nbaffairs~religious,data=Fair))
## Df Sum Sq Mean Sq F value Pr(>F)
## religious 1 136 136.3 12.8 0.00038 ***
## Residuals 599 6393 10.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
meanstar = mean(Fair$nbaffairs)
# sd is sqrt(residual error). Get this value from summary(), above.
sdstar = sqrt(10.67)
simreligious = Fair$religious
R = 10000 # 10,000 iterations
Fstar = numeric(R)
for (i in 1:R) {
# make residual normally distributed with known pooled variance
# this loop is a Monte Carlo simulation since distribution is known normal
groupA = rnorm(120, mean=meanstar, sd=sdstar)
groupB = rnorm(120, mean=meanstar, sd=sdstar)
groupC = rnorm(120, mean=meanstar, sd=sdstar)
groupD = rnorm(120, mean=meanstar, sd=sdstar)
groupE = rnorm(121, mean=meanstar, sd=sdstar)
simaffair = c(groupA,groupB,groupC,groupD,groupE)
simdata = data.frame(simaffair,simreligious)
Fstar[i] = oneway.test(simaffair~simreligious, var.equal=T, data=simdata)$statistic
}
# computes empirical distribution of F under H0 and display as histogram
hist(Fstar, ylim=c(0,1), xlim=c(0, 8), prob=T)
# Plot F distribution from .25 to 6 in .5 increments
x=seq(.25,6,.25)
# Overlay the analytic F distribution as red dots (degrees of freedom known)
# 5 = 6 levels-1 mean; 599= 604 (n)-6 means
# Notice the quality of the fit of the Monte Carlo simulation to the empirical distribution of F
points(x,y=df(x,1,604),type="b",col="red")
#using Bootstrap to do ANOVA
meanstar = with(data, tapply(nbaffairs,religious,mean))
grpA = data$nbaffairs[data$religious=="1"] - meanstar[1]
grpB = data$nbaffairs[data$religious=="2"] - meanstar[2]
grpC = data$nbaffairs[data$religious=="3"] - meanstar[3]
grpD = data$nbaffairs[data$religious=="4"] - meanstar[4]
grpE = data$nbaffairs[data$religious=="5"] - meanstar[5]
simreligious2 = data$religious
R = 10000
Fstar2 = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=120, replace=T)
groupB = sample(grpB, size=120, replace=T)
groupC = sample(grpC, size=120, replace=T)
groupD = sample(grpD, size=120, replace=T)
groupE = sample(grpE, size=121, replace=T)
simaffair2 = c(groupA,groupB,groupC,groupD,groupE)
simdata2 = data.frame(simaffair2,simreligious2)
Fstar2[i] = oneway.test(simaffair2~simreligious2, var.equal=T, data=simdata2)$statistic
}
# Now generate a similar plot, comparing bootstrapped distribution to the known F distribution
hist(Fstar2, ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,1,604),type="b",col="red") # The F distribution
# Here is the alpha level from the bootstrapped distribution
print(realFstar <- oneway.test(nbaffairs~religious, var.equal=T, data=Fair)$statistic)
## F
## 4.54
mean(Fstar>=realFstar)
## [1] 0.0011
# estimate alpha = 0.05 value of the test statistic, given F* from empirical distribution above
# generate quantiles of the analytic F distribution
qf(.95,1,604)
## [1] 3.857
# alpha = 0.05 value of bootstrapped F* is output here
quantile(Fstar2,.95)
## 95%
## 2.402
Next we preform another ANOVA on the newly created simaffair and simreligious.
model2=aov(simaffair2~simreligious2, data=simdata2)
anova(model2)
## Analysis of Variance Table
##
## Response: simaffair2
## Df Sum Sq Mean Sq F value Pr(>F)
## simreligious2 4 78 19.6 1.61 0.17
## Residuals 596 7259 12.2
Our p-value in this case is very high at .7365 and we are led to accept the null hypothesis that the number of affairs is independent of how religious a person is.
Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid.
#QQ plots for residuals of the model
qqnorm(residuals(model))
qqline(residuals(model))
qqnorm(residuals(model2))
qqline(residuals(model2))
Our QQ plots show an extreme jump towards the end, leading us to think our data is not normal.
Second we plot a fitted model against the residuals. We see a large amount of symmetry across the horizontal.
#Plot of Fitted vs Residuals of model
plot(fitted(model), residuals(model))
plot(fitted(model2), residuals(model2))
Overrall the results of our model lead us to believe our model is not adequate and does not explain the effect of how religious a person is on the response variable of number of affairs. This is largely due to our qq plot making the data appear extremely non-parametric.
No literature was used
The R package in which this data was found can be located in the R library named Ecdat
It is possible that the results of our study are the result of chance. One concern could be the inherant honesty of the data. Marital affairs are sensitive matters and some participating parties may have been untruthful when the data was collected. While there is not an obvious causation of religiousness on the number of affairs there may in turn be an effect of religiousness on whether or not affairs occured.
There are also factors that may not have been accounted for. Perhaps the data was collected in an area where religion isn’t extremely prevalent or vice versa. Religion is also a complicated topic that may not easily be placed upon a 5 point scale. There also was no inclusion as to whether it was a specific religion or just religion in general.
The Kruskal-Wallis test is a non-parametric test of two different groups of means. As we can see from our following tests our p-value is close to 0 and therefore significant.
#Kruskal-Wallis test on explanatory variable price
kruskal.test(nbaffairs~religious, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: nbaffairs by religious
## Kruskal-Wallis chi-squared = 20.55, df = 4, p-value = 0.0003879