Ecdat package was installed through package list next to console window.
data = read.csv("Fair.csv", header = TRUE, sep = ",")
#selects desired dataset, however this line is actually unnecessary
Data from a Yale University study conducted in 1977 by Ray C. Fair. Cross-section containing 601 observations of extramarital affair data.
Dataset includes:
sex, a factor with levels (m/f)
age, reported age
ym, number of years married
child, a factor with levels (y/n)
religious, how religious, integer ranging from 1:5 (anti:very)
education, education level
occupation, occupation, integer ranging 1:7 according to Hollingshead classification
rate, self measure of marriage happiness, integer ranging from 1:5 (very unhappy:very happy)
nbaffairs, number of affairs in past year
For a condensed view of the data, call head/tail for a quick look at the set.
head(data, n=5) #displays first 5 rows
## X sex age ym child religious education occupation rate nbaffairs
## 1 1 male 37 10.00 no 3 18 7 4 0
## 2 2 female 27 4.00 no 4 14 6 4 0
## 3 3 female 32 15.00 yes 1 12 1 4 0
## 4 4 male 57 15.00 yes 5 18 6 5 0
## 5 5 male 22 0.75 no 2 17 6 3 0
tail(data, n=5) #shows last 5 rows
## X sex age ym child religious education occupation rate
## 597 597 male 22 1.5 yes 1 12 2 5
## 598 598 female 32 10.0 yes 2 18 5 4
## 599 599 male 32 10.0 yes 2 17 6 5
## 600 600 male 22 7.0 yes 3 18 6 2
## 601 601 female 32 15.0 yes 3 14 1 5
## nbaffairs
## 597 1
## 598 7
## 599 2
## 600 2
## 601 1
Instructions call for the selection of a pair of categorical independent variables (IVs) Factors selected are gender (sex) and whether or not the subject has children (child) Both factors contain just two levels.
str(data) #allows examination of dataset and variables
## 'data.frame': 601 obs. of 10 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 2 1 2 ...
## $ age : num 37 27 32 57 22 32 22 57 32 22 ...
## $ ym : num 10 4 15 15 0.75 1.5 0.75 15 15 1.5 ...
## $ child : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 2 2 1 ...
## $ religious : int 3 4 1 5 2 2 2 2 4 4 ...
## $ education : int 18 14 12 18 17 17 12 14 16 14 ...
## $ occupation: int 7 6 1 6 6 5 1 4 1 4 ...
## $ rate : int 4 4 4 5 3 5 3 4 2 5 ...
## $ nbaffairs : int 0 0 0 0 0 0 0 0 0 0 ...
Continuous dependent variable (DV) will be selected as the response variable, (nbaffairs)
Cohen’s D will be examined to measure the effect size of variables, confirming whether or not I have selected IVs effective enough to be worth studying in my opinion. A factor with two levels makes calculating Cohen’s D relatively easy. Since we are looking at just two factors and two levels, I will compute Cohen’s d with brute force. Observe all code segments in the appendix of this project. I noticed that Michael Deagan implemented a very nifty function which will conduct the analysis for each factor with less lines of codes than my simple analysis.
CohensDCHILD = abs(meandiff2/STDEVPOOLED2) #computes Cohen's D according to formula, using variables generated
CohensDGENDER #displays for CDgender
## [1] 0.09653795
CohensDCHILD #displays CDchild
## [1] 0.8621838
Gender shows a rather small effect, while the presence of a child in the family dictates a very strong effect. Intuitively this seems very plausible.
Variables are defined above in Systems Under Test section. Graphics and Descriptive summary will be provided in the Statistical Analysis section The response variable nbaffairs ranges from 0-12
Having looked at the effect size, an IV to study and an IV to block can be selected. A Null Hypothesis Statistical Test may now be designed and executed. Factor sex will be blocked, and factor child will be studied
G*Power will be used to determine sample size according to input parameters. Effect size =0.8622, alpha =0.05, Power =0.95, No. Groups=2
Resulting sample size N = 20 and Critical F value = 4.414 A random sample of 20 observations will be called later
Type I error alpha = 0.05 Type II error beta = 0.05 These are default options on G*Power and are common cutoffs for determining statistical significance.
The first objective was to choose a dataset and design that fit the requirements of the Project 2 description. The set was ideal because it contained two factors at low levels, which made Cohen’s D a simple analysis
Details on data collection are not provided. Since the method is unknown, it is not known to have randomization or not. Randomization reduces bias, so well generate some of that using a Monte Carlo simulation later.
There are no numbers/names/identifications of individuals. It is safe to assume that each person was surveyed once. There do not appear to be repeated measures in the data.
Replication should increase precision
Blocking is used as required by project description. By looking at effect sizes of two factors, a decision was made to block the less effective factor and study the more predictive variable. If I did this project again, I would likely block the effect of gender again and study the effect of religion. Blocking increases precision in the experiment.
Variables are factors and can be subject to ANOVA. First, examine a few descriptive statistics
boxplot(data$nbaffairs~data$child, xlab="Children?", ylab="Number of Affairs") #boxplot comparing yes/no children to number of affairs
Boxplot of studied IV against response (# affairs)
While the effect of children is high, I found that result to be quite intuitive. What came as a great surprise to me, is that by observing the boxplot, it appears that within this set of observations, one is more likely to have an affair if they have a child. While it can be said that correlation does not imply causation, this made me extremely curious and I consulted friends and roommates for thoughts on this finding. There was some speculation that folks in relationships with children are likely to stay in relationships “for their children” even if it is an unhappy relationship. Adultery is a fallacy regardless of the presence of children, so it seems plausible that those who choose to have affairs do not consider their children to be a deciding factor, rather it seems that it may make them more likely to stay married despite an unhappy marriage. Again this is speculation, but some thoughts I found very interesting.
Using results from G*power, simulation sample size will consist of 20 observations randomly selected from the sample population. The first ANOVA is executed below, and using a bootstrap method we will generate several more iterations of this analysis. The bootstrap method will be used for resampling, an alternative to NHST that is required for ISYE 6020 students. These alternative methods will follow model adequacy checking.
SampleSize=20 #sets sample size, determined in G*power
set.seed(2) #important for RNG, since factor has two levels
FairChildSample=data[sample(nrow(data), SampleSize),] #takes sample
Single_AOV= aov(FairChildSample$nbaffairs ~ FairChildSample$child + FairChildSample$sex) #generates single anova with sample
summary(Single_AOV) #shows F stat, residuals, and summary of first ANOVA
## Df Sum Sq Mean Sq F value Pr(>F)
## FairChildSample$child 1 9.80 9.80 0.538 0.473
## FairChildSample$sex 1 42.25 42.25 2.319 0.146
## Residuals 17 309.75 18.22
qqnorm(residuals(Single_AOV)) #generates graph of residuals
qqline(residuals(Single_AOV)) #fits a normal line
Residuals do not form linearity in the Q-Q plot, so an assumption of normality is invalid. This suggests that the results of the analysis may not be valid as well, a frustrating realization.
plot(fitted(Single_AOV),residuals(Single_AOV)) #plots residuals
QQ plot, as well as qqnorm, lead us to believe the model is inadequate for determining the effect of the presence of children toward the response variable affairs. This is mostly because the qqnorm trend suggests non-parametric data, and the qqplot as well.
To accompany NHST, we will execute two alternative methods. First, resampling statistics will be used as suggested in the project description. Effect size may be used as the second method, but since it was considered earlier using Cohen’s D to determine how I wanted to conduct the experiment, the other approach will be Tukey’s test (since its function is available in R, this method should be relatively easy to execute)
Ive observed this simulation conducted with a bootstrap method, Monte Carlo simulation, and permutation tests in other statistical analysis classes. The simplest way to generate several iterations of the AOV of a repeatedly randomized sample is to implement a for loop and use the randomization scheme executed in the “testing” section I think the definition of Monte Carlo simulation is pretty broad so I’d feel confident classifying this as a Monte Carlo experiment since it uses a for loop to generate repeated random sampling.
Alternative 1) Monte Carlo Simulation
SSize=20 #sets sample size, determined in G*power
set.seed(2) #needed for RNG
Ftest_val = 0 #sets initial value
R = 10000 # number of replications
for (i in 1:R) #Monte Carlo for loop
{
FairChildSample = data[sample(nrow(data), SampleSize),] #generates random sample like first ANOVA
Ftest_val[i] <- summary(aov(FairChildSample$nbaffairs ~ FairChildSample$child+FairChildSample$sex))[[1]][["F value"]][1]
}
hist(Ftest_val, freq = FALSE, xlab = "F-value", main = "PDF: (Histogram of F value from Resampling)") #plots PDF
x=seq(.25,8,.5) #makes trend line for PDF
points(x,y=df(x,1,604),type="b",col="red") #makes trend line for PDF
Plot
Alternative 2) Tukey’s Test to generate Confidence Intervals Error rate of 0.05 (or 95% CI) is expected from the value of alpha The procedure controls the experimentwise or “family” error rate at the selected level of alpha.
tukeytest = TukeyHSD(Single_AOV) #executes test, remember single_aov still uses the N=20 sample
plot(tukeytest) #plots results
Since the confidence interval contains zero, this suggests that the statistical significance of the difference in mean gender is null or insignificant.
Fair, R. (1977) “A note on the computation of the tobit estimator”, Econometrica, 45, 1723-1727.
install.packages("Ecdat")
library("Ecdat", lib.loc="~/R/win-library/3.2")
data = Fair #selects desired dataset, however this line is actually unnecessary
?Fair #Command helps understand variable names and gives background information on the dataset
head(data, n=5) #displays first 5 rows
tail(data, n=5) #shows last 5 rows
str(data) #allows examination of dataset and variables
mean1=mean(Fair$sex=="male")
mean2=mean(Fair$sex=="female")
meandiff = mean2-mean1 #computes diff1
STDEV1= sd(data$sex=="male")
STDEV2= sd(data$sex=="female")
STDEVPOOLED = (STDEV1 + STDEV2)/2
CohensDGENDER= abs(meandiff/STDEVPOOLED) #first cohens d value according to formula
meanA=mean(data$child=="yes")
meanB=mean(data$child=="no")
meandiff2= meanB-meanA #computs diff 2
STDEV1.2= sd(data$child=="yes")
STDEV2.2= sd(data$child=="no")
STDEVPOOLED2 = (STDEV1 + STDEV2)/2
CohensDCHILD = abs(meandiff2/STDEVPOOLED2) # second Cohen's d value using formula
CohensDGENDER #shows values
CohensDCHILD #' '
image:  #embeds G*power image. This line was not used
boxplot(data$nbaffairs~data$child, xlab="Children?", ylab="Number of Affairs") #boxplot comparing yes/no children to number of affairs
boxplot(data$nbaffairs~data$sex, xlab="Gender?", ylab="Number of Affairs") #boxplot comparing gender to no. affairs
SampleSize=20 #sets sample size, determined in G*power
set.seed(2) #needed for RNG
FairChildSample=data[sample(nrow(data), SampleSize),] #sample, just like first AOV
Single_AOV= aov(FairChildSample$nbaffairs ~ FairChildSample$child + FairChildSample$sex)
summary(Single_AOVblocked) #provides summary (Fstat, residuals, etc)
qqnorm(residuals(Single_AOV)) #graphs residuals
qqline(residuals(Single_AOV)) #fitted normal line in residual graph
plot(fitted(Single_AOV),residuals(Single_AOV)) #plots residuals
SampleSize=20 #sets sample size, determined in G*power
set.seed(2) # for RNG
Ftest_val = 0 #initial value
R = 10000 #sets number of iterations
for (i in 1:R)
{
FairChildSample = data[sample(nrow(data), SampleSize),] #takes sample like original AOV
Ftest_val[i] <- summary(aov(FairChildSample$nbaffairs ~ FairChildSample$child+FairChildSample$sex))[[1]][["F value"]][1]
}
hist(Ftest_val, freq = FALSE, xlab = "F-value", main = "PDF: (Histogram of F value from Resampling)")
x=seq(.25,8,.5) #makes trend line for PDF
points(x,y=df(x,1,604),type="b",col="red") #makes trend line for PDF
tukeytest = TukeyHSD(Single_AOV) #executes test, remember single_aov still uses the N=20 sample
plot(tukeytest) #generates plot