Project #2: Null Hypothesis Statistical Testing

Andreas Vought

Rensselaer Polytechnic Institute

11/10/2016, V1.0

1. Setting

Ecdat package was installed through package list next to console window.

data = read.csv("Fair.csv", header = TRUE, sep = ",")
#selects desired dataset, however this line is actually unnecessary

Data from a Yale University study conducted in 1977 by Ray C. Fair. Cross-section containing 601 observations of extramarital affair data.

System Under Test

Dataset includes:

sex, a factor with levels (m/f)
age, reported age
ym, number of years married
child, a factor with levels (y/n)
religious, how religious, integer ranging from 1:5 (anti:very)
education, education level
occupation, occupation, integer ranging 1:7 according to Hollingshead classification
rate, self measure of marriage happiness, integer ranging from 1:5 (very unhappy:very happy)
nbaffairs, number of affairs in past year

For a condensed view of the data, call head/tail for a quick look at the set.

head(data, n=5) #displays first 5 rows

##   X    sex age    ym child religious education occupation rate nbaffairs
## 1 1   male  37 10.00    no         3        18          7    4         0
## 2 2 female  27  4.00    no         4        14          6    4         0
## 3 3 female  32 15.00   yes         1        12          1    4         0
## 4 4   male  57 15.00   yes         5        18          6    5         0
## 5 5   male  22  0.75    no         2        17          6    3         0

tail(data, n=5) #shows last 5 rows

##       X    sex age   ym child religious education occupation rate
## 597 597   male  22  1.5   yes         1        12          2    5
## 598 598 female  32 10.0   yes         2        18          5    4
## 599 599   male  32 10.0   yes         2        17          6    5
## 600 600   male  22  7.0   yes         3        18          6    2
## 601 601 female  32 15.0   yes         3        14          1    5
##     nbaffairs
## 597         1
## 598         7
## 599         2
## 600         2
## 601         1

Factors and Levels

Instructions call for the selection of a pair of categorical independent variables (IVs) Factors selected are gender (sex) and whether or not the subject has children (child) Both factors contain just two levels.

str(data) #allows examination of dataset and variables

## 'data.frame':    601 obs. of  10 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex       : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 2 1 2 ...
##  $ age       : num  37 27 32 57 22 32 22 57 32 22 ...
##  $ ym        : num  10 4 15 15 0.75 1.5 0.75 15 15 1.5 ...
##  $ child     : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 2 2 1 ...
##  $ religious : int  3 4 1 5 2 2 2 2 4 4 ...
##  $ education : int  18 14 12 18 17 17 12 14 16 14 ...
##  $ occupation: int  7 6 1 6 6 5 1 4 1 4 ...
##  $ rate      : int  4 4 4 5 3 5 3 4 2 5 ...
##  $ nbaffairs : int  0 0 0 0 0 0 0 0 0 0 ...

Continuous variables (if any)

Continuous dependent variable (DV) will be selected as the response variable, (nbaffairs)

Response Variables

Cohen’s D will be examined to measure the effect size of variables, confirming whether or not I have selected IVs effective enough to be worth studying in my opinion. A factor with two levels makes calculating Cohen’s D relatively easy. Since we are looking at just two factors and two levels, I will compute Cohen’s d with brute force. Observe all code segments in the appendix of this project. I noticed that Michael Deagan implemented a very nifty function which will conduct the analysis for each factor with less lines of codes than my simple analysis.

CohensDCHILD = abs(meandiff2/STDEVPOOLED2) #computes Cohen's D according to formula, using variables generated 

CohensDGENDER #displays for CDgender

## [1] 0.09653795

CohensDCHILD #displays CDchild

## [1] 0.8621838

Gender shows a rather small effect, while the presence of a child in the family dictates a very strong effect. Intuitively this seems very plausible.

The Data: How is it organized and what does it look like?

Variables are defined above in Systems Under Test section. Graphics and Descriptive summary will be provided in the Statistical Analysis section The response variable nbaffairs ranges from 0-12

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

Having looked at the effect size, an IV to study and an IV to block can be selected. A Null Hypothesis Statistical Test may now be designed and executed. Factor sex will be blocked, and factor child will be studied

G*Power will be used to determine sample size according to input parameters. Effect size =0.8622, alpha =0.05, Power =0.95, No. Groups=2

Resulting sample size N = 20 and Critical F value = 4.414 A random sample of 20 observations will be called later

Type I error alpha = 0.05 Type II error beta = 0.05 These are default options on G*Power and are common cutoffs for determining statistical significance.

What is the rational for this design?

The first objective was to choose a dataset and design that fit the requirements of the Project 2 description. The set was ideal because it contained two factors at low levels, which made Cohen’s D a simple analysis

Randomize:

Details on data collection are not provided. Since the method is unknown, it is not known to have randomization or not. Randomization reduces bias, so well generate some of that using a Monte Carlo simulation later.

Replicate:

There are no numbers/names/identifications of individuals. It is safe to assume that each person was surveyed once. There do not appear to be repeated measures in the data.

Replication should increase precision

Block:

Blocking is used as required by project description. By looking at effect sizes of two factors, a decision was made to block the less effective factor and study the more predictive variable. If I did this project again, I would likely block the effect of gender again and study the effect of religion. Blocking increases precision in the experiment.

3. (Statistical) Analysis

Variables are factors and can be subject to ANOVA. First, examine a few descriptive statistics

(Exploratory Data Analysis) Graphics and descriptive summary

boxplot(data$nbaffairs~data$child, xlab="Children?", ylab="Number of Affairs") #boxplot comparing yes/no children to number of affairs

Boxplot of studied IV against response (# affairs)

While the effect of children is high, I found that result to be quite intuitive. What came as a great surprise to me, is that by observing the boxplot, it appears that within this set of observations, one is more likely to have an affair if they have a child. While it can be said that correlation does not imply causation, this made me extremely curious and I consulted friends and roommates for thoughts on this finding. There was some speculation that folks in relationships with children are likely to stay in relationships “for their children” even if it is an unhappy relationship. Adultery is a fallacy regardless of the presence of children, so it seems plausible that those who choose to have affairs do not consider their children to be a deciding factor, rather it seems that it may make them more likely to stay married despite an unhappy marriage. Again this is speculation, but some thoughts I found very interesting.

Testing

Using results from G*power, simulation sample size will consist of 20 observations randomly selected from the sample population. The first ANOVA is executed below, and using a bootstrap method we will generate several more iterations of this analysis. The bootstrap method will be used for resampling, an alternative to NHST that is required for ISYE 6020 students. These alternative methods will follow model adequacy checking.

SampleSize=20 #sets sample size, determined in G*power
set.seed(2) #important for RNG, since factor has two levels
FairChildSample=data[sample(nrow(data), SampleSize),] #takes sample
Single_AOV= aov(FairChildSample$nbaffairs ~ FairChildSample$child + FairChildSample$sex) #generates single anova with sample
summary(Single_AOV) #shows F stat, residuals, and summary of first ANOVA

##                       Df Sum Sq Mean Sq F value Pr(>F)
## FairChildSample$child  1   9.80    9.80   0.538  0.473
## FairChildSample$sex    1  42.25   42.25   2.319  0.146
## Residuals             17 309.75   18.22

Diagnostics / Model Adequacy Checking

qqnorm(residuals(Single_AOV)) #generates graph of residuals
qqline(residuals(Single_AOV)) #fits a normal line

Residuals do not form linearity in the Q-Q plot, so an assumption of normality is invalid. This suggests that the results of the analysis may not be valid as well, a frustrating realization.

plot(fitted(Single_AOV),residuals(Single_AOV)) #plots residuals

QQ plot, as well as qqnorm, lead us to believe the model is inadequate for determining the effect of the presence of children toward the response variable affairs. This is mostly because the qqnorm trend suggests non-parametric data, and the qqplot as well.

Alternatives to Null Hypothesis Statistical Testing

To accompany NHST, we will execute two alternative methods. First, resampling statistics will be used as suggested in the project description. Effect size may be used as the second method, but since it was considered earlier using Cohen’s D to determine how I wanted to conduct the experiment, the other approach will be Tukey’s test (since its function is available in R, this method should be relatively easy to execute)

Ive observed this simulation conducted with a bootstrap method, Monte Carlo simulation, and permutation tests in other statistical analysis classes. The simplest way to generate several iterations of the AOV of a repeatedly randomized sample is to implement a for loop and use the randomization scheme executed in the “testing” section I think the definition of Monte Carlo simulation is pretty broad so I’d feel confident classifying this as a Monte Carlo experiment since it uses a for loop to generate repeated random sampling.

Alternative 1) Monte Carlo Simulation

SSize=20 #sets sample size, determined in G*power
set.seed(2) #needed for RNG
Ftest_val = 0 #sets initial value
R = 10000 # number of replications 
for (i in 1:R) #Monte Carlo for loop
{
  FairChildSample = data[sample(nrow(data), SampleSize),] #generates random sample like first ANOVA
  Ftest_val[i] <- summary(aov(FairChildSample$nbaffairs ~ FairChildSample$child+FairChildSample$sex))[[1]][["F value"]][1]
}
hist(Ftest_val, freq = FALSE, xlab = "F-value", main = "PDF: (Histogram of F value from Resampling)") #plots PDF
x=seq(.25,8,.5) #makes trend line for PDF 
points(x,y=df(x,1,604),type="b",col="red") #makes trend line for PDF

Plot

Alternative 2) Tukey’s Test to generate Confidence Intervals Error rate of 0.05 (or 95% CI) is expected from the value of alpha The procedure controls the experimentwise or “family” error rate at the selected level of alpha.

tukeytest = TukeyHSD(Single_AOV) #executes test, remember single_aov still uses the N=20 sample
plot(tukeytest) #plots results

Since the confidence interval contains zero, this suggests that the statistical significance of the difference in mean gender is null or insignificant.

4. References to the literature

Design and Analysis of Experiments, 8th Edition. Douglas C. Montgomery, 2013.
http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/multiple-comparisons/what-is-tukey-s-method/
http://stats.stackexchange.com/questions/144084/variance-of-cohens-d-statistic
https://www.r-bloggers.com/probability-and-monte-carlo-methods/

5. Appendices

A summary of, or pointer to, the raw data

Fair, R. (1977) “A note on the computation of the tobit estimator”, Econometrica, 45, 1723-1727.

http://fairmodel.econ.yale.edu/rayfair/pdf/1978A200.PDF.

complete and documented R code

install.packages("Ecdat")
library("Ecdat", lib.loc="~/R/win-library/3.2")
data = Fair #selects desired dataset, however this line is actually unnecessary 

?Fair #Command helps understand variable names and gives background information on the dataset

head(data, n=5) #displays first 5 rows
tail(data, n=5) #shows last 5 rows 

str(data) #allows examination of dataset and variables 
mean1=mean(Fair$sex=="male")
mean2=mean(Fair$sex=="female")

meandiff = mean2-mean1 #computes diff1

STDEV1= sd(data$sex=="male")
STDEV2= sd(data$sex=="female")
STDEVPOOLED = (STDEV1 + STDEV2)/2

CohensDGENDER= abs(meandiff/STDEVPOOLED) #first cohens d value according to formula

meanA=mean(data$child=="yes")
meanB=mean(data$child=="no")

meandiff2= meanB-meanA #computs diff 2

STDEV1.2= sd(data$child=="yes")
STDEV2.2= sd(data$child=="no")
STDEVPOOLED2 = (STDEV1 + STDEV2)/2

CohensDCHILD = abs(meandiff2/STDEVPOOLED2) # second Cohen's d value using formula

CohensDGENDER #shows values
CohensDCHILD #' '

image: ![](C:\Users\andre_000\Desktop\5th Year\Fall\DoE\Pics\Gpower.jpg) #embeds G*power image. This line was not used

boxplot(data$nbaffairs~data$child, xlab="Children?", ylab="Number of Affairs") #boxplot comparing yes/no children to number of affairs 

boxplot(data$nbaffairs~data$sex, xlab="Gender?", ylab="Number of Affairs") #boxplot comparing gender to no. affairs

SampleSize=20 #sets sample size, determined in G*power
set.seed(2) #needed for RNG
FairChildSample=data[sample(nrow(data), SampleSize),] #sample, just like first AOV
Single_AOV= aov(FairChildSample$nbaffairs ~ FairChildSample$child + FairChildSample$sex)
summary(Single_AOVblocked) #provides summary (Fstat, residuals, etc)

qqnorm(residuals(Single_AOV)) #graphs residuals 
qqline(residuals(Single_AOV)) #fitted normal line in residual graph

plot(fitted(Single_AOV),residuals(Single_AOV)) #plots residuals 

SampleSize=20 #sets sample size, determined in G*power
set.seed(2) # for RNG
Ftest_val = 0 #initial value
R = 10000 #sets number of iterations 
for (i in 1:R) 
{
  FairChildSample = data[sample(nrow(data), SampleSize),] #takes sample like original AOV
  Ftest_val[i] <- summary(aov(FairChildSample$nbaffairs ~ FairChildSample$child+FairChildSample$sex))[[1]][["F value"]][1]
}
hist(Ftest_val, freq = FALSE, xlab = "F-value", main = "PDF: (Histogram of F value from Resampling)")
x=seq(.25,8,.5) #makes trend line for PDF 
points(x,y=df(x,1,604),type="b",col="red") #makes trend line for PDF 

tukeytest = TukeyHSD(Single_AOV) #executes test, remember single_aov still uses the N=20 sample
plot(tukeytest) #generates plot