Design of Experiments Project 2

Null Hypothesis Testing

Felipe Ortiz

RPI

November 10th 2016 V1

1. Setting

For this example we will select a dataset with at least two categorical independent variables and a continuous dependent variable. The objective is to test a hypothesis between the DV and one of the IVs while blocking the other.

The data set selected is a part of the Ecdat R Package called Fair, it has 601 observations and 9 variables.

System under test

The data collected provides the amount of affairs a person has had in the past year, while also having collected various factors of interest associated with the individual. Below we show the data being loaded from the library

library(Ecdat)
Data <- Fair

When the data was collected the variables were defined as follows:

  • sex a factor with levels (male,female)

  • age age

  • ym number of years married

  • child children ? a factor

  • religious how religious, from 1 (anti) to 5 (very)

  • education education

  • occupation occupation, from 1 to 7, according to hollingshead classification (reverse numbering)

  • rate self rating of mariage, from 1 (very unhappy) to 5 (very happy)

  • nbaffairs number of affairs in past year

Next we look at the stucture of the data.

str(Data)
## 'data.frame':    601 obs. of  9 variables:
##  $ sex       : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 2 1 2 ...
##  $ age       : num  37 27 32 57 22 32 22 57 32 22 ...
##  $ ym        : num  10 4 15 15 0.75 1.5 0.75 15 15 1.5 ...
##  $ child     : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 2 2 1 ...
##  $ religious : int  3 4 1 5 2 2 2 2 4 4 ...
##  $ education : num  18 14 12 18 17 17 12 14 16 14 ...
##  $ occupation: int  7 6 1 6 6 5 1 4 1 4 ...
##  $ rate      : int  4 4 4 5 3 5 3 4 2 5 ...
##  $ nbaffairs : num  0 0 0 0 0 0 0 0 0 0 ...

We see that R has incorrectly assigned some data as integers rather than factors, so we will tell it which ones are factors

Data$religious <- as.factor(Data$religious)

Data$education <- as.factor(Data$education)

Data$occupation <- as.factor(Data$occupation)

Data$rate <- as.factor(Data$rate)

Factors and Levels

The Factors present in the data set and their levels.

levels(Data$sex)
## [1] "female" "male"
levels(Data$child)
## [1] "no"  "yes"
levels(Data$religious)
## [1] "1" "2" "3" "4" "5"
levels(Data$education)
## [1] "9"  "12" "14" "16" "17" "18" "20"
levels(Data$occupation)
## [1] "1" "2" "3" "4" "5" "6" "7"
levels(Data$rate)
## [1] "1" "2" "3" "4" "5"

Continuous variables

The Continuous variables present are the following

summary(Data$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.50   27.00   32.00   32.49   37.00   57.00
summary(Data$ym)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.125   4.000   7.000   8.178  15.000  15.000
summary(Data$nbaffairs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.456   0.000  12.000

Response variables

The response variable that will be used is nbaffairs. The number of affairs the subject had over the past year.

Data Organization

Although these steps are simple, they are an important part of setting up the higher level analysis. We can see the summary of the data as well as the first 10 rows of data.

summary(Data)
##      sex           age              ym         child     religious
##  female:315   Min.   :17.50   Min.   : 0.125   no :171   1: 48    
##  male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430   2:164    
##               Median :32.00   Median : 7.000             3:129    
##               Mean   :32.49   Mean   : 8.178             4:190    
##               3rd Qu.:37.00   3rd Qu.:15.000             5: 70    
##               Max.   :57.00   Max.   :15.000                      
##                                                                   
##  education occupation rate      nbaffairs     
##  9 :  7    1:113      1: 16   Min.   : 0.000  
##  12: 44    2: 13      2: 66   1st Qu.: 0.000  
##  14:154    3: 47      3: 93   Median : 0.000  
##  16:115    4: 68      4:194   Mean   : 1.456  
##  17: 89    5:204      5:232   3rd Qu.: 0.000  
##  18:112    6:143              Max.   :12.000  
##  20: 80    7: 13
head(Data, n=10)
##       sex age    ym child religious education occupation rate nbaffairs
## 1    male  37 10.00    no         3        18          7    4         0
## 2  female  27  4.00    no         4        14          6    4         0
## 3  female  32 15.00   yes         1        12          1    4         0
## 4    male  57 15.00   yes         5        18          6    5         0
## 5    male  22  0.75    no         2        17          6    3         0
## 6  female  32  1.50    no         2        17          5    5         0
## 7  female  22  0.75    no         2        12          1    3         0
## 8    male  57 15.00   yes         2        14          4    4         0
## 9  female  32 15.00   yes         4        16          1    2         0
## 10   male  22  1.50    no         4        14          4    5         0

2. Experimental Design

How will the experiment be organized and conducted to test the hypothesis?

After we have looked at the data, we will formulate our hypothesis.

We see that the intended response variable is the nbaffairs, in other words our null hypothesis should be as follows: The average value of affairs a married person has had in the past year is the same for all groups of the independent varaible.

We will choose rate as the independent variable.

What is the rationale for this design?

In this case the experiments have already been run, and there is not data stating how or why it was done. As such we cannot comment on the design.

Randomize: What is the Randomization Scheme?

Randomized design is used to control for effects from extraneous variables, or variables that have an effect but are not being studied. The assumption is that on average these variables will affect the response variable equally so any significant differences should come from the independent variable(s) being tested.

Again, since we weren’t present for the collection of this data we cannot assure that this data was collected randomly. We will however be sub-setting and ordering the data randomly, this will help provide some randomization.

Replicate: Are there replicates and/or repeated measures?

Replication is an important part of experimental design, especially for validating results. It is not only more convincing, but more significant if the same parameters on an experiment are run several times, and the dependent results are similar. In this case however it would only be possible to have replications if two subjects answered equally to all the questions (Independent variables).

Block: Did you use blocking in the design?

Blocking is used to eliminate the influence of extraneous variables, in this case we will be using the child factor to block.

3. Statistical Analysis

Exploratory Data Analysis - Graphics and descriptive summary

We will now begin some analysis of the data, to better understand it.

hist(Data$nbaffairs)

As we can see, the majority of responses were 0. Now lets try to split it up by the levels of our factor.

boxplot(Data$nbaffairs~ Data$rate, main = "Boxplot of Affairs", xlab = "Marriage Rating")

It does seem that the independent variable has an effect on the response.

Testing

To begin testing we will need to select a type I error α and a type II error β. As well as this we We will need to investigate the effect size of the main effect. Using Cohen’s d we find that the effect is very large.

Power Analysis and Effect Size

Using the following parameters we use G*Power under the F-test family.

  • Effect size: .99 (large)

  • α: 0.05

  • β: 0.10

  • Number of Groups: 5

We find the necessary N to be 25, due to the large effect size. We sample Data randomly 25 times and name our sample Data2

Data2 <- Data[sample(601, 25, replace = T),]

ANOVA

Full<- aov(Data$nbaffairs~ Data$rate+Data$child)
summary(Full)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Data$rate     4    626  156.42  15.808 2.64e-12 ***
## Data$child    1     16   15.90   1.607    0.205    
## Residuals   595   5887    9.89                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(Full))
qqline(residuals(Full))

Sample<- aov(Data2$nbaffairs~ Data2$rate+Data2$child)
summary(Sample)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## Data2$rate   4 135.64   33.91  15.477 8.7e-06 ***
## Data2$child  1   0.10    0.10   0.043   0.837    
## Residuals   19  41.63    2.19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(Sample))
qqline(residuals(Sample))

We see that the results for both the Full and Samples sets are similar. However we see that the residuals of the ANOVA test are not normal. Lets run some more diagnostic plots.

par(mfrow=c(2,2))
plot(Full)

Looking at the plot we can draw the following conclusions.

  • Residuals vs Fitted: Shows possible non-linear pattern in the residuals.

  • Normal QQ Plot: Residuals of the linear regression are not normal.

  • Scale – Location: Division line is somewhat horizontal. Showing possible homoscedasticity.

  • Residuals vs Leverage: No proof of influential outliers.

At this point it is typical to attempt alternative hypothesis testing techniques. We will try two others.

Confidence Intervals

After conducting an ANOVA it is a good idea to perform a Tukey test even if the diagnostic plots are ideal. This test will analyze the difference between the means of the levels and return a CI for this estimate. As such if the result contains the value of 0 you usually reject that there is a difference in the means.

par(mfrow=c(1,1))
CI <- TukeyHSD(x=Full, conf.level=0.95)
CI
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Data$nbaffairs ~ Data$rate + Data$child)
## 
## $`Data$rate`
##            diff       lwr        upr     p adj
## 2-1  0.04924242 -2.349169  2.4476535 0.9999977
## 3-1 -2.44489247 -4.774380 -0.1154047 0.0342271
## 4-1 -2.56572165 -4.804430 -0.3270137 0.0154155
## 5-1 -3.15517241 -5.379868 -0.9304772 0.0010932
## 3-2 -2.49413490 -3.879403 -1.1088670 0.0000107
## 4-2 -2.61496407 -3.841450 -1.3884783 0.0000001
## 5-2 -3.20441484 -4.405132 -2.0036974 0.0000000
## 4-3 -0.12082918 -1.206373  0.9647144 0.9981274
## 5-3 -0.71027994 -1.766623  0.3460628 0.3516310
## 5-4 -0.58945076 -1.426804  0.2479023 0.3045381
## 
## $`Data$child`
##             diff        lwr       upr     p adj
## yes-no 0.3499589 -0.2085683 0.9084861 0.2189713
plot(CI)

In this case we that some levels do show some difference (the P-Value is close to 0) however others do not. This gives us evidence that there is change for some levels, but there is not for others.

Resampling

P = numeric(10000) 
for (i in 1:10000){
  Data2 <- Data[sample(601, 50, replace = T),]
  Fits = anova(lm(Data2$nbaffairs~ Data2$rate+Data2$child))
  p <- Fits$"Pr(>F)"[1]  
  P[i] <- p # save p-value 
}

Due to the fact that some results were being emitted from the results due to warnings with the data fitting, the sample size was doubled to 50.

par(mfrow=c(1,1))
hist(P)

We can see from the Histogram that most results lie below a P value of 0.05

summary(P < .05)
##    Mode   FALSE    TRUE    NA's 
## logical    5212    4788       0
mean(P)
## [1] 0.1926917

Due to the P-value being above 0.05; In this case we accept the null and

The same tests were run with sample sizes of 100 to assure accurate results, and the results were equivalent.

Conclusion

These results show the importance of not blindly trusting ANOVA results and using alternative tests. We ran the ANOVA and saw there was a significant effect from the IV. However when we look more deeply we saw through the diagnostic plots that the data did not exactly validate the ANOVA assumptions. From there instead of attempting data transforms we used confidence intervals and resampling as alternative tests. From both of these we found possible evidence that there was en effect; however we do end up accepting the Null Hypothesis for both of the alternative tests.

4. References to the literature

Design and Analysis of Experiments, 8th Edition Douglas C. Montgomery

https://www.r-bloggers.com/anova-and-tukeys-test-on-r/

5. Appendices

Appendix I: Data

The data set is a part of the Ecdat R Package, more information at: https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

Appendix II: Full Code

rm(list=ls())

#Read in the selected data set
library(Ecdat)
Data <- Fair


str(Data)

#Correct Data
Data$religious <- as.factor(Data$religious)
Data$education <- as.factor(Data$education)
Data$occupation <- as.factor(Data$occupation)
Data$rate <- as.factor(Data$rate)


#Levels
levels(Data$sex)
levels(Data$child)
levels(Data$religious)
levels(Data$education)
levels(Data$occupation)
levels(Data$rate)

#Cont. Var
summary(Data$age)
summary(Data$ym)
summary(Data$nbaffairs)

#Explore
summary(Data)
head(Data, n=10)
boxplot(Data$nbaffairs~ Data$rate, main = "Boxplot of Affairs", xlab = "Marriage Rating")

#G*Power Sample
Data2 <- Data[sample(601, 25, replace = T),]

### ANOVA
Full<- aov(Data$nbaffairs~ Data$rate+Data$child)
summary(Full)
qqnorm(residuals(Full))
qqline(residuals(Full))

Sample<- aov(Data2$nbaffairs~ Data2$rate+Data2$child)
summary(Sample)
qqnorm(residuals(Sample))
qqline(residuals(Sample))

#Diagnostic Plots
par(mfrow=c(2,2))
plot(Full)


#Tukey and CI
par(mfrow=c(2,1))
CI <- TukeyHSD(x=Full, conf.level=0.95)
CI
plot(CI)



### Resampling - 10000 Iterations
P = numeric(10000) 
for (i in 1:10000){
  Data2 <- Data[sample(601, 50, replace = T),]
  Fits = anova(lm(Data2$nbaffairs~ Data2$rate+Data2$child))
  p <- Fits$"Pr(>F)"[1]  
  P[i] <- p 
}


par(mfrow=c(1,1))
hist(P)
summary(P < .05)
mean(P)