For this example we will select a dataset with at least two categorical independent variables and a continuous dependent variable. The objective is to test a hypothesis between the DV and one of the IVs while blocking the other.
The data set selected is a part of the Ecdat R Package called Fair, it has 601 observations and 9 variables.
The data collected provides the amount of affairs a person has had in the past year, while also having collected various factors of interest associated with the individual. Below we show the data being loaded from the library
library(Ecdat)
Data <- Fair
When the data was collected the variables were defined as follows:
sex a factor with levels (male,female)
age age
ym number of years married
child children ? a factor
religious how religious, from 1 (anti) to 5 (very)
education education
occupation occupation, from 1 to 7, according to hollingshead classification (reverse numbering)
rate self rating of mariage, from 1 (very unhappy) to 5 (very happy)
nbaffairs number of affairs in past year
Next we look at the stucture of the data.
str(Data)
## 'data.frame': 601 obs. of 9 variables:
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 2 1 2 ...
## $ age : num 37 27 32 57 22 32 22 57 32 22 ...
## $ ym : num 10 4 15 15 0.75 1.5 0.75 15 15 1.5 ...
## $ child : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 1 2 2 1 ...
## $ religious : int 3 4 1 5 2 2 2 2 4 4 ...
## $ education : num 18 14 12 18 17 17 12 14 16 14 ...
## $ occupation: int 7 6 1 6 6 5 1 4 1 4 ...
## $ rate : int 4 4 4 5 3 5 3 4 2 5 ...
## $ nbaffairs : num 0 0 0 0 0 0 0 0 0 0 ...
We see that R has incorrectly assigned some data as integers rather than factors, so we will tell it which ones are factors
Data$religious <- as.factor(Data$religious)
Data$education <- as.factor(Data$education)
Data$occupation <- as.factor(Data$occupation)
Data$rate <- as.factor(Data$rate)
The Factors present in the data set and their levels.
levels(Data$sex)
## [1] "female" "male"
levels(Data$child)
## [1] "no" "yes"
levels(Data$religious)
## [1] "1" "2" "3" "4" "5"
levels(Data$education)
## [1] "9" "12" "14" "16" "17" "18" "20"
levels(Data$occupation)
## [1] "1" "2" "3" "4" "5" "6" "7"
levels(Data$rate)
## [1] "1" "2" "3" "4" "5"
The Continuous variables present are the following
summary(Data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.50 27.00 32.00 32.49 37.00 57.00
summary(Data$ym)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.125 4.000 7.000 8.178 15.000 15.000
summary(Data$nbaffairs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.456 0.000 12.000
The response variable that will be used is nbaffairs. The number of affairs the subject had over the past year.
Although these steps are simple, they are an important part of setting up the higher level analysis. We can see the summary of the data as well as the first 10 rows of data.
summary(Data)
## sex age ym child religious
## female:315 Min. :17.50 Min. : 0.125 no :171 1: 48
## male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430 2:164
## Median :32.00 Median : 7.000 3:129
## Mean :32.49 Mean : 8.178 4:190
## 3rd Qu.:37.00 3rd Qu.:15.000 5: 70
## Max. :57.00 Max. :15.000
##
## education occupation rate nbaffairs
## 9 : 7 1:113 1: 16 Min. : 0.000
## 12: 44 2: 13 2: 66 1st Qu.: 0.000
## 14:154 3: 47 3: 93 Median : 0.000
## 16:115 4: 68 4:194 Mean : 1.456
## 17: 89 5:204 5:232 3rd Qu.: 0.000
## 18:112 6:143 Max. :12.000
## 20: 80 7: 13
head(Data, n=10)
## sex age ym child religious education occupation rate nbaffairs
## 1 male 37 10.00 no 3 18 7 4 0
## 2 female 27 4.00 no 4 14 6 4 0
## 3 female 32 15.00 yes 1 12 1 4 0
## 4 male 57 15.00 yes 5 18 6 5 0
## 5 male 22 0.75 no 2 17 6 3 0
## 6 female 32 1.50 no 2 17 5 5 0
## 7 female 22 0.75 no 2 12 1 3 0
## 8 male 57 15.00 yes 2 14 4 4 0
## 9 female 32 15.00 yes 4 16 1 2 0
## 10 male 22 1.50 no 4 14 4 5 0
After we have looked at the data, we will formulate our hypothesis.
We see that the intended response variable is the nbaffairs, in other words our null hypothesis should be as follows: The average value of affairs a married person has had in the past year is the same for all groups of the independent varaible.
We will choose rate as the independent variable.
In this case the experiments have already been run, and there is not data stating how or why it was done. As such we cannot comment on the design.
Randomized design is used to control for effects from extraneous variables, or variables that have an effect but are not being studied. The assumption is that on average these variables will affect the response variable equally so any significant differences should come from the independent variable(s) being tested.
Again, since we weren’t present for the collection of this data we cannot assure that this data was collected randomly. We will however be sub-setting and ordering the data randomly, this will help provide some randomization.
Replication is an important part of experimental design, especially for validating results. It is not only more convincing, but more significant if the same parameters on an experiment are run several times, and the dependent results are similar. In this case however it would only be possible to have replications if two subjects answered equally to all the questions (Independent variables).
Blocking is used to eliminate the influence of extraneous variables, in this case we will be using the child factor to block.
We will now begin some analysis of the data, to better understand it.
hist(Data$nbaffairs)
As we can see, the majority of responses were 0. Now lets try to split it up by the levels of our factor.
boxplot(Data$nbaffairs~ Data$rate, main = "Boxplot of Affairs", xlab = "Marriage Rating")
It does seem that the independent variable has an effect on the response.
To begin testing we will need to select a type I error α and a type II error β. As well as this we We will need to investigate the effect size of the main effect. Using Cohen’s d we find that the effect is very large.
Using the following parameters we use G*Power under the F-test family.
Effect size: .99 (large)
α: 0.05
β: 0.10
Number of Groups: 5
We find the necessary N to be 25, due to the large effect size. We sample Data randomly 25 times and name our sample Data2
Data2 <- Data[sample(601, 25, replace = T),]
Full<- aov(Data$nbaffairs~ Data$rate+Data$child)
summary(Full)
## Df Sum Sq Mean Sq F value Pr(>F)
## Data$rate 4 626 156.42 15.808 2.64e-12 ***
## Data$child 1 16 15.90 1.607 0.205
## Residuals 595 5887 9.89
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(Full))
qqline(residuals(Full))
Sample<- aov(Data2$nbaffairs~ Data2$rate+Data2$child)
summary(Sample)
## Df Sum Sq Mean Sq F value Pr(>F)
## Data2$rate 4 135.64 33.91 15.477 8.7e-06 ***
## Data2$child 1 0.10 0.10 0.043 0.837
## Residuals 19 41.63 2.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(Sample))
qqline(residuals(Sample))
We see that the results for both the Full and Samples sets are similar. However we see that the residuals of the ANOVA test are not normal. Lets run some more diagnostic plots.
par(mfrow=c(2,2))
plot(Full)
Looking at the plot we can draw the following conclusions.
Residuals vs Fitted: Shows possible non-linear pattern in the residuals.
Normal QQ Plot: Residuals of the linear regression are not normal.
Scale – Location: Division line is somewhat horizontal. Showing possible homoscedasticity.
Residuals vs Leverage: No proof of influential outliers.
At this point it is typical to attempt alternative hypothesis testing techniques. We will try two others.
After conducting an ANOVA it is a good idea to perform a Tukey test even if the diagnostic plots are ideal. This test will analyze the difference between the means of the levels and return a CI for this estimate. As such if the result contains the value of 0 you usually reject that there is a difference in the means.
par(mfrow=c(1,1))
CI <- TukeyHSD(x=Full, conf.level=0.95)
CI
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Data$nbaffairs ~ Data$rate + Data$child)
##
## $`Data$rate`
## diff lwr upr p adj
## 2-1 0.04924242 -2.349169 2.4476535 0.9999977
## 3-1 -2.44489247 -4.774380 -0.1154047 0.0342271
## 4-1 -2.56572165 -4.804430 -0.3270137 0.0154155
## 5-1 -3.15517241 -5.379868 -0.9304772 0.0010932
## 3-2 -2.49413490 -3.879403 -1.1088670 0.0000107
## 4-2 -2.61496407 -3.841450 -1.3884783 0.0000001
## 5-2 -3.20441484 -4.405132 -2.0036974 0.0000000
## 4-3 -0.12082918 -1.206373 0.9647144 0.9981274
## 5-3 -0.71027994 -1.766623 0.3460628 0.3516310
## 5-4 -0.58945076 -1.426804 0.2479023 0.3045381
##
## $`Data$child`
## diff lwr upr p adj
## yes-no 0.3499589 -0.2085683 0.9084861 0.2189713
plot(CI)
In this case we that some levels do show some difference (the P-Value is close to 0) however others do not. This gives us evidence that there is change for some levels, but there is not for others.
P = numeric(10000)
for (i in 1:10000){
Data2 <- Data[sample(601, 50, replace = T),]
Fits = anova(lm(Data2$nbaffairs~ Data2$rate+Data2$child))
p <- Fits$"Pr(>F)"[1]
P[i] <- p # save p-value
}
Due to the fact that some results were being emitted from the results due to warnings with the data fitting, the sample size was doubled to 50.
par(mfrow=c(1,1))
hist(P)
We can see from the Histogram that most results lie below a P value of 0.05
summary(P < .05)
## Mode FALSE TRUE NA's
## logical 5212 4788 0
mean(P)
## [1] 0.1926917
Due to the P-value being above 0.05; In this case we accept the null and
The same tests were run with sample sizes of 100 to assure accurate results, and the results were equivalent.
These results show the importance of not blindly trusting ANOVA results and using alternative tests. We ran the ANOVA and saw there was a significant effect from the IV. However when we look more deeply we saw through the diagnostic plots that the data did not exactly validate the ANOVA assumptions. From there instead of attempting data transforms we used confidence intervals and resampling as alternative tests. From both of these we found possible evidence that there was en effect; however we do end up accepting the Null Hypothesis for both of the alternative tests.
Design and Analysis of Experiments, 8th Edition Douglas C. Montgomery
The data set is a part of the Ecdat R Package, more information at: https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf
rm(list=ls())
#Read in the selected data set
library(Ecdat)
Data <- Fair
str(Data)
#Correct Data
Data$religious <- as.factor(Data$religious)
Data$education <- as.factor(Data$education)
Data$occupation <- as.factor(Data$occupation)
Data$rate <- as.factor(Data$rate)
#Levels
levels(Data$sex)
levels(Data$child)
levels(Data$religious)
levels(Data$education)
levels(Data$occupation)
levels(Data$rate)
#Cont. Var
summary(Data$age)
summary(Data$ym)
summary(Data$nbaffairs)
#Explore
summary(Data)
head(Data, n=10)
boxplot(Data$nbaffairs~ Data$rate, main = "Boxplot of Affairs", xlab = "Marriage Rating")
#G*Power Sample
Data2 <- Data[sample(601, 25, replace = T),]
### ANOVA
Full<- aov(Data$nbaffairs~ Data$rate+Data$child)
summary(Full)
qqnorm(residuals(Full))
qqline(residuals(Full))
Sample<- aov(Data2$nbaffairs~ Data2$rate+Data2$child)
summary(Sample)
qqnorm(residuals(Sample))
qqline(residuals(Sample))
#Diagnostic Plots
par(mfrow=c(2,2))
plot(Full)
#Tukey and CI
par(mfrow=c(2,1))
CI <- TukeyHSD(x=Full, conf.level=0.95)
CI
plot(CI)
### Resampling - 10000 Iterations
P = numeric(10000)
for (i in 1:10000){
Data2 <- Data[sample(601, 50, replace = T),]
Fits = anova(lm(Data2$nbaffairs~ Data2$rate+Data2$child))
p <- Fits$"Pr(>F)"[1]
P[i] <- p
}
par(mfrow=c(1,1))
hist(P)
summary(P < .05)
mean(P)