Recipe 6: Analysis of Covariance"

Recipes for the Design of Experiments

Unemployment

Jane Braun

RPI

November 6th Version 1.0

1. Setting

System under test

The data used in this recipe was found using the “100+ interesting data sets” webpage. It is part of the R dataset package called “Ecdat,” which various datasets for the purpose of econometrics.

This dataset in particular has data of unemployed individuals in the United States in 1993.

# Read in dataset from Ecdat package
#install.packages("Ecdat")
# library(Ecdat)
data(Unemployment, package = "Ecdat")

Factors and Levels

The data being examined has 452 observations of 12 variables.

head(Unemployment)
##   duration spell     race    sex reason search pubemp ftp1 ftp2 ftp3 ftp4
## 1        4     1    white   male reentr    yes    yes    1    0    0    0
## 2        7     0    white   male   lose     no     no    1    1    1    1
## 3        1     0 nonwhite   male   lose     no     no    0    0    0    0
## 4        1     1 nonwhite   male reentr     no     no    0    1    0    0
## 5        3     1 nonwhite female reentr     no     no    0    0    0    0
## 6        1     1    white female reentr     no     no    0    0    0    0
##   nobs
## 1    1
## 2    2
## 3    1
## 4    1
## 5    1
## 6    1
summary(Unemployment)
##     duration          spell              race         sex     
##  Min.   :  0.00   Min.   :0.0000   nonwhite:117   male  :242  
##  1st Qu.:  4.00   1st Qu.:0.0000   white   :335   female:210  
##  Median : 10.00   Median :1.0000                              
##  Mean   : 18.51   Mean   :0.5664                              
##  3rd Qu.: 22.25   3rd Qu.:1.0000                              
##  Max.   :117.00   Max.   :1.0000                              
##     reason    search    pubemp         ftp1             ftp2       
##  new   : 41   no :309   no :360   Min.   :0.0000   Min.   :0.0000  
##  lose  :171   yes:143   yes: 92   1st Qu.:0.0000   1st Qu.:0.0000  
##  leave : 92                       Median :1.0000   Median :0.0000  
##  reentr:148                       Mean   :0.6726   Mean   :0.4336  
##                                   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                                   Max.   :1.0000   Max.   :1.0000  
##       ftp3            ftp4             nobs      
##  Min.   :0.000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.000  
##  Median :0.000   Median :0.0000   Median :1.000  
##  Mean   :0.354   Mean   :0.3053   Mean   :1.788  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:2.000  
##  Max.   :1.000   Max.   :1.0000   Max.   :4.000
# As seen below, there are 452 observations of 12 variables, including: 5 factors and 7 numeric variables.
str(Unemployment)
## 'data.frame':    452 obs. of  12 variables:
##  $ duration: int  4 7 1 1 3 1 65 4 113 9 ...
##  $ spell   : int  1 0 0 1 1 1 0 0 0 1 ...
##  $ race    : Factor w/ 2 levels "nonwhite","white": 2 2 1 1 1 2 2 2 2 2 ...
##  $ sex     : Factor w/ 2 levels "male","female": 1 1 1 1 2 2 1 2 2 1 ...
##  $ reason  : Factor w/ 4 levels "new","lose","leave",..: 4 2 2 4 4 4 2 4 4 3 ...
##  $ search  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 1 1 2 ...
##  $ pubemp  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 1 1 1 ...
##  $ ftp1    : int  1 1 0 0 0 0 1 0 0 1 ...
##  $ ftp2    : int  0 1 0 1 0 0 1 0 0 0 ...
##  $ ftp3    : int  0 1 0 0 0 0 1 0 0 0 ...
##  $ ftp4    : int  0 1 0 0 0 0 1 0 0 1 ...
##  $ nobs    : int  1 2 1 1 1 1 4 1 3 1 ...

Continuous variables

The only continuous variable in this dataset is the duration of unemployment.

Response variables

The response variable being analyzed in this dataset is the duration of unemployment in weeks. As seen below, the range of unemployment is between 0 and 117 weeks, with a mean of 18.51 weeks.

summary(Unemployment$duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00   10.00   18.51   22.25  117.00

The Data

The data is organized into 12 variables, including: 5 factors and 7 numeric variables.

Randomization

There was no randomization in this recipe because of the data points are based on observations about unemployment among American individuals.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The experiment will use a one-way, 4-level analysis of variance (ANOVA). It will use a single factor - reason for unemployment - which has 4 levels. We will be testing the effect of this factor on the response variable - duration of unemployment in weeks. The null hypothesis for this analysis - that variation in unemployment duration cannot be explained by anything other than randomization - will be tested.

What is the rationale for this design?

Analysis of varariance is used to analyze the effect of a categorical factor on a contiuous dependent variable. A one-way ANOVA was used because we were interested in the difference between the means of the various “treatment levels” - in our case, the reason for leaving.

Randomize:

The experiment was based on observations, and was therefore not randomized.

Replicate:

The experiment had no repeated measures.

Block:

There was no blocking in this recipe.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

summary(Unemployment)
##     duration          spell              race         sex     
##  Min.   :  0.00   Min.   :0.0000   nonwhite:117   male  :242  
##  1st Qu.:  4.00   1st Qu.:0.0000   white   :335   female:210  
##  Median : 10.00   Median :1.0000                              
##  Mean   : 18.51   Mean   :0.5664                              
##  3rd Qu.: 22.25   3rd Qu.:1.0000                              
##  Max.   :117.00   Max.   :1.0000                              
##     reason    search    pubemp         ftp1             ftp2       
##  new   : 41   no :309   no :360   Min.   :0.0000   Min.   :0.0000  
##  lose  :171   yes:143   yes: 92   1st Qu.:0.0000   1st Qu.:0.0000  
##  leave : 92                       Median :1.0000   Median :0.0000  
##  reentr:148                       Mean   :0.6726   Mean   :0.4336  
##                                   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                                   Max.   :1.0000   Max.   :1.0000  
##       ftp3            ftp4             nobs      
##  Min.   :0.000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.000  
##  Median :0.000   Median :0.0000   Median :1.000  
##  Mean   :0.354   Mean   :0.3053   Mean   :1.788  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:2.000  
##  Max.   :1.000   Max.   :1.0000   Max.   :4.000
plot(Unemployment$duration ~ Unemployment$reason, xlab = "Reason for Unemployment", ylab = "Duration of Unemployment (in weeks)")    

As seen in the plot above, “lose”, or lost their job has the greatest mean, and what also appears to have the largest IQR. “New,” or new to the job markets, has the lowest median and also the lowest IQR.

Testing

Analysis of Variance

The focus of this recipe was on a single-factor, 4-level analysis of variance.

data(Unemployment, package = "Ecdat") # "Unemployment," part of the Ecdat package
# We need the sd(residual error) from the summary for the Monte Carlo simulation
model <- aov(duration~reason,data=Unemployment)
summary(model)   
##              Df Sum Sq Mean Sq F value Pr(>F)
## reason        3   2965   988.2   1.865  0.135
## Residuals   448 237378   529.9

The null hypothesis of an analysis of variance states that the variation in duration of unemployment cannot be explained by anything other than randomization. The alternative to this hypothesis would be that there is a factor which can explain the variation in duration. In this case, the p-value is 0.135. This is greater than an alpha level of 0.05, and therefore we must fail to reject the null.

Tukey’s Range Tests

Tukey’s range tests are used alongside ANOVAs in order to find means that are significantly different from each other, by compairing pairs of means. This test compares the mean of each treatment level to the mean of the other treatment levels.

The null hypothesis for a Tukey test states that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means. In this Tukey test, the difference between the means of pairs of reasons for unemployment were compared.

model <- aov(duration~reason,data=Unemployment)
tukey1 <- TukeyHSD(model)
tukey1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = duration ~ reason, data = Unemployment)
## 
## $reason
##                   diff        lwr       upr     p adj
## lose-new      8.265012  -2.056898 18.586922 0.1663287
## leave-new     5.032874  -6.113205 16.178953 0.6496613
## reentr-new    3.855142  -6.620732 14.331015 0.7783599
## leave-lose   -3.232138 -10.906955  4.442679 0.6982993
## reentr-lose  -4.409870 -11.074079  2.254338 0.3215074
## reentr-leave -1.177732  -9.058399  6.702935 0.9805247
plot(tukey1)

The results of the Tukey Test display all p-values greater than an alpha, 0.05 between all pairs of reasons for unemployment. This leads us to fail to reject the null. Therefore, there is some probability that there is no significant difference between the mean durations of the pairs of reasons. The plot of the Tukey also supports this claim. The 95% confidence intervals of the pair’s difference still include 0 as a difference in mean levels. Therefore, “no difference” can not be not included from the possibilities.

Diagnostics/Model Adequacy Checking

Assumptions for ANOVA tests include that the data is normally distributed.

shapiro.test(Unemployment$duration)
## 
##  Shapiro-Wilk normality test
## 
## data:  Unemployment$duration
## W = 0.7233, p-value < 2.2e-16

The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the duration data is not normally distributed.

par(mfrow = c(1,1))
qqnorm(residuals(model))
qqline(residuals(model))

The Q-Q Normality Plots of the residuals for this model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.

plot(fitted(model),residuals(model))

The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the model do not appear to be symmetric over the length of the dynamic range, and are not symmetric over 0.

Resampling techniques

This recipe will outline two methods for resampling - the Monte Carlo Simulation and Bootstrapping.

Monte Carlo Simulation

One of the assumptions for this technique is that there is a known normal distribution. Unfortunately, this is not known - as seen in the previous section.

with(Unemployment,tapply(duration,reason,mean))
##      new     lose    leave   reentr 
## 13.09756 21.36257 18.13043 16.95270
with(Unemployment,tapply(duration,reason,var))
##      new     lose    leave   reentr 
## 301.9902 627.3736 457.0377 524.1814
with(Unemployment,tapply(duration,reason,length))
##    new   lose  leave reentr 
##     41    171     92    148
meanstar = mean(Unemployment$duration)
# sd is sqrt(residual error). Get this value from summary(), above.
sdstar = sqrt(529.9)
simblock = Unemployment$reason
R = 10000  # 10,000 iterations
Fstar = numeric(R)
for (i in 1:R) {
  # make residual normally distributed with known pooled variance
  # this loop is a Monte Carlo simulation since distribution is known normal
  groupA = rnorm(113, mean=meanstar, sd=sdstar)
  groupB = rnorm(113, mean=meanstar, sd=sdstar)
  groupC = rnorm(113, mean=meanstar, sd=sdstar)
  groupD = rnorm(113, mean=meanstar, sd=sdstar)
  simyield = c(groupA,groupB,groupC,groupD)
  simdata = data.frame(simyield,simblock)
  Fstar[i] = oneway.test(simyield~simblock, var.equal=T, data=simdata)$statistic
}

# computes empirical distribution of F under H0 and display as histogram
hist(Fstar, ylim=c(0,1), xlim=c(0, 8), prob=T)
# Plot F distribution from .25 to 6 in .5 increments
x=seq(.25,6,.25)
# Overlay the analytic F distribution as red dots (degrees of freedom known)
# 3 = 4 levels-1 mean; 449= 452 (n)-3 means
# Notice the quality of the fit of the Monte Carlo simulation to the empirical distribution of F
points(x,y=df(x,3,449),type="b",col="red")

The fit for the analytic F distribution (shows as red dots) versus the empirical distribution is relatively good. The fit seems to be better in the right tail of the graph, rather than near the origin.

# now the bootstrap version of ANOVA
#
rm(list=ls()) # only if you want to clear environment and start over
data(Unemployment, package="Ecdat")
meanstar = with(Unemployment, tapply(duration,reason,mean))
grpA = Unemployment$duration[Unemployment$reason=="new"] - meanstar[1]
grpB = Unemployment$duration[Unemployment$reason=="lose"] - meanstar[2]
grpC = Unemployment$duration[Unemployment$reason=="leave"] - meanstar[3]
grpD = Unemployment$duration[Unemployment$reason=="reentr"] - meanstar[4]

R = 10000
Fstar = numeric(R)
simblock = Unemployment$reason
for (i in 1:R) {
  groupA = sample(grpA, size=113, replace=T)
  groupB = sample(grpB, size=113, replace=T)
  groupC = sample(grpC, size=113, replace=T)
  groupD = sample(grpD, size=113, replace=T)

  simyield = c(groupA,groupB,groupC,groupD)
  simdata = data.frame(simyield,simblock)
  Fstar[i] = oneway.test(simyield~simblock, var.equal=T, data=simdata)$statistic
}

# Now generate a similar plot, comparing bootstrapped distribution to the known F distribution
hist(Fstar, ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,3,449),type="b",col="red")  # The F distribution   

The fit of the bootstrappd distribution compared to the known distribution is relatively similar to the first fit. The values are better fits on the tail of the graph, versus on the initial points near 0.

# Here is the alpha level from the bootstrapped distribution
print(realFstar<-oneway.test(duration~reason, var.equal=T, data=Unemployment)$statistic)
##        F 
## 1.865083
mean(Fstar>=realFstar)
## [1] 0.1351
# estimate alpha = 0.05 value of the test statistic, given F* from empirical distribution above
# generate quantiles of the analytic F distribution
qf(.95,3,449)
## [1] 2.624769
# alpha = 0.05 value of bootstrapped F* is output here
quantile(Fstar,.95)
##      95% 
## 2.592683

4. Contingencies

Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.

A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.

kruskal.test(duration~reason,data=Unemployment)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  duration by reason
## Kruskal-Wallis chi-squared = 3.7148, df = 3, p-value = 0.294

With the kruskal-wallis test, we could only analyze how the various “reason” groups’ mean varied. The p-value of this Kruskal-Wallis test is greater than an alpha of 0.05. Therefore, you must fail to reject the null - that variation in assets cannot be explained by anything other than randomization.