The data used in this recipe was found using the “100+ interesting data sets” webpage. It is part of the R dataset package called “Ecdat,” which various datasets for the purpose of econometrics.
This dataset in particular has data of unemployed individuals in the United States in 1993.
# Read in dataset from Ecdat package
#install.packages("Ecdat")
# library(Ecdat)
data(Unemployment, package = "Ecdat")
The data being examined has 452 observations of 12 variables.
head(Unemployment)
## duration spell race sex reason search pubemp ftp1 ftp2 ftp3 ftp4
## 1 4 1 white male reentr yes yes 1 0 0 0
## 2 7 0 white male lose no no 1 1 1 1
## 3 1 0 nonwhite male lose no no 0 0 0 0
## 4 1 1 nonwhite male reentr no no 0 1 0 0
## 5 3 1 nonwhite female reentr no no 0 0 0 0
## 6 1 1 white female reentr no no 0 0 0 0
## nobs
## 1 1
## 2 2
## 3 1
## 4 1
## 5 1
## 6 1
summary(Unemployment)
## duration spell race sex
## Min. : 0.00 Min. :0.0000 nonwhite:117 male :242
## 1st Qu.: 4.00 1st Qu.:0.0000 white :335 female:210
## Median : 10.00 Median :1.0000
## Mean : 18.51 Mean :0.5664
## 3rd Qu.: 22.25 3rd Qu.:1.0000
## Max. :117.00 Max. :1.0000
## reason search pubemp ftp1 ftp2
## new : 41 no :309 no :360 Min. :0.0000 Min. :0.0000
## lose :171 yes:143 yes: 92 1st Qu.:0.0000 1st Qu.:0.0000
## leave : 92 Median :1.0000 Median :0.0000
## reentr:148 Mean :0.6726 Mean :0.4336
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## ftp3 ftp4 nobs
## Min. :0.000 Min. :0.0000 Min. :1.000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.000
## Median :0.000 Median :0.0000 Median :1.000
## Mean :0.354 Mean :0.3053 Mean :1.788
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :1.000 Max. :1.0000 Max. :4.000
# As seen below, there are 452 observations of 12 variables, including: 5 factors and 7 numeric variables.
str(Unemployment)
## 'data.frame': 452 obs. of 12 variables:
## $ duration: int 4 7 1 1 3 1 65 4 113 9 ...
## $ spell : int 1 0 0 1 1 1 0 0 0 1 ...
## $ race : Factor w/ 2 levels "nonwhite","white": 2 2 1 1 1 2 2 2 2 2 ...
## $ sex : Factor w/ 2 levels "male","female": 1 1 1 1 2 2 1 2 2 1 ...
## $ reason : Factor w/ 4 levels "new","lose","leave",..: 4 2 2 4 4 4 2 4 4 3 ...
## $ search : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 1 1 2 ...
## $ pubemp : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 1 1 1 ...
## $ ftp1 : int 1 1 0 0 0 0 1 0 0 1 ...
## $ ftp2 : int 0 1 0 1 0 0 1 0 0 0 ...
## $ ftp3 : int 0 1 0 0 0 0 1 0 0 0 ...
## $ ftp4 : int 0 1 0 0 0 0 1 0 0 1 ...
## $ nobs : int 1 2 1 1 1 1 4 1 3 1 ...
The only continuous variable in this dataset is the duration of unemployment.
The response variable being analyzed in this dataset is the duration of unemployment in weeks. As seen below, the range of unemployment is between 0 and 117 weeks, with a mean of 18.51 weeks.
summary(Unemployment$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.00 10.00 18.51 22.25 117.00
The data is organized into 12 variables, including: 5 factors and 7 numeric variables.
There was no randomization in this recipe because of the data points are based on observations about unemployment among American individuals.
The experiment will use a one-way, 4-level analysis of variance (ANOVA). It will use a single factor - reason for unemployment - which has 4 levels. We will be testing the effect of this factor on the response variable - duration of unemployment in weeks. The null hypothesis for this analysis - that variation in unemployment duration cannot be explained by anything other than randomization - will be tested.
Analysis of varariance is used to analyze the effect of a categorical factor on a contiuous dependent variable. A one-way ANOVA was used because we were interested in the difference between the means of the various “treatment levels” - in our case, the reason for leaving.
The experiment was based on observations, and was therefore not randomized.
The experiment had no repeated measures.
There was no blocking in this recipe.
summary(Unemployment)
## duration spell race sex
## Min. : 0.00 Min. :0.0000 nonwhite:117 male :242
## 1st Qu.: 4.00 1st Qu.:0.0000 white :335 female:210
## Median : 10.00 Median :1.0000
## Mean : 18.51 Mean :0.5664
## 3rd Qu.: 22.25 3rd Qu.:1.0000
## Max. :117.00 Max. :1.0000
## reason search pubemp ftp1 ftp2
## new : 41 no :309 no :360 Min. :0.0000 Min. :0.0000
## lose :171 yes:143 yes: 92 1st Qu.:0.0000 1st Qu.:0.0000
## leave : 92 Median :1.0000 Median :0.0000
## reentr:148 Mean :0.6726 Mean :0.4336
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## ftp3 ftp4 nobs
## Min. :0.000 Min. :0.0000 Min. :1.000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.000
## Median :0.000 Median :0.0000 Median :1.000
## Mean :0.354 Mean :0.3053 Mean :1.788
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :1.000 Max. :1.0000 Max. :4.000
plot(Unemployment$duration ~ Unemployment$reason, xlab = "Reason for Unemployment", ylab = "Duration of Unemployment (in weeks)")
As seen in the plot above, “lose”, or lost their job has the greatest mean, and what also appears to have the largest IQR. “New,” or new to the job markets, has the lowest median and also the lowest IQR.
The focus of this recipe was on a single-factor, 4-level analysis of variance.
data(Unemployment, package = "Ecdat") # "Unemployment," part of the Ecdat package
# We need the sd(residual error) from the summary for the Monte Carlo simulation
model <- aov(duration~reason,data=Unemployment)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## reason 3 2965 988.2 1.865 0.135
## Residuals 448 237378 529.9
The null hypothesis of an analysis of variance states that the variation in duration of unemployment cannot be explained by anything other than randomization. The alternative to this hypothesis would be that there is a factor which can explain the variation in duration. In this case, the p-value is 0.135. This is greater than an alpha level of 0.05, and therefore we must fail to reject the null.
Tukey’s range tests are used alongside ANOVAs in order to find means that are significantly different from each other, by compairing pairs of means. This test compares the mean of each treatment level to the mean of the other treatment levels.
The null hypothesis for a Tukey test states that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means. In this Tukey test, the difference between the means of pairs of reasons for unemployment were compared.
model <- aov(duration~reason,data=Unemployment)
tukey1 <- TukeyHSD(model)
tukey1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = duration ~ reason, data = Unemployment)
##
## $reason
## diff lwr upr p adj
## lose-new 8.265012 -2.056898 18.586922 0.1663287
## leave-new 5.032874 -6.113205 16.178953 0.6496613
## reentr-new 3.855142 -6.620732 14.331015 0.7783599
## leave-lose -3.232138 -10.906955 4.442679 0.6982993
## reentr-lose -4.409870 -11.074079 2.254338 0.3215074
## reentr-leave -1.177732 -9.058399 6.702935 0.9805247
plot(tukey1)
The results of the Tukey Test display all p-values greater than an alpha, 0.05 between all pairs of reasons for unemployment. This leads us to fail to reject the null. Therefore, there is some probability that there is no significant difference between the mean durations of the pairs of reasons. The plot of the Tukey also supports this claim. The 95% confidence intervals of the pair’s difference still include 0 as a difference in mean levels. Therefore, “no difference” can not be not included from the possibilities.
Assumptions for ANOVA tests include that the data is normally distributed.
shapiro.test(Unemployment$duration)
##
## Shapiro-Wilk normality test
##
## data: Unemployment$duration
## W = 0.7233, p-value < 2.2e-16
The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the duration data is not normally distributed.
par(mfrow = c(1,1))
qqnorm(residuals(model))
qqline(residuals(model))
The Q-Q Normality Plots of the residuals for this model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.
plot(fitted(model),residuals(model))
The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the model do not appear to be symmetric over the length of the dynamic range, and are not symmetric over 0.
This recipe will outline two methods for resampling - the Monte Carlo Simulation and Bootstrapping.
One of the assumptions for this technique is that there is a known normal distribution. Unfortunately, this is not known - as seen in the previous section.
with(Unemployment,tapply(duration,reason,mean))
## new lose leave reentr
## 13.09756 21.36257 18.13043 16.95270
with(Unemployment,tapply(duration,reason,var))
## new lose leave reentr
## 301.9902 627.3736 457.0377 524.1814
with(Unemployment,tapply(duration,reason,length))
## new lose leave reentr
## 41 171 92 148
meanstar = mean(Unemployment$duration)
# sd is sqrt(residual error). Get this value from summary(), above.
sdstar = sqrt(529.9)
simblock = Unemployment$reason
R = 10000 # 10,000 iterations
Fstar = numeric(R)
for (i in 1:R) {
# make residual normally distributed with known pooled variance
# this loop is a Monte Carlo simulation since distribution is known normal
groupA = rnorm(113, mean=meanstar, sd=sdstar)
groupB = rnorm(113, mean=meanstar, sd=sdstar)
groupC = rnorm(113, mean=meanstar, sd=sdstar)
groupD = rnorm(113, mean=meanstar, sd=sdstar)
simyield = c(groupA,groupB,groupC,groupD)
simdata = data.frame(simyield,simblock)
Fstar[i] = oneway.test(simyield~simblock, var.equal=T, data=simdata)$statistic
}
# computes empirical distribution of F under H0 and display as histogram
hist(Fstar, ylim=c(0,1), xlim=c(0, 8), prob=T)
# Plot F distribution from .25 to 6 in .5 increments
x=seq(.25,6,.25)
# Overlay the analytic F distribution as red dots (degrees of freedom known)
# 3 = 4 levels-1 mean; 449= 452 (n)-3 means
# Notice the quality of the fit of the Monte Carlo simulation to the empirical distribution of F
points(x,y=df(x,3,449),type="b",col="red")
The fit for the analytic F distribution (shows as red dots) versus the empirical distribution is relatively good. The fit seems to be better in the right tail of the graph, rather than near the origin.
# now the bootstrap version of ANOVA
#
rm(list=ls()) # only if you want to clear environment and start over
data(Unemployment, package="Ecdat")
meanstar = with(Unemployment, tapply(duration,reason,mean))
grpA = Unemployment$duration[Unemployment$reason=="new"] - meanstar[1]
grpB = Unemployment$duration[Unemployment$reason=="lose"] - meanstar[2]
grpC = Unemployment$duration[Unemployment$reason=="leave"] - meanstar[3]
grpD = Unemployment$duration[Unemployment$reason=="reentr"] - meanstar[4]
R = 10000
Fstar = numeric(R)
simblock = Unemployment$reason
for (i in 1:R) {
groupA = sample(grpA, size=113, replace=T)
groupB = sample(grpB, size=113, replace=T)
groupC = sample(grpC, size=113, replace=T)
groupD = sample(grpD, size=113, replace=T)
simyield = c(groupA,groupB,groupC,groupD)
simdata = data.frame(simyield,simblock)
Fstar[i] = oneway.test(simyield~simblock, var.equal=T, data=simdata)$statistic
}
# Now generate a similar plot, comparing bootstrapped distribution to the known F distribution
hist(Fstar, ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,3,449),type="b",col="red") # The F distribution
The fit of the bootstrappd distribution compared to the known distribution is relatively similar to the first fit. The values are better fits on the tail of the graph, versus on the initial points near 0.
# Here is the alpha level from the bootstrapped distribution
print(realFstar<-oneway.test(duration~reason, var.equal=T, data=Unemployment)$statistic)
## F
## 1.865083
mean(Fstar>=realFstar)
## [1] 0.1351
# estimate alpha = 0.05 value of the test statistic, given F* from empirical distribution above
# generate quantiles of the analytic F distribution
qf(.95,3,449)
## [1] 2.624769
# alpha = 0.05 value of bootstrapped F* is output here
quantile(Fstar,.95)
## 95%
## 2.592683
Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.
A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.
kruskal.test(duration~reason,data=Unemployment)
##
## Kruskal-Wallis rank sum test
##
## data: duration by reason
## Kruskal-Wallis chi-squared = 3.7148, df = 3, p-value = 0.294
With the kruskal-wallis test, we could only analyze how the various “reason” groups’ mean varied. The p-value of this Kruskal-Wallis test is greater than an alpha of 0.05. Therefore, you must fail to reject the null - that variation in assets cannot be explained by anything other than randomization.