This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
In this study that uses data collected from “The impact of unemployment insurance benefit levels on recipiency” (McCall, B.P. 1995), a single-factor, multi-level experiment is performed to see if the reason for a blue collar worker’s job loss has a statistically significant effect on the state unemployment rate (in %). In the dataset, the factor ‘joblost’ refers to the reason for a blue collar worker’s job loss. Additionally, this analysis’ response variable is referred to in the dataset as ‘stateur’, which denotes the state unemployment rate in the analysis. [1]
#Install and load the 'Ecdat' dataset into R.
rm(list=ls())
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\howelb\AppData\Local\Temp\RtmpQZInif\downloaded_packages
library("Ecdat",lib.loc="C:/Program Files/R/R-3.1.1/library")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.1.2
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
#Assign a variable, 'benefits', to the complete dataframe, 'Benefits'. Then, determine the "head" and "tail" of the dataset, 'benefits'.
benefits<-Benefits
head(benefits)
## stateur statemb state age tenure joblost nwhite school12 sex
## 1 4.5 167 42 49 21 other no no male
## 2 10.5 251 55 26 2 slack_work no no male
## 3 7.2 260 21 40 19 other no yes female
## 4 5.8 245 56 51 17 slack_work yes no female
## 5 6.5 125 58 33 1 slack_work no yes male
## 6 7.5 188 11 51 3 other no no male
## bluecol smsa married dkids dykids yrdispl rr head ui
## 1 yes yes no no no 7 0.2906 yes yes
## 2 yes yes no yes yes 10 0.5202 yes no
## 3 yes yes yes no no 10 0.4325 yes yes
## 4 yes yes yes no no 10 0.5000 no yes
## 5 yes yes yes yes yes 4 0.3906 yes no
## 6 yes yes yes no no 10 0.4822 yes yes
tail(benefits)
## stateur statemb state age tenure joblost nwhite school12 sex
## 4872 8.6 125 59 21 2 slack_work no no male
## 4873 7.4 168 33 35 12 slack_work yes yes female
## 4874 7.0 189 74 39 5 slack_work no no male
## 4875 8.0 168 74 59 13 slack_work no no male
## 4876 6.3 191 41 33 11 slack_work no no male
## 4877 9.3 188 94 27 2 other no no male
## bluecol smsa married dkids dykids yrdispl rr head ui
## 4872 yes yes no no no 2 0.3125 yes no
## 4873 yes yes no yes no 6 0.4636 yes yes
## 4874 yes yes yes no no 4 0.3937 yes yes
## 4875 yes yes yes yes no 2 0.3724 yes yes
## 4876 yes yes yes yes yes 3 0.5000 yes no
## 4877 yes yes yes yes yes 7 0.4954 yes yes
This analysis considers one single factor (that has multiple levels), which is denoted in the dataset as ‘joblost’. In the original dataset “Benefits”, the factor ‘joblost’ is characterized as being a factor (or, in other words, a categorical variable) with four specific categorical levels. This factor was selected intuitively, since this analysis aims to determine whether or not the reason for a blue collar worker’s job loss has a statistically significant effect on the state unemployment rate (in %).
#Display the summary statistics of "Benefits".
summary(benefits)
## stateur statemb state age
## Min. : 2.20 Min. : 84 Min. :11.0 Min. :20.0
## 1st Qu.: 5.70 1st Qu.:150 1st Qu.:31.0 1st Qu.:28.0
## Median : 7.20 Median :177 Median :55.0 Median :34.0
## Mean : 7.51 Mean :181 Mean :52.8 Mean :36.1
## 3rd Qu.: 9.00 3rd Qu.:205 3rd Qu.:74.0 3rd Qu.:43.0
## Max. :18.00 Max. :293 Max. :95.0 Max. :61.0
## tenure joblost nwhite school12
## Min. : 1.00 other :1976 no :4159 no :3950
## 1st Qu.: 2.00 position_abolished: 402 yes: 718 yes: 927
## Median : 3.00 seasonal_job_ended: 177
## Mean : 5.66 slack_work :2322
## 3rd Qu.: 7.00
## Max. :41.00
## sex bluecol smsa married dkids dykids
## female:1150 yes:4877 no :1694 no :1791 no :2508 no :3796
## male :3727 yes:3183 yes:3086 yes:2369 yes:1081
##
##
##
##
## yrdispl rr head ui
## Min. : 1.0 Min. :0.0386 no :1558 no :1542
## 1st Qu.: 2.0 1st Qu.:0.3752 yes:3319 yes:3335
## Median : 5.0 Median :0.4904
## Mean : 5.2 Mean :0.4384
## 3rd Qu.: 8.0 3rd Qu.:0.5116
## Max. :10.0 Max. :0.6912
#Display the names found in "Benefits".
names(benefits)
## [1] "stateur" "statemb" "state" "age" "tenure" "joblost"
## [7] "nwhite" "school12" "sex" "bluecol" "smsa" "married"
## [13] "dkids" "dykids" "yrdispl" "rr" "head" "ui"
#Display the structure of "Benefits".
str(benefits)
## 'data.frame': 4877 obs. of 18 variables:
## $ stateur : num 4.5 10.5 7.2 5.8 6.5 7.5 5.8 5.8 7.7 6 ...
## $ statemb : int 167 251 260 245 125 188 166 214 213 187 ...
## $ state : int 42 55 21 56 58 11 93 84 84 33 ...
## $ age : int 49 26 40 51 33 51 30 26 54 31 ...
## $ tenure : int 21 2 19 17 1 3 5 3 20 1 ...
## $ joblost : Factor w/ 4 levels "other","position_abolished",..: 1 4 1 4 4 1 2 4 1 4 ...
## $ nwhite : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ school12: Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 2 2 ...
## $ bluecol : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ smsa : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ married : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 2 2 2 ...
## $ dkids : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 2 1 1 ...
## $ dykids : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 2 1 1 ...
## $ yrdispl : int 7 10 10 10 4 10 6 8 6 8 ...
## $ rr : num 0.291 0.52 0.432 0.5 0.391 ...
## $ head : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
## $ ui : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 2 1 2 ...
In this dataset, there are a few variables that can be considered to be continuous variables; these variables are the ones which are categorized as being numeric variables. By this standard, the continuous variables that exist in this dataset include ‘stateur’ (which refers to the state unemployment rate (in %)) and ‘rr’(which refers to the replacement rate of blue collar workers). Additionally, five variables in this dataset currently exist as integer variables, including the state maximum benefit level (‘statemb’), the state of residence code (‘state’), the age in years of each blue collar worker (‘age’), the years of tenure that a blue collar worker had in the job that they lost (‘tenure’), and the year of a blue collar worker’s job displacement (‘yrdispl’).
This analysis will consider one response variable, ‘stateur’, which denotes the state unemployment rate in the analysis.
As a whole, this dataset contains a cross-section of information from 1972 pertaining to the unemployment rates of blue collar workers in the United States, collected for the research corresponding to the article entitled “”The impact of unemployment insurance benefit levels on recipiency" (McCall, B.P. 1995) [1]. It includes data pertaining to state unemployment rates and to the reasons for job loss among blue collar workers. It contains 4877 observations of 18 variables, which are defined below:
stateur [state unemployment rate (in %)]
statemb [state maximum benefit level]
state [state of residence code]
age [age in years]
tenure [years of tenure in job lost]
joblost [a factor with four levels (“slack_work”, “position_abolished”, “seasonal_job_ended”, “other”)]
nwhite [non-white ?]
school12 [more than 12 years of school ?]
sex [a factor with two levels (“male”, “female”)]
bluecol [blue collar worker ?]
smsa [lives is smsa ?]
married [married ?]
dkids [has kids ?]
dykids [has young kids (0-5 yrs) ?]
yrdispl [year of job displacement (1982=1,…, 1991=10)]
rr [replacement rate]
head [is head of household ?]
ui [applied for (and received) UI benefits ?]
According to the research publication being considered in this analysis (McCall, B.P. 1995), the observations contained within this dataset were simply the results of taking survey data. Therefore, we can likely make the assumption that this experiment exhibits sufficient randomization.
In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘stateur’ in this analysis) can be explained by the variation existent in the single treatment of the experiment (which correspond to ‘joblost’). Therefore, the null hypothesis that is being tested states that the reason for an individual blue collar worker’s job loss does not have have a statistically significant effect on the state unemployment rate (in %). In carrying out this analysis, we perform an analysis of variance (ANOVA) for the state unemployment rate (‘stateur’) to see if there is a significant difference in the means for this response variable when considering the different reasons for job loss for blue collar workers, which are contained in this dataset. Additionally, in further testing this experiment, the analysis of variance (ANOVA) will be repeated using resampling techniques (namely, Bootstrapping).
The rationale for this design lies primarily in the fact that we’re trying to determine if the reason for a blue collar worker’s job loss has any effect on the state unemployment rate. So, since the state unemployment rate (in %) is a useful and relevant metric to consider when determining the GDP and the economic performance/status of the United States, this design of a single-factor, multi-level experiment (as it corresponds to an analysis of variance) was crafted to see if the reason for a blue collar worker’s job loss has a statistically significant effect on the state unemployment rate. By performing this analysis, we can hope to receive some insight regarding the different factors that come into play that typically result in a given level of the rate of state unemployment.
Since the observations contained within this dataset were simply the results of taking survey data (McCall, B.P. 1995), we can likely make the assumption that this experiment exhibits sufficient randomization.
In this experiment, there are no replicates or repeated measures present.
In designing the experiment in such a way that can be classified as a one-factor, four-level experiment that uses an “originally” categorical variable as its factor, blocking is inherently built into the experiment’s design. The experiment’s single factor, which is denoted as ‘joblost’ in the dataset [referring to the different reasons for a given blue collar worker’s job loss], is represented by the distinct levels of “slack_work”, “position_abolished”, “seasonal_job_ended”, and “other” (these levels ultimately designate the main blocking groups in the experiment).
In beginning to display this data graphically, summary statistics were gathered for the dataset being considered here, “benefits”. Additionally, histograms and boxplots were created to represent the different state unemployment rates.
#Display the summary statistics of "Benefits".
summary(benefits)
## stateur statemb state age
## Min. : 2.20 Min. : 84 Min. :11.0 Min. :20.0
## 1st Qu.: 5.70 1st Qu.:150 1st Qu.:31.0 1st Qu.:28.0
## Median : 7.20 Median :177 Median :55.0 Median :34.0
## Mean : 7.51 Mean :181 Mean :52.8 Mean :36.1
## 3rd Qu.: 9.00 3rd Qu.:205 3rd Qu.:74.0 3rd Qu.:43.0
## Max. :18.00 Max. :293 Max. :95.0 Max. :61.0
## tenure joblost nwhite school12
## Min. : 1.00 other :1976 no :4159 no :3950
## 1st Qu.: 2.00 position_abolished: 402 yes: 718 yes: 927
## Median : 3.00 seasonal_job_ended: 177
## Mean : 5.66 slack_work :2322
## 3rd Qu.: 7.00
## Max. :41.00
## sex bluecol smsa married dkids dykids
## female:1150 yes:4877 no :1694 no :1791 no :2508 no :3796
## male :3727 yes:3183 yes:3086 yes:2369 yes:1081
##
##
##
##
## yrdispl rr head ui
## Min. : 1.0 Min. :0.0386 no :1558 no :1542
## 1st Qu.: 2.0 1st Qu.:0.3752 yes:3319 yes:3335
## Median : 5.0 Median :0.4904
## Mean : 5.2 Mean :0.4384
## 3rd Qu.: 8.0 3rd Qu.:0.5116
## Max. :10.0 Max. :0.6912
#Display the names found in "Benefits".
names(benefits)
## [1] "stateur" "statemb" "state" "age" "tenure" "joblost"
## [7] "nwhite" "school12" "sex" "bluecol" "smsa" "married"
## [13] "dkids" "dykids" "yrdispl" "rr" "head" "ui"
#Display the structure of "Benefits".
str(benefits)
## 'data.frame': 4877 obs. of 18 variables:
## $ stateur : num 4.5 10.5 7.2 5.8 6.5 7.5 5.8 5.8 7.7 6 ...
## $ statemb : int 167 251 260 245 125 188 166 214 213 187 ...
## $ state : int 42 55 21 56 58 11 93 84 84 33 ...
## $ age : int 49 26 40 51 33 51 30 26 54 31 ...
## $ tenure : int 21 2 19 17 1 3 5 3 20 1 ...
## $ joblost : Factor w/ 4 levels "other","position_abolished",..: 1 4 1 4 4 1 2 4 1 4 ...
## $ nwhite : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ school12: Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 2 2 ...
## $ bluecol : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ smsa : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ married : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 2 2 2 ...
## $ dkids : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 2 1 1 ...
## $ dykids : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 2 1 1 ...
## $ yrdispl : int 7 10 10 10 4 10 6 8 6 8 ...
## $ rr : num 0.291 0.52 0.432 0.5 0.391 ...
## $ head : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
## $ ui : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 2 1 2 ...
#Display the head and tail of "Benefits".
head(benefits)
## stateur statemb state age tenure joblost nwhite school12 sex
## 1 4.5 167 42 49 21 other no no male
## 2 10.5 251 55 26 2 slack_work no no male
## 3 7.2 260 21 40 19 other no yes female
## 4 5.8 245 56 51 17 slack_work yes no female
## 5 6.5 125 58 33 1 slack_work no yes male
## 6 7.5 188 11 51 3 other no no male
## bluecol smsa married dkids dykids yrdispl rr head ui
## 1 yes yes no no no 7 0.2906 yes yes
## 2 yes yes no yes yes 10 0.5202 yes no
## 3 yes yes yes no no 10 0.4325 yes yes
## 4 yes yes yes no no 10 0.5000 no yes
## 5 yes yes yes yes yes 4 0.3906 yes no
## 6 yes yes yes no no 10 0.4822 yes yes
tail(benefits)
## stateur statemb state age tenure joblost nwhite school12 sex
## 4872 8.6 125 59 21 2 slack_work no no male
## 4873 7.4 168 33 35 12 slack_work yes yes female
## 4874 7.0 189 74 39 5 slack_work no no male
## 4875 8.0 168 74 59 13 slack_work no no male
## 4876 6.3 191 41 33 11 slack_work no no male
## 4877 9.3 188 94 27 2 other no no male
## bluecol smsa married dkids dykids yrdispl rr head ui
## 4872 yes yes no no no 2 0.3125 yes no
## 4873 yes yes no yes no 6 0.4636 yes yes
## 4874 yes yes yes no no 4 0.3937 yes yes
## 4875 yes yes yes yes no 2 0.3724 yes yes
## 4876 yes yes yes yes yes 3 0.5000 yes no
## 4877 yes yes yes yes yes 7 0.4954 yes yes
#Display the levels of 'joblost' within "Benefits".
levels(benefits$joblost)
## [1] "other" "position_abolished" "seasonal_job_ended"
## [4] "slack_work"
par(mfrow=c(1,1))
#Create a histogram of state unemployment rate ('stateur').
hist(benefits$stateur, xlim=c(2,18), main = "State Unemployment Rates in 1972", xlab = "State Unemployment Rate (in %)")
par(mfrow=c(1,1))
#Create a boxplot of state unemployment rate ('stateur').
boxplot(benefits$stateur~benefits$joblost, main = "Boxplot of State Unemployment Rates in 1972 among varying Reasons for Job Loss ", ylim = c(2,18), xlab = "Reasons for Job Loss", ylab = "State Unemployment Rate (in %)")
In order to determine if the variation that is observed in the response variable (which corresponds to the state unemployment rate in this analysis) can be explained by the variation existent in the treatment of the experiment (which corresponds to the reasons for job loss among blue collar workers), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in state unemployment rate values for each of the different reasons for job loss among blue collar workers being considered in this dataset.
For the ANOVA model that is designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the reason for job loss for a given blue collar worker does not have have a statistically significant effect on the state unemployment rate (in %), implying that the differences in the mean values of the state unemployment rate were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of the state unemployment rate for each of the corresponding reasons for job loss among blue collar workers being considered in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the state unemployment rate can be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of the state unemployment rate cannot be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis and, as such, is likely caused by randomization.
#Perform an analysis of variance (ANOVA) for the different mean values observed in the state unemployment rate, given the factor 'joblost'.
model_job_loss <- aov(stateur~joblost,benefits)
anova(model_job_loss)
## Analysis of Variance Table
##
## Response: stateur
## Df Sum Sq Mean Sq F value Pr(>F)
## joblost 3 135 45.0 7.22 7.8e-05 ***
## Residuals 4873 30354 6.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the analysis of variance (ANOVA) that is performed where ‘joblost’ is analyzed against the response variable ‘stateur’, a p-value = 7.81e-05 is returned, indicating that there is roughly a probability of 7.81e-05 that the resulting associated F-value (7.2233) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the state unemployment rate can be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
In further carrying out this analysis, we can compute Tukey Honest Significant Differences (via “TukeyHSD()”) as a means for determining the specific levels of the single factor existent in this analysis that are truly independent from each other and that significantly affect the response variable, ‘stateur’.
#Perform a TukeyHSD Test for "model_job_loss".
Tukey_job_loss = TukeyHSD(model_job_loss, ordered = FALSE, conf.level = 0.95)
Tukey_job_loss
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = stateur ~ joblost, data = benefits)
##
## $joblost
## diff lwr upr p adj
## position_abolished-other -0.04472 -0.39565 0.3062 0.9879
## seasonal_job_ended-other -0.05790 -0.56114 0.4453 0.9910
## slack_work-other 0.32072 0.12441 0.5170 0.0002
## seasonal_job_ended-position_abolished -0.01319 -0.59178 0.5654 0.9999
## slack_work-position_abolished 0.36544 0.01895 0.7119 0.0341
## slack_work-seasonal_job_ended 0.37863 -0.12152 0.8788 0.2092
par(mfrow=c(1,1))
plot(Tukey_job_loss)
After observing the results of these Tukey Honest Significant Differences for “model_job_loss”, it’s seemingly clear that a majority of the different level-comparisons suggest a not-so-significant difference in means among a great majority of the different level-pairs (including “position_abolished-other”, “seasonal_job_ended-other”, “seasonal_job_ended-position_abolished”, and “slack_work-seasonal_job_ended”, since, unlike the other level-pairs “slack_work-other” and “slack_work-position_abolished”, their respective p-values are greater than 0.05), which does not really serve to suggest the significance of the results that were gathered in the above ANOVA model “model_job_loss”.
In further analyzing the differences in state unemployment rate values for each of the different reasons for job loss among blue collar workers being considered in this dataset, we can perform an analysis of variance (ANOVA) test a second time using the resampling technique known as Bootstrapping. By resampling the data via Bootstrapping, an analysis of variance (ANOVA) can be performed without necessarily having to ensure that the data exhibits a normal distribution (which would likely result in a more accurate analysis overall).
#Perform ANOVA Test with Resampling (via Bootstrapping).
with(benefits,tapply(stateur,joblost,mean))
## other position_abolished seasonal_job_ended
## 7.364 7.319 7.306
## slack_work
## 7.685
with(benefits,tapply(stateur,joblost,var))
## other position_abolished seasonal_job_ended
## 6.426 6.057 5.643
## slack_work
## 6.136
with(benefits,tapply(stateur,joblost,length))
## other position_abolished seasonal_job_ended
## 1976 402 177
## slack_work
## 2322
#Obtain the sd(residual error) from summary statistics.
summary(aov(stateur~joblost,data=benefits))
## Df Sum Sq Mean Sq F value Pr(>F)
## joblost 3 135 45.0 7.22 7.8e-05 ***
## Residuals 4873 30354 6.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
meanstar = mean(benefits$stateur)
#From summary statistics above, input residual error below to calculate sd(residual error).
sdstar = sqrt(6.23)
simjoblost = benefits$joblost
R = 10000 #10,000 iterations
Fstar = numeric(R)
for (i in 1:R) {
#Residual should be normally distributed with known pooled-variance.
groupA = rnorm(1976, mean=meanstar, sd=sdstar)
groupB = rnorm(402, mean=meanstar, sd=sdstar)
groupC = rnorm(177, mean=meanstar, sd=sdstar)
groupD = rnorm(2322, mean=meanstar, sd=sdstar)
simstateur = c(groupA,groupB,groupC,groupD)
simdata = data.frame(simstateur,simjoblost)
Fstar[i] = oneway.test(simstateur~simjoblost, var.equal=T, data=simdata)$statistic
}
#Computes empirical distribution of F under H0 and display as histogram.
par(mfrow=c(1,1))
hist(Fstar, ylim=c(0,1), xlim=c(0, 8), prob=T, main = "Historgram of Empirical F-distribution")
#Plot F-distribution from .25 to 6 in .5 increments.
x=seq(.25,6,.25)
#Overlay the analytic F-distribution as red dots (degrees of freedom known).
#(Note: 4 levels-1 mean = 3; 4877 (n)-4 means = 4873)
#Notice the quality of the fit of the Monte Carlo simulation to the empirical distribution of F.
points(x,y=df(x,3,4873),type="b",col="red")
#Carry out the Bootstrap version of ANOVA.
meanstar = with(benefits, tapply(stateur,joblost,mean))
grpA = benefits$stateur[benefits$joblost=="other"] - meanstar[1]
grpB = benefits$stateur[benefits$joblost=="position_abolished"] - meanstar[2]
grpC = benefits$stateur[benefits$joblost=="seasonal_job_ended"] - meanstar[3]
grpD = benefits$stateur[benefits$joblost=="slack_work"] - meanstar[4]
simjoblost = benefits$joblost
R = 10000
Fstar_real = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=1976, replace=T)
groupB = sample(grpB, size=402, replace=T)
groupC = sample(grpC, size=177, replace=T)
groupD = sample(grpD, size=2322, replace=T)
simstateur = c(groupA,groupB,groupC,groupD)
simdata = data.frame(simstateur,simjoblost)
Fstar_real[i] = oneway.test(simstateur~simjoblost, var.equal=T, data=simdata)$statistic
}
#Generate a plot that compares the Bootstrapped distribution to the known F-distribution.
par(mfrow=c(1,1))
hist(Fstar_real, ylim=c(0,1), xlim=c(0,8), prob=T, main = "Histogram of Actual F-distribution via Bootstrapping")
x=seq(.25,6,.25)
points(x,y=df(x,3,4873),type="b",col="red") #The F-distribution
#(Note: Here is the alpha-level from the Bootstrapped distribution.)
print(realFstar<-oneway.test(stateur~joblost, var.equal=T, data=benefits)$statistic)
## F
## 7.223
mean(Fstar>=realFstar)
## [1] 0
#Estimate the "alpha = 0.05" value of the test statistic, given F* from the empirical distribution above.
#Generate quantiles of the analytic F-distribution.
qf(.95,5,90)
## [1] 2.316
#Print the "alpha = 0.05" value of Bootstrapped F*.
quantile(Fstar_real,.95)
## 95%
## 2.614
In comparing each of the two histograms that result upon generating this “ANOVA with Resampling” analysis, it’s clear that a similar fit is exhibited when considering both the empirical F-distribution and the actual F-distribution, individually. Furthermore, this Bootstrapping ANOVA Test has a resulting p-value = 0 (zero), indicating that there is roughly a probability = 0 that the resulting associated “Bootstrapped” F-value [see above output for associated F-value] is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the state unemployment rate can be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis and, as such, is likely not caused solely by randomization.
In estimating the different parameters of the experiment, summary statistics were determined for the relevant data in the dataset pertaining to the state unemployment rate contained in “Benefits” (which includes both the average state unemployment rate values for all of the states contained within the dataset and the standard deviation of those average state unemployment rates) and the reasons for job loss among blue collar workers contained in “Benefits” (which includes both the quantities of individuals who are categorized as relating to the four different reasons for job loss among blue collar workers, respectively, and the standard deviation of those distributed quantities).
#Display summary statistics of benefits$stateur.
summary(benefits$stateur)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.20 5.70 7.20 7.51 9.00 18.00
#Display standard deviation of benefits$stateur.
sd(benefits$stateur, na.rm = FALSE)
## [1] 2.501
#Display summary statistics of benefits$joblost.
summary(benefits$joblost)
## other position_abolished seasonal_job_ended
## 1976 402 177
## slack_work
## 2322
#Display standard deviation of benefits$joblost.
sd(benefits$joblost, na.rm = FALSE)
## [1] 1.416
In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot of our data and perform a Shapiro-Wilk Test of Normality on our data.
#Create a Normal Q-Q Plot for the state unemployment rate.
qqnorm(benefits[,"stateur"], main = "Normal Q-Q Plot of the State Unemployment Rate")
qqline(benefits[,"stateur"])
#Create a Normal Q-Q Plot of the residuals for "model_job_loss".
qqnorm(residuals(model_job_loss), main = "Normal Q-Q Plot of Residuals of 'model_job_loss'")
qqline(residuals(model_job_loss))
#Perform Shapiro-Wilk Test of Normality on the state unemployment rate (normality is assummed if p > 0.1).
shapiro.test(benefits[,"stateur"])
##
## Shapiro-Wilk normality test
##
## data: benefits[, "stateur"]
## W = 0.9602, p-value < 2.2e-16
Upon both constructing Normal Q-Q Plots and performing a Shapiro-Wilk Test of Normality on the data in this analysis, it’s likely that we cannot readily assume that our data exhibits normality, since the resulting p-value of the Shapiro-Wilk Test of Normality for “benefits[,“stateur”]” was found to be < 0.1 at its value of < 2.2e-16 and since all of the constructed Normal Q-Q Plots (and their respective “tails”, especially) did not seem to display a trend of data that aligned closely with the Normal Q-Q Line.
In further attempting to verify the validity of our original ANOVA model’s results, we can generate a “quality of fit” model that plots residual error against the fitted model that was developed in our original analysis of variance (ANOVA).
#Create a "Quality of Fit Model" that plots the residuals of "model_job_loss" against its fitted model.
plot(fitted(model_job_loss),residuals(model_job_loss))
Because the resulting plot does not appear to be scatted and clumped around zero, the ANOVA model that was developed does not suggest good fit under this “model adequacy checking” criteria. Thus, we cannot confidently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.
Since our modeling assumptions of data normality and factor independence seemingly failed in our analysis, we can err on the side of caution by performing the nonparametric Kruskal-Wallis rank sum test to back up our original results (which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution).
#Perform Kruskal-Wallis Rank Sum Test on 'stateur' within the "Benefits" dataset for 'joblost' (identical populations is assummed if p > 0.05).
kruskal.test(benefits[,"stateur"],benefits$joblost)
##
## Kruskal-Wallis rank sum test
##
## data: benefits[, "stateur"] and benefits$joblost
## Kruskal-Wallis chi-squared = 26.38, df = 3, p-value = 7.96e-06
Since the p-value for the resulting Kruskal-Wallis rank sum test that considers the factors ‘joblost’ against the response variable ‘stateur’ is less than 0.05, we can assume that the mean values of the state unemployment rate compared to the different types of reasons for job loss among blue collar workers are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, which would lead us to infer that the differences in mean values of the state unemployment rate for each of the corresponding reasons for job loss among blue collar workers being considered in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the state unemployment rate can be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis. Furthermore, in addition to treating our data in such a way that uses a nonparametric analysis upon any realization that normality cannot be assumed, transformations such as the “Box-Cox Power Transformation” certainly could have been performed on the data to make it approximately normal. However, these transformations would not be necessary for this analysis, since the nonparametric significance results that we generated by using the Kruskal-Wallis rank sums test were suitable in giving us confidence in the results of our analysis.
[1] McCall, B.P. (1995) “The impact of unemployment insurance benefit levels on recipiency”, Journal of Business and Economic Statistics, 13, 189-198.
This dataframe contains a cross-section of information from 1972 pertaining to the unemployment rates of blue collar workers in the United States, collected for the research corresponding to the article entitled “The impact of unemployment insurance benefit levels on recipiency” (McCall, B.P. 1995) [1]. [http://www.stats4stem.org/r-benefits-data.html]. This dataframe, which is called ‘Benefits’, has been conveniently packaged for R users in the R Library ‘Ecdat’.