Recipe 7: Resampling Methods

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Analysis with Resampling of Average State Unemployment Rate (Single-Factor, Multi-Level Experiment)

Brendan Howell

Renselaer Polytechnic Institute

11/06/14 - Version 1.0

1. Setting

Dataset of Unemployment Rates of Blue Collar Workers in the United States (‘Benefits’).

Description: This dataframe contains a cross-section of information from 1972 pertaining to the unemployment rates of blue collar workers in the United States, collected for the research corresponding to the article entitled “The impact of unemployment insurance benefit levels on recipiency” (McCall, B.P. 1995) [1].

In this study that uses data collected from “The impact of unemployment insurance benefit levels on recipiency” (McCall, B.P. 1995), a single-factor, multi-level experiment is performed to see if the reason for a blue collar worker’s job loss has a statistically significant effect on the state unemployment rate (in %). In the dataset, the factor ‘joblost’ refers to the reason for a blue collar worker’s job loss. Additionally, this analysis’ response variable is referred to in the dataset as ‘stateur’, which denotes the state unemployment rate in the analysis. [1]

#Install and load the 'Ecdat' dataset into R.
rm(list=ls())
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\howelb\AppData\Local\Temp\RtmpQZInif\downloaded_packages
library("Ecdat",lib.loc="C:/Program Files/R/R-3.1.1/library")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.1.2
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
#Assign a variable, 'benefits', to the complete dataframe, 'Benefits'. Then, determine the "head" and "tail" of the dataset, 'benefits'.
benefits<-Benefits
head(benefits)
##   stateur statemb state age tenure    joblost nwhite school12    sex
## 1     4.5     167    42  49     21      other     no       no   male
## 2    10.5     251    55  26      2 slack_work     no       no   male
## 3     7.2     260    21  40     19      other     no      yes female
## 4     5.8     245    56  51     17 slack_work    yes       no female
## 5     6.5     125    58  33      1 slack_work     no      yes   male
## 6     7.5     188    11  51      3      other     no       no   male
##   bluecol smsa married dkids dykids yrdispl     rr head  ui
## 1     yes  yes      no    no     no       7 0.2906  yes yes
## 2     yes  yes      no   yes    yes      10 0.5202  yes  no
## 3     yes  yes     yes    no     no      10 0.4325  yes yes
## 4     yes  yes     yes    no     no      10 0.5000   no yes
## 5     yes  yes     yes   yes    yes       4 0.3906  yes  no
## 6     yes  yes     yes    no     no      10 0.4822  yes yes
tail(benefits)
##      stateur statemb state age tenure    joblost nwhite school12    sex
## 4872     8.6     125    59  21      2 slack_work     no       no   male
## 4873     7.4     168    33  35     12 slack_work    yes      yes female
## 4874     7.0     189    74  39      5 slack_work     no       no   male
## 4875     8.0     168    74  59     13 slack_work     no       no   male
## 4876     6.3     191    41  33     11 slack_work     no       no   male
## 4877     9.3     188    94  27      2      other     no       no   male
##      bluecol smsa married dkids dykids yrdispl     rr head  ui
## 4872     yes  yes      no    no     no       2 0.3125  yes  no
## 4873     yes  yes      no   yes     no       6 0.4636  yes yes
## 4874     yes  yes     yes    no     no       4 0.3937  yes yes
## 4875     yes  yes     yes   yes     no       2 0.3724  yes yes
## 4876     yes  yes     yes   yes    yes       3 0.5000  yes  no
## 4877     yes  yes     yes   yes    yes       7 0.4954  yes yes

Factors and Levels

This analysis considers one single factor (that has multiple levels), which is denoted in the dataset as ‘joblost’. In the original dataset “Benefits”, the factor ‘joblost’ is characterized as being a factor (or, in other words, a categorical variable) with four specific categorical levels. This factor was selected intuitively, since this analysis aims to determine whether or not the reason for a blue collar worker’s job loss has a statistically significant effect on the state unemployment rate (in %).

#Display the summary statistics of "Benefits".
summary(benefits)
##     stateur         statemb        state           age      
##  Min.   : 2.20   Min.   : 84   Min.   :11.0   Min.   :20.0  
##  1st Qu.: 5.70   1st Qu.:150   1st Qu.:31.0   1st Qu.:28.0  
##  Median : 7.20   Median :177   Median :55.0   Median :34.0  
##  Mean   : 7.51   Mean   :181   Mean   :52.8   Mean   :36.1  
##  3rd Qu.: 9.00   3rd Qu.:205   3rd Qu.:74.0   3rd Qu.:43.0  
##  Max.   :18.00   Max.   :293   Max.   :95.0   Max.   :61.0  
##      tenure                    joblost     nwhite     school12  
##  Min.   : 1.00   other             :1976   no :4159   no :3950  
##  1st Qu.: 2.00   position_abolished: 402   yes: 718   yes: 927  
##  Median : 3.00   seasonal_job_ended: 177                        
##  Mean   : 5.66   slack_work        :2322                        
##  3rd Qu.: 7.00                                                  
##  Max.   :41.00                                                  
##      sex       bluecol     smsa      married    dkids      dykids    
##  female:1150   yes:4877   no :1694   no :1791   no :2508   no :3796  
##  male  :3727              yes:3183   yes:3086   yes:2369   yes:1081  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##     yrdispl           rr          head        ui      
##  Min.   : 1.0   Min.   :0.0386   no :1558   no :1542  
##  1st Qu.: 2.0   1st Qu.:0.3752   yes:3319   yes:3335  
##  Median : 5.0   Median :0.4904                        
##  Mean   : 5.2   Mean   :0.4384                        
##  3rd Qu.: 8.0   3rd Qu.:0.5116                        
##  Max.   :10.0   Max.   :0.6912
#Display the names found in "Benefits".
names(benefits)
##  [1] "stateur"  "statemb"  "state"    "age"      "tenure"   "joblost" 
##  [7] "nwhite"   "school12" "sex"      "bluecol"  "smsa"     "married" 
## [13] "dkids"    "dykids"   "yrdispl"  "rr"       "head"     "ui"
#Display the structure of "Benefits".
str(benefits)
## 'data.frame':    4877 obs. of  18 variables:
##  $ stateur : num  4.5 10.5 7.2 5.8 6.5 7.5 5.8 5.8 7.7 6 ...
##  $ statemb : int  167 251 260 245 125 188 166 214 213 187 ...
##  $ state   : int  42 55 21 56 58 11 93 84 84 33 ...
##  $ age     : int  49 26 40 51 33 51 30 26 54 31 ...
##  $ tenure  : int  21 2 19 17 1 3 5 3 20 1 ...
##  $ joblost : Factor w/ 4 levels "other","position_abolished",..: 1 4 1 4 4 1 2 4 1 4 ...
##  $ nwhite  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ school12: Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 2 2 ...
##  $ bluecol : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ smsa    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ married : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 2 2 2 ...
##  $ dkids   : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 2 1 1 ...
##  $ dykids  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 2 1 1 ...
##  $ yrdispl : int  7 10 10 10 4 10 6 8 6 8 ...
##  $ rr      : num  0.291 0.52 0.432 0.5 0.391 ...
##  $ head    : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ ui      : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 2 1 2 ...

Continuous variables (if any)

In this dataset, there are a few variables that can be considered to be continuous variables; these variables are the ones which are categorized as being numeric variables. By this standard, the continuous variables that exist in this dataset include ‘stateur’ (which refers to the state unemployment rate (in %)) and ‘rr’(which refers to the replacement rate of blue collar workers). Additionally, five variables in this dataset currently exist as integer variables, including the state maximum benefit level (‘statemb’), the state of residence code (‘state’), the age in years of each blue collar worker (‘age’), the years of tenure that a blue collar worker had in the job that they lost (‘tenure’), and the year of a blue collar worker’s job displacement (‘yrdispl’).

Response variables

This analysis will consider one response variable, ‘stateur’, which denotes the state unemployment rate in the analysis.

The Data: How is it organized and what does it look like?

As a whole, this dataset contains a cross-section of information from 1972 pertaining to the unemployment rates of blue collar workers in the United States, collected for the research corresponding to the article entitled “”The impact of unemployment insurance benefit levels on recipiency" (McCall, B.P. 1995) [1]. It includes data pertaining to state unemployment rates and to the reasons for job loss among blue collar workers. It contains 4877 observations of 18 variables, which are defined below:

Variable ID [Description]

stateur [state unemployment rate (in %)]

statemb [state maximum benefit level]

state [state of residence code]

age [age in years]

tenure [years of tenure in job lost]

joblost [a factor with four levels (“slack_work”, “position_abolished”, “seasonal_job_ended”, “other”)]

nwhite [non-white ?]

school12 [more than 12 years of school ?]

sex [a factor with two levels (“male”, “female”)]

bluecol [blue collar worker ?]

smsa [lives is smsa ?]

married [married ?]

dkids [has kids ?]

dykids [has young kids (0-5 yrs) ?]

yrdispl [year of job displacement (1982=1,…, 1991=10)]

rr [replacement rate]

head [is head of household ?]

ui [applied for (and received) UI benefits ?]

Randomization

According to the research publication being considered in this analysis (McCall, B.P. 1995), the observations contained within this dataset were simply the results of taking survey data. Therefore, we can likely make the assumption that this experiment exhibits sufficient randomization.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘stateur’ in this analysis) can be explained by the variation existent in the single treatment of the experiment (which correspond to ‘joblost’). Therefore, the null hypothesis that is being tested states that the reason for an individual blue collar worker’s job loss does not have have a statistically significant effect on the state unemployment rate (in %). In carrying out this analysis, we perform an analysis of variance (ANOVA) for the state unemployment rate (‘stateur’) to see if there is a significant difference in the means for this response variable when considering the different reasons for job loss for blue collar workers, which are contained in this dataset. Additionally, in further testing this experiment, the analysis of variance (ANOVA) will be repeated using resampling techniques (namely, Bootstrapping).

What is the rationale for this design?

The rationale for this design lies primarily in the fact that we’re trying to determine if the reason for a blue collar worker’s job loss has any effect on the state unemployment rate. So, since the state unemployment rate (in %) is a useful and relevant metric to consider when determining the GDP and the economic performance/status of the United States, this design of a single-factor, multi-level experiment (as it corresponds to an analysis of variance) was crafted to see if the reason for a blue collar worker’s job loss has a statistically significant effect on the state unemployment rate. By performing this analysis, we can hope to receive some insight regarding the different factors that come into play that typically result in a given level of the rate of state unemployment.

Randomize: What is the Randomization Scheme?

Since the observations contained within this dataset were simply the results of taking survey data (McCall, B.P. 1995), we can likely make the assumption that this experiment exhibits sufficient randomization.

Replicate: Are there replicates and/or repeated measures?

In this experiment, there are no replicates or repeated measures present.

Block: Did you use blocking in the design?

In designing the experiment in such a way that can be classified as a one-factor, four-level experiment that uses an “originally” categorical variable as its factor, blocking is inherently built into the experiment’s design. The experiment’s single factor, which is denoted as ‘joblost’ in the dataset [referring to the different reasons for a given blue collar worker’s job loss], is represented by the distinct levels of “slack_work”, “position_abolished”, “seasonal_job_ended”, and “other” (these levels ultimately designate the main blocking groups in the experiment).

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and Descriptive Summary

In beginning to display this data graphically, summary statistics were gathered for the dataset being considered here, “benefits”. Additionally, histograms and boxplots were created to represent the different state unemployment rates.

#Display the summary statistics of "Benefits".
summary(benefits)
##     stateur         statemb        state           age      
##  Min.   : 2.20   Min.   : 84   Min.   :11.0   Min.   :20.0  
##  1st Qu.: 5.70   1st Qu.:150   1st Qu.:31.0   1st Qu.:28.0  
##  Median : 7.20   Median :177   Median :55.0   Median :34.0  
##  Mean   : 7.51   Mean   :181   Mean   :52.8   Mean   :36.1  
##  3rd Qu.: 9.00   3rd Qu.:205   3rd Qu.:74.0   3rd Qu.:43.0  
##  Max.   :18.00   Max.   :293   Max.   :95.0   Max.   :61.0  
##      tenure                    joblost     nwhite     school12  
##  Min.   : 1.00   other             :1976   no :4159   no :3950  
##  1st Qu.: 2.00   position_abolished: 402   yes: 718   yes: 927  
##  Median : 3.00   seasonal_job_ended: 177                        
##  Mean   : 5.66   slack_work        :2322                        
##  3rd Qu.: 7.00                                                  
##  Max.   :41.00                                                  
##      sex       bluecol     smsa      married    dkids      dykids    
##  female:1150   yes:4877   no :1694   no :1791   no :2508   no :3796  
##  male  :3727              yes:3183   yes:3086   yes:2369   yes:1081  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##     yrdispl           rr          head        ui      
##  Min.   : 1.0   Min.   :0.0386   no :1558   no :1542  
##  1st Qu.: 2.0   1st Qu.:0.3752   yes:3319   yes:3335  
##  Median : 5.0   Median :0.4904                        
##  Mean   : 5.2   Mean   :0.4384                        
##  3rd Qu.: 8.0   3rd Qu.:0.5116                        
##  Max.   :10.0   Max.   :0.6912
#Display the names found in "Benefits".
names(benefits)
##  [1] "stateur"  "statemb"  "state"    "age"      "tenure"   "joblost" 
##  [7] "nwhite"   "school12" "sex"      "bluecol"  "smsa"     "married" 
## [13] "dkids"    "dykids"   "yrdispl"  "rr"       "head"     "ui"
#Display the structure of "Benefits".
str(benefits)
## 'data.frame':    4877 obs. of  18 variables:
##  $ stateur : num  4.5 10.5 7.2 5.8 6.5 7.5 5.8 5.8 7.7 6 ...
##  $ statemb : int  167 251 260 245 125 188 166 214 213 187 ...
##  $ state   : int  42 55 21 56 58 11 93 84 84 33 ...
##  $ age     : int  49 26 40 51 33 51 30 26 54 31 ...
##  $ tenure  : int  21 2 19 17 1 3 5 3 20 1 ...
##  $ joblost : Factor w/ 4 levels "other","position_abolished",..: 1 4 1 4 4 1 2 4 1 4 ...
##  $ nwhite  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ school12: Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 2 2 ...
##  $ bluecol : Factor w/ 1 level "yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ smsa    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ married : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 2 2 2 ...
##  $ dkids   : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 2 1 1 ...
##  $ dykids  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 2 1 1 ...
##  $ yrdispl : int  7 10 10 10 4 10 6 8 6 8 ...
##  $ rr      : num  0.291 0.52 0.432 0.5 0.391 ...
##  $ head    : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ ui      : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 2 1 2 ...
#Display the head and tail of "Benefits".
head(benefits)
##   stateur statemb state age tenure    joblost nwhite school12    sex
## 1     4.5     167    42  49     21      other     no       no   male
## 2    10.5     251    55  26      2 slack_work     no       no   male
## 3     7.2     260    21  40     19      other     no      yes female
## 4     5.8     245    56  51     17 slack_work    yes       no female
## 5     6.5     125    58  33      1 slack_work     no      yes   male
## 6     7.5     188    11  51      3      other     no       no   male
##   bluecol smsa married dkids dykids yrdispl     rr head  ui
## 1     yes  yes      no    no     no       7 0.2906  yes yes
## 2     yes  yes      no   yes    yes      10 0.5202  yes  no
## 3     yes  yes     yes    no     no      10 0.4325  yes yes
## 4     yes  yes     yes    no     no      10 0.5000   no yes
## 5     yes  yes     yes   yes    yes       4 0.3906  yes  no
## 6     yes  yes     yes    no     no      10 0.4822  yes yes
tail(benefits)
##      stateur statemb state age tenure    joblost nwhite school12    sex
## 4872     8.6     125    59  21      2 slack_work     no       no   male
## 4873     7.4     168    33  35     12 slack_work    yes      yes female
## 4874     7.0     189    74  39      5 slack_work     no       no   male
## 4875     8.0     168    74  59     13 slack_work     no       no   male
## 4876     6.3     191    41  33     11 slack_work     no       no   male
## 4877     9.3     188    94  27      2      other     no       no   male
##      bluecol smsa married dkids dykids yrdispl     rr head  ui
## 4872     yes  yes      no    no     no       2 0.3125  yes  no
## 4873     yes  yes      no   yes     no       6 0.4636  yes yes
## 4874     yes  yes     yes    no     no       4 0.3937  yes yes
## 4875     yes  yes     yes   yes     no       2 0.3724  yes yes
## 4876     yes  yes     yes   yes    yes       3 0.5000  yes  no
## 4877     yes  yes     yes   yes    yes       7 0.4954  yes yes
#Display the levels of 'joblost' within "Benefits".
levels(benefits$joblost)
## [1] "other"              "position_abolished" "seasonal_job_ended"
## [4] "slack_work"
par(mfrow=c(1,1))
#Create a histogram of state unemployment rate ('stateur').
hist(benefits$stateur, xlim=c(2,18), main = "State Unemployment Rates in 1972", xlab = "State Unemployment Rate (in %)")

plot of chunk unnamed-chunk-4

par(mfrow=c(1,1))
#Create a boxplot of state unemployment rate ('stateur').
boxplot(benefits$stateur~benefits$joblost, main = "Boxplot of State Unemployment Rates in 1972 among varying Reasons for Job Loss ", ylim = c(2,18), xlab = "Reasons for Job Loss", ylab = "State Unemployment Rate (in %)")

plot of chunk unnamed-chunk-5

Testing

In order to determine if the variation that is observed in the response variable (which corresponds to the state unemployment rate in this analysis) can be explained by the variation existent in the treatment of the experiment (which corresponds to the reasons for job loss among blue collar workers), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in state unemployment rate values for each of the different reasons for job loss among blue collar workers being considered in this dataset.

For the ANOVA model that is designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the reason for job loss for a given blue collar worker does not have have a statistically significant effect on the state unemployment rate (in %), implying that the differences in the mean values of the state unemployment rate were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of the state unemployment rate for each of the corresponding reasons for job loss among blue collar workers being considered in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the state unemployment rate can be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of the state unemployment rate cannot be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis and, as such, is likely caused by randomization.

#Perform an analysis of variance (ANOVA) for the different mean values observed in the state unemployment rate, given the factor 'joblost'.
model_job_loss <- aov(stateur~joblost,benefits)
anova(model_job_loss)
## Analysis of Variance Table
## 
## Response: stateur
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## joblost      3    135    45.0    7.22 7.8e-05 ***
## Residuals 4873  30354     6.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the analysis of variance (ANOVA) that is performed where ‘joblost’ is analyzed against the response variable ‘stateur’, a p-value = 7.81e-05 is returned, indicating that there is roughly a probability of 7.81e-05 that the resulting associated F-value (7.2233) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the state unemployment rate can be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

Tukey Honest Significant Differences

In further carrying out this analysis, we can compute Tukey Honest Significant Differences (via “TukeyHSD()”) as a means for determining the specific levels of the single factor existent in this analysis that are truly independent from each other and that significantly affect the response variable, ‘stateur’.

#Perform a TukeyHSD Test for "model_job_loss".
Tukey_job_loss = TukeyHSD(model_job_loss, ordered = FALSE, conf.level = 0.95)
Tukey_job_loss
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = stateur ~ joblost, data = benefits)
## 
## $joblost
##                                           diff      lwr    upr  p adj
## position_abolished-other              -0.04472 -0.39565 0.3062 0.9879
## seasonal_job_ended-other              -0.05790 -0.56114 0.4453 0.9910
## slack_work-other                       0.32072  0.12441 0.5170 0.0002
## seasonal_job_ended-position_abolished -0.01319 -0.59178 0.5654 0.9999
## slack_work-position_abolished          0.36544  0.01895 0.7119 0.0341
## slack_work-seasonal_job_ended          0.37863 -0.12152 0.8788 0.2092
par(mfrow=c(1,1))
plot(Tukey_job_loss)

plot of chunk unnamed-chunk-7

After observing the results of these Tukey Honest Significant Differences for “model_job_loss”, it’s seemingly clear that a majority of the different level-comparisons suggest a not-so-significant difference in means among a great majority of the different level-pairs (including “position_abolished-other”, “seasonal_job_ended-other”, “seasonal_job_ended-position_abolished”, and “slack_work-seasonal_job_ended”, since, unlike the other level-pairs “slack_work-other” and “slack_work-position_abolished”, their respective p-values are greater than 0.05), which does not really serve to suggest the significance of the results that were gathered in the above ANOVA model “model_job_loss”.

Testing with Resampling (via Bootstrapping)

In further analyzing the differences in state unemployment rate values for each of the different reasons for job loss among blue collar workers being considered in this dataset, we can perform an analysis of variance (ANOVA) test a second time using the resampling technique known as Bootstrapping. By resampling the data via Bootstrapping, an analysis of variance (ANOVA) can be performed without necessarily having to ensure that the data exhibits a normal distribution (which would likely result in a more accurate analysis overall).

#Perform ANOVA Test with Resampling (via Bootstrapping).
with(benefits,tapply(stateur,joblost,mean))
##              other position_abolished seasonal_job_ended 
##              7.364              7.319              7.306 
##         slack_work 
##              7.685
with(benefits,tapply(stateur,joblost,var))
##              other position_abolished seasonal_job_ended 
##              6.426              6.057              5.643 
##         slack_work 
##              6.136
with(benefits,tapply(stateur,joblost,length))
##              other position_abolished seasonal_job_ended 
##               1976                402                177 
##         slack_work 
##               2322
#Obtain the sd(residual error) from summary statistics.
summary(aov(stateur~joblost,data=benefits))
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## joblost        3    135    45.0    7.22 7.8e-05 ***
## Residuals   4873  30354     6.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
meanstar = mean(benefits$stateur)
#From summary statistics above, input residual error below to calculate sd(residual error).
sdstar = sqrt(6.23)
simjoblost = benefits$joblost
R = 10000  #10,000 iterations
Fstar = numeric(R)
for (i in 1:R) {
  #Residual should be normally distributed with known pooled-variance.
  groupA = rnorm(1976, mean=meanstar, sd=sdstar)
  groupB = rnorm(402, mean=meanstar, sd=sdstar)
  groupC = rnorm(177, mean=meanstar, sd=sdstar)
  groupD = rnorm(2322, mean=meanstar, sd=sdstar)
  simstateur = c(groupA,groupB,groupC,groupD)
  simdata = data.frame(simstateur,simjoblost)
  Fstar[i] = oneway.test(simstateur~simjoblost, var.equal=T, data=simdata)$statistic
}
#Computes empirical distribution of F under H0 and display as histogram.
par(mfrow=c(1,1))
hist(Fstar, ylim=c(0,1), xlim=c(0, 8), prob=T, main = "Historgram of Empirical F-distribution")
#Plot F-distribution from .25 to 6 in .5 increments.
x=seq(.25,6,.25)
#Overlay the analytic F-distribution as red dots (degrees of freedom known).
#(Note: 4 levels-1 mean = 3; 4877 (n)-4 means = 4873)
#Notice the quality of the fit of the Monte Carlo simulation to the empirical distribution of F.
points(x,y=df(x,3,4873),type="b",col="red")

plot of chunk unnamed-chunk-8

#Carry out the Bootstrap version of ANOVA.
meanstar = with(benefits, tapply(stateur,joblost,mean))
grpA = benefits$stateur[benefits$joblost=="other"] - meanstar[1]
grpB = benefits$stateur[benefits$joblost=="position_abolished"] - meanstar[2]
grpC = benefits$stateur[benefits$joblost=="seasonal_job_ended"] - meanstar[3]
grpD = benefits$stateur[benefits$joblost=="slack_work"] - meanstar[4]
simjoblost = benefits$joblost
R = 10000
Fstar_real = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=1976, replace=T)
  groupB = sample(grpB, size=402, replace=T)
  groupC = sample(grpC, size=177, replace=T)
  groupD = sample(grpD, size=2322, replace=T)
  simstateur = c(groupA,groupB,groupC,groupD)
  simdata = data.frame(simstateur,simjoblost)
  Fstar_real[i] = oneway.test(simstateur~simjoblost, var.equal=T, data=simdata)$statistic
}
#Generate a plot that compares the Bootstrapped distribution to the known F-distribution.
par(mfrow=c(1,1))
hist(Fstar_real, ylim=c(0,1), xlim=c(0,8), prob=T, main = "Histogram of Actual F-distribution via Bootstrapping")
x=seq(.25,6,.25)
points(x,y=df(x,3,4873),type="b",col="red")  #The F-distribution

plot of chunk unnamed-chunk-8

#(Note: Here is the alpha-level from the Bootstrapped distribution.)
print(realFstar<-oneway.test(stateur~joblost, var.equal=T, data=benefits)$statistic)
##     F 
## 7.223
mean(Fstar>=realFstar)
## [1] 0
#Estimate the "alpha = 0.05" value of the test statistic, given F* from the empirical distribution above.
#Generate quantiles of the analytic F-distribution.
qf(.95,5,90)
## [1] 2.316
#Print the "alpha = 0.05" value of Bootstrapped F*.
quantile(Fstar_real,.95)
##   95% 
## 2.614

In comparing each of the two histograms that result upon generating this “ANOVA with Resampling” analysis, it’s clear that a similar fit is exhibited when considering both the empirical F-distribution and the actual F-distribution, individually. Furthermore, this Bootstrapping ANOVA Test has a resulting p-value = 0 (zero), indicating that there is roughly a probability = 0 that the resulting associated “Bootstrapped” F-value [see above output for associated F-value] is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the state unemployment rate can be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis and, as such, is likely not caused solely by randomization.

Estimation (of Parameters)

In estimating the different parameters of the experiment, summary statistics were determined for the relevant data in the dataset pertaining to the state unemployment rate contained in “Benefits” (which includes both the average state unemployment rate values for all of the states contained within the dataset and the standard deviation of those average state unemployment rates) and the reasons for job loss among blue collar workers contained in “Benefits” (which includes both the quantities of individuals who are categorized as relating to the four different reasons for job loss among blue collar workers, respectively, and the standard deviation of those distributed quantities).

Tables of Parameter Values

#Display summary statistics of benefits$stateur.
summary(benefits$stateur)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.20    5.70    7.20    7.51    9.00   18.00
#Display standard deviation of benefits$stateur.
sd(benefits$stateur, na.rm = FALSE)
## [1] 2.501
#Display summary statistics of benefits$joblost.
summary(benefits$joblost)
##              other position_abolished seasonal_job_ended 
##               1976                402                177 
##         slack_work 
##               2322
#Display standard deviation of benefits$joblost.
sd(benefits$joblost, na.rm = FALSE)
## [1] 1.416

Diagnostics/Model Adequacy Checking

In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot of our data and perform a Shapiro-Wilk Test of Normality on our data.

#Create a Normal Q-Q Plot for the state unemployment rate.
qqnorm(benefits[,"stateur"], main = "Normal Q-Q Plot of the State Unemployment Rate")
qqline(benefits[,"stateur"])

plot of chunk unnamed-chunk-10

#Create a Normal Q-Q Plot of the residuals for "model_job_loss".
qqnorm(residuals(model_job_loss), main = "Normal Q-Q Plot of Residuals of 'model_job_loss'")
qqline(residuals(model_job_loss))

plot of chunk unnamed-chunk-10

Shapiro-Wilk Test of Normality Results

#Perform Shapiro-Wilk Test of Normality on the state unemployment rate (normality is assummed if p > 0.1).
shapiro.test(benefits[,"stateur"])
## 
##  Shapiro-Wilk normality test
## 
## data:  benefits[, "stateur"]
## W = 0.9602, p-value < 2.2e-16

Upon both constructing Normal Q-Q Plots and performing a Shapiro-Wilk Test of Normality on the data in this analysis, it’s likely that we cannot readily assume that our data exhibits normality, since the resulting p-value of the Shapiro-Wilk Test of Normality for “benefits[,“stateur”]” was found to be < 0.1 at its value of < 2.2e-16 and since all of the constructed Normal Q-Q Plots (and their respective “tails”, especially) did not seem to display a trend of data that aligned closely with the Normal Q-Q Line.

In further attempting to verify the validity of our original ANOVA model’s results, we can generate a “quality of fit” model that plots residual error against the fitted model that was developed in our original analysis of variance (ANOVA).

#Create a "Quality of Fit Model" that plots the residuals of "model_job_loss" against its fitted model.
plot(fitted(model_job_loss),residuals(model_job_loss))

plot of chunk unnamed-chunk-12

Because the resulting plot does not appear to be scatted and clumped around zero, the ANOVA model that was developed does not suggest good fit under this “model adequacy checking” criteria. Thus, we cannot confidently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.

4. Contingencies to the Experimental Design/Analysis if Modeling Assumptions Failed

Since our modeling assumptions of data normality and factor independence seemingly failed in our analysis, we can err on the side of caution by performing the nonparametric Kruskal-Wallis rank sum test to back up our original results (which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution).

Kruskal-Wallis Rank Sum Test Results

#Perform Kruskal-Wallis Rank Sum Test on 'stateur' within the "Benefits" dataset for 'joblost' (identical populations is assummed if p > 0.05).
kruskal.test(benefits[,"stateur"],benefits$joblost)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  benefits[, "stateur"] and benefits$joblost
## Kruskal-Wallis chi-squared = 26.38, df = 3, p-value = 7.96e-06

Since the p-value for the resulting Kruskal-Wallis rank sum test that considers the factors ‘joblost’ against the response variable ‘stateur’ is less than 0.05, we can assume that the mean values of the state unemployment rate compared to the different types of reasons for job loss among blue collar workers are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, which would lead us to infer that the differences in mean values of the state unemployment rate for each of the corresponding reasons for job loss among blue collar workers being considered in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the state unemployment rate can be explained by the variation existent in the different reasons for job loss among blue collar workers being considered in this analysis. Furthermore, in addition to treating our data in such a way that uses a nonparametric analysis upon any realization that normality cannot be assumed, transformations such as the “Box-Cox Power Transformation” certainly could have been performed on the data to make it approximately normal. However, these transformations would not be necessary for this analysis, since the nonparametric significance results that we generated by using the Kruskal-Wallis rank sums test were suitable in giving us confidence in the results of our analysis.

5. References to the literature

[1] McCall, B.P. (1995) “The impact of unemployment insurance benefit levels on recipiency”, Journal of Business and Economic Statistics, 13, 189-198.

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code

This dataframe contains a cross-section of information from 1972 pertaining to the unemployment rates of blue collar workers in the United States, collected for the research corresponding to the article entitled “The impact of unemployment insurance benefit levels on recipiency” (McCall, B.P. 1995) [1]. [http://www.stats4stem.org/r-benefits-data.html]. This dataframe, which is called ‘Benefits’, has been conveniently packaged for R users in the R Library ‘Ecdat’.