1. Setting

From 1980 to 1987, 4360 observations were collected from 545 young males working in 9 different occupations across 12 industries. The data collected included information pertaining to hourly wage earned, ethnicity, years of schooling, years of experience, marriage status, and union representation.

System Under Test

I used the Males data set from the Ecdat R package. Originally the data set contained the following variables:

nr: identification
year: year observation taken
school: years of schooling
exper: years of experience (=age-6-school)
union: wage set by collective bargaining? (factor with two levels: yes, no)
ethn: ethnicity (factor with three levels: black, hisp, other)
married: marriage status (factor with two levels: yes, no)
health: health problem? (factor with two levels: yes, no)
wage: log of hourly wage
industry: industry employed in (factor with 12 levels)
occupation: occupation (factor with 9 levels)
residence: area lived in (factor with 4 levels)

To simplify the data, I controlled for the occupation, selecting only the observations that pertained to individuals employed as Craftsmen, Foremen, and kindred. I also subsetted the data frame to include only the variables:

nr
union
ethn
married
wage
industry

Below is a look at the data:

##      nr union  ethn maried     wage         industry
## 5    13    no other     no 1.568125 Personal_Service
## 16   17    no other     no 1.820334     Construction
## 78  166    no other     no 1.728221            Trade
## 80  166    no other    yes 1.851075            Trade
## 105 212    no other     no 1.784472    Manufacturing
## 121 243    no other     no 1.564181   Transportation

##        nr        union        ethn     maried         wage        
##  Min.   :   13   no :713   other:732   no :402   Min.   :-0.8877  
##  1st Qu.: 2191   yes:221   black: 70   yes:532   1st Qu.: 1.4490  
##  Median : 4510             hisp :132             Median : 1.7477  
##  Mean   : 5152                                   Mean   : 1.7119  
##  3rd Qu.: 8370                                   3rd Qu.: 2.0354  
##  Max.   :12548                                   Max.   : 2.9109  
##                                                                   
##                         industry  
##  Manufacturing              :343  
##  Construction               :174  
##  Trade                      :140  
##  Business_and_Repair_Service:124  
##  Transportation             : 63  
##  Public_Administration      : 30  
##  (Other)                    : 60

Factors and Levels

First let’s take a look at the structure of the data frame.

## 'data.frame':    934 obs. of  6 variables:
##  $ nr      : int  13 17 166 166 212 243 259 259 259 259 ...
##  $ union   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 2 2 2 ...
##  $ ethn    : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ maried  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 2 2 2 ...
##  $ wage    : num  1.57 1.82 1.73 1.85 1.78 ...
##  $ industry: Factor w/ 12 levels "Agricultural",..: 8 3 4 4 10 5 10 10 10 10 ...

From this I selected two categorical independent variables (IVs): union and industry.
union has two factors and will be the variable that I conduct null hypothesis testing on. industry has 12 factors and will be the variable that I will block on.

Continuous Variables

The response variables considered in this experiment will be wage. It is a continuous variable that represents the natural log of hourly wage earned by each observation.

2. Experimental Design

Having selected a pair of categorical IVs and a continuous dependent (response) variable, it is now time to set up the experiment for null hypothesis statistical testing on the non-blocked IV, union.

A randomized incomplete block design will be used to examine the effect of union on wage and ANOVA testing with two factors with no interaction will be conducted to analyze the effect of union while blocking across industry.

Selection of Sample Size

In order to determine the appropriate sample size for this experiment, we first need to set type I error (alpha) and type II error (beta) and estimate the effect size from the main effect of the non-blocking IV (union).

Alpha = 0.05 Beta = 0.05 Power = 1 - Beta = 0.95

Estimate effect size using Cohen’s D. Use cohen.d function from effsize package to calculate.

require(effsize)

## Loading required package: effsize

## Warning: package 'effsize' was built under R version 3.3.2

cohen.d(craftsmen$wage, craftsmen$union)

## 
## Cohen's d
## 
## d estimate: -0.5289825 (medium)
## 95 percent confidence interval:
##        inf        sup 
## -0.6821412 -0.3758239

Use G*Power to determine sample size given above inputs. G*Power Output

Sample Size = 50

#set seed so results are reproduceable
set.seed(1587)
#select 25 observations from each level of union factor to ensure equal group size
sample1 <- craftsmen[sample(which(craftsmen$union =="yes"), 25), ]
sample2 <- craftsmen[sample(which(craftsmen$union =="no"), 25), ]
#combine samples together and randomize order
my_sample <- rbind(sample1, sample2)
my_sample <- my_sample[sample(nrow(my_sample)), ]

Rationale for Design

Since we are not interested in interaction effects and want to evaluate only the main effect of one IV, it makes sense to use a randomized block design.

Randomization

While we had no control over how the original data were collected, by selecting a random sample from the data, we are incorporating randomization into the model.

Replicaton

There is no replication in this model.

Blocking

There is blocking in this model. The decision was made to block on the industry factor because we are not interested in how industry impacts wage but we do think that it will have an impact. By blocking, we are better able to evaluate the factor that we do care about, union.

3. Analysis

To begin we will only use Null Hypothesis Statistical Testing (NHST). Alternatives to NHST will be looked at in the next section. ##Exploratory Data Analysis First we will look at the histogram of wage to see if it meets the assumption of normality.

hist(craftsmen$wage, main = "Log Hourly Wage of Craftsmen")

As you can see, the distribution looks to be normal.

Next we will consider the main effect of union. Since there are only two levels the main effect is easy to measure by calculating the difference in means between the two levels.

high <- mean(subset(craftsmen$wage, craftsmen$union == "yes"))
low <- mean(subset(craftsmen$wage, craftsmen$union == "no"))
main_effect <- high - low
main_effect

## [1] 0.2418119

To give this calculation some perspective we can also look at the boxplot.

boxplot(craftsmen$wage ~ craftsmen$union, main = "Main Effect of Union Representation on Wage", xlab = "Union Representation?")

ANOVA Test

Next we will perform a two-way ANOVA test with no interaction.

model <- (aov(my_sample$wage ~ my_sample$union + my_sample$industry))
summary(model)

##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## my_sample$union     1  1.341  1.3410  12.788 0.000894 ***
## my_sample$industry  6  1.303  0.2172   2.071 0.077144 .  
## Residuals          42  4.404  0.1049                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA output we get an F statistic of 12.788 and a p-value of 0.000894. This indicates that union representation has a sizable effect on the log hourly wage of craftsmen. However it should be noted that depending on the sample of 50 taken from the larger data set, the ANOVA results varied drastically with some p-values much greater than 0.05. This makes me question whether 50 is really an adequate number of observations from which to draw any meaningful conclusions.

Model Adequacey Checking

qqnorm(residuals(model))
qqline(residuals(model))

From the QQ plot, the residuals nearly form a linear line and we can summarize that the assumptions of normality are met.

plot(fitted(model),residuals(model))

Looking at this plot the distribution of points looks fairly random although we observe a few outliers and some linearity. Overall, it is not too concerning.

4. Alternatives to Null Hypothesis Statistical Testing

In addition to the NHST performed in the previous section, we will also conduct to alternative evaluation methods, Resampling Statistics and effect size.

Resampling Statistics

Since union only has two levels, we can analyze it using a t-test. First we will calculate the true t-value statistic by taking the difference in group means between union workers and non-union workers. Then by randomly selecting two groups from the same sample and computing the difference between the mean and repeating this process 10,000 times we can determine the probability of randomly finding a difference in means greater than the original t-value we computed between the two groups.

The below code was used to calculate the t-value for the two groups, union workers and non-union workers.

union_yes <- sample1$wage
union_no <- sample2$wage
#compute real statistic
real_T <- t.test(union_yes, union_no)$statistic
real_T

##        t 
## 3.358272

Next we resampled 10,000 times and assigned all values to an array called t.values.

R <- 10000
t.values <- numeric(R)

for (i in 1:R) {
  index <- sample(1:50, size = 25, replace = FALSE)
  group1 <- my_sample$wage[index]
  group2 <- my_sample$wage[-index]
  t.values[i] <- t.test(group1, group2)$statistic
}

Finally, we plot the t.values to see if it is normally distributed and compare to the true t-value to the array of t-values to determine the probability that a randomly grouped sample will yield a higher t-value than the original.

hist(t.values)

mean(t.values >= real_T)

## [1] 5e-04

As you can see from the histogram, the t.values look to be normally distributed. Also we see that our original t-value of 3.358272 is far on the right tail of the distribution and that there is only a 0.0005 probability that a t-value calculated in the resampling would be higher than the true test statistic. This indicates that there is indeed a strong main effect of union representation on hourly wage earned.

Effect Size

The following equation will be used to calculated effect size for union representation on hourly wage:
\[(Observed Effect - Hypothesized Null Effect)/Variation of Observed Effect\] In this experiment the hypothesized null effect is zero so the equation simplifies to:
\[Observed Effect/Variation of Observed Effect\]

Alternatively, we can also use the Cohen’s D function from the effsize package used earlier to help determine the sample size. However, now instead of applying it to the whole data set, we will only apply it to our sample.

cohen.d(my_sample$wage, my_sample$union)

## 
## Cohen's d
## 
## d estimate: -0.9498629 (large)
## 95 percent confidence interval:
##        inf        sup 
## -1.5634303 -0.3362955

Here we get an larger value for effect size than we did earlier. The effect size of 0.9498 is categorized as a large effect and reinforces the conclusions made earlier in this experiment that union representation has a significant and sizable effect on log hourly wage of craftsmen employed from 1980 to 1987. Looking at the associated confidence interval, however, shows a substantial range for the effect size. Still, even the lower bound of the interval is an effect size of 0.3362 which while a much smalled effect, shows that union representation does have a least some effect on hourly wage.

5 References

Literature: Montgomery, Douglas C.. Design and Analysis of Experiments, 8th Edition, PDF #6. Appendices Raw R code:

#load and unpack Ecdat package
#save Males data set as my_data
require(Ecdat)
my_data <- Males

#subset dataframe to only include Craftsmen. Get rid of unnecessary columns
craftsmen <- subset(my_data, my_data$occupation == "Craftsmen, Foremen_and_kindred")
craftsmen <- subset(craftsmen, select = -c(year, school, exper, health, occupation, residence))

#show head and summary of subsetted dataframe
head(craftsmen)
summary(craftsmen)

#show structure to give preview of factors and levels
str(craftsmen)

#load and unpack effsize package
#apply cohen d function to estimat effect size of union
require(effsize)
cohen.d(craftsmen$wage, craftsmen$union)

#select sample of 50 from dataframe
#set seed so results are reproduceable
set.seed(1587)
#select 25 observations from each level of union factor to ensure equal group size
sample1 <- craftsmen[sample(which(craftsmen$union =="yes"), 25), ]
sample2 <- craftsmen[sample(which(craftsmen$union =="no"), 25), ]
#combine samples together and randomize order
my_sample <- rbind(sample1, sample2)
my_sample <- my_sample[sample(nrow(my_sample)), ]

#plot histogram of log hourly wage
hist(craftsmen$wage, main = "Log Hourly Wage of Craftsmen")

#determine main effect of union
high <- mean(subset(craftsmen$wage, craftsmen$union == "yes"))
low <- mean(subset(craftsmen$wage, craftsmen$union == "no"))
main_effect <- high - low
main_effect

#boxplot to show main effect of union
boxplot(craftsmen$wage ~ craftsmen$union, main = "Main Effect of Union Representation on Wage", xlab = "Union Representation?")

#conduct 2 factor anova with no interaction
#union = treatment, industry = block
model <- (aov(my_sample$wage ~ my_sample$union + my_sample$industry))
summary(model)

#model adequacy checking
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))

#resampling technique
union_yes <- sample1$wage
union_no <- sample2$wage
#compute real statistic
real_T <- t.test(union_yes, union_no)$statistic
#print test statisitic
real_T

#set R = 10000
R <- 10000
#define array of R values
t.values <- numeric(R)

#create for loop to compute R t.values for R different group combinations
for (i in 1:R) {
  index <- sample(1:50, size = 25, replace = FALSE)
  group1 <- my_sample$wage[index]
  group2 <- my_sample$wage[-index]
  t.values[i] <- t.test(group1, group2)$statistic
}

#plot histogram of t values (should be normal)
hist(t.values)
#print probability that t value would be greater than real test statistic
mean(t.values >= real_T)

#use cohen d function to compute effect size for sample of 50
cohen.d(my_sample$wage, my_sample$union)

Project 2: Null Hypothesis Statistical Testing

Clare Dorsey

November 6, 2016