From 1980 to 1987, 4360 observations were collected from 545 young males working in 9 different occupations across 12 industries. The data collected included information pertaining to hourly wage earned, ethnicity, years of schooling, years of experience, marriage status, and union representation.
I used the Males data set from the Ecdat R package. Originally the data set contained the following variables:
nr: identification
year: year observation taken
school: years of schooling
exper: years of experience (=age-6-school)
union: wage set by collective bargaining? (factor with two levels: yes, no)
ethn: ethnicity (factor with three levels: black, hisp, other)
married: marriage status (factor with two levels: yes, no)
health: health problem? (factor with two levels: yes, no)
wage: log of hourly wage
industry: industry employed in (factor with 12 levels)
occupation: occupation (factor with 9 levels)
residence: area lived in (factor with 4 levels)
To simplify the data, I controlled for the occupation, selecting only the observations that pertained to individuals employed as Craftsmen, Foremen, and kindred. I also subsetted the data frame to include only the variables:
nr
union
ethn
married
wage
industry
Below is a look at the data:
## nr union ethn maried wage industry
## 5 13 no other no 1.568125 Personal_Service
## 16 17 no other no 1.820334 Construction
## 78 166 no other no 1.728221 Trade
## 80 166 no other yes 1.851075 Trade
## 105 212 no other no 1.784472 Manufacturing
## 121 243 no other no 1.564181 Transportation
## nr union ethn maried wage
## Min. : 13 no :713 other:732 no :402 Min. :-0.8877
## 1st Qu.: 2191 yes:221 black: 70 yes:532 1st Qu.: 1.4490
## Median : 4510 hisp :132 Median : 1.7477
## Mean : 5152 Mean : 1.7119
## 3rd Qu.: 8370 3rd Qu.: 2.0354
## Max. :12548 Max. : 2.9109
##
## industry
## Manufacturing :343
## Construction :174
## Trade :140
## Business_and_Repair_Service:124
## Transportation : 63
## Public_Administration : 30
## (Other) : 60
First let’s take a look at the structure of the data frame.
## 'data.frame': 934 obs. of 6 variables:
## $ nr : int 13 17 166 166 212 243 259 259 259 259 ...
## $ union : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 2 2 2 ...
## $ ethn : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ maried : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 2 2 2 ...
## $ wage : num 1.57 1.82 1.73 1.85 1.78 ...
## $ industry: Factor w/ 12 levels "Agricultural",..: 8 3 4 4 10 5 10 10 10 10 ...
From this I selected two categorical independent variables (IVs): union and industry.
union has two factors and will be the variable that I conduct null hypothesis testing on. industry has 12 factors and will be the variable that I will block on.
The response variables considered in this experiment will be wage. It is a continuous variable that represents the natural log of hourly wage earned by each observation.
Having selected a pair of categorical IVs and a continuous dependent (response) variable, it is now time to set up the experiment for null hypothesis statistical testing on the non-blocked IV, union.
A randomized incomplete block design will be used to examine the effect of union on wage and ANOVA testing with two factors with no interaction will be conducted to analyze the effect of union while blocking across industry.
In order to determine the appropriate sample size for this experiment, we first need to set type I error (alpha) and type II error (beta) and estimate the effect size from the main effect of the non-blocking IV (union).
Alpha = 0.05 Beta = 0.05 Power = 1 - Beta = 0.95
Estimate effect size using Cohen’s D. Use cohen.d function from effsize package to calculate.
require(effsize)
## Loading required package: effsize
## Warning: package 'effsize' was built under R version 3.3.2
cohen.d(craftsmen$wage, craftsmen$union)
##
## Cohen's d
##
## d estimate: -0.5289825 (medium)
## 95 percent confidence interval:
## inf sup
## -0.6821412 -0.3758239
Use G*Power to determine sample size given above inputs.
Sample Size = 50
#set seed so results are reproduceable
set.seed(1587)
#select 25 observations from each level of union factor to ensure equal group size
sample1 <- craftsmen[sample(which(craftsmen$union =="yes"), 25), ]
sample2 <- craftsmen[sample(which(craftsmen$union =="no"), 25), ]
#combine samples together and randomize order
my_sample <- rbind(sample1, sample2)
my_sample <- my_sample[sample(nrow(my_sample)), ]
Since we are not interested in interaction effects and want to evaluate only the main effect of one IV, it makes sense to use a randomized block design.
While we had no control over how the original data were collected, by selecting a random sample from the data, we are incorporating randomization into the model.
There is no replication in this model.
There is blocking in this model. The decision was made to block on the industry factor because we are not interested in how industry impacts wage but we do think that it will have an impact. By blocking, we are better able to evaluate the factor that we do care about, union.
To begin we will only use Null Hypothesis Statistical Testing (NHST). Alternatives to NHST will be looked at in the next section. ##Exploratory Data Analysis First we will look at the histogram of wage to see if it meets the assumption of normality.
hist(craftsmen$wage, main = "Log Hourly Wage of Craftsmen")
As you can see, the distribution looks to be normal.
Next we will consider the main effect of union. Since there are only two levels the main effect is easy to measure by calculating the difference in means between the two levels.
high <- mean(subset(craftsmen$wage, craftsmen$union == "yes"))
low <- mean(subset(craftsmen$wage, craftsmen$union == "no"))
main_effect <- high - low
main_effect
## [1] 0.2418119
To give this calculation some perspective we can also look at the boxplot.
boxplot(craftsmen$wage ~ craftsmen$union, main = "Main Effect of Union Representation on Wage", xlab = "Union Representation?")
Next we will perform a two-way ANOVA test with no interaction.
model <- (aov(my_sample$wage ~ my_sample$union + my_sample$industry))
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## my_sample$union 1 1.341 1.3410 12.788 0.000894 ***
## my_sample$industry 6 1.303 0.2172 2.071 0.077144 .
## Residuals 42 4.404 0.1049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA output we get an F statistic of 12.788 and a p-value of 0.000894. This indicates that union representation has a sizable effect on the log hourly wage of craftsmen. However it should be noted that depending on the sample of 50 taken from the larger data set, the ANOVA results varied drastically with some p-values much greater than 0.05. This makes me question whether 50 is really an adequate number of observations from which to draw any meaningful conclusions.
qqnorm(residuals(model))
qqline(residuals(model))
From the QQ plot, the residuals nearly form a linear line and we can summarize that the assumptions of normality are met.
plot(fitted(model),residuals(model))
Looking at this plot the distribution of points looks fairly random although we observe a few outliers and some linearity. Overall, it is not too concerning.
In addition to the NHST performed in the previous section, we will also conduct to alternative evaluation methods, Resampling Statistics and effect size.
Since union only has two levels, we can analyze it using a t-test. First we will calculate the true t-value statistic by taking the difference in group means between union workers and non-union workers. Then by randomly selecting two groups from the same sample and computing the difference between the mean and repeating this process 10,000 times we can determine the probability of randomly finding a difference in means greater than the original t-value we computed between the two groups.
The below code was used to calculate the t-value for the two groups, union workers and non-union workers.
union_yes <- sample1$wage
union_no <- sample2$wage
#compute real statistic
real_T <- t.test(union_yes, union_no)$statistic
real_T
## t
## 3.358272
Next we resampled 10,000 times and assigned all values to an array called t.values.
R <- 10000
t.values <- numeric(R)
for (i in 1:R) {
index <- sample(1:50, size = 25, replace = FALSE)
group1 <- my_sample$wage[index]
group2 <- my_sample$wage[-index]
t.values[i] <- t.test(group1, group2)$statistic
}
Finally, we plot the t.values to see if it is normally distributed and compare to the true t-value to the array of t-values to determine the probability that a randomly grouped sample will yield a higher t-value than the original.
hist(t.values)
mean(t.values >= real_T)
## [1] 5e-04
As you can see from the histogram, the t.values look to be normally distributed. Also we see that our original t-value of 3.358272 is far on the right tail of the distribution and that there is only a 0.0005 probability that a t-value calculated in the resampling would be higher than the true test statistic. This indicates that there is indeed a strong main effect of union representation on hourly wage earned.
The following equation will be used to calculated effect size for union representation on hourly wage:
\[(Observed Effect - Hypothesized Null Effect)/Variation of Observed Effect\] In this experiment the hypothesized null effect is zero so the equation simplifies to:
\[Observed Effect/Variation of Observed Effect\]
Alternatively, we can also use the Cohen’s D function from the effsize package used earlier to help determine the sample size. However, now instead of applying it to the whole data set, we will only apply it to our sample.
cohen.d(my_sample$wage, my_sample$union)
##
## Cohen's d
##
## d estimate: -0.9498629 (large)
## 95 percent confidence interval:
## inf sup
## -1.5634303 -0.3362955
Here we get an larger value for effect size than we did earlier. The effect size of 0.9498 is categorized as a large effect and reinforces the conclusions made earlier in this experiment that union representation has a significant and sizable effect on log hourly wage of craftsmen employed from 1980 to 1987. Looking at the associated confidence interval, however, shows a substantial range for the effect size. Still, even the lower bound of the interval is an effect size of 0.3362 which while a much smalled effect, shows that union representation does have a least some effect on hourly wage.
Literature: Montgomery, Douglas C.. Design and Analysis of Experiments, 8th Edition, PDF #6. Appendices Raw R code:
#load and unpack Ecdat package
#save Males data set as my_data
require(Ecdat)
my_data <- Males
#subset dataframe to only include Craftsmen. Get rid of unnecessary columns
craftsmen <- subset(my_data, my_data$occupation == "Craftsmen, Foremen_and_kindred")
craftsmen <- subset(craftsmen, select = -c(year, school, exper, health, occupation, residence))
#show head and summary of subsetted dataframe
head(craftsmen)
summary(craftsmen)
#show structure to give preview of factors and levels
str(craftsmen)
#load and unpack effsize package
#apply cohen d function to estimat effect size of union
require(effsize)
cohen.d(craftsmen$wage, craftsmen$union)
#select sample of 50 from dataframe
#set seed so results are reproduceable
set.seed(1587)
#select 25 observations from each level of union factor to ensure equal group size
sample1 <- craftsmen[sample(which(craftsmen$union =="yes"), 25), ]
sample2 <- craftsmen[sample(which(craftsmen$union =="no"), 25), ]
#combine samples together and randomize order
my_sample <- rbind(sample1, sample2)
my_sample <- my_sample[sample(nrow(my_sample)), ]
#plot histogram of log hourly wage
hist(craftsmen$wage, main = "Log Hourly Wage of Craftsmen")
#determine main effect of union
high <- mean(subset(craftsmen$wage, craftsmen$union == "yes"))
low <- mean(subset(craftsmen$wage, craftsmen$union == "no"))
main_effect <- high - low
main_effect
#boxplot to show main effect of union
boxplot(craftsmen$wage ~ craftsmen$union, main = "Main Effect of Union Representation on Wage", xlab = "Union Representation?")
#conduct 2 factor anova with no interaction
#union = treatment, industry = block
model <- (aov(my_sample$wage ~ my_sample$union + my_sample$industry))
summary(model)
#model adequacy checking
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))
#resampling technique
union_yes <- sample1$wage
union_no <- sample2$wage
#compute real statistic
real_T <- t.test(union_yes, union_no)$statistic
#print test statisitic
real_T
#set R = 10000
R <- 10000
#define array of R values
t.values <- numeric(R)
#create for loop to compute R t.values for R different group combinations
for (i in 1:R) {
index <- sample(1:50, size = 25, replace = FALSE)
group1 <- my_sample$wage[index]
group2 <- my_sample$wage[-index]
t.values[i] <- t.test(group1, group2)$statistic
}
#plot histogram of t values (should be normal)
hist(t.values)
#print probability that t value would be greater than real test statistic
mean(t.values >= real_T)
#use cohen d function to compute effect size for sample of 50
cohen.d(my_sample$wage, my_sample$union)