From 1976 to 1982 a panel of 595 observations of individual workers were collected.In total there are 4165 observations. The data collected included information such as work experience, type of worker, where they reside, marital status and gender. I used the Wages data from the Ecdat R package. The following is a look at the data:
## exp wks bluecol ind south smsa married sex union ed black lwage
## 1 3 32 no 0 yes no yes male no 9 no 5.56068
## 2 4 43 no 0 yes no yes male no 9 no 5.72031
## 3 5 40 no 0 yes no yes male no 9 no 5.99645
## 4 6 39 no 0 yes no yes male no 9 no 5.99645
## 5 7 42 no 1 yes no yes male no 9 no 6.06146
## 6 8 35 no 1 yes no yes male no 9 no 6.17379
Let’s first look at the structure of the data frame.
## 'data.frame': 4165 obs. of 12 variables:
## $ exp : int 3 4 5 6 7 8 9 30 31 32 ...
## $ wks : int 32 43 40 39 42 35 32 34 27 33 ...
## $ bluecol: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ...
## $ ind : int 0 0 0 0 1 1 1 0 0 1 ...
## $ south : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 1 ...
## $ smsa : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ married: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
## $ union : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ ed : int 9 9 9 9 9 9 9 11 11 11 ...
## $ black : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ lwage : num 5.56 5.72 6 6 6.06 ...
From this the categorical independent variables selected are: sex gender of the individual and black whether or not the individual is black. Both of these variables have two factors each.
The continous variable selected is lwage the logarithmic wage.
After selecting a pair of categorical IVs and a continous dependent variable, an experiment for the null hypothesis statistical testing on the non-blocked independent variable,black will be selected.
A randomized incomplete block design will be used to examine the effect of black on lwage and ANOVA testing with the two factors with no interactions will be conducted to analyze the effect of union while blocking across sex.
To determine an appropriate sample size for this experiment, we first set the Type I error to be .05 and the Type II to be 0.1.
The effect size is estamted by utilizing Cohen’s D as follows.
library(effsize)
## Warning: package 'effsize' was built under R version 3.3.2
cohen.d(W$lwage,W$black)
##
## Cohen's d
##
## d estimate: 0.7452586 (medium)
## 95 percent confidence interval:
## inf sup
## 0.6268203 0.8636969
G*Power was utilized to determine the sample size based on the inputs given above. The output parameters given are the following:
Noncentrality parameter:12.2190284
Critical F: 4.3512435
Numerator df: 1
Denominator df: 20
Total sample size: 22
Actual power 0.9136259
#set seed so results are able to be reproduced
set.seed(512)
#select 11 observations from each level of the black factor
sample1 <- W[sample(which(W$black == "yes"),11),]
sample2 <- W[sample(which(W$black == "no"),11),]
#combine sample and randomize order
wsample <- rbind(sample1,sample2)
Wsample <- wsample[sample(nrow(wsample),22),]
Wsample
## exp wks bluecol ind south smsa married sex union ed black lwage
## 2083 14 49 yes 1 yes yes yes male no 12 yes 6.66568
## 1621 30 47 no 0 no no yes male no 17 no 7.01571
## 3087 34 51 yes 0 no no yes male no 12 no 6.66823
## 420 24 43 yes 0 no yes no female yes 11 yes 6.17170
## 3725 32 50 no 0 no no yes male no 13 no 6.29157
## 3380 12 49 no 0 yes yes no male yes 14 yes 6.84268
## 2283 10 39 yes 0 yes yes no female no 12 yes 5.93225
## 2670 22 49 yes 0 no yes yes male yes 12 yes 6.33683
## 3687 9 49 no 0 no yes yes male no 16 yes 6.74524
## 4103 12 50 yes 0 no yes yes male yes 12 yes 6.61874
## 1768 10 49 yes 1 yes no yes male yes 12 yes 6.58617
## 927 21 50 yes 1 yes yes yes male yes 12 yes 6.58617
## 1265 18 49 yes 1 yes no yes male no 12 no 7.13090
## 1254 11 48 no 0 no yes yes male no 12 no 6.28972
## 3239 17 47 no 0 no yes no female no 17 no 6.90776
## 524 28 51 yes 1 yes no yes male no 8 yes 6.43775
## 3510 16 44 no 0 no no no female yes 17 no 6.77422
## 652 11 36 no 0 no yes yes male yes 17 no 6.79122
## 1330 12 50 yes 1 yes no yes male no 9 no 6.95464
## 2319 37 48 yes 0 yes yes no female no 12 yes 5.82008
## 547 10 50 yes 0 no no no female no 8 no 5.42935
## 3020 29 52 yes 0 yes yes yes male yes 12 no 7.03878
Since we are not looking at interaction effects and want to evaluate only the main effect of one independent variable, it’s logical to use a randomized block design.
Since for this data set there was no way to control how the original data was collected, by selecting a random sample from the data set, randomization is incorporated into the model.
Replication isn’t utilized in this model.
First look at the historgram of logarithmic wage.
It appears to follow a normal distribution.
Next we will determine the main effect of the variable black. Because there are only two levels the main effect is the difference between the means of the two levels.
high <- mean(subset(W$lwage,W$black =="yes"))
low <- mean(subset(W$lwage,W$black == "no"))
main_effect <- high-low
main_effect
## [1] -0.3377531
We can also look at the boxplot.
boxplot(W$lwage ~ W$black,main="Main Effect of Race on Logarithmic Wage",xlab="Is the individual Black?")
Performing two-way ANOVA test with no interaction
model <- aov(Wsample$lwage~Wsample$black+Wsample$sex)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Wsample$black 1 0.2953 0.2953 2.333 0.14317
## Wsample$sex 1 1.1571 1.1571 9.140 0.00699 **
## Residuals 19 2.4053 0.1266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the F value and p-value for the variable black there appear to be some effect on the logarithmic wage.
From the QQ plot, the residuals approximately form a linear line, as a result, it can be asssumed that the assumptions of normality are met.
qqnorm(residuals(model))
qqline(residuals(model))
From looking at the plot below, the distribution of points appear fairly random and there is some linearity.
plot(fitted(model),residuals(model))
As a supplement to the NHST performed in the section 3, the alternative evaluation methods, Resampling Statistics and Effect Size will be utilized.
Because black has only two levels, it will be analyzed by using a t-test. First calculate the true t-value statistic by taking the difference in group means between black individuals and inividuals who aren’t black. Then by randomly selecting two groups from the same sample and computing the difference between the mean and repeating the process 10,000 times the probability of randomly finding a difference in means greater than the original t-value computed between the two groups.
Calculating t-value for the two groups: black individuals and individuals who aren’t black.
black_y <-sample1$lwage
black_n <-sample2$lwage
real_T <- t.test(black_y,black_n)$statistic
real_T
## t
## -1.287574
Resampling 10,000 times and assigned values to an array called t.values
R <- 10000
t.values <- numeric(R)
for (i in 1:R) {
index <- sample(1:22, size = 11, replace = FALSE)
group1 <- Wsample$lwage[index]
group2 <- Wsample$lwage[-index]
t.values[i] <- t.test(group1, group2)$statistic
}
Finally, plot the t.values to see if it’s normally distributed and compare to the true t-value to the array of t-values to determine the probability that a randomly grouped sample will yield a higher t-value than the original
hist(t.values)
mean(t.values >= real_T)
## [1] 0.8921
From the histogram, the t.values appears to be normally distributed. Also, the original t.value of -1.287574 is around the center of the distibution. There’s also a 89.21% that a t-value calculated in the resampling would be higher than the true test statictic. This indicates that there isn’t a significant main effect of whether or not an individual is black on the logarithmic wage.
Apply Cohen’s D function to the sample data set
cohen.d(Wsample$lwage,Wsample$black)
##
## Cohen's d
##
## d estimate: 0.5490232 (medium)
## 95 percent confidence interval:
## inf sup
## -0.4029837 1.5010301
There is a smaller value for the effect size than earlier. The effect size of 0.5490232 is categorized as a medium effect and reinforces the conclusion made earlier that whether or not an individual is black doesn’t have a strong effect on the logarithmic wage.
W <-Ecdat::Wages
head(W)
str(W)
library(effsize)
cohen.d(W$lwage,W$black)
#set seed so results are able to be reproduced
set.seed(512)
#select 11 observations from each level of the black factor
sample1 <- W[sample(which(W$black == "yes"),11),]
sample2 <- W[sample(which(W$black == "no"),11),]
#combine sample and randomize order
wsample <- rbind(sample1,sample2)
Wsample <- wsample[sample(nrow(wsample),22),]
Wsample
hist(W$lwage,main="Histogram of Logarithmic Wage",xlab="Logarithmic Wage ($)")
high <- mean(subset(W$lwage,W$black =="yes"))
low <- mean(subset(W$lwage,W$black == "no"))
main_effect <- high-low
main_effect
boxplot(W$lwage ~ W$black,main="Main Effect of Race on Logarithmic Wage",xlab="Is the individual Black?")
model <- aov(Wsample$lwage~Wsample$black+Wsample$sex)
summary(model)
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))
black_y <-sample1$lwage
black_n <-sample2$lwage
real_T <- t.test(black_y,black_n)$statistic
real_T
R <- 10000
t.values <- numeric(R)
for (i in 1:R) {
index <- sample(1:22, size = 11, replace = FALSE)
group1 <- Wsample$lwage[index]
group2 <- Wsample$lwage[-index]
t.values[i] <- t.test(group1, group2)$statistic
hist(t.values)
mean(t.values >= real_T)
cohen.d(Wsample$lwage,Wsample$black)