1.Setting

System under test

From 1976 to 1982 a panel of 595 observations of individual workers were collected.In total there are 4165 observations. The data collected included information such as work experience, type of worker, where they reside, marital status and gender. I used the Wages data from the Ecdat R package. The following is a look at the data:

##   exp wks bluecol ind south smsa married  sex union ed black   lwage
## 1   3  32      no   0   yes   no     yes male    no  9    no 5.56068
## 2   4  43      no   0   yes   no     yes male    no  9    no 5.72031
## 3   5  40      no   0   yes   no     yes male    no  9    no 5.99645
## 4   6  39      no   0   yes   no     yes male    no  9    no 5.99645
## 5   7  42      no   1   yes   no     yes male    no  9    no 6.06146
## 6   8  35      no   1   yes   no     yes male    no  9    no 6.17379

Factors and Levels

Let’s first look at the structure of the data frame.

## 'data.frame':    4165 obs. of  12 variables:
##  $ exp    : int  3 4 5 6 7 8 9 30 31 32 ...
##  $ wks    : int  32 43 40 39 42 35 32 34 27 33 ...
##  $ bluecol: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ...
##  $ ind    : int  0 0 0 0 1 1 1 0 0 1 ...
##  $ south  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 1 ...
##  $ smsa   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ married: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex    : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ union  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ ed     : int  9 9 9 9 9 9 9 11 11 11 ...
##  $ black  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ lwage  : num  5.56 5.72 6 6 6.06 ...

From this the categorical independent variables selected are: sex gender of the individual and black whether or not the individual is black. Both of these variables have two factors each.

Continuous Variables

The continous variable selected is lwage the logarithmic wage.

2.Experimental Design

After selecting a pair of categorical IVs and a continous dependent variable, an experiment for the null hypothesis statistical testing on the non-blocked independent variable,black will be selected.

A randomized incomplete block design will be used to examine the effect of black on lwage and ANOVA testing with the two factors with no interactions will be conducted to analyze the effect of union while blocking across sex.

Selection of Sample Size

To determine an appropriate sample size for this experiment, we first set the Type I error to be .05 and the Type II to be 0.1.

The effect size is estamted by utilizing Cohen’s D as follows.

library(effsize)
## Warning: package 'effsize' was built under R version 3.3.2
cohen.d(W$lwage,W$black)
## 
## Cohen's d
## 
## d estimate: 0.7452586 (medium)
## 95 percent confidence interval:
##       inf       sup 
## 0.6268203 0.8636969

G*Power was utilized to determine the sample size based on the inputs given above. The output parameters given are the following:

Noncentrality parameter:12.2190284

Critical F: 4.3512435

Numerator df: 1

Denominator df: 20

Total sample size: 22

Actual power 0.9136259

#set seed so results are able to be reproduced
set.seed(512)
#select 11 observations from each level of the black factor
sample1 <- W[sample(which(W$black == "yes"),11),]
sample2 <- W[sample(which(W$black == "no"),11),]
#combine sample and randomize order
wsample <- rbind(sample1,sample2)
Wsample <- wsample[sample(nrow(wsample),22),]
Wsample
##      exp wks bluecol ind south smsa married    sex union ed black   lwage
## 2083  14  49     yes   1   yes  yes     yes   male    no 12   yes 6.66568
## 1621  30  47      no   0    no   no     yes   male    no 17    no 7.01571
## 3087  34  51     yes   0    no   no     yes   male    no 12    no 6.66823
## 420   24  43     yes   0    no  yes      no female   yes 11   yes 6.17170
## 3725  32  50      no   0    no   no     yes   male    no 13    no 6.29157
## 3380  12  49      no   0   yes  yes      no   male   yes 14   yes 6.84268
## 2283  10  39     yes   0   yes  yes      no female    no 12   yes 5.93225
## 2670  22  49     yes   0    no  yes     yes   male   yes 12   yes 6.33683
## 3687   9  49      no   0    no  yes     yes   male    no 16   yes 6.74524
## 4103  12  50     yes   0    no  yes     yes   male   yes 12   yes 6.61874
## 1768  10  49     yes   1   yes   no     yes   male   yes 12   yes 6.58617
## 927   21  50     yes   1   yes  yes     yes   male   yes 12   yes 6.58617
## 1265  18  49     yes   1   yes   no     yes   male    no 12    no 7.13090
## 1254  11  48      no   0    no  yes     yes   male    no 12    no 6.28972
## 3239  17  47      no   0    no  yes      no female    no 17    no 6.90776
## 524   28  51     yes   1   yes   no     yes   male    no  8   yes 6.43775
## 3510  16  44      no   0    no   no      no female   yes 17    no 6.77422
## 652   11  36      no   0    no  yes     yes   male   yes 17    no 6.79122
## 1330  12  50     yes   1   yes   no     yes   male    no  9    no 6.95464
## 2319  37  48     yes   0   yes  yes      no female    no 12   yes 5.82008
## 547   10  50     yes   0    no   no      no female    no  8    no 5.42935
## 3020  29  52     yes   0   yes  yes     yes   male   yes 12    no 7.03878

Rationale for Design

Since we are not looking at interaction effects and want to evaluate only the main effect of one independent variable, it’s logical to use a randomized block design.

Randomization

Since for this data set there was no way to control how the original data was collected, by selecting a random sample from the data set, randomization is incorporated into the model.

Replication

Replication isn’t utilized in this model.

Blocking

3.Analysis

First look at the historgram of logarithmic wage.

It appears to follow a normal distribution.

Next we will determine the main effect of the variable black. Because there are only two levels the main effect is the difference between the means of the two levels.

high <- mean(subset(W$lwage,W$black =="yes"))
low <- mean(subset(W$lwage,W$black == "no"))
main_effect <- high-low
main_effect
## [1] -0.3377531

We can also look at the boxplot.

boxplot(W$lwage ~ W$black,main="Main Effect of Race on Logarithmic Wage",xlab="Is the individual Black?")

ANOVA Test

Performing two-way ANOVA test with no interaction

model <- aov(Wsample$lwage~Wsample$black+Wsample$sex)
summary(model)
##               Df Sum Sq Mean Sq F value  Pr(>F)   
## Wsample$black  1 0.2953  0.2953   2.333 0.14317   
## Wsample$sex    1 1.1571  1.1571   9.140 0.00699 **
## Residuals     19 2.4053  0.1266                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From looking at the F value and p-value for the variable black there appear to be some effect on the logarithmic wage.

Model Adequacy Checking

From the QQ plot, the residuals approximately form a linear line, as a result, it can be asssumed that the assumptions of normality are met.

qqnorm(residuals(model))
qqline(residuals(model))

From looking at the plot below, the distribution of points appear fairly random and there is some linearity.

plot(fitted(model),residuals(model))

4.Alternatives to Null Hypothesis Statistical Testing

As a supplement to the NHST performed in the section 3, the alternative evaluation methods, Resampling Statistics and Effect Size will be utilized.

Resampling Statistics

Because black has only two levels, it will be analyzed by using a t-test. First calculate the true t-value statistic by taking the difference in group means between black individuals and inividuals who aren’t black. Then by randomly selecting two groups from the same sample and computing the difference between the mean and repeating the process 10,000 times the probability of randomly finding a difference in means greater than the original t-value computed between the two groups.

Calculating t-value for the two groups: black individuals and individuals who aren’t black.

black_y <-sample1$lwage
black_n <-sample2$lwage
real_T <- t.test(black_y,black_n)$statistic
real_T
##         t 
## -1.287574

Resampling 10,000 times and assigned values to an array called t.values

R <- 10000
t.values <- numeric(R)
for (i in 1:R) {
  index <- sample(1:22, size = 11, replace = FALSE)
  group1 <- Wsample$lwage[index]
  group2 <- Wsample$lwage[-index]
  t.values[i] <- t.test(group1, group2)$statistic
}

Finally, plot the t.values to see if it’s normally distributed and compare to the true t-value to the array of t-values to determine the probability that a randomly grouped sample will yield a higher t-value than the original

hist(t.values)

mean(t.values >= real_T)
## [1] 0.8921

From the histogram, the t.values appears to be normally distributed. Also, the original t.value of -1.287574 is around the center of the distibution. There’s also a 89.21% that a t-value calculated in the resampling would be higher than the true test statictic. This indicates that there isn’t a significant main effect of whether or not an individual is black on the logarithmic wage.

Effect Size

Apply Cohen’s D function to the sample data set

cohen.d(Wsample$lwage,Wsample$black)
## 
## Cohen's d
## 
## d estimate: 0.5490232 (medium)
## 95 percent confidence interval:
##        inf        sup 
## -0.4029837  1.5010301

There is a smaller value for the effect size than earlier. The effect size of 0.5490232 is categorized as a medium effect and reinforces the conclusion made earlier that whether or not an individual is black doesn’t have a strong effect on the logarithmic wage.

R Code

W <-Ecdat::Wages
head(W)
str(W)
library(effsize)
cohen.d(W$lwage,W$black)
#set seed so results are able to be reproduced
set.seed(512)
#select 11 observations from each level of the black factor
sample1 <- W[sample(which(W$black == "yes"),11),]
sample2 <- W[sample(which(W$black == "no"),11),]
#combine sample and randomize order
wsample <- rbind(sample1,sample2)
Wsample <- wsample[sample(nrow(wsample),22),]
Wsample
hist(W$lwage,main="Histogram of Logarithmic Wage",xlab="Logarithmic Wage ($)")
high <- mean(subset(W$lwage,W$black =="yes"))
low <- mean(subset(W$lwage,W$black == "no"))
main_effect <- high-low
main_effect
boxplot(W$lwage ~ W$black,main="Main Effect of Race on Logarithmic Wage",xlab="Is the individual Black?")
model <- aov(Wsample$lwage~Wsample$black+Wsample$sex)
summary(model)
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))
black_y <-sample1$lwage
black_n <-sample2$lwage
real_T <- t.test(black_y,black_n)$statistic
real_T
R <- 10000
t.values <- numeric(R)
for (i in 1:R) {
  index <- sample(1:22, size = 11, replace = FALSE)
  group1 <- Wsample$lwage[index]
  group2 <- Wsample$lwage[-index]
  t.values[i] <- t.test(group1, group2)$statistic
hist(t.values)
mean(t.values >= real_T)
cohen.d(Wsample$lwage,Wsample$black)