The European Community Household Panel conducted a study on hourly wages in Belgium in 1994. This dataset contains 1472 observations of individuals and includes information on hourly wages, education level, years of work experience, and sex. There is value in analyzing this dataset to determine the factors that impact hourly wages. In this report, the effect of sex of an individual on their wage is analyzed.
The Wages in Belgium data set is in the Ecdat package in the Bwages dataframe. The Bwages dataframe is a cross-section of 1472 individuals from 1994 and includes the following information for each observation:
wage gross hourly wage in euros
educ education level from 1 [low] to 5 [high]
exper years of experience
sex gender with levels male and female
To begin to analyze the data, we can first inspect the head and tail of the dataframe.
Bwages <- read.delim("C:/Users/wheels/Desktop/Design of Experiments/Topic 4/Project #2/Bwages.txt")
#view the first observations in Bwages dataframe
head(Bwages)
## wage educ exper sex
## 1 7.780208 1 23 1
## 2 4.818505 1 15 0
## 3 10.563645 1 31 1
## 4 7.042429 1 32 1
## 5 7.887521 1 9 1
## 6 8.200058 1 15 0
# view the last observations in Bwages dataframe
tail(Bwages)
## wage educ exper sex
## 1467 9.637010 5 5 0
## 1468 11.119626 5 1 1
## 1469 5.453375 5 14 0
## 1470 11.085489 5 19 0
## 1471 15.649843 5 15 0
## 1472 15.023850 5 24 1
The two categorical independent variables (IVs) that I selected were sex and educ , which are the sex and education level of the individual.
Sex is a factor with 2 levels. The 2 levels are male and female.
Education Level is a factor with 5 levels. The levels range from a low level of education to a high level of education. Though examining the structure of the dataframe presents this variable as in integer, it is indeed a categorical independent variable.
We can examine the factors by examining the structure of the dataframe below.
# view the structure of Bwages dataframe
str(Bwages)
## 'data.frame': 1472 obs. of 4 variables:
## $ wage : num 7.78 4.82 10.56 7.04 7.89 ...
## $ educ : int 1 1 1 1 1 1 1 1 1 1 ...
## $ exper: int 23 15 31 32 9 15 26 23 13 22 ...
## $ sex : int 1 0 1 1 1 0 1 1 1 1 ...
To perform ANOVA, a continuous dependent variable (DV) has to be selected as the response variable. Since I was interested in examining wages, I chose wage as the response variable. wage is a continuous variable. We will discuss this in further detail in the Experimental Design.
The first test that was conducted was a Null Hypothesis Statistical Test (NHST), where ANOVA was performed. After this, two alternatives were conducted. These alternatives were Resampling Statistics and the Plot Plus Error Bar (PPE) Procedure.
This dataset and the factors and response variables were chosen because of an interest in exploring wage data. For ANOVA, the IVs needs to be categorical and the DV needs to be continuous. sex and educ are appropriate IVs because they are both categorical variables that we have interest in seeing their effect on the response variable. wage is an appropriate DV because it is continuous.
For this experiment, the null hypothesis states that there is no difference in wage, hourly wage, between the two levels of sex , male and female. The hypothesis is there is a difference in hourly wages that is impacted by the sex of the individual.
A Type I error of 0.05 was selected because it is generally used as a cutoff to determine statistical significance. A type 1 error is when the null hypothesis is rejected when it shouldn’t be.
A Type 2 error of 0.05 was selected because a power of (1 - beta) of 0.8 or greater is typically used. Increasing this power increases the sample size necessary to draw a statistically significant conclusion. A type 2 error is retaining a false null hypothesis when you shouldn’t.
I am conducting an ANOVA because it is a sufficient analysis for a blocked experiment. A two-way ANOVA can be used to look at the effect of a factor of interest while blocking the other factor.
Randomization was utilized for this design. A sufficient sample size was determined using the G*Power program. A seed was set to unsure reproducibility for the sake of this report. The sample() function was used to select a random sample and will be shown in the code in a later section.
There was no possibility for replication in this experiment because of the nature of the dataset. There is only one set of data points for each individual.
The blocking IV in this experiment was educ. The data was blocked by education level to reduce noise in the wage data caused by the education level of the individuals being observed. In this experiment, a 2-way ANOVA was performed, but the F-statistic associated with the blocking variable will be ignored. This will allow us to examine the effect of sex on wage without the noise caused by education level.
A histogram of wages was created to look at the distribution before further analysis was completed. From the histogram, we can see that there is some slight skewing, which violates the assumption of normality for ANOVA. However, since the skew is slight, we will continue with the analysis.
The main effect of sex on the response variable wage is the difference between the means of the 2 levels, as shown below.
#calculate main effect
m_f <- mean(subset(Bwages$wage, Bwages$sex=="0"))
m_m <- mean(subset(Bwages$wage, Bwages$sex=="1"))
me <- m_m - m_f
me
## [1] 1.300687
To further examine the main effect of sex on wage, the following boxplot was generated to show hourly wages for each gender level.
While we can see from the boxplot that there is a difference in wages between the two groups, we will want to perform a 2-way ANOVA on this data.
cohen.d() function was used from the effsize package to calculate effect size.
#store as factor
Bwages$sex <- as.factor(Bwages$sex)
#load effsize package
library("effsize", lib.loc="~/R/win-library/3.3")
## Warning: package 'effsize' was built under R version 3.3.2
#use cohen.d function from effsize package
cohen.d(Bwages$wage,Bwages$sex)
##
## Cohen's d
##
## d estimate: -0.2951819 (small)
## 95 percent confidence interval:
## inf sup
## -0.4004602 -0.1899036
There are times that you might want to minimize a sample size to make analysis easier, or to introduce some randomization into an experiment. In this project, I used G*Power 3.1.9.2 for Windows to determine the sample size. The following parameters were used:
*Effect size f = .2951819
*alpha error prob = 0.05
*Power (1-beta err prob) = 0.95
*Number of groups = 2
The image above shows the output that was generated. As you can see, the critical F statistic is 3.90 and there is a sample size of N = 152.
The code below shows how a sample size of 152 was randomly selected from the Bwages dataframe. A seed was used for reproducibility of the data for this report.
#select a sample from dataframe
sample_size <- 152
#set seed for reproducibility of data
set.seed(8)
wage_sample <- Bwages[sample(1:nrow(Bwages),sample_size,replace=FALSE),]
A 2-way ANOVA was performed. Since the data was blocked by educ we should ignore the F-statistic associated with that factor.
#select a sample from dataframe
sample_size <- 152
#set seed for reproducibility of data
set.seed(8)
wage_sample <- Bwages[sample(1:nrow(Bwages),sample_size,replace=FALSE),]
#store as factors
wage_sample$sex <- as.factor(wage_sample$sex)
wage_sample$educ <- as.factor(wage_sample$educ)
#2-way anova on sample
anova <- aov(wage_sample$wage ~ wage_sample$sex + wage_sample$educ)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## wage_sample$sex 1 421 421.2 17.73 4.43e-05 ***
## wage_sample$educ 4 1099 274.8 11.57 3.47e-08 ***
## Residuals 146 3468 23.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To reach a sufficient conclusion with ANOVA, it is important to check that the dataset meets the conditions of the model. The Normal Q-Q plot shows that the data is relatively normally distributed, but it does show the same skew that was visible in the histogram of wage.
The residuals plot is not as randomly dispersed as we would like to see. This shows that the quality of fit for the model could be better, but is alright.
The Q-Q plot shows that there is some deviation from normality. As mentioned, a seed was used to ensure reproducibility in the report. When the seed was removed to repeat the sample selections, it seemed if about one third of the samples the ANOVA returned an F-statistic below the critical value of 3.90. This could mean that the sample size is too small. It could also be a result of the skew in the distribution or the outliers on the upper end of the wage distribution. Since the p-value that ANOVA returned is below 0.05 I reject the null hypothesis. Therefore, there is a difference in hourly wage based on sex.
While NHST is a valuable tool, not all distributions meet the assumptions of ANOVA. In the following section, we examine two alternatives to NHST. These are Resampling Statistics and Plot Plus Error Bar (PPE) Procedure.
Resampling Statistics is a computational approach that allows you to estimate the probability distribution function of the data. This can be valuable when the population distribution isn’t normalized. Resampling tests the F-statistic by performing many iterations of the experiment to look at the distribution of F-statistics. Resampling was done using a for loop and the number of iterations was arbitrarily set at 10,000. Each iteration, a sample is generated, Anova is performed, and an F-statistic is generated. A histogram of the array of F-statistics shows the shape of the probability distribution function.
#resampling technique
#for loop to take new sample, perform aov, and store F-value
#create array F_values to store F-value for each test
F_values <- 0
#designate the number of iterations in the for loop
iterations <- 10000
for (i in 1:iterations){
wage_sample <- Bwages[sample(1:nrow(Bwages),sample_size,replace=FALSE),]
F_values[i] <- summary(aov(wage_sample$wage ~ wage_sample$sex + wage_sample$educ))[[1]][["F value"]][1]
}
hist(F_values, freq=FALSE, xlab = "F-value", main = "Probability Density Function from Resampling")
#determine PDF above critical F-value
F_critical <- 3.90
length(which(F_values>F_critical))/length(F_values)
## [1] 0.5042
Resampling shows that the probability of the F-value falling above the critical F-value is right around 50%. This shows that we must carefully check assumptions of normality and homogeneity before using G*Power to reduce sample size.
Error bars can be added to a plot to help show the confidence level with which a conclusion can be made. This bar plot compares mean wages with error bars representing a 95% confidence interval. If there is no overlap in the bars, we can say with a 95% confidence level that hourly wages are different between males and females. This code was adapted from work done by James Jones.
#calculate standard deviation
sd_f <- sd(subset(Bwages$wage, Bwages$sex=="0"))
sd_m <- sd(subset(Bwages$wage, Bwages$sex=="1"))
#Plot Plus Error Bar Procedure
error.bar <- function(x, y, upper, lower=upper, length=0.1,...){
if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))
stop("vectors must be same length")
arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)
}
#Plot Plus Error Bar (PPE)
wage_means <- c(m_f,m_m)
wage_sds <- c(sd_f,sd_m)
bar_plot <- barplot(wage_means, names.arg = c("Female","Male"), ylim = c(0,15), xpd = FALSE, col="gray", axis.lty=1, xlab="Gender", ylab="Hourly Wage (in EUR)", main = "Bar Plot of Mean Wage with Error Bars")
error.bar(bar_plot,wage_means,1.96*wage_sds/sqrt(length(Bwages$wage)))
The PPE Procedure is a good way to visually look at the difference between two means at a confidence interval. This shows that there is a difference in mean wages between males and females at a 95% confidence interval. #5. References
Design and Analysis of Experiments, 8th Edition Douglas C. Montgomery
Jones, J. “Plotting Error Bars in R.” August 24, 2009. Accessed November 7, 2016 http://monkeysuncle.stanford.edu/?p=485
ISYE 6020 class resources
The Bwages data set was used from the Ecdat package in R. More information on this data set can be found at: https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf The Bwages data has missing values in the Ecdat package. The complete dataset can be found here:http://eu.wiley.com/legacy/wileychi/verbeek2ed/datasets.html
# Project 2: Null Hypothesis Statistical Testing
# Shamus Wheeler
# load Ecdat R package
library(Ecdat)
# view the first observations in Bwages dataframe
head(Bwages)
# view the last observations in Bwages dataframe
tail(Bwages)
# view the structure of Bwages dataframe
str(Bwages)
#store as factor
Bwages$sex <- as.factor(Bwages$sex)
#use cohen.d function from effsize package
cohen.d(Bwages$wage,Bwages$sex)
# create histogram of price variable
hist(Bwages$wage, main = "Hourly Wage in Belgium in 1994 ", xlab = "Hourly wage (in EUR)", ylab = "Number of Individuals")
#calculate main effect
m_f <- mean(subset(Bwages$wage, Bwages$sex=="0"))
m_m <- mean(subset(Bwages$wage, Bwages$sex=="1"))
me <- m_m - m_f
me
#calculate standard deviation
sd_f <- sd(subset(Bwages$wage, Bwages$sex=="0"))
sd_m <- sd(subset(Bwages$wage, Bwages$sex=="1"))
#create boxplot
boxplot(Bwages$wage~Bwages$sex, xlab="Sex",ylab="Wage",
main = "Wage by Sex",names=c("Female","Male"))
#select a sample from dataframe
sample_size <- 152
#set seed for reproducibility of data
set.seed(8)
wage_sample <- Bwages[sample(1:nrow(Bwages),sample_size,replace=FALSE),]
#store as factors
wage_sample$sex <- as.factor(wage_sample$sex)
wage_sample$educ <- as.factor(wage_sample$educ)
#2-way anova on sample
anova <- aov(wage_sample$wage ~ wage_sample$sex + wage_sample$educ)
summary(anova)
#create Q-Q plot to see how the data fits a normal distribution
qqnorm(residuals(anova))
#add reference line to Q-Q plot
qqline(residuals(anova))
#create a residuals plot
plot(fitted(anova),residuals(anova))
#resampling technique
#for loop to take new sample, perform aov, and store F-value
#create array F_values to store F-value for each test
F_values <- 0
#designate the number of iterations in the for loop
iterations <- 10000
for (i in 1:iterations){
wage_sample <- Bwages[sample(1:nrow(Bwages),sample_size,replace=FALSE),]
F_values[i] <- summary(aov(wage_sample$wage ~ wage_sample$sex + wage_sample$educ))[[1]][["F value"]][1]
}
hist(F_values, freq=FALSE, xlab = "F-value", main = "Probability Density Function from Resampling")
#determine PDF above critical F-value
F_critical <- 3.90
length(which(F_values>F_critical))/length(F_values)
#Plot Plus Error Bar Procedure
#James Jones PPE error.bar function
error.bar <- function(x, y, upper, lower=upper, length=0.1,...){
if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))
stop("vectors must be same length")
arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)
}
#Plot Plus Error Bar (PPE)
wage_means <- c(m_f,m_m)
wage_sds <- c(sd_f,sd_m)
bar_plot <- barplot(wage_means, names.arg = c("Female","Male"), ylim = c(0,15), xpd = FALSE, col="gray", axis.lty=1, xlab="Gender", ylab="Hourly Wage (in EUR)", main = "Bar Plot of Mean Wage with Error Bars")
error.bar(bar_plot,wage_means,1.96*wage_sds/sqrt(length(Bwages$wage)))