Experimental Setting

System under test

This data is taken from a study of wages based on a variety of conditions. This data was taken from 1976 to 1982 in the United States in a Panel Study of Income Dynamics. The interview was with 595 heads of household, and the data was taken 7 times per person over the course of the experiment. This was taken from the ecdat package in R.

The data set can be used in order to examine the effect of a variety of factors on the wage that a worker earns over the course of a year. Wages are presented as a logarithm in this study. This experiment will focus on the factorial variables included in the data set as opposed to the continuous variables.

Factors and Levels

The factors that are being studied include whether the worker is a blue collar worker (bluecol), with levels of yes and no. Whether or not the worker lives in the south is another variable, with levels of yes and no. The final two are whether wages are set by the workers’ union (union) with levels of yes and no, and the workers’ sex, with possible levels of male and female,

Continuous Variables

While continuous variables exist in the data set as independent variables (experience (exp), weeks worked annually (wks), and years of education (ed), they will not be studied, in favor of focusing on the factorial variables in the data set. The response variable, the log transformed wages (lwages) is also a continuous variable.

Response Variables

The response variable is log transformed wages, presented as a numerical value.

The Data

head(df)
##   exp wks bluecol ind south smsa married  sex union ed black   lwage
## 1   3  32      no   0   yes   no     yes male    no  9    no 5.56068
## 2   4  43      no   0   yes   no     yes male    no  9    no 5.72031
## 3   5  40      no   0   yes   no     yes male    no  9    no 5.99645
## 4   6  39      no   0   yes   no     yes male    no  9    no 5.99645
## 5   7  42      no   1   yes   no     yes male    no  9    no 6.06146
## 6   8  35      no   1   yes   no     yes male    no  9    no 6.17379
str(df)
## 'data.frame':    4165 obs. of  12 variables:
##  $ exp    : int  3 4 5 6 7 8 9 30 31 32 ...
##  $ wks    : int  32 43 40 39 42 35 32 34 27 33 ...
##  $ bluecol: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ...
##  $ ind    : int  0 0 0 0 1 1 1 0 0 1 ...
##  $ south  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 1 ...
##  $ smsa   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ married: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex    : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ union  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ ed     : int  9 9 9 9 9 9 9 11 11 11 ...
##  $ black  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ lwage  : num  5.56 5.72 6 6 6.06 ...

This data has a total of 4165 observations (595 individuals and 7 years from each individual), with 12 total variables. This data set will be pared down to the factors discussed above for the present study.

Following the paring of the data set, it looks as follows.

head(condensed_data)
##   bluecol south  sex union   lwage
## 1      no   yes male    no 5.56068
## 2      no   yes male    no 5.72031
## 3      no   yes male    no 5.99645
## 4      no   yes male    no 5.99645
## 5      no   yes male    no 6.06146
## 6      no   yes male    no 6.17379
str(condensed_data)
## 'data.frame':    4165 obs. of  5 variables:
##  $ bluecol: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ...
##  $ south  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 1 ...
##  $ sex    : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ union  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ lwage  : num  5.56 5.72 6 6 6.06 ...
summary(condensed_data)
##  bluecol    south          sex       union          lwage      
##  no :2036   no :2956   female: 469   no :2649   Min.   :4.605  
##  yes:2129   yes:1209   male  :3696   yes:1516   1st Qu.:6.395  
##                                                 Median :6.685  
##                                                 Mean   :6.676  
##                                                 3rd Qu.:6.953  
##                                                 Max.   :8.537

Experimental Design

The purpose of this experiment is to provide a predictive model for worker wages based on social, geographic, and occupational factors. This model could be used in the future in determining what groups are at more significant disadvantages as compared to others.

How will the experiment be organized and conducted?

The experiment will be conducted as a simple data analysis of available data. Exploratory statistics will be performed on each of the factors, followed by main effects and two-way interaction effects analysis. This will consist of plots of the main effects, followed by a significance check using a one factor ANOVA test. Alpha of .05 will be used, as well as an analysis of how much variance that factor explains will be used in assessing predictive power and whether it could be incorporated into the model.

The data is already available in the ecdat package for R. However, if I were to design a method to in which I would collect it myself, I would take a random sample of both cell phone and land-line phones and call the owners. The questions would be asked only if the subject was willing to do the panel for the full seven year period. The appropriate data will be collected from them and placed into the data set. However, the data should not contain have any ID indicator

What is the rationale for this design?

It is important to randomly contact people in order to prevent bias. An initially voluntary study can introduce bias, as it may result in a sample with more people who are ‘proud’ of their income. Additionally, financial incentives will not work, as this could make it more likely for lower-income groups to take it, as the money means more to them. Additionally, no information should be known about them beforehand; the calling should be done blinded to eliminate any subconscious bias.

ISYE 6020 Discussion: Relevant Theory and Assumptions

The basic assumptions of a completely randomized design is that groups are assigned entirely at random according to the randomization, replication, and blocking paradigm. In order to determine the viability of a data set for factorial design, each of these must be considered.

Randomization

In order to have a proper randomized design, the subjects are randomly assigned (in equal numbers) to an experimental group.This assumes a large equal group of subjects that can be divided into groups for treatments.

Randomization appears to have been used to some capacity in the original data set. A large pool of people was interviewed and categorized based off of their answers. However, as the data set is limited, certain groups have fewer replicates, which will be discussed later. However, the data set seems as random as can be expected based on a smaller data set.

Because this data is structured based not on treatments, but on individual factors of people, it is difficult to balance equal replicates in each group, while eliminating selection bias. This results in a few issues in analysis of the data set, but nothing incredibly disqualifying.

Replication

In a full factorial design, replicates are important in ensuring that there is a sample that is representative of the whole population. This ensures that there is as little variability in the results as possible. This improves the significance and confidence level assessment of the results.

Replicates in this group are the number of people that were interviewed that fell into each of the possible combinations of factor levels. With the exception of the (bluecol - no, south - yes, sex - female, union - yes) grouping, all factor combinations have at least one replicate.

As the original study can not be located, it is not possible to tell for sure how the data was sampled in this case. Ideally, more replicates for some of the lower groups would be present, but that is a risk with truly random sampling, instead of proportional or stratified sampling, in which the researchers would contact people in order to hit a certain number of replicates for each of the subsets.

Blocking

In experimental design, blocking is the process in which subjects are divided into blocks before they have a treatment applied. This blocking could be used for factors like age, in which separate groups could be created based on 10 or 20 year gaps instead of studying age as its own variable. This decreases the number of sampling groups. Control groups are very important in this style of experimentation to ensure significance of results.

No blocking is necessary in this experimental design, as all of the factors involved in this experiment are simply questions instead of treatments. As a result, it is unnecessary to limit the number of runs conducted through blocking. Additionally, exp (years of experience), the most likely blocking variable, is presented as a continuous variable through the data set, not as blocked groups (1-10, 11-20, etc.).

Assumptions

This data set meets many of the assumptions required for truly random design. Where it falls short is on replication, as one of the groups only has one replicate. Whether this is intentional as a result of proportional sampling or just a byproduct of a random sampling process is undetermined, as the original data set and accompanying information cannot be located. However, based on principles of randomization, replication, and blocking, the data set appears to be a viable, albeit not perfect, set for factorial analysis.

Statistical Analysis

In order to determine the effects of the data, main and interaction effect plots will be created. One and two-way ANOVA will be used in order to test for significance of the effects.

Descriptive Summary

Grouped boxplots of wages compared to all factors together, as well as each factor individually will be examined in order to understand the data, and show differences between groupings.

Testing

Main effects will be computed for each variable, as well as all 6 combinations of two-way interaction effects. The main effect will be computed by taking the mean of the lwage variable while only looking at one variable. The other values are not manipulated in any fashion, and are just collected into the average. The main effects will be shown as a single line plot between both levels of the given factor.

Interaction effects will be shown as a set of lines, with each line representing one of two factors, with the x-axis being the other factor that would be present in the interaction. If the lines are not parallel, there is likely an interaction present between the two variables. Significance must still be tested.

Once it is determined whether main and interaction effects are present, the statistical significance of these effects must be determined. In order to calculate this, ANOVA testing will be done with one and two factors. With a p-value of less than .05, we will be able to reject the null hypothesis, that there is no statistically significant main or interaction effect present within certain factors.

Main Effects

The data set is manipulated in order to calculate the main effects. This is done by subsetting the data in terms of the appropriate main factor (bluecol, south, sex, or union). The values of lwage will be averaged for both levels of this factor.

The data will be subset using code similar to the following chunk and then main effect plots will be produced using the average of each subset:

ME_BC_y <- subset(condensed_data,condensed_data$bluecol == "yes")
ME_BC_n <- subset(condensed_data,condensed_data$bluecol == "no")

There appear to be main effects for each of the variables except for union, as there are slopes to each of the lines shown in the main effect plots. The calculated values of the main effects (from no to yes and M to F) are:

bluecol = -.3

south = -.18

sex = -.47

union = .01

In order to confirm the statistical significance of differing means, one-factor ANOVA must be carried out on each of the main effects.

## Analysis of Variance Table
## 
## Response: lwage
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## bluecol      1  89.49  89.486  467.17 < 2.2e-16 ***
## Residuals 4163 797.42   0.192                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: lwage
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## south        1  28.87 28.8678  140.06 < 2.2e-16 ***
## Residuals 4163 858.04  0.2061                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: lwage
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## sex          1  93.69  93.691  491.72 < 2.2e-16 ***
## Residuals 4163 793.21   0.191                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: lwage
##             Df Sum Sq  Mean Sq F value Pr(>F)
## union        1   0.07 0.067121  0.3151 0.5746
## Residuals 4163 886.84 0.213029

This analysis shows that the main effects are significant in the first three variables, but not in the union factor. Each of the results has very low p-values, allowing for the rejection of the null hypothesis in each of those factors. This shows that a main effect is present. The other three factors seem to have reasonably large predictive power, although in a data set this large, with other unstudied factors, it is unlikely that one factor explains all that much of the variance.

Interaction Effects

Interaction plots will be used to visually search for the presence of interaction effects. The lines will be plotted next to each other, and differences in the slope of the lines will indicate the possible presence of an interaction effect. After visual inspection, two-factor ANOVA will be used in order to confirm the relationship.

The magnitude of the interaction effect is calculated by finding the difference of the difference between the two lines. This is to say that the magnitude of the difference between the left and right point of both lines is compared. The difference of the differences is the interaction effect.

The equation is as follows, where L is left and R is right for the factor graphs:

\(IE = (Line1_L - Line2_L) - (Line1_R - Line2_R)\)

The resulting interaction effects for each of the graph are listed below:

bluecol x south = -0.0113653
bluecol x sex = 0.1277816
bluecol x union = 0.4175451
south x sex = -0.0529726
south x union = 0.1788942
sex x union = -0.1882084

Based on the above graphs and values, it can be seen that interaction effects are present between many of the variables. The most obvious interaction effects are between bluecol and union factors, with south x union and sex x union also showing evidence of an interaction. Multifactorial ANOVA will be used in order to confirm the presence of a statistically significant interaction, as well as give an idea as to how much variance can be explained by an interaction effect.

bluecol x south

## Analysis of Variance Table
## 
## Response: lwage
##                 Df Sum Sq Mean Sq  F value Pr(>F)    
## bluecol          1  89.49  89.486 481.9919 <2e-16 ***
## south            1  24.87  24.867 133.9406 <2e-16 ***
## bluecol:south    1   0.03   0.028   0.1484    0.7    
## Residuals     4161 772.52   0.186                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bluecol x sex

## Analysis of Variance Table
## 
## Response: lwage
##               Df Sum Sq Mean Sq  F value    Pr(>F)    
## bluecol        1  89.49  89.486 543.4540 < 2.2e-16 ***
## sex            1 110.64 110.636 671.8985 < 2.2e-16 ***
## bluecol:sex    1   1.63   1.628   9.8877  0.001676 ** 
## Residuals   4161 685.16   0.165                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bluecol x union

## Analysis of Variance Table
## 
## Response: lwage
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## bluecol          1  89.49  89.486 498.445 < 2.2e-16 ***
## union            1  17.20  17.199  95.802 < 2.2e-16 ***
## bluecol:union    1  33.20  33.196 184.904 < 2.2e-16 ***
## Residuals     4161 747.02   0.180                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

south x sex

## Analysis of Variance Table
## 
## Response: lwage
##             Df Sum Sq Mean Sq  F value Pr(>F)    
## south        1  28.87  28.868 156.1732 <2e-16 ***
## sex          1  88.63  88.633 479.5016 <2e-16 ***
## south:sex    1   0.26   0.264   1.4273 0.2323    
## Residuals 4161 769.14   0.185                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

south x union

## Analysis of Variance Table
## 
## Response: lwage
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## south          1  28.87 28.8678 140.946 < 2.2e-16 ***
## union          1   0.39  0.3892   1.900    0.1681    
## south:union    1   5.42  5.4155  26.441 2.842e-07 ***
## Residuals   4161 852.23  0.2048                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

sex x union

## Analysis of Variance Table
## 
## Response: lwage
##             Df Sum Sq Mean Sq  F value    Pr(>F)    
## sex          1  93.69  93.691 493.5033 < 2.2e-16 ***
## union        1   0.71   0.709   3.7348 0.0533582 .  
## sex:union    1   2.54   2.540  13.3767 0.0002579 ***
## Residuals 4161 789.96   0.190                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA analysis, it can be seen that there are statistically significant interaction effects in the following factor combinations:

bluecol x sex

bluecol x union

south x union

sex x union

No statistical significance was observed in these groups:

bluecol x south

south x sex

However, not all of the statistically significant interaction effects were large enough to be worth considering. For example, the bluecol x sex interaction, while statistically significant, has a much smaller F value than the main effects. However, noticeable interaction effects are present within the data.

References

http://www.stat.yale.edu/Courses/1997-98/101/expdes.htm

Design and Analysis of Experiments, 8th Edition Douglas C. Montgomery

Appendix 1: Raw Data

The data was drawn from the R ecdat library, and the dataset used was wages. The structure and the head of the data set can be seen in the section Experimental Setting - The Data

Appendix 2: R Code

library("Ecdat")
df <- Wages

head(df) # View the top of the data setsummary(df) # Breaks down the data frame by class

condensed_data <- df[,c(3,5,8,9,12)] # removes unneccesary variables

# Exploratory Statistics

par(mar = c(9,5,0,1)) # Changes graph space so bottom labels fit
boxplot(condensed_data$lwage~condensed_data$bluecol + condensed_data$south + 
  condensed_data$sex + condensed_data$union,las = 2, ylab = "log of wages") 
# produces boxplot by four factors, with rotated grouping labels 

legend(x = "bottomright", legend = "Factor order= bluecol, South, Sex , Union",
bty = "n", cex = .75) 
# Creates a small, borderless legend which is used as a label,
# due to the number of categoriesbelow the plot prohibiting 
# x-axis labeling

boxplot(condensed_data$lwage~condensed_data$bluecol,xlab="Blue Collar?", 
        ylab= "log of wages") 
# bluecol boxplot

boxplot(condensed_data$lwage~condensed_data$south, xlab = "South?", 
        ylab = "log of wages")
# south boxplot

boxplot(condensed_data$lwage~condensed_data$sex,xlab = "Sex of worker",
        ylab = "log of wages") 
# sex boxplot

boxplot(condensed_data$lwage~condensed_data$union,xlab = "Union-set Wages?",
        ylab = "log of wages")
# union boxplot



# breaks down data by bluecol factor
ME_BC_y <- subset(condensed_data,condensed_data$bluecol == "yes") 
ME_BC_n <- subset(condensed_data,condensed_data$bluecol == "no")


plot(c(1,2), c(mean(ME_BC_n$lwage),mean(ME_BC_y$lwage)),type = 'l',
  main = "ME of Blue Collar on log of wages",xlab = "Blue Collar?",
  ylab = "log of wages",xlim = c(.5,2.5),xaxt = "n") 
# line plot of main effects with no x-axis

axis(1,at = c(1,2),labels = c("No","Yes")) 
# allows for insertion of factors on x-axis instead of numbers

text(c(1,2) - .25,c(mean(ME_BC_n$lwage),mean(ME_BC_y$lwage)),
     round(c(mean(ME_BC_n$lwage),mean(ME_BC_y$lwage)),digits = 2)) 
# prints mean values next to points. 

# breaks down data by south factor
ME_south_yes <- subset(condensed_data,condensed_data$south == "yes")
ME_south_no  <- subset(condensed_data,condensed_data$south == "no")



plot(c(1,2),c(mean(ME_south_no$lwage),mean(ME_south_yes$lwage)),type = 'l',
     main = "ME of Location on log of wages",xlab = "South?",ylab = "log of wages",
     xlim = c(.5,2.5),xaxt = "n") 

# line plot of main effects with no x-axis


axis(1,at = c(1,2),labels = c("No","Yes")) 
# allows for insertion of factors on x-axis instead of numbers

text(c(1,2)-.25,c(mean(ME_south_no$lwage),mean(ME_south_yes$lwage)),
     round(c(mean(ME_south_no$lwage),mean(ME_south_yes$lwage)),digits = 2)) 
# prints mean values next to points.

# breaks down data by sex factor

ME_sex_f<- subset(condensed_data,condensed_data$sex == "female")
ME_sex_m<- subset(condensed_data,condensed_data$sex == "male")


plot(c(1,2),c(mean(ME_sex_m$lwage),mean(ME_sex_f$lwage)),type = 'l',
     main = "ME of Sex on log of wages",xlab = "Sex",ylab = " log of wages",
     xlim = c(.5,2.5),xaxt = "n") 
# line plot of main effects with no x-axis


axis(1,at = c(1,2),labels = c("M","F")) 
# allows for insertion of factors on x-axis instead of numbers

text(c(1,2)-.25,c(mean(ME_sex_m$lwage),mean(ME_sex_f$lwage)),
     round(c(mean(ME_sex_m$lwage),mean(ME_sex_f$lwage)),digits = 2))

# prints mean values next to points.


# breaks down data by union factor
ME_union_y <- subset(condensed_data,condensed_data$union == "yes") 
ME_union_n <- subset(condensed_data,condensed_data$union == "no")


plot(c(1,2), c(mean(ME_BC_n$lwage),mean(ME_BC_y$lwage)),type = 'l',
     main = "ME of Union Set Wages on log of wages",xlab = "Union Set Wages?"
     ,ylab = "log of wages",xlim = c(.5,2.5),xaxt = "n") 
# line plot of main effects with no x-axis


axis(1,at = c(1,2),labels = c("no","yes")) 
# allows for insertion of factors on x-axis instead of numbers


text(c(1,2) - .25, c(mean(ME_BC_n$lwage),mean(ME_BC_y$lwage)),
     (c(mean(ME_BC_n$lwage),mean(ME_BC_y$lwage)),digits = 2))
# prints mean values next to points.


# Test for Significance

# first line of group creates analysis of variance model
# next line prints it to console 

model1 <- aov(lwage~bluecol,data = condensed_data)
print(anova(model1))

model2 <- aov(lwage~south,data = condensed_data)
print(anova(model2))

model3 <- aov(lwage~sex,data = condensed_data)
print(anova(model3))

model4 <- aov(lwage~union,data = condensed_data)
print(anova(model4))

## Interaction plot

# This uses the interaction.plot function to compare the interaction of 
# each of the two-factor interactions. 6 interactions are present. 
interaction.plot(condensed_data$bluecol,condensed_data$south,condensed_data$lwage,
                 legend = F,xlab = "Blue Collar?",ylab = "log of wages")
# Calculates legend and places it appropriately
legend(x = "bottomleft",legend = c("yes","no"), title = "South", lty = c(1,2))

# Calculates the points of the interaction plot 
# for analysis of interaction effects
# These lines are repeated through the code. 
means = by(condensed_data$lwage,list(condensed_data$bluecol,condensed_data$south),mean)



interaction.plot(condensed_data$bluecol,condensed_data$sex,condensed_data$lwage,
      legend = F,xlab = "Blue Collar?",ylab = "log of wages")
legend(x = "bottomleft",legend = c("male","female"), title = "Sex", lty = c(1,2))

means = by(condensed_data$lwage,list(condensed_data$bluecol,condensed_data$sex),mean)

interaction.plot(condensed_data$bluecol,condensed_data$union,condensed_data$lwage,
      legend = F,xlab = "Blue Collar?",ylab = "log of wages")
legend(x = "bottomleft",legend = c("yes","no"), title = "Union Set Wages?", lty = c(1,2))

means = by(condensed_data$lwage,list(condensed_data$bluecol,condensed_data$union),mean)


interaction.plot(condensed_data$south,condensed_data$sex,condensed_data$lwage,
      legend = F,xlab = "South?",ylab = "log of wages")
legend(x = "bottomleft",legend = c("male","female"), title = "Sex", lty = c(1,2))

means = by(condensed_data$lwage,list(condensed_data$south,condensed_data$sex),mean)

interaction.plot(condensed_data$south,condensed_data$union,condensed_data$lwage,
      legend = F,xlab = "South?",ylab = "log of wages")
legend(x = "bottomleft",legend = c("yes","no"), title = "Union Set Wages?", lty = c(1,2))

means = by(condensed_data$lwage,list(condensed_data$south,condensed_data$union),mean)

interaction.plot(condensed_data$sex,condensed_data$union,condensed_data$lwage,
       legend = F,xlab = "Sex?",ylab = "log of wages")
legend(x = "bottomleft",legend = c("yes","no"), title = "Union Set Wages?", lty = c(1,2))

means = by(condensed_data$lwage,list(condensed_data$sex,condensed_data$union),mean)

# Example ie calculation
ie6 <- (means[1,1] - means[2,1]) - (means[1,2] - means[2,2])