1. Setting

System Under Test

This experiment examines the effects of several childhood experiences on future wages. The data for this experiment was taken from a study on wages, schooling, and proximity to college campuses. The data was collected from individuals in the United States in 1976. There are 3010 observations in this data set. This data set is the Schooling data set in the Ecdat package. This experiment examines the effects of several childhood experiences on future wages. A further description of the experiment will be given in the following sections. Here is the installation of the package and defining the data set:

#load Ecdat package
library("Ecdat")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
## 
##     sign
## 
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
## 
##     Orange
#rename dataset as school
school <- Schooling

Factors and Levels

The Schooling dataset consists of 3010 observations of 28 variables. For this experiment, four factors were studied, each with two levels. The first factor (smsa66) is whether or not a person grew up in a metropolitan statistical area (an area with a relatively high population density), with the levels yes and no. The second factor (nearc4) is whether or not a person grew up in close proximity to a 4 year college, with the levels yes and no. The third factor (libcrd14) is whether or not a person had a library card at age 14, with levels yes and no. The fourth factor (sinmom14) is whether or not a person had a single mother at age 14, with levels yes and no.

Continuous Variables (if any)

The continuous variables in the data set are IQ score (iqscore), KWW (kww) (similar to IQ score), wage (wage76), and logarithm of wage (lwage76). The log of wages will be the response variable that this experiment is examining, while the other continuous variables won’t be examined further.

Response Variables

The response variable is the log of wages (lwage76). This variable is a continuous numerical value that has a normal distribution, which will be shown later. ##The Data

head(school)
##   smsa66 smsa76 nearc2 nearc4 nearc4a nearc4b ed76 ed66 age76 daded
## 1    yes    yes     no     no      no      no    7    5    29  9.94
## 2    yes    yes     no     no      no      no   12   11    27  8.00
## 3    yes    yes     no     no      no      no   12   12    34 14.00
## 4    yes    yes    yes    yes     yes      no   11   11    27 11.00
## 5    yes    yes    yes    yes     yes      no   12   12    34  8.00
## 6    yes    yes    yes    yes     yes      no   12   11    26  9.00
##   nodaded momed nomomed momdad14 sinmom14 step14 south66 south76  lwage76
## 1     yes 10.25     yes      yes       no     no      no      no 6.306275
## 2      no  8.00      no      yes       no     no      no      no 6.175867
## 3      no 12.00      no      yes       no     no      no      no 6.580639
## 4      no 12.00      no      yes       no     no      no      no 5.521461
## 5      no  7.00      no      yes       no     no      no      no 6.591674
## 6      no 12.00      no      yes       no     no      no      no 6.214608
##   famed black wage76 enroll76 kww iqscore mar76 libcrd14 exp76
## 1     9   yes    548       no  15      NA   yes       no    16
## 2     8    no    481       no  35      93   yes      yes     9
## 3     2    no    721       no  42     103   yes      yes    16
## 4     6    no    250       no  25      88   yes      yes    10
## 5     8    no    729       no  34     108   yes       no    16
## 6     6    no    500       no  38      85   yes      yes     8
str(school)
## 'data.frame':    3010 obs. of  28 variables:
##  $ smsa66  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ smsa76  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ nearc2  : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 2 2 2 2 ...
##  $ nearc4  : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 2 2 2 2 ...
##  $ nearc4a : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 2 2 2 2 ...
##  $ nearc4b : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ed76    : int  7 12 12 11 12 12 18 14 12 12 ...
##  $ ed66    : int  5 11 12 11 12 11 16 13 12 12 ...
##  $ age76   : int  29 27 34 27 34 26 33 29 28 29 ...
##  $ daded   : num  9.94 8 14 11 8 9 14 14 12 12 ...
##  $ nodaded : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ momed   : num  10.2 8 12 12 7 ...
##  $ nomomed : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ momdad14: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ sinmom14: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ step14  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ south66 : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ south76 : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ lwage76 : num  6.31 6.18 6.58 5.52 6.59 ...
##  $ famed   : int  9 8 2 6 8 6 1 1 3 3 ...
##  $ black   : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ wage76  : int  548 481 721 250 729 500 565 608 425 515 ...
##  $ enroll76: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ kww     : int  15 35 42 25 34 38 41 46 32 34 ...
##  $ iqscore : int  NA 93 103 88 108 85 119 108 96 97 ...
##  $ mar76   : Factor w/ 6 levels "2","3","4","5",..: 6 6 6 6 6 6 6 6 3 6 ...
##  $ libcrd14: Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 2 1 2 ...
##  $ exp76   : int  16 9 16 10 16 8 9 9 10 11 ...

2. Experimental Design

This experiment is looking for the effects of childhood experiences on wages. This model could be of interest to parents, legislators, and school officials when looking to improve the economic outcome of the next generation.

How will the experiment be organized and conducted to test the hypothesis?

This experiment will begin with exploratory data analysis of each of the factors, starting with boxplots. Each main effect will be plotted so that it can be quantified. Each of 6 2-way interactions will also be plotted for analysis. An one-way ANOVA test will be used to test the variance that each factor causes. For this experiment, alpha = 0.05.

What is the rationale for this design?

This experiment was designed to fit a pre-existing data set. It is difficult to determine how this data was collected, or is any biases exist, because the collection method is unknown. If I was responsible for collection, I would take a random sample of phone numbers and hold phone interviews with those people. Also, with the exception of library card (libcrd14) data, the rest of the data may be available in census information. If that was something that I had access to, I would generate a random sample of observations from that data. I believe that either of these methods could be used to collect data without biases. More on the experimental design will be discussed in the ISYE 6020 Discussion section. ##ISYE 6020 Summary of Relevant Theory

Randomization

For an experimental design to have randomization, it must have both the allocation of observations to groups and the order of the individual runs randomized. This data set seems to have some randomization, although it is most likely not a complete randomization. It helps that a large number (3010) of observations were taken. If collected in a way that minimizes biases, this will aid randomization. From observing the data, it is clear that some groups have less replicates than other groups. The largest group had 320 replicates, while the smallest group had 8. I’ll discuss these replicates more in the next section. This being said, I would still consider this dataset to be random. The observations in this data set are attributes of people. It is completely possible that the data set was created from a random population without selection biases. It would be helpful to have a larger group or know the method that was used to collect this data.

Replication

A replicate is a repeat run of a set of factor combinations. In this example, it would be the number of people who had the same set of factor combinations. Because this is a four factor, two level design, there are 16 groups that a person can be categorized into. The largest group is smsa66=yes, nearc4=yes, libcrd14=no, sinmom14=no with 320 replicates. The smallest group is smsa66=yes, nearc4=no, libcrd14=no, sinmom14=yes with 8 replicates. It is good that all groups have replicates so that error can be more accurately measured.

Blocking

Blocking is when observations are divided into a range before they have a treatment applied to them. Blocking can be used for factors such as years of experience. For example, the years of experience that someone has in the workplace can be grouped into five year periods 0-5 years, 6-10 years, etc., instead of being individually treated as levels. This wasn’t necessary in this experiment because all of the factors had two levels that were yes or no. Many of the variables in the data set were blocked out of the experiment entirely, because they weren’t of interest in this study.

Assumptions

This data set seems to meet many of the requirements for a random design. This data set has replicates for every group, didn’t need to utilize blocking, and seems to be relatively random for a study that looks at characteristics of individuals. From this discussion, can assume that this data set is random and it is appropriate to conduct further analysis.

3. Statistical Analysis

To examine the effects of childhood experiences on future wages, main and interaction effects are plotted and one and two-way ANOVA are used to test the effects for significance.

Descriptive Summary

A histogram of lwage76 shows that the log of wages of the group is normally distributed.
Boxplots of wages over the two levels of each factor show, graphically, the effect of each factor on wages.

Testing

The main effects on wage for each of the four factors and the six two-way interactions will be plotted and computed. The main effects will be displayed as a single line over the two levels of the factor being examined. All other factors are held constant. The interaction effects will be displayed as two lines in their interaction with wages. Significance will be tested with ANOVA for both main effects and two-way interactions. If the test value is below 0.05, we will be able to reject the null hypothesis, which is that the factor has no effect on the response variable.

Main Effects

The main effects for each factor were plotted by subsetting the data over the levels for that factor. Below are the main effects for each factor:

Graphically, there are main effects for each of the factors. This can be seem in the slope of the lines over the levels. The calculated values for the main effects are:

smsa66= 0.2

nearc4= 0.15

libcrd14= 0.21

sinmom14= -0.13

ANOVA must be performed on each of these main effects to prove the statistical significance of these effects.

## Analysis of Variance Table
## 
## Response: school$lwage76
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## school$smsa66    1  27.47 27.4665  146.18 < 2.2e-16 ***
## Residuals     3008 565.18  0.1879                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: school$lwage76
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## school$nearc4    1  15.87 15.8660  82.745 < 2.2e-16 ***
## Residuals     3008 576.78  0.1917                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: school$lwage76
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## school$libcrd14    1  29.13 29.1251  155.41 < 2.2e-16 ***
## Residuals       2995 561.31  0.1874                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: school$lwage76
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## school$sinmom14    1   4.64  4.6390  23.731 1.165e-06 ***
## Residuals       3008 588.00  0.1955                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This shows that the main effects are statistically significant for all four main effects. Since the test value is below 0.05, the null hypothesis can be rejected. This shows that there are main effects on wages.

Interaction Effects

To examine the interaction effects, each possible interaction is plotted next to each other with wages on the y-axis. The slope of the lines shows interaction, so parallel lines indicate no interaction effect, while crossed lines indicate an interaction effect. ANOVA will be used to statistically test these interactions.

## Analysis of Variance Table
## 
## Response: school$lwage76
##                                 Df Sum Sq Mean Sq  F value Pr(>F)    
## school$smsa66                    1  27.66 27.6557 152.0722 <2e-16 ***
## school$libcrd14                  1  18.00 18.0014  98.9853 <2e-16 ***
## school$smsa66:school$libcrd14    1   0.47  0.4689   2.5781 0.1085    
## Residuals                     2993 544.30  0.1819                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: school$lwage76
##                                   Df Sum Sq Mean Sq  F value    Pr(>F)    
## school$libcrd14                    1  29.13 29.1251 156.3637 < 2.2e-16 ***
## school$sinmom14                    1   3.09  3.0859  16.5670 4.818e-05 ***
## school$libcrd14:school$sinmom14    1   0.73  0.7282   3.9093   0.04811 *  
## Residuals                       2993 557.49  0.1863                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: school$lwage76
##                                 Df Sum Sq Mean Sq F value    Pr(>F)    
## school$sinmom14                  1   4.64  4.6390 24.3922 8.288e-07 ***
## school$nearc4                    1  16.09 16.0864 84.5846 < 2.2e-16 ***
## school$sinmom14:school$nearc4    1   0.23  0.2302  1.2103    0.2714    
## Residuals                     3006 571.69  0.1902                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: school$lwage76
##                               Df Sum Sq Mean Sq F value Pr(>F)    
## school$nearc4                  1  15.87 15.8660 84.9136 <2e-16 ***
## school$smsa66                  1  14.74 14.7383 78.8782 <2e-16 ***
## school$nearc4:school$smsa66    1   0.37  0.3691  1.9752   0.16    
## Residuals                   3006 561.67  0.1868                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: school$lwage76
##                                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## school$smsa66                    1  27.47 27.4665 147.4256 < 2.2e-16 ***
## school$sinmom14                  1   4.97  4.9696  26.6742 2.565e-07 ***
## school$smsa66:school$sinmom14    1   0.16  0.1646   0.8834    0.3473    
## Residuals                     3006 560.04  0.1863                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: school$lwage76
##                                 Df Sum Sq Mean Sq F value    Pr(>F)    
## school$libcrd14                  1  29.13 29.1251 157.962 < 2.2e-16 ***
## school$nearc4                    1   9.45  9.4547  51.279 1.004e-12 ***
## school$libcrd14:school$nearc4    1   0.00  0.0017   0.009    0.9246    
## Residuals                     2993 551.85  0.1844                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is difficult to tell if there are any interaction effects graphically. Most of the lines appear to be parallel to each other. However, from the slopes it appears if libcrd14 x sinmom14 could have an interaction effect. ANOVA is performed to test for significance.

Based on ANOVA, there are no statistically significant interaction effects for five of the possible interactions:

smsa66 x libcrd14

sinmom14 x nearc4

nearc4 x smsa66

smsa66 x sinmom14

libcrd14 x nearc4

Based on ANOVA, there is statistical significance for one interaction effect:

libcrd14 x sinmom14 is statistically significant at the 0.05 level, which means that an interaction effect is present.

4. References

Design and Analysis of Experiments, 8th Edition Douglas C. Montgomery

5. Appendices

Appendix A: Raw Data

The Schooling data set was used from the Ecdat package in R. More information on this data set can be found at: https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

Appendix B: Complete R Code

#load Ecdat package
library("Ecdat")

#rename dataset as school
school <- Schooling

#show head of dataset 
head(school)

#show structure of dataset
str(school)

#show levels of 

#histogram of wages
hist(school$lwage76)

#boxplot of ME of smsa on wage
boxplot(school$lwage76~school$smsa66, xlab="Lived in SMSA",ylab="Wage")
title("Lived in SMSA")

#boxplot of ME of proximity to 4-year college on wage
boxplot(school$lwage76~school$nearc4, xlab="Lived near 4-Year College",ylab="Wage")
title("Lived near 4-Year College")

#boxplot of ME of library card on wage
boxplot(school$lwage76~school$libcrd14, xlab="Had Library Card at Age 14",ylab="Wage")
title("Ownership of Library Card")

#boxplot of ME of single mother on wage
boxplot(school$lwage76~school$sinmom14, xlab="Had a Single Mother at Age 14",ylab="Wage")
title("Single Mother")

#split smsa factor by levels
smsa_y <- subset(school,school$smsa66 == "yes")
smsa_n <- subset(school,school$smsa66 == "no")

#plot of ME of smsa on wage
plot(c(1,2), c(mean(smsa_n$lwage76),mean(smsa_y$lwage76)),type = 'l',
     main = "ME of SMSA on Wages",xlab = "Lived in SMSA",
     ylab = "Wages",xlim = c(.5,2.5),xaxt = "n")
#label x-axis
axis(1,at = c(1,2),labels = c("No","Yes")) 
#show means on plot
text(c(1,2) - .25,c(mean(smsa_n$lwage76),mean(smsa_y$lwage76)),
     round(c(mean(smsa_n$lwage76),mean(smsa_y$lwage76)),digits = 2))

#split proximity to 4-year college factor by levels
nearc4_y <- subset(school,school$nearc4 == "yes")
nearc4_n <- subset(school,school$nearc4 == "no")

#plot of ME of proximity to 4-year college on wage
plot(c(1,2), c(mean(nearc4_n$lwage76),mean(nearc4_y$lwage76)),type = 'l',
     main = "ME of Proximity to 4-year College on Wages",xlab = "Lived near 4 year college",
     ylab = "Wages",xlim = c(.5,2.5),xaxt = "n")
#label x-axis
axis(1,at = c(1,2),labels = c("No","Yes")) 
#show means on plot
text(c(1,2) - .25,c(mean(nearc4_n$lwage76),mean(nearc4_y$lwage76)),
     round(c(mean(nearc4_n$lwage76),mean(nearc4_y$lwage76)),digits = 2))

#split library card factor by levels
lib_y <- subset(school,school$libcrd14 == "yes")
lib_n <- subset(school,school$libcrd14 == "no")

#plot of ME of library card on wage
plot(c(1,2), c(mean(lib_n$lwage76),mean(lib_y$lwage76)),type = 'l',
     main = "ME of Library Card on Wages",xlab = "Had library card at age 14",
     ylab = "Wages",xlim = c(.5,2.5),xaxt = "n")
#label x-axis
axis(1,at = c(1,2),labels = c("No","Yes")) 
#show means on plot
text(c(1,2) - .25,c(mean(lib_n$lwage76),mean(lib_y$lwage76)),
     round(c(mean(lib_n$lwage76),mean(lib_y$lwage76)),digits = 2))

#split single mother factor by levels
mom_y <- subset(school,school$sinmom14 == "yes")
mom_n <- subset(school,school$sinmom14 == "no")

#plot of ME of single mother on wage
plot(c(1,2), c(mean(mom_n$lwage76),mean(mom_y$lwage76)),type = 'l',
     main = "ME of Single Mother on Wages",xlab = "Had a single mother at age 14",
     ylab = "Wages",xlim = c(.5,2.5),xaxt = "n")
#label x-axis
axis(1,at = c(1,2),labels = c("No","Yes")) 
#show means on plot
text(c(1,2) - .25,c(mean(mom_n$lwage76),mean(mom_y$lwage76)),
     round(c(mean(mom_n$lwage76),mean(mom_y$lwage76)),digits = 2))

#anova of smsa on wage
anova1 <- aov(school$lwage76~school$smsa66)
anova(anova1)

#anova of proximity to 4-year college on wage
anova2 <- aov(school$lwage76~school$nearc4)
anova(anova2)

#anova of library card on wage
anova3 <- aov(school$lwage76~school$libcrd14)
anova(anova3)

#anova of single mother on wage
anova4 <- aov(school$lwage76~school$sinmom14)
anova(anova4)

#interaction between smsa and library card on wage
int1 <- aov(school$lwage76~school$smsa66*school$libcrd14)
anova(int1)

#interaction plot of smsa and library card on wage
interaction.plot(school$libcrd14,school$smsa66,school$lwage76)

#interaction between library card and single mother on wage
int2 <- aov(school$lwage76~school$libcrd14*school$sinmom14)
anova(int2)

#interaction plot of library card and single mother on wage
interaction.plot(school$sinmom14,school$libcrd14,school$lwage76)

#interaction between single mother and near 4 year college on wage
int3 <- aov(school$lwage76~school$sinmom14*school$nearc4)
anova(int3)

#interaction plot of single mother and near 4 year college wage
interaction.plot(school$nearc4,school$sinmom14,school$lwage76)

#interaction between near 4 year college and smsa on wage
int4 <- aov(school$lwage76~school$nearc4*school$smsa66)
anova(int4)

#interaction plot of near 4 year college and smsa on wage on wage
interaction.plot(school$smsa66,school$nearc4,school$lwage76)

#interaction between smsa and single mother on wage
int5 <- aov(school$lwage76~school$smsa66*school$sinmom14)
anova(int5)

#interaction plot of smsa and single mother on wage
interaction.plot(school$sinmom14,school$smsa66,school$lwage76)

#interaction between library card and near 4 year college on wage
int6 <- aov(school$lwage76~school$libcrd14*school$nearc4)
anova(int6)

#interaction plot of library card and near 4 year college on wage
interaction.plot(school$nearc4,school$libcrd14,school$lwage76)