Recipes for the Design of Experiments

Wages for Males

Jane Braun

RPI

December 1st Version 1.0

1. Setting

System under test

The data used in this recipe was found using the “100+ interesting data sets” webpage. It is part of the R dataset package called “Ecdat,” which various datasets for the purpose of econometrics.

This dataset in particular has data for wages and education level of 4,300 young males in the united states from 1980 to 1987.

# Read in dataset from Ecdat package
#install.packages("Ecdat")
# library(Ecdat)
data(Males, package = "Ecdat")
x <- Males

Factors and Levels

The original data being examined has 4,360 observations of 12 variables.

head(x)

##   nr year school exper union  ethn maried health     wage
## 1 13 1980     14     1    no other     no     no 1.197540
## 2 13 1981     14     2   yes other     no     no 1.853060
## 3 13 1982     14     3    no other     no     no 1.344462
## 4 13 1983     14     4    no other     no     no 1.433213
## 5 13 1984     14     5    no other     no     no 1.568125
## 6 13 1985     14     6    no other     no     no 1.699891
##                      industry                          occupation
## 1 Business_and_Repair_Service                     Service_Workers
## 2            Personal_Service                     Service_Workers
## 3 Business_and_Repair_Service                     Service_Workers
## 4 Business_and_Repair_Service                     Service_Workers
## 5            Personal_Service      Craftsmen, Foremen_and_kindred
## 6 Business_and_Repair_Service Managers, Officials_and_Proprietors
##    residence
## 1 north_east
## 2 north_east
## 3 north_east
## 4 north_east
## 5 north_east
## 6 north_east

summary(x)

##        nr             year          school          exper       
##  Min.   :   13   Min.   :1980   Min.   : 3.00   Min.   : 0.000  
##  1st Qu.: 2329   1st Qu.:1982   1st Qu.:11.00   1st Qu.: 4.000  
##  Median : 4569   Median :1984   Median :12.00   Median : 6.000  
##  Mean   : 5262   Mean   :1984   Mean   :11.77   Mean   : 6.515  
##  3rd Qu.: 8406   3rd Qu.:1985   3rd Qu.:12.00   3rd Qu.: 9.000  
##  Max.   :12548   Max.   :1987   Max.   :16.00   Max.   :18.000  
##                                                                 
##  union         ethn      maried     health          wage       
##  no :3296   other:3176   no :2446   no :4286   Min.   :-3.579  
##  yes:1064   black: 504   yes:1914   yes:  74   1st Qu.: 1.351  
##             hisp : 680                         Median : 1.671  
##                                                Mean   : 1.649  
##                                                3rd Qu.: 1.991  
##                                                Max.   : 4.052  
##                                                                
##                              industry   
##  Manufacturing                   :1231  
##  Trade                           :1169  
##  Professional_and_Related Service: 333  
##  Business_and_Repair_Service     : 331  
##  Construction                    : 327  
##  Transportation                  : 286  
##  (Other)                         : 683  
##                                occupation            residence   
##  Craftsmen, Foremen_and_kindred     :934   rural_area     :  85  
##  Operatives_and_kindred             :881   north_east     : 733  
##  Service_Workers                    :509   nothern_central: 964  
##  Clerical_and_kindred               :486   south          :1333  
##  Professional, Technical_and_kindred:453   NA's           :1245  
##  Laborers_and_farmers               :401                         
##  (Other)                            :696

# As seen below, there are 452 observations of 12 variables, including: 5 factors and 7 numeric variables.
str(x)

## 'data.frame':    4360 obs. of  12 variables:
##  $ nr        : int  13 13 13 13 13 13 13 13 17 17 ...
##  $ year      : int  1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
##  $ school    : int  14 14 14 14 14 14 14 14 13 13 ...
##  $ exper     : int  1 2 3 4 5 6 7 8 4 5 ...
##  $ union     : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ethn      : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ maried    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ health    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wage      : num  1.2 1.85 1.34 1.43 1.57 ...
##  $ industry  : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
##  $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
##  $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...

x$school <- as.numeric(x$school) # assign as a numeric variable
x$school <- cut(x$school, c(2, 7, 11, 17)) #create intervals out of continuous variable
x$school <- as.factor(x$school) #reassign as a factor 

x <- x[, c(3,6,9,11,12)] # eliminate all unused variables

Continuous variables

The continuous variables in this dataset are years of schooling, years of experience, and the log of the hourly wage.

Response variables

The response variable being analyzed in this dataset is the log of the hourly wage earned by the individual.

As seen below, the log of the hourly wage ranged from -3.579 to 4.052, with a mean of 1.649.

summary(x$wage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.579   1.351   1.671   1.649   1.991   4.052

The Data

The data is originally consisted of 12 variables, but was subseted down to 5 columns: 4 factors, and 1 continuous variable.

Randomization

There was no randomization in this recipe because the data points are based on observations of wages among young American males.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

Response surface methods will be used to analyze which, if any, of the four factors are responsible for variation in the response variable, the log of hourly wage.

Randomize:

The experiment was based on observations, and was therefore not randomized.

Replicate:

The experiment had no replicates or repeated measures.

Block:

There was no blocking in this recipe.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

The boxplot below shows the log of hourly wages in response to various intervals of schooling. All three of the intervals show relatively similar medians, however, their ranges vary greatly. the 2-7 interval has the smallest range, whereas the 7-ll years interval has the largest range in wages.

par(mfrow=c(1,1))
attach(x)
plot(wage ~ school, xlab = "Interval of Years of School", ylab = "Hourly wages", main="Wages by Years of Schooling" )

The boxplot below shows how log of hourly wages is affected by ethnicity. Similarly to years of schooling, each of the three levels have similar medians, but vary in the ranges of wages. “Other” has the largest range, whereas “black” seems to have the smallest.

plot(wage ~ ethn, xlab = "Ethnicity", ylab = "Hourly wages", main="Wages by Ethnicity" )

The boxplot below shows how occupation may affect the log of hourly wages. As seen below, there seems to be some variation in both the medians and ranges of wages.

plot(wage ~ occupation, xlab = "Occupation", ylab = "Hourly wages", main="Wages by Occupation" )

The boxplot shows variaous residences/locations of the young males, and their associated wages. “Rural area” has the smallest range in wages. “South” seems to be relatively symmetric, with a large range. “Northern central” has a large range and seems to be negatively skewed. All of the medians of the groups, however, tend to be relatively similar.

plot(wage ~ residence, xlab = "Residence", ylab = "Hourly wages", main="Wages by Residence" )

Testing

Response Surface Methods

The response surface method below analyzed the log of hourly wages as a response to 4 different factors, with >2 levels.

The first step was to convert the levels of the factors into integers, in order to use the “rsm” function.

# Assign all factors as binary
x$school <- as.character(x$school)
x$school[x$school == "(2,7]"] <- 1
x$school[x$school == "(7,11]"] <- 2
x$school[x$school == "(11,17]"] <- 3

x$ethn <- as.character(x$ethn)
x$ethn[x$ethn == "other"] <- 1
x$ethn[x$ethn == "black"] <- 2
x$ethn[x$ethn == "hisp"] <- 3

x$occupation <- as.character(x$occupation)
x$occupation[x$occupation == "Professional, Technical_and_kindred"] <- 1
x$occupation[x$occupation == "Managers, Officials_and_Proprietors"] <- 2
x$occupation[x$occupation == "Sales_Workers"] <- 3
x$occupation[x$occupation == "Clerical_and_kindred"] <- 4
x$occupation[x$occupation == "Craftsmen, Foremen_and_kindred"] <- 5
x$occupation[x$occupation == "Operatives_and_kindred"] <- 6
x$occupation[x$occupation == "Laborers_and_farmers"] <- 7
x$occupation[x$occupation == "Farm_Laborers_and_Foreman"] <- 8
x$occupation[x$occupation == "Service_Workers"] <- 9

x$residence <- as.character(x$residence)
x$residence[x$residence == "rural_area"] <- 1
x$residence[x$residence == "north_east"] <- 2
x$residence[x$residence == "nothern_central"] <- 3
x$residence[x$residence == "south"] <- 4

x <- na.omit(x)

x$school <- as.integer(x$school)
x$ethn <- as.integer(x$ethn)
x$occupation <- as.integer(x$occupation)
x$residence <- as.integer(x$residence)

Next, a second order response surface was completed using the 4 factors.

#install.packages("rsm")
library("rsm")

## Warning: package 'rsm' was built under R version 3.1.2

wage.rsm <- rsm(wage~SO(ethn, school, occupation, residence), data=x)
summary(wage.rsm)

## 
## Call:
## rsm(formula = wage ~ SO(ethn, school, occupation, residence), 
##     data = x)
## 
##                         Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)           2.47914968  0.46575231  5.3229 1.094e-07 ***
## ethn                 -0.30036812  0.14965680 -2.0070  0.044831 *  
## school               -0.81263361  0.29217192 -2.7814  0.005446 ** 
## occupation           -0.04089938  0.03495928 -1.1699  0.242125    
## residence             0.15931298  0.11251746  1.4159  0.156907    
## ethn:school           0.00268620  0.02836155  0.0947  0.924549    
## ethn:occupation       0.00529164  0.00609105  0.8688  0.385048    
## ethn:residence       -0.02261332  0.01429618 -1.5818  0.113803    
## school:occupation     0.00071236  0.00895847  0.0795  0.936626    
## school:residence     -0.04906397  0.02214647 -2.2154  0.026803 *  
## occupation:residence  0.00283332  0.00461751  0.6136  0.539522    
## ethn^2                0.08875190  0.02959825  2.9986  0.002734 ** 
## school^2              0.24393150  0.05646101  4.3204 1.607e-05 ***
## occupation^2         -0.00175769  0.00147352 -1.1929  0.233019    
## residence^2          -0.00865855  0.01351586 -0.6406  0.521816    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Multiple R-squared:  0.1076, Adjusted R-squared:  0.1036 
## F-statistic:  26.7 on 14 and 3100 DF,  p-value: < 2.2e-16
## 
## Analysis of Variance Table
## 
## Response: wage
##                                            Df Sum Sq Mean Sq F value
## FO(ethn, school, occupation, residence)     4  85.02 21.2544 83.6452
## TWI(ethn, school, occupation, residence)    6   2.02  0.3366  1.3245
## PQ(ethn, school, occupation, residence)     4   7.93  1.9827  7.8026
## Residuals                                3100 787.72  0.2541        
## Lack of fit                               143  94.70  0.6623  2.8258
## Pure error                               2957 693.01  0.2344        
##                                             Pr(>F)
## FO(ethn, school, occupation, residence)  < 2.2e-16
## TWI(ethn, school, occupation, residence)    0.2424
## PQ(ethn, school, occupation, residence)   2.97e-06
## Residuals                                         
## Lack of fit                              < 2.2e-16
## Pure error                                        
## 
## Stationary point of response surface:
##       ethn     school occupation  residence 
##  1.9639364  1.7145060 -7.9458988  0.4774514 
## 
## Eigenanalysis:
## $values
## [1]  0.246329573  0.090006970 -0.001546365 -0.012523012
## 
## $vectors
##                   [,1]        [,2]         [,3]        [,4]
## ethn        0.01541906  0.99355424 -0.009046895  0.11193914
## school      0.99522013 -0.02579419  0.014650056  0.09304257
## occupation  0.00104265  0.02689393  0.986897301 -0.15908912
## residence  -0.09642639 -0.10705765  0.160428324  0.97647497

The null hypothesis for the ANOVA is that variation in the log of hourly wages cannot be explained by anything other than randomization. The p-values above for main effect show that ethnicity and years of schooling are significant for an alpha of 0.05. This is because the p-value is less than alpha, so we must reject the null, and suggest that something other than randomization explains the variation in wage. The interaction between years of school and residence is also significant, with a p-value of 0.027.

The stationary points are 1.96 (ethnicity), 1.715 (school), -7.946 (occupation), and 0.477 (residence).

We will now use contour plots to characterize the stationary points seen in the above summary of the response surfaces.

par(mfrow=c(2,3))
contour(wage.rsm, ~ethn + school + occupation + residence, image=TRUE, at=summary(wage.rsm$canoncial$xs))

The plots above show all of the second order response surfaces of the data. We will now look at each separately.

par(mfrow=c(1,1))
contour(wage.rsm, ~ethn + school, image=TRUE, at=summary(wage.rsm$canoncial$xs))

This second order response surface is shown for ethnicity and school. This appears to be a minima, because the values are increasing from this centralized point.

par(mfrow=c(1,1))
contour(wage.rsm, ~ethn + occupation, image=TRUE, at=summary(wage.rsm$canoncial$xs))

This stationary point seems to be a falling ridge.

par(mfrow=c(1,1))
contour(wage.rsm, ~ethn + residence, image=TRUE, at=summary(wage.rsm$canoncial$xs))

This stationary point is slightly more ambiguous. It appears as though it migh be a ridge, or part of a saddle point.

par(mfrow=c(1,1))
contour(wage.rsm, ~school + occupation, image=TRUE, at=summary(wage.rsm$canoncial$xs))

This stationary point seems to be also a falling ridge.

par(mfrow=c(1,1))
contour(wage.rsm, ~school + residence, image=TRUE, at=summary(wage.rsm$canoncial$xs))

This stationary point is a ridge.

par(mfrow=c(1,1))
contour(wage.rsm, ~occupation + residence, image=TRUE, at=summary(wage.rsm$canoncial$xs))

This stationary point seems to be a maxima on the bottom left corner of the plot.

Diagnostics/Model Adequacy Checking

Assumptions for ANOVA tests include that the data is normally distributed.

shapiro.test(x$wage)

## 
##  Shapiro-Wilk normality test
## 
## data:  x$wage
## W = 0.9525, p-value < 2.2e-16

The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the log of hourly wages is not normally distributed.

par(mfrow = c(1,1))
qqnorm(residuals(wage.rsm))
qqline(residuals(wage.rsm))

The Q-Q Normality Plots of the residuals for the model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.

plot(fitted(wage.rsm),residuals(wage.rsm))

The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the model do not appear to be symmetric over the length of the dynamic range, but seem to be mostly symmetric over 0.

4. Contingencies

Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.

A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.

With the kruskal-wallis test, we could only analyze the main effect of one factor at a time. I chose to analyze ethnicity and years of schooling.

kruskal.test(wage ~ ethn, data=x)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  wage by ethn
## Kruskal-Wallis chi-squared = 41.1272, df = 2, p-value = 1.173e-09

The p-value of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in wage cannot be explained by anything other than randomization. This suggests that ethnicity may explain variation in the log of hourly wages.

kruskal.test(wage ~ school, data=x)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  wage by school
## Kruskal-Wallis chi-squared = 241.1581, df = 2, p-value < 2.2e-16

The p-value for the factor, years of schooling, of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in wages cannot be explained by something other than randomization, and suggest an alternative - that the randomization can be explained by something other than randomization.

4. References

http://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

Wage for Males

Jane Braun

December 1st 2014

Recipe 9: Response Surface Methods

Recipes for the Design of Experiments

Wages for Males

Jane Braun

RPI

December 1st Version 1.0

1. Setting

System under test

Factors and Levels

Continuous variables

Response variables

The Data

Randomization

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

Randomize:

Replicate:

Block:

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Testing

Response Surface Methods

Diagnostics/Model Adequacy Checking

4. Contingencies

4. References