1. Setting

Wages and Education in Young Males

The wages and education of young males were recorded from 1980 to 1987. The dataset has been obtained from the R package “Ecdat” and contains 4360 observations of 12 variables.

library(Ecfun) # Package needed for Ecdat
library(Ecdat)
data("Males") # Load dataset

Below are the first 6 observations

head(Males)
##   nr year school exper union  ethn maried health     wage
## 1 13 1980     14     1    no other     no     no 1.197540
## 2 13 1981     14     2   yes other     no     no 1.853060
## 3 13 1982     14     3    no other     no     no 1.344462
## 4 13 1983     14     4    no other     no     no 1.433213
## 5 13 1984     14     5    no other     no     no 1.568125
## 6 13 1985     14     6    no other     no     no 1.699891
##                      industry                          occupation
## 1 Business_and_Repair_Service                     Service_Workers
## 2            Personal_Service                     Service_Workers
## 3 Business_and_Repair_Service                     Service_Workers
## 4 Business_and_Repair_Service                     Service_Workers
## 5            Personal_Service      Craftsmen, Foremen_and_kindred
## 6 Business_and_Repair_Service Managers, Officials_and_Proprietors
##    residence
## 1 north_east
## 2 north_east
## 3 north_east
## 4 north_east
## 5 north_east
## 6 north_east

Factors and levels

The following 4 factors will be used in the model: industry, residence, union and occupation. The levels are shown below.

levels(Males$industry)
##  [1] "Agricultural"                     "Mining"                          
##  [3] "Construction"                     "Trade"                           
##  [5] "Transportation"                   "Finance"                         
##  [7] "Business_and_Repair_Service"      "Personal_Service"                
##  [9] "Entertainment"                    "Manufacturing"                   
## [11] "Professional_and_Related Service" "Public_Administration"
levels(Males$residence)
## [1] "rural_area"      "north_east"      "nothern_central" "south"
levels(Males$union)
## [1] "no"  "yes"
levels(Males$occupation)
## [1] "Professional, Technical_and_kindred"
## [2] "Managers, Officials_and_Proprietors"
## [3] "Sales_Workers"                      
## [4] "Clerical_and_kindred"               
## [5] "Craftsmen, Foremen_and_kindred"     
## [6] "Operatives_and_kindred"             
## [7] "Laborers_and_farmers"               
## [8] "Farm_Laborers_and_Foreman"          
## [9] "Service_Workers"

Response variables

The response variable is wage. Wage is defined as logorithm of hourly wage. Wage is a continuous variable.

2. Experimental Design

There is a reference in the Ecdat package manual supposedly with information about the dataset. Unfortunately, the link is for an outdated website with no information about how the experiment was conducted or how the data was collected.

3. Statistical Analysis

Exploratoty data analysis

In this section, the levels of each factor are shown in a boxplot to determine thier influence on the response, wage. Starting with industry. Keep in mind that the y axes are log scale.

boxplot(wage~industry,data=Males,vertical=TRUE,las=2,ylab="wage")

The medians of the industries are very different. Some of the top earning industries include Mining and Transportation. The next factor is residence.

boxplot(wage~residence,data=Males,ylab="wage")

Here, the medians are very close. The variance for “south” appears to be higher than the other levels.

boxplot(wage~union,data=Males,ylab="wage",xlab="Was the person in a union?")

Unions can influence the wage of an employee and this case is no different. However, on a log scale the difference is quite small. People who were in a union recieved a higher wage.

boxplot(wage~occupation,data=Males,vertical=TRUE,las=2,cex.axis=0.7,ylab="wage")

Farm laborers and foremen seem to have a lower wage than the others.

Testing

In the section the main effect of all the factors is determined by using the function lm. The result is presented in the following table.

m1 = aov(wage~industry+residence+union+occupation,data=Males)
summary(m1)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## industry      11   99.2   9.019   39.00  < 2e-16 ***
## residence      3    9.3   3.091   13.36 1.15e-08 ***
## union          1    6.5   6.519   28.19 1.18e-07 ***
## occupation     8   52.8   6.597   28.52  < 2e-16 ***
## Residuals   3091  714.9   0.231                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1245 observations deleted due to missingness

Using a similar function call, the interaction effects can be determined.

m2 = aov(wage~(industry+residence+union+occupation)^2,data=Males)

Finally, analysis of variance is used to determine which factors have an influence on wage. The result is shown in the following table.

anova(m2)
## Analysis of Variance Table
## 
## Response: wage
##                        Df Sum Sq Mean Sq F value    Pr(>F)    
## industry               11  99.21  9.0194 42.1567 < 2.2e-16 ***
## residence               3   9.27  3.0908 14.4463 2.425e-09 ***
## union                   1   6.52  6.5190 30.4697 3.686e-08 ***
## occupation              8  52.78  6.5970 30.8342 < 2.2e-16 ***
## industry:residence     29  20.72  0.7147  3.3403 4.640e-09 ***
## industry:union         11   5.75  0.5231  2.4448 0.0049063 ** 
## industry:occupation    74  42.11  0.5691  2.6599 1.525e-12 ***
## residence:union         3   4.14  1.3799  6.4494 0.0002383 ***
## residence:occupation   24  10.60  0.4417  2.0646 0.0017254 ** 
## union:occupation        8   2.13  0.2663  1.2446 0.2685622    
## Residuals            2942 629.44  0.2139                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this table the test statistic is calculated for each factor or interaction. With a significance level of \(\alpha=0.05\), many of the factors are statistically significant. Looking at the factor “union”, a p-value of \(6\cdot10^{-7}\) is very low probability that the change in wage due to being in a union, is a result of randomization.

4. References

R Package Ecdat: https://cran.r-project.org/web/packages/Ecdat/index.html