The wages and education of young males were recorded from 1980 to 1987. The dataset has been obtained from the R package “Ecdat” and contains 4360 observations of 12 variables.
library(Ecfun) # Package needed for Ecdat
library(Ecdat)
data("Males") # Load dataset
Below are the first 6 observations
head(Males)
## nr year school exper union ethn maried health wage
## 1 13 1980 14 1 no other no no 1.197540
## 2 13 1981 14 2 yes other no no 1.853060
## 3 13 1982 14 3 no other no no 1.344462
## 4 13 1983 14 4 no other no no 1.433213
## 5 13 1984 14 5 no other no no 1.568125
## 6 13 1985 14 6 no other no no 1.699891
## industry occupation
## 1 Business_and_Repair_Service Service_Workers
## 2 Personal_Service Service_Workers
## 3 Business_and_Repair_Service Service_Workers
## 4 Business_and_Repair_Service Service_Workers
## 5 Personal_Service Craftsmen, Foremen_and_kindred
## 6 Business_and_Repair_Service Managers, Officials_and_Proprietors
## residence
## 1 north_east
## 2 north_east
## 3 north_east
## 4 north_east
## 5 north_east
## 6 north_east
The following 4 factors will be used in the model: industry, residence, union and occupation. The levels are shown below.
levels(Males$industry)
## [1] "Agricultural" "Mining"
## [3] "Construction" "Trade"
## [5] "Transportation" "Finance"
## [7] "Business_and_Repair_Service" "Personal_Service"
## [9] "Entertainment" "Manufacturing"
## [11] "Professional_and_Related Service" "Public_Administration"
levels(Males$residence)
## [1] "rural_area" "north_east" "nothern_central" "south"
levels(Males$union)
## [1] "no" "yes"
levels(Males$occupation)
## [1] "Professional, Technical_and_kindred"
## [2] "Managers, Officials_and_Proprietors"
## [3] "Sales_Workers"
## [4] "Clerical_and_kindred"
## [5] "Craftsmen, Foremen_and_kindred"
## [6] "Operatives_and_kindred"
## [7] "Laborers_and_farmers"
## [8] "Farm_Laborers_and_Foreman"
## [9] "Service_Workers"
The response variable is wage. Wage is defined as logorithm of hourly wage. Wage is a continuous variable.
There is a reference in the Ecdat package manual supposedly with information about the dataset. Unfortunately, the link is for an outdated website with no information about how the experiment was conducted or how the data was collected.
In this section, the levels of each factor are shown in a boxplot to determine thier influence on the response, wage. Starting with industry. Keep in mind that the y axes are log scale.
boxplot(wage~industry,data=Males,vertical=TRUE,las=2,ylab="wage")
The medians of the industries are very different. Some of the top earning industries include Mining and Transportation. The next factor is residence.
boxplot(wage~residence,data=Males,ylab="wage")
Here, the medians are very close. The variance for “south” appears to be higher than the other levels.
boxplot(wage~union,data=Males,ylab="wage",xlab="Was the person in a union?")
Unions can influence the wage of an employee and this case is no different. However, on a log scale the difference is quite small. People who were in a union recieved a higher wage.
boxplot(wage~occupation,data=Males,vertical=TRUE,las=2,cex.axis=0.7,ylab="wage")
Farm laborers and foremen seem to have a lower wage than the others.
In the section the main effect of all the factors is determined by using the function lm. The result is presented in the following table.
m1 = aov(wage~industry+residence+union+occupation,data=Males)
summary(m1)
## Df Sum Sq Mean Sq F value Pr(>F)
## industry 11 99.2 9.019 39.00 < 2e-16 ***
## residence 3 9.3 3.091 13.36 1.15e-08 ***
## union 1 6.5 6.519 28.19 1.18e-07 ***
## occupation 8 52.8 6.597 28.52 < 2e-16 ***
## Residuals 3091 714.9 0.231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1245 observations deleted due to missingness
Using a similar function call, the interaction effects can be determined.
m2 = aov(wage~(industry+residence+union+occupation)^2,data=Males)
Finally, analysis of variance is used to determine which factors have an influence on wage. The result is shown in the following table.
anova(m2)
## Analysis of Variance Table
##
## Response: wage
## Df Sum Sq Mean Sq F value Pr(>F)
## industry 11 99.21 9.0194 42.1567 < 2.2e-16 ***
## residence 3 9.27 3.0908 14.4463 2.425e-09 ***
## union 1 6.52 6.5190 30.4697 3.686e-08 ***
## occupation 8 52.78 6.5970 30.8342 < 2.2e-16 ***
## industry:residence 29 20.72 0.7147 3.3403 4.640e-09 ***
## industry:union 11 5.75 0.5231 2.4448 0.0049063 **
## industry:occupation 74 42.11 0.5691 2.6599 1.525e-12 ***
## residence:union 3 4.14 1.3799 6.4494 0.0002383 ***
## residence:occupation 24 10.60 0.4417 2.0646 0.0017254 **
## union:occupation 8 2.13 0.2663 1.2446 0.2685622
## Residuals 2942 629.44 0.2139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this table the test statistic is calculated for each factor or interaction. With a significance level of \(\alpha=0.05\), many of the factors are statistically significant. Looking at the factor “union”, a p-value of \(6\cdot10^{-7}\) is very low probability that the change in wage due to being in a union, is a result of randomization.
R Package Ecdat: https://cran.r-project.org/web/packages/Ecdat/index.html