What effect does union membership have on wages?
library(ggplot2)
library(reshape2)
library(knitr)
library(dplyr)
library(tidyr)
library(ggthemes)
I chose the Wages data set because throughout my economics degree wage, experience and schooling was continually used as an example for various econometric applications. Even with all the time spent discussing it, I never actually did any analysis of the actual data. This is the remedy for that.
df = read.csv("https://raw.githubusercontent.com/kaiserxc/CUNY_MSDA/master/Wages.csv")
df = df[-c(1)]
kable(head(df))
| exp | wks | bluecol | ind | south | smsa | married | sex | union | ed | black | lwage |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 32 | no | 0 | yes | no | yes | male | no | 9 | no | 5.56068 |
| 4 | 43 | no | 0 | yes | no | yes | male | no | 9 | no | 5.72031 |
| 5 | 40 | no | 0 | yes | no | yes | male | no | 9 | no | 5.99645 |
| 6 | 39 | no | 0 | yes | no | yes | male | no | 9 | no | 5.99645 |
| 7 | 42 | no | 1 | yes | no | yes | male | no | 9 | no | 6.06146 |
| 8 | 35 | no | 1 | yes | no | yes | male | no | 9 | no | 6.17379 |
names(df)
## [1] "exp" "wks" "bluecol" "ind" "south" "smsa" "married"
## [8] "sex" "union" "ed" "black" "lwage"
str(df)
## 'data.frame': 4165 obs. of 12 variables:
## $ exp : int 3 4 5 6 7 8 9 30 31 32 ...
## $ wks : int 32 43 40 39 42 35 32 34 27 33 ...
## $ bluecol: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ...
## $ ind : int 0 0 0 0 1 1 1 0 0 1 ...
## $ south : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 1 ...
## $ smsa : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ married: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
## $ union : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ ed : int 9 9 9 9 9 9 9 11 11 11 ...
## $ black : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ lwage : num 5.56 5.72 6 6 6.06 ...
summary(df)
## exp wks bluecol ind south
## Min. : 1.00 Min. : 5.00 no :2036 Min. :0.0000 no :2956
## 1st Qu.:11.00 1st Qu.:46.00 yes:2129 1st Qu.:0.0000 yes:1209
## Median :18.00 Median :48.00 Median :0.0000
## Mean :19.85 Mean :46.81 Mean :0.3954
## 3rd Qu.:29.00 3rd Qu.:50.00 3rd Qu.:1.0000
## Max. :51.00 Max. :52.00 Max. :1.0000
## smsa married sex union ed black
## no :1442 no : 773 female: 469 no :2649 Min. : 4.00 no :3864
## yes:2723 yes:3392 male :3696 yes:1516 1st Qu.:12.00 yes: 301
## Median :12.00
## Mean :12.85
## 3rd Qu.:16.00
## Max. :17.00
## lwage
## Min. :4.605
## 1st Qu.:6.395
## Median :6.685
## Mean :6.676
## 3rd Qu.:6.953
## Max. :8.537
We can see that the majority of participants are male. Years of school and experience vary widely with the min years of experience being 1 and the max 18. School is slightly less spread out with 3-16 years. The mean experience is 8.043. The data is clumped very tightly in this area with the fist and third quartiles being 7 and 9 respectively. Education is similarly clustered but with the average being 12.
Wages range from 0.07656 to 39.809. The mean is 5.75.
Because even experienced mathematicians may have trouble converting the log of wages to actual wages, I added a new column with actual wages. This won’t be super useful to our analysis since log is such a great way to see data that has no negative values and spans a wide range. This also helps with linearity.
df$wage = exp(df$lwage)
df = rename(.data = df, metropolitan = smsa, manufacturing = ind, education = ed)
ggplot(data = df,) + geom_histogram(aes(x=wage)) + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = df,) + geom_histogram(aes(x=lwage)) + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = df, aes(x = exp, y = lwage)) + geom_point() + theme_economist()
ggplot(data = df, aes(x = education, y = lwage)) + geom_point() + theme_economist()
ggplot(data = df, aes(x = bluecol, y = lwage)) + geom_boxplot() + theme_economist()
ggplot(data = df, aes(x = union, y = lwage)) + geom_boxplot() + theme_economist()
We see some unsurprising results here. Men make more, there is more variation in union which makes sense because unions tend to be blue collar which depresses wages, but union would be higher than non union blue collar.
reg = lm(lwage ~ . -wage, data = df)
summary(reg)
##
## Call:
## lm(formula = lwage ~ . - wage, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.20177 -0.23653 -0.00949 0.24248 1.97630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0878563 0.0691572 73.569 < 2e-16 ***
## exp 0.0103709 0.0005354 19.372 < 2e-16 ***
## wks 0.0049445 0.0011059 4.471 7.99e-06 ***
## bluecolyes -0.1486343 0.0149933 -9.913 < 2e-16 ***
## manufacturing 0.0530578 0.0120663 4.397 1.12e-05 ***
## southyes -0.0532101 0.0128247 -4.149 3.41e-05 ***
## metropolitanyes 0.1453039 0.0123480 11.767 < 2e-16 ***
## marriedyes 0.0660799 0.0210208 3.144 0.00168 **
## sexmale 0.3533016 0.0256743 13.761 < 2e-16 ***
## unionyes 0.1020758 0.0130870 7.800 7.79e-15 ***
## education 0.0571539 0.0026749 21.366 < 2e-16 ***
## blackyes -0.1671226 0.0225679 -7.405 1.58e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3577 on 4153 degrees of freedom
## Multiple R-squared: 0.4009, Adjusted R-squared: 0.3993
## F-statistic: 252.6 on 11 and 4153 DF, p-value: < 2.2e-16
Check to see if anything screwy is happening with the residuals. Heteroskedasticity does not seem to be a problem.
regf <- fortify(reg)
ggplot(regf, aes(x = .fitted, y = .resid)) + geom_point() + theme_economist()
reg2 = lm(wage ~ . -lwage, data = df)
summary(reg2)
##
## Call:
## lm(formula = wage ~ . - lwage, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1228.6 -227.0 -52.4 161.2 4219.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -425.8648 71.0428 -5.994 2.21e-09 ***
## exp 9.8074 0.5499 17.833 < 2e-16 ***
## wks 2.9957 1.1361 2.637 0.008398 **
## bluecolyes -115.9502 15.4021 -7.528 6.27e-14 ***
## manufacturing 47.3383 12.3953 3.819 0.000136 ***
## southyes -37.2573 13.1743 -2.828 0.004706 **
## metropolitanyes 119.1433 12.6847 9.393 < 2e-16 ***
## marriedyes 44.4769 21.5939 2.060 0.039490 *
## sexmale 276.2223 26.3744 10.473 < 2e-16 ***
## unionyes 14.2916 13.4438 1.063 0.287816
## education 52.0817 2.7479 18.953 < 2e-16 ***
## blackyes -113.9089 23.1832 -4.913 9.29e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 367.4 on 4153 degrees of freedom
## Multiple R-squared: 0.3132, Adjusted R-squared: 0.3114
## F-statistic: 172.2 on 11 and 4153 DF, p-value: < 2.2e-16
regf2 <- fortify(reg2)
ggplot(regf2, aes(x = .fitted, y = .resid)) + geom_point() +theme_economist()
It’s pretty interesting to see the benefit of log wages on heteroskedasticty.
From this analysis, union membership increases real wages by 1.1074674, however, this result is small. From the bar graphs it seems that union membership decreases inequality, that is most union members are more similar while there is more diversity among un-unionized workers.
We also see that union and blue collar correlate,
print(cor(as.numeric(df$union), as.numeric(df$bluecol)))
## [1] 0.3784186
Which I hypothesize is due to workers who are blue collar and belong to a union make more than their non union blue collar counter parts. However, white collare workers would also be un unionized and make more.