Question:

What effect does union membership have on wages?

library(ggplot2)
library(reshape2)
library(knitr)
library(dplyr)
library(tidyr)
library(ggthemes)

Data Exploration:

I chose the Wages data set because throughout my economics degree wage, experience and schooling was continually used as an example for various econometric applications. Even with all the time spent discussing it, I never actually did any analysis of the actual data. This is the remedy for that.

df = read.csv("https://raw.githubusercontent.com/kaiserxc/CUNY_MSDA/master/Wages.csv")
df = df[-c(1)]
kable(head(df))
exp wks bluecol ind south smsa married sex union ed black lwage
3 32 no 0 yes no yes male no 9 no 5.56068
4 43 no 0 yes no yes male no 9 no 5.72031
5 40 no 0 yes no yes male no 9 no 5.99645
6 39 no 0 yes no yes male no 9 no 5.99645
7 42 no 1 yes no yes male no 9 no 6.06146
8 35 no 1 yes no yes male no 9 no 6.17379
names(df)
##  [1] "exp"     "wks"     "bluecol" "ind"     "south"   "smsa"    "married"
##  [8] "sex"     "union"   "ed"      "black"   "lwage"
str(df)
## 'data.frame':    4165 obs. of  12 variables:
##  $ exp    : int  3 4 5 6 7 8 9 30 31 32 ...
##  $ wks    : int  32 43 40 39 42 35 32 34 27 33 ...
##  $ bluecol: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ...
##  $ ind    : int  0 0 0 0 1 1 1 0 0 1 ...
##  $ south  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 1 ...
##  $ smsa   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ married: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex    : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ union  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ ed     : int  9 9 9 9 9 9 9 11 11 11 ...
##  $ black  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ lwage  : num  5.56 5.72 6 6 6.06 ...
summary(df)
##       exp             wks        bluecol         ind         south     
##  Min.   : 1.00   Min.   : 5.00   no :2036   Min.   :0.0000   no :2956  
##  1st Qu.:11.00   1st Qu.:46.00   yes:2129   1st Qu.:0.0000   yes:1209  
##  Median :18.00   Median :48.00              Median :0.0000             
##  Mean   :19.85   Mean   :46.81              Mean   :0.3954             
##  3rd Qu.:29.00   3rd Qu.:50.00              3rd Qu.:1.0000             
##  Max.   :51.00   Max.   :52.00              Max.   :1.0000             
##   smsa      married        sex       union            ed        black     
##  no :1442   no : 773   female: 469   no :2649   Min.   : 4.00   no :3864  
##  yes:2723   yes:3392   male  :3696   yes:1516   1st Qu.:12.00   yes: 301  
##                                                 Median :12.00             
##                                                 Mean   :12.85             
##                                                 3rd Qu.:16.00             
##                                                 Max.   :17.00             
##      lwage      
##  Min.   :4.605  
##  1st Qu.:6.395  
##  Median :6.685  
##  Mean   :6.676  
##  3rd Qu.:6.953  
##  Max.   :8.537

Comentary:

We can see that the majority of participants are male. Years of school and experience vary widely with the min years of experience being 1 and the max 18. School is slightly less spread out with 3-16 years. The mean experience is 8.043. The data is clumped very tightly in this area with the fist and third quartiles being 7 and 9 respectively. Education is similarly clustered but with the average being 12.

Wages range from 0.07656 to 39.809. The mean is 5.75.

Data Wrangling:

Because even experienced mathematicians may have trouble converting the log of wages to actual wages, I added a new column with actual wages. This won’t be super useful to our analysis since log is such a great way to see data that has no negative values and spans a wide range. This also helps with linearity.

df$wage = exp(df$lwage)
df = rename(.data = df, metropolitan = smsa, manufacturing = ind, education = ed)

Graphics:

ggplot(data = df,) + geom_histogram(aes(x=wage))  + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = df,) + geom_histogram(aes(x=lwage)) + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = df, aes(x = exp, y = lwage)) + geom_point() + theme_economist()

ggplot(data = df, aes(x = education, y = lwage)) + geom_point() + theme_economist()

ggplot(data = df, aes(x = bluecol, y = lwage)) + geom_boxplot() + theme_economist()

ggplot(data = df, aes(x = union, y = lwage)) + geom_boxplot() + theme_economist()

We see some unsurprising results here. Men make more, there is more variation in union which makes sense because unions tend to be blue collar which depresses wages, but union would be higher than non union blue collar.

reg = lm(lwage ~ . -wage, data = df)
summary(reg)
## 
## Call:
## lm(formula = lwage ~ . - wage, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.20177 -0.23653 -0.00949  0.24248  1.97630 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.0878563  0.0691572  73.569  < 2e-16 ***
## exp              0.0103709  0.0005354  19.372  < 2e-16 ***
## wks              0.0049445  0.0011059   4.471 7.99e-06 ***
## bluecolyes      -0.1486343  0.0149933  -9.913  < 2e-16 ***
## manufacturing    0.0530578  0.0120663   4.397 1.12e-05 ***
## southyes        -0.0532101  0.0128247  -4.149 3.41e-05 ***
## metropolitanyes  0.1453039  0.0123480  11.767  < 2e-16 ***
## marriedyes       0.0660799  0.0210208   3.144  0.00168 ** 
## sexmale          0.3533016  0.0256743  13.761  < 2e-16 ***
## unionyes         0.1020758  0.0130870   7.800 7.79e-15 ***
## education        0.0571539  0.0026749  21.366  < 2e-16 ***
## blackyes        -0.1671226  0.0225679  -7.405 1.58e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3577 on 4153 degrees of freedom
## Multiple R-squared:  0.4009, Adjusted R-squared:  0.3993 
## F-statistic: 252.6 on 11 and 4153 DF,  p-value: < 2.2e-16

Check to see if anything screwy is happening with the residuals. Heteroskedasticity does not seem to be a problem.

regf <- fortify(reg)
ggplot(regf, aes(x = .fitted, y = .resid)) + geom_point() + theme_economist()

reg2 = lm(wage ~ . -lwage, data = df)
summary(reg2)
## 
## Call:
## lm(formula = wage ~ . - lwage, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1228.6  -227.0   -52.4   161.2  4219.0 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -425.8648    71.0428  -5.994 2.21e-09 ***
## exp                9.8074     0.5499  17.833  < 2e-16 ***
## wks                2.9957     1.1361   2.637 0.008398 ** 
## bluecolyes      -115.9502    15.4021  -7.528 6.27e-14 ***
## manufacturing     47.3383    12.3953   3.819 0.000136 ***
## southyes         -37.2573    13.1743  -2.828 0.004706 ** 
## metropolitanyes  119.1433    12.6847   9.393  < 2e-16 ***
## marriedyes        44.4769    21.5939   2.060 0.039490 *  
## sexmale          276.2223    26.3744  10.473  < 2e-16 ***
## unionyes          14.2916    13.4438   1.063 0.287816    
## education         52.0817     2.7479  18.953  < 2e-16 ***
## blackyes        -113.9089    23.1832  -4.913 9.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 367.4 on 4153 degrees of freedom
## Multiple R-squared:  0.3132, Adjusted R-squared:  0.3114 
## F-statistic: 172.2 on 11 and 4153 DF,  p-value: < 2.2e-16
regf2 <- fortify(reg2)
ggplot(regf2, aes(x = .fitted, y = .resid)) + geom_point() +theme_economist()

It’s pretty interesting to see the benefit of log wages on heteroskedasticty.

From this analysis, union membership increases real wages by 1.1074674, however, this result is small. From the bar graphs it seems that union membership decreases inequality, that is most union members are more similar while there is more diversity among un-unionized workers.

We also see that union and blue collar correlate,

print(cor(as.numeric(df$union), as.numeric(df$bluecol)))
## [1] 0.3784186

Which I hypothesize is due to workers who are blue collar and belong to a union make more than their non union blue collar counter parts. However, white collare workers would also be un unionized and make more.