Using Bayesian Linear Regression to Predict Worker Wages

Background

Bayesian modeling provides variable selection techniques that assure convidence in variable selection. The study of how socio-economic factors affect income and wages provides ample opportunity to apply these techniques, while also offering insight into topics ranging from gender discrimination to the benefits of higher education. Below, the Bayesian Information Criterion (BIC) and Bayesian Model Averaging are applied to construct a parsimonious income prediction model.

The data were gathered as a random sample of 935 respondents throughout the United States. This data set was released as part of the series Instructional Stata Datasets for Econometrics by the Boston College Department of Economics (DATA - Wooldridge 2000). The dataset is available through the statsr package.

Load packages

The data will first be explored using the dplyr package as well as visualized using the ggplot2 package. Later, the MASS package to implement stepwise Bayesian linear regression and the BAS package will assist with Bayesian Model Averaging (BMA).

library(statsr)
library(MASS)
library(dplyr)
library(ggplot2)
library(BAS)

The data

data(wage)

The Wooldridge Datasets web page gives the provides the following table of variable descriptions (fmwww.bc.edu/ec-p/data/wooldrige/wage2.des):

variable	description
`wage`	weekly earnings (dollars)
`hours`	average hours worked per week
`IQ`	IQ score
`kww`	knowledge of world work score
`educ`	number of years of education
`exper`	years of work experience
`tenure`	years with current employer
`age`	age in years
`married`	=1 if married
`black`	=1 if black
`south`	=1 if live in south
`urban`	=1 if live in a Standard Metropolitan Statistical Area
`sibs`	number of siblings
`brthord`	birth order
`meduc`	mother’s education (years)
`feduc`	father’s education (years)
`lwage`	natural log of `wage`

Exploring the data

As with any new data set a good place to start is standard exploratory data analysis. Summary tables are an easy and informative first step.

# a summary table of all variables in the dataset - both continuous and categorical
summary(wage)

##       wage            hours             iq             kww       
##  Min.   : 115.0   Min.   :20.00   Min.   : 50.0   Min.   :12.00  
##  1st Qu.: 669.0   1st Qu.:40.00   1st Qu.: 92.0   1st Qu.:31.00  
##  Median : 905.0   Median :40.00   Median :102.0   Median :37.00  
##  Mean   : 957.9   Mean   :43.93   Mean   :101.3   Mean   :35.74  
##  3rd Qu.:1160.0   3rd Qu.:48.00   3rd Qu.:112.0   3rd Qu.:41.00  
##  Max.   :3078.0   Max.   :80.00   Max.   :145.0   Max.   :56.00  
##                                                                  
##       educ           exper           tenure            age        married
##  Min.   : 9.00   Min.   : 1.00   Min.   : 0.000   Min.   :28.00   0:100  
##  1st Qu.:12.00   1st Qu.: 8.00   1st Qu.: 3.000   1st Qu.:30.00   1:835  
##  Median :12.00   Median :11.00   Median : 7.000   Median :33.00          
##  Mean   :13.47   Mean   :11.56   Mean   : 7.234   Mean   :33.08          
##  3rd Qu.:16.00   3rd Qu.:15.00   3rd Qu.:11.000   3rd Qu.:36.00          
##  Max.   :18.00   Max.   :23.00   Max.   :22.000   Max.   :38.00          
##                                                                          
##  black   south   urban        sibs           brthord           meduc      
##  0:815   0:616   0:264   Min.   : 0.000   Min.   : 1.000   Min.   : 0.00  
##  1:120   1:319   1:671   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 8.00  
##                          Median : 2.000   Median : 2.000   Median :12.00  
##                          Mean   : 2.941   Mean   : 2.277   Mean   :10.68  
##                          3rd Qu.: 4.000   3rd Qu.: 3.000   3rd Qu.:12.00  
##                          Max.   :14.000   Max.   :10.000   Max.   :18.00  
##                                           NA's   :83       NA's   :78     
##      feduc           lwage      
##  Min.   : 0.00   Min.   :4.745  
##  1st Qu.: 8.00   1st Qu.:6.506  
##  Median :10.00   Median :6.808  
##  Mean   :10.22   Mean   :6.779  
##  3rd Qu.:12.00   3rd Qu.:7.056  
##  Max.   :18.00   Max.   :8.032  
##  NA's   :194

A histogram of the response variable (wage) gives an idea of what reasonable predictions should look like.

# simple histogram of the wage data
hist(wage$wage, main = 'Weekly Wages', xlab = 'Weekly Wage in Dollars', breaks = 30)

The histogram can also be used to get a general sense of where results become unlikely.

# checking the number of points in the "tails" of the plot
sum(wage$wage < 300)

## [1] 6

sum(wage$wage > 2000)

## [1] 20

```

Simple linear regression

Since weekly wage (‘wage’) is the response variable in this analysis, we would like to explore the relationship of the other variables as predictors. One possible, simplistic, explanation for the variation in wages that we see in the data is that smarter people make more money. The plot below visualizes a scatterplot between weekly wage and IQ score.

ggplot(wage, aes(iq, wage)) + geom_point() +geom_smooth() +
  geom_point(col = "steelblue", alpha = 1/2)

## `geom_smooth()` using method = 'loess'

It appears there may be a slight positive linear relationship between IQ score and wage, but IQ alone would not reliably predict wages. Nonetheless, the relationship can be quantified by fitting a simple linear regression which gives: \[wage_i = \alpha + \beta \cdot iq_i + \epsilon_i\]

m_wage_iq = lm(wage ~ iq, data = wage)
m_wage_iq$coefficients

## (Intercept)          iq 
##  116.991565    8.303064

\[wage_i = 116.99 + 8.3 \cdot iq_i + \epsilon_i\]

Before turing to Bayes to improve upon this meager model, note that Bayesian modeling assumes that the errors ($\epsilon_i$) are normally distributed with a constant variance. This assumption is checked by examining the distribution of the residuals for the model. If the residuals are highly non-normal or skewed, the assumption is violated and any subsequent inference is not valid. To check the assumptions, plot the resdiuals as follows:

# checking the normality assumption with a scatterplot 
# and a histogram of model error residuals
ggplot(data = m_wage_iq, aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = m_wage_iq, aes(x = .resid)) +
  geom_histogram(binwidth = 100) +
  xlab("Residuals")

Variable transformation

Both plots show the residuals are right skewed. Thus, IQ (as it currently exists in the data set) should not be used as a Bayesian predictor. However, using a (natural) log transform on a skewed dependent variable with positive-only values will often correct the problem. Below, the model is refit with the wage variable transformed.

# fititng th model using natural log of IQ
m_lwage_iq = lm(lwage ~ iq, data = wage)
summary(m_lwage_iq)

## 
## Call:
## lm(formula = lwage ~ iq, data = wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09324 -0.25547  0.02261  0.27544  1.21486 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.8869942  0.0890206   66.13   <2e-16 ***
## iq          0.0088072  0.0008694   10.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3999 on 933 degrees of freedom
## Multiple R-squared:  0.09909,    Adjusted R-squared:  0.09813 
## F-statistic: 102.6 on 1 and 933 DF,  p-value: < 2.2e-16

# residual sctterplot and histogram of transformed data
ggplot(data = m_lwage_iq, aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = m_lwage_iq, aes(x = .resid)) +
  geom_histogram(binwidth = .1) +
  xlab("Residuals")

Good news! The residuals have indeed become roughly normally distributed. However, the resulting coefficient of IQ is very small (only 0.0088), which is expected since a one point increase in IQ score can hardly be expected to have much of an effect on wage. Further refinements are needed. Fortunately, the dataset contains lots more information.

Multiple linear regression and BIC

We can first include all the potential explanatory variables in a regression model in a rough attempt to explain as much wage variation as possible.

# run a linear model on all variables in thedataset using the '.' convention
m_lwage_full = lm(lwage ~ . - wage, data = wage)
summary(m_lwage_full)

## 
## Call:
## lm(formula = lwage ~ . - wage, data = wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.96887 -0.19460  0.00923  0.22401  1.34185 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.156439   0.225286  22.888  < 2e-16 ***
## hours       -0.006548   0.001934  -3.385 0.000754 ***
## iq           0.003186   0.001223   2.604 0.009425 ** 
## kww          0.003735   0.002390   1.562 0.118662    
## educ         0.041267   0.008942   4.615 4.74e-06 ***
## exper        0.010749   0.004435   2.424 0.015629 *  
## tenure       0.007102   0.002894   2.454 0.014401 *  
## age          0.009107   0.005977   1.524 0.128058    
## married1     0.200760   0.045998   4.365 1.48e-05 ***
## black1      -0.105141   0.055667  -1.889 0.059373 .  
## south1      -0.049076   0.030753  -1.596 0.111019    
## urban1       0.195658   0.031240   6.263 6.88e-10 ***
## sibs         0.009619   0.007876   1.221 0.222423    
## brthord     -0.018465   0.011569  -1.596 0.110975    
## meduc        0.009633   0.006167   1.562 0.118753    
## feduc        0.005590   0.005398   1.036 0.300805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3507 on 647 degrees of freedom
##   (272 observations deleted due to missingness)
## Multiple R-squared:  0.2925, Adjusted R-squared:  0.2761 
## F-statistic: 17.84 on 15 and 647 DF,  p-value: < 2.2e-16

The above summary of the full linear model shows that many coefficients of independent variables are not statistically significant (see the p-values in the 4th numeric column). One way to select model variables is by using the Bayesian Information Criterion (BIC). BIC is a numeric assessment of model fit, that also penalizes a higher numbers of parameters in proportion to the sample size. Here is the BIC for the full linear model:

BIC(m_lwage_full)

## [1] 586.3732

A smaller BIC value indicates a better fit. Thus, BIC can calculated for a variety of reduced models, and then compared to the full model BIC to find the best model for the wage prediction job. Of course, R has a function that can systematically preform these BIC adjustments called stepAIC.

# calculate the model using stepAIC
step_model = stepAIC(lm(lwage ~ . - wage, data = na.omit(wage)), trace = FALSE, k = log(length(na.omit(wage))))
step_model

## 
## Call:
## lm(formula = lwage ~ hours + iq + educ + exper + tenure + age + 
##     married + black + south + urban + meduc, data = na.omit(wage))
## 
## Coefficients:
## (Intercept)        hours           iq         educ        exper  
##    5.043796    -0.006175     0.003734     0.045144     0.010216  
##      tenure          age     married1       black1       south1  
##    0.007286     0.012555     0.206907    -0.106256    -0.054877  
##      urban1        meduc  
##    0.201508     0.014247

# show the step-wise model's BIC
BIC(step_model)

## [1] 566.7482

Calling stepAIC finds the combination of variables that produce the lowest BIC, and provides their corfficients. Pretty nice.

Bayesian model averaging

Even with BIC at its lowest possible value, how certain can one be that the resulting model is the true “best fit”? The answer most likely depends on the size and stability of the underlying data. In these times of uncertainty, Bayesian model averaging (BMA) is helpful. BMA agerages multiple models to obtain posteriors of coefficients and predictions from new data. Below, BMA is applied to the wage data (after NA values are omitted).

# exlcude the NAs
wage_no_na = na.omit(wage)

# run the BMA, specifying BIC as the standard by which the resulting models are judged
bma_lwage = bas.lm(lwage ~ . -wage, data = wage_no_na,
                   prior = "BIC", 
                   modelprior = uniform())

# display the results
summary(bma_lwage)

##           P(B != 0 | Y)    model 1       model 2       model 3
## Intercept    1.00000000     1.0000     1.0000000     1.0000000
## hours        0.85540453     1.0000     1.0000000     1.0000000
## iq           0.89732383     1.0000     1.0000000     1.0000000
## kww          0.34789688     0.0000     0.0000000     0.0000000
## educ         0.99887165     1.0000     1.0000000     1.0000000
## exper        0.70999255     0.0000     1.0000000     1.0000000
## tenure       0.70388781     1.0000     1.0000000     1.0000000
## age          0.52467710     1.0000     1.0000000     0.0000000
## married1     0.99894488     1.0000     1.0000000     1.0000000
## black1       0.34636467     0.0000     0.0000000     0.0000000
## south1       0.32028825     0.0000     0.0000000     0.0000000
## urban1       0.99999983     1.0000     1.0000000     1.0000000
## sibs         0.04152242     0.0000     0.0000000     0.0000000
## brthord      0.12241286     0.0000     0.0000000     0.0000000
## meduc        0.57339302     1.0000     1.0000000     1.0000000
## feduc        0.23274084     0.0000     0.0000000     0.0000000
## BF                   NA     1.0000     0.5219483     0.5182769
## PostProbs            NA     0.0455     0.0237000     0.0236000
## R2                   NA     0.2710     0.2767000     0.2696000
## dim                  NA     9.0000    10.0000000     9.0000000
## logmarg              NA -1490.0530 -1490.7032349 -1490.7102938
##                 model 4       model 5
## Intercept     1.0000000     1.0000000
## hours         1.0000000     1.0000000
## iq            1.0000000     1.0000000
## kww           1.0000000     0.0000000
## educ          1.0000000     1.0000000
## exper         1.0000000     0.0000000
## tenure        1.0000000     1.0000000
## age           0.0000000     1.0000000
## married1      1.0000000     1.0000000
## black1        0.0000000     1.0000000
## south1        0.0000000     0.0000000
## urban1        1.0000000     1.0000000
## sibs          0.0000000     0.0000000
## brthord       0.0000000     0.0000000
## meduc         1.0000000     1.0000000
## feduc         0.0000000     0.0000000
## BF            0.4414346     0.4126565
## PostProbs     0.0201000     0.0188000
## R2            0.2763000     0.2762000
## dim          10.0000000    10.0000000
## logmarg   -1490.8707736 -1490.9381880

The resulting table shows the five most probable models, as well as the probability that each coefficient would be included in the true model. We see that birth order and presence of siblings are the least likely variables to be included while education and the IQ variables are locks. The BMA model rankings can also be visualized using an image plot which clearly shows which variables are in all models, which variables are excluded from all models, and those in between.

image(bma_lwage, top.models = 5, rotate=F, cex.axis = 1)

We can also provide 95% confidence intervals for the model coefficients. The results below bolster decisions about the coefficients to include or exclude. For example, where zero is included in the interval, there is substantial evidence favoring exclusion of the variable.

coef_lwage = coefficients(bma_lwage)
confint(coef_lwage)

##                    2.5%        97.5%          beta
## Intercept  6.785810e+00 6.839245e+00  6.8142970694
## hours     -9.331613e-03 0.000000e+00 -0.0053079979
## iq         0.000000e+00 6.271139e-03  0.0037983313
## kww       -6.260177e-06 8.482645e-03  0.0019605787
## educ       2.228374e-02 6.649064e-02  0.0440707549
## exper     -2.867630e-06 2.102556e-02  0.0100264057
## tenure     0.000000e+00 1.280466e-02  0.0059357058
## age        0.000000e+00 2.582214e-02  0.0089659753
## married1   1.179266e-01 2.990897e-01  0.2092940731
## black1    -1.867578e-01 2.357883e-05 -0.0441863361
## south1    -1.030339e-01 0.000000e+00 -0.0221757978
## urban1     1.323568e-01 2.573709e-01  0.1981221313
## sibs       0.000000e+00 0.000000e+00  0.0000218455
## brthord   -1.961975e-02 9.150558e-05 -0.0019470674
## meduc      0.000000e+00 2.271478e-02  0.0086717156
## feduc      0.000000e+00 1.571052e-02  0.0025125930
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Prediction Using the BAS Package

With the model constructed, pediction is just a matter of plugging in data:

# example prediction data using a fictional worker's stats
worker <- data.frame(hours = 44, iq = 88, kww = 33, educ = 11, exper = 11,
                     tenure = 7, age = 44, married = as.factor(0), black = as.factor(0), 
                     south = as.factor(0), urban = as.factor(1), sibs = 3, brthord = 3, 
                     meduc = 7, feduc = 5, std_iq = -0.6524859)

# making the prediction
worker_predict = predict(step_model, newdata = worker, estimator = "BMA")

# translating the result back to dollars
exp(worker_predict)

##        1 
## 745.7072

This made up worker is predicted to be paid $745/week. How accurate is that? You’d have to ask her, but we’re confident in our variable selection and did our best with the available data. The Bayesian techniques applied here provide confidence in the result.

Acknowledgements

Dr. Merlise Clyde of Duke University is the primary author of the ‘BAS’ package. Drs. Clyde, Mine Cetinkaya-Rundel, and David Banks provided the inspiration and resources for this post through their online course Bayesian Statistics. That course is available at www.coursera.com.

DATA - Wooldridge, Jeffrey. 2000. Introductory Econometrics- A Modern Approach. South-Western College Publishing. http://fmwww.bc.edu/ec-p/data/wooldridge/wage2.dta.