Complete all Exercises, and submit answers to Questions in the Quiz: Week 4 of Lab on Coursera.

Modeling Wages

In the field of labor economics, the study of income and wages provides insight about topics ranging from gender discrimination to the benefits of higher education. In this lab, we will analyze cross-sectional wage data in order to practice using Bayesian selection methods such as BIC and Bayesian Model Averaging to construct parsimonious predictive models.

Getting Started

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. Both of these packages are part of the tidyverse. We will review simple linear regression using the lm function and how the output can be interpreted from a Bayesian perspective. We will also use the broom package to turn regression outputs to tidy data frames to help with diagnostic plotting. We will use the stepAIC function from the MASS package for model selection using step-wise selection using BIC. The bas.lm function from the BAS package later in the lab to implement Bayesian Model Averaging. Please make sure that the version of BAS is 1.4.9 or greater. The data can be found in the companion package for this course, statsr. Some learners may want to review material from the earlier courses in the specialization that covers EDA and regression if they are unfamiliar with ggplot basics or the lm function.

Load packages

Let’s load the packages that we will be using:

library(MASS)
library(tidyverse)
library(statsr)
library(BAS)
library(broom)

options(width=100)

The data

The data we will be using in this lab were gathered as a random sample of 935 respondents throughout the United States. This data set was released as part of the series Instructional Stata Datasets for Econometrics by the Boston College Department of Economics [@Wooldridge2000].

Let’s start by loading the data:

data(wage)

variable	description
`wage`	weekly earnings (dollars)
`hours`	average hours worked per week
`iq`	IQ score
`kww`	knowledge of world work score
`educ`	number of years of education
`exper`	years of work experience
`tenure`	years with current employer
`age`	age in years
`married`	=1 if married
`black`	=1 if black
`south`	=1 if live in south
`urban`	=1 if live in a Standard Metropolitan Statistical Area
`sibs`	number of siblings
`brthord`	birth order
`meduc`	mother’s education (years)
`feduc`	father’s education (years)
`lwage`	natural log of `wage`

Is this an observational study or an experiment? You may refer to http://study.com/academy/lesson/experiments-vs-observational-studies.html for the definitions of the two.

Observational study –>correct
Experiment

Setting a seed

In this lab we will do some random generation, which means you should set a seed on top of your document. Setting a seed will cause R to sample the same sample each time you knit your document. This will make sure your results don’t change each time you knit, and it will also ensure reproducibility of your work (by setting the same seed it will be possible to reproduce your results). You can set a seed like this:

set.seed(18382)

The number above is completely arbitraty. If you need inspiration, you can use your ID, birthday, or just a random string of numbers. The important thing is that you use each seed only once. You only need to do this once in your R Markdown document, but make sure it comes before sampling.

Exploring the data

For a new data set, a good place to start is standard exploratory data analysis. We will begin with the wage variable since it will be the response variable in our models. We may use a histogram to visualize the distribution.

ggplot(data = wage, aes(x = wage)) +
  geom_histogram(binwidth = 100)

For numeric summary statistics, the summary function provides additional insights about the distribution of wage.

summary(wage$wage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   115.0   669.0   905.0   957.9  1160.0  3078.0

Which of the following statements is false about the distribution of weekly wages?

The median of the distribution is 905.
25% of respondents make at least 1160 dollars per week.
10 of the respondents make strictly less than 300 dollars per week –>correct
wage is right-skewed, meaning that more respondents have weekly wages below the mean weekly wage than above it.

Simple linear regression

Since wage is our response variable, we would like to explore the relationship between wage and other variables as predictors. One possible, simplistic, explanation for the variation in wages that we see in the data is that smarter people make more money. The plot below visualizes a scatterplot between weekly wage and IQ score.

ggplot(data = wage, aes(x = iq, y = wage)) +
  geom_point()

There appears to be a positive relationship between IQ score and wage. We can quantify this by fitting a Bayesian simple linear regression

\[\text{wage}_i = \alpha + \beta \cdot \text{iq}_i + \epsilon_i\] to the observed data using the reference prior. We can fit the model using the lm function:

m_wage_iq <- lm(wage ~ iq, data = wage)

and extract the summary statistics for the posterior distribution using the output from the lm by applyting the tidy function from the broom package.

tidy(m_wage_iq)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   117.      85.6        1.37 1.72e- 1
## 2 iq              8.30     0.836      9.93 3.79e-22

The first column displays the posterior means of the linear model’s y-intercept and the regression coefficient of iq.

With this we can write down the posterior mean of the regression line \[116.992 + 8.303 \times \text{IQ}\] and create a scatterplot with the posterior mean for the regression line laid on top.

ggplot(data = wage, aes(x = iq, y = wage)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

Under the assumption that the errors $\epsilon_i$ are independent and normally distributed with mean zero and an unknown variance $\sigma^2$, the posterior distributions for the intercept and slope will have a Student-t distribution under the reference prior with the posterior means and scales equal to the ordinary least squares estimates and standard errors respectively. We can create 95% credible intervals for the two parameters using the confint function:

confint(m_wage_iq)

##                  2.5 %     97.5 %
## (Intercept) -51.080781 285.063910
## iq            6.661631   9.944498

Fit a new model that uses educ (education) to predict average weekly wages. Using the estimates from the R output, write the equation of the posterior mean of the regression line and obtain a 95% credible interval for the coefficients. What does the slope tell us in the context of the relationship between education and earnings?

Each additional year of education increases weekly wages by $60.21.
Each additional year of education increases weekly wages by $146.95.
For each additional year of education, there is a 95% chance that average weekly wages will possibly decrease by $5.56 or increase by $299.47.
For each additional year of education, there is a 95% chance that average weekly wages will increase by $49.04 to $71.39.

–>correct

m_wage_edu <- lm(wage ~ educ, data = wage)

tidy(m_wage_edu)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    147.      77.7       1.89 5.89e- 2
## 2 educ            60.2      5.69     10.6  9.35e-25

confint(m_wage_edu)

##                2.5 %    97.5 %
## (Intercept) -5.56393 299.46881
## educ        49.03783  71.39074

Model diagnostics

The Bayesian model specification assumes that the errors are normally distributed with a constant variance and that the mean expected weekly wages is linear in IQ. We can check these assumption by examining the distribution of the residuals for the model.

In order to do so we will use predicted values, residuals, and standardized residuals of the model we fit earlier. The augment function in the broom package is going to come in handy here as it takes in a model object (the output of an lm) and returns a data frame with columns correspinding to variables in the model as well as predicted values (.fitted), residuals (.resid), and standardized residuals (.std.resid), along with a few others.

m_wage_iq_aug <- augment(m_wage_iq)

Linearity and Constant Variance: You already checked if the relationship between weekly wages and IQ is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.

ggplot(data = m_wage_iq_aug, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals")

Also note that we’re getting fancy with the code here. We set the alpha level of our points to a value lower than 1 (0.6 to be precise) in order to add plot the points with some transparency. This will allow us to more easily identify where the points are more dense vs. more sparse. Then, we overlay a horizontal dashed line at $y = 0$ (to help us check whether residuals are distributed evenly around 0 at each fitted value), and we also adjust the axis labels to be more informative.

Normality: To check this condition, we can look at a histogram of residuals

ggplot(data = m_wage_iq_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 100) +
  xlab("Residuals")

or a normal probability plot of the residuals

ggplot(m_wage_iq_aug) +
  geom_qq(aes(sample = .std.resid)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(x = "Theoretical quantiles", y = "Standardized residuals")

where we expect the points to be close to the dashed line, if the assumption of normality holds. Note that the $y$-axis in the plot uses standardized residuals, which are the residuals divided by their standard deviations so that they will have a normal distribution with mean zero and constant variance if the model holds.

Which of the following statements about the residual plots are false?

The residuals appear to be randomly distributed around 0.
The residuals are strongly left skewed, hence the normal distribution of errors condition is not met. –>correct
The variability of residuals appears to increase as the fitted increase, suggesting that the constant variance assumption does not hold.
There are more individuals where the model under predicts weekly wages rather than over estimates weekly wages.

ggplot(m_wage_iq, aes(x = .resid)) +
  geom_histogram(bins = 50)

Refit the model by using educ (education) as the independent variable. Does your answer to the previous exercise change?

ggplot(m_wage_edu, aes(x = .resid)) +
  geom_histogram(bins = 50)

Linear Regression After Transforming `wage`

One way to accommodate the right-skewness in the residuals is to (natural) log-transform the dependent variable. Note that this is only possible if the variable is strictly positive, since the log of negative value is not defined and $\ln(0) = -\infty$. Let us try to fit a linear model with log-wage (lwage) as the dependent variable. The next two questions will be based on this log transformed model.

m_lwage_iq = lm(lwage ~ iq, data = wage)

Examine the residuals of this model. Is the assumption of normally distributed residuals reasonable?

m_lwage_iq_aug <- augment(m_lwage_iq)

ggplot(m_lwage_iq_aug, aes(x= .fitted, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0) +
  labs(x = "Fitted Values", y = 'Residuals')

ggplot(m_lwage_iq_aug, aes(x = .resid)) +
  geom_histogram(alpha = 0.6, bins = 30) +
  xlab("Residuals")

ggplot(m_lwage_iq_aug) +
  geom_qq(aes(sample = .std.resid)) +
  geom_abline(slope = 1, intercept = 0, linetype = 'dashed') +
  labs(x = "Theoretical quantiles", y = "Standardized residuals")

Excluding wage and lwage, select two other variables that you think might be good predictors of lwage. Visualize their relationships with wage and check assumptions using appropriate plots.

# exper variables
m_lwage_exper <- lm(lwage ~ exper, data = wage)

confint(m_lwage_exper)

##                    2.5 %      97.5 %
## (Intercept)  6.679618899 6.832520731
## exper       -0.004200756 0.008167327

m_lwage_exper_aug <- augment(m_lwage_exper)

ggplot(m_lwage_exper_aug, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = 'dashed') +
  labs(x = 'Fitted Values', y = 'Residuals')

ggplot(m_lwage_exper_aug, aes(x = .resid)) +
  geom_histogram(bins = 30) +
  xlab('Residuals')

ggplot(m_lwage_exper_aug, aes(sample = .resid)) +
  stat_qq()

# hours variable
m_lwage_hours <- lm(lwage ~hours, data = wage)

confint(m_lwage_hours)

##                    2.5 %       97.5 %
## (Intercept)  6.733368440 7.0664860483
## hours       -0.006493986 0.0009886344

m_lwage_hours_aug <- augment(m_lwage_hours)

ggplot(m_lwage_hours_aug, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = 'dashed') +
  labs(x = 'Fitted Values', y = 'Residuals')

ggplot(m_lwage_hours_aug, aes(x = .resid)) +
  geom_histogram(bins = 30) +
  xlab('Residuals')

ggplot(m_lwage_hours_aug, aes(sample = .resid)) +
  stat_qq()

Outliers

We declared observations to be outliers with respect to the population model if their deviation or error $\epsilon_i$ was more than $k=3$ standard deviations above or below 0. Let’s use the Bayes.outlier function from BAS, to calculate these probabilities for the model m_lwage_iq and plot them against the case number.

We start by calculating the probabilities,

outliers <- Bayes.outlier(m_lwage_iq, k = 3)
names(outliers)

## [1] "e"            "hat"          "prob.outlier" "prior.prob"

and then store the results in a data frame and plot them.

outliers_df <- data.frame(probability = outliers$prob.outlier,
                          case = 1:length(outliers$prob.outlier))
ggplot(outliers_df, aes(ymax = probability, x = case)) +
  geom_linerange(ymin = 0) +
  labs(y = "Probability")

To identify which cases have probabilities greater than 0.50 of being an outlier, we can use the filter function to return which cases have probability > 0.50.

outliers_df %>%
  filter(probability > 0.50)

##   probability case
## 1   0.9937375  434
## 2   1.0000000  514
## 3   0.6610078  616
## 4   0.5309985  705
## 5   1.0000000  784

Using the definition of outlier above, which statement is false?

Case 434 has a probability of close to 1 that it an outlier under the normal error model for regressing lwage on iq
Case 514 has a probably of close to 1 that it an outlier under the normal error model for regressing lwage on iq
Case 616 has a probably of close to 1 that it an outlier under the normal error model for regressing lwage on iq–>correct
Case 784 has a probably of close to 1 that it an outlier under the normal error model for regressing lwage on iq

While being $3$ standard deviations seems like an extremely unlikely event for a single observation, for large sample sizes, there is often a rather high probability that there will be at least one error $\epsilon_i$ that exceeds $3$ standard deviations above or below zero a priori. We can calculate this as follows

# prob of a case being an outlier:
#   being below or above 3 standard deviations from 0
(prob_outlier <- pnorm(-3) + pnorm(3, lower.tail = FALSE))

## [1] 0.002699796

# probability of a signle case not being an outler is therefore the complement 
(prob_not_outlier <- 1 - prob_outlier)

## [1] 0.9973002

# probability of no outliers in the sample of n assuming errors are independent a priori
n <- nrow(wage)
(prob_no_outliers <- prob_not_outlier^n)

## [1] 0.07984061

# probability of at least one outlier in the sample is the complement of the 
# probability of no outliers in the sample of n
1 - prob_no_outliers

## [1] 0.9201594

With a sample size of 935 and using $3$ standard deviations to define outliers, the chance of having at least one outlier in the sample is 92.02% so the fact that we did discover some outliers is not that surprising.

So instead of fixing the number of standard deviations to $k=3$, an alternative is fix the prior probability of there being no outliers in the sample, \[P(\text{no outliers in sample}) = P(\text{observation is not an outlier})^n = 0.95\] which we can solve for

\[ P(\text{observation is not an outlier}) = 0.95^{1/n} \]

and then solve for $k$ using the normal quantile function.

n <- nrow(wage)
(prob_obs_not_outlier <- 0.95^(1/n))

## [1] 0.9999451

(newk <- qnorm(0.5 + 0.5 * prob_obs_not_outlier))

## [1] 4.033904

The function Bayes.outlier can also calculate k internally if we specify the prior probability of there being no outliers in the sample:

outliers <- Bayes.outlier(m_lwage_iq, prior.prob=0.95)
names(outliers)

## [1] "e"            "hat"          "prob.outlier" "prior.prob"

Use the new value of $k$ to calculate the posterior probability of each observation being an outlier. Which observation has a posterior probability of being an outlier that exceeds the prior probability of being an outlier?

Case 434
Case 514
Case 616
Case 784 –>correct

outliers_df <- data.frame(outliers, probability = outliers$prob.outlier, case = 1:length(outliers$prob.outlier))

ggplot(outliers_df, aes(ymax = probability, x = case)) +
  geom_linerange(ymin = 0)

Multiple linear regression

It is evident that wage can be explained by many predictors, such as experience,
education, IQ, and so on. We can include all relevant covariates in a regression model in an attempt to explain as much wage variation as possible. In addition, sometimes outliers can be explained by changing the model by adding other predictors; let’s take a look at multiple regression before removing any cases.

m_lwage_full <- lm(lwage ~ . - wage, data = wage)

The use of . - wage in the lm function tells R to include all covariates in the model except the wage variable from the data set.

However, running this full model has a cost: we will remove observations from our data if some measurements in the variables (e.g. birth order, mother’s education, and father’s education) are missing. By default, the lm function does a complete-case analysis. So it removes any observations with a missing (NA) value in one or more of the predictor variables.

Because of these missing values we must make an addition assumption in order for our inferences to be valid. This exclusion of rows with missing values requires that in the data there is no systematic reason for the values to be missing. In other words, our data must be missing at random. For example, if all first-born children did not report their birth order, the data would not be missing at random. Without any additional information we will assume this is reasonable and use the 663 complete observations (as opposed to the original 935) to fit the model. Both Bayesian and frequentist methods exist to handle data sets with missing data, but they are beyond the scope of this course.

From the model, all else begin equal, who would you expect to make more: a married black man or a single non-black man?

The married black man –>correct
The single non-black man

tidy(m_lwage_full)

## # A tibble: 16 x 5
##    term        estimate std.error statistic  p.value
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)  5.16      0.225       22.9  2.16e-85
##  2 hours       -0.00655   0.00193     -3.39 7.54e- 4
##  3 iq           0.00319   0.00122      2.60 9.43e- 3
##  4 kww          0.00373   0.00239      1.56 1.19e- 1
##  5 educ         0.0413    0.00894      4.61 4.74e- 6
##  6 exper        0.0107    0.00443      2.42 1.56e- 2
##  7 tenure       0.00710   0.00289      2.45 1.44e- 2
##  8 age          0.00911   0.00598      1.52 1.28e- 1
##  9 married1     0.201     0.0460       4.36 1.48e- 5
## 10 black1      -0.105     0.0557      -1.89 5.94e- 2
## 11 south1      -0.0491    0.0308      -1.60 1.11e- 1
## 12 urban1       0.196     0.0312       6.26 6.88e-10
## 13 sibs         0.00962   0.00788      1.22 2.22e- 1
## 14 brthord     -0.0185    0.0116      -1.60 1.11e- 1
## 15 meduc        0.00963   0.00617      1.56 1.19e- 1
## 16 feduc        0.00559   0.00540      1.04 3.01e- 1

As you can see from a quick summary of the full linear model, many coefficients of independent variables are not statistically significant. In previous labs within this specialization, you selected variables based on the values of Adjusted $R^2$. This module introduced the Bayesian Information Criterion (BIC), which is a criterion that can be used for model selection. BIC is based on model fit, while simultaneously penalizing the number of parameters in proportion to the sample size. We can calculate the BIC of the full linear model using the command below:

BIC(m_lwage_full)

## [1] 586.3732

We can compare the BIC of the full model with that of a reduced model. Let us try to remove birth order from the model. To ensure that the observations remain the same, the data set can be specified as na.omit(wage), which includes only the observations with no missing values in any variables in the data set.

m_lwage_nobrthord <- lm(lwage ~ . - wage - brthord, data = na.omit(wage))
BIC(m_lwage_nobrthord)

## [1] 582.4815

As you can see, removing birth order from the regression reduces BIC, which we seek to minimize by model selection.

Elimination of which variable from the full model yielded the lowest BIC?

brthord
sibs
feduc –>correct
meduc

# Remove brthord
m_lwage_nobrthord <- lm(lwage ~ . - wage - brthord, data = na.omit(wage))
BIC(m_lwage_nobrthord)

## [1] 582.4815

# Remove sibs
m_lwage_sibs <- lm(lwage ~ . - wage - sibs, data = na.omit(wage))
BIC(m_lwage_sibs)

## [1] 581.4031

# Remove feduc
m_lwage_feduc <- lm(lwage ~ . - wage - feduc, data = na.omit(wage))
BIC(m_lwage_feduc)

## [1] 580.9743

# Remove medu
m_lwage_meduc <- lm(lwage ~ . - wage - meduc, data = na.omit(wage))
BIC(m_lwage_meduc)

## [1] 582.3722

R has a function stepAIC from the MASS package that will work backwards through the model space, removing variables until the AIC score can be no longer lowered. It takes all inputs in the full model, and a penalty parameter $k$. The default setting is $k=2$ for the AIC score. Find the best model according to BIC (in which case k = log(n) where $n$ is the number of observations). Remember to use na.omit(wage) as your data set. You may type ?stepAIC in the RStudio Console to get the use and examples of the function stepAIC.

?stepAIC

## starting httpd help server ... done

Bayesian model averaging

Often, several models are equally plausible and choosing only one ignores the inherent uncertainty involved in choosing the variables to include in the model. A way to get around this problem is to implement Bayesian model averaging (BMA), in which multiple models are averaged to obtain posteriors of coefficients and predictions from new data. Dr. Merlise Clyde is the primary author of the R package BAS, which implements BMA [@Clyde2018]. We can use this for either implementing BMA or selecting models.

We start by applying BMA to the wage data using all $15$ potential predictors.

# Exclude observations with missing values in the data set
wage_no_na <- na.omit(wage)

# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_lwage <- bas.lm(lwage ~ . -wage, data = wage_no_na,
                   prior = "BIC", 
                   modelprior = uniform())

# Print out the marginal posterior inclusion probabilities for each variable                
bma_lwage

## 
## Call:
## bas.lm(formula = lwage ~ . - wage, data = wage_no_na, prior = "BIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
## Intercept      hours         iq        kww       educ      exper     tenure        age   married1  
##   1.00000    0.85540    0.89732    0.34790    0.99887    0.70999    0.70389    0.52468    0.99894  
##    black1     south1     urban1       sibs    brthord      meduc      feduc  
##   0.34636    0.32029    1.00000    0.04152    0.12241    0.57339    0.23274

# Top 5 most probably models
summary(bma_lwage)

##           P(B != 0 | Y)    model 1       model 2       model 3       model 4       model 5
## Intercept    1.00000000     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## hours        0.85540453     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## iq           0.89732383     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## kww          0.34789688     0.0000     0.0000000     0.0000000     1.0000000     0.0000000
## educ         0.99887165     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## exper        0.70999255     0.0000     1.0000000     1.0000000     1.0000000     0.0000000
## tenure       0.70388781     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## age          0.52467710     1.0000     1.0000000     0.0000000     0.0000000     1.0000000
## married1     0.99894488     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## black1       0.34636467     0.0000     0.0000000     0.0000000     0.0000000     1.0000000
## south1       0.32028825     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## urban1       0.99999983     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## sibs         0.04152242     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## brthord      0.12241286     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## meduc        0.57339302     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## feduc        0.23274084     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## BF                   NA     1.0000     0.5219483     0.5182769     0.4414346     0.4126565
## PostProbs            NA     0.0455     0.0237000     0.0236000     0.0201000     0.0188000
## R2                   NA     0.2710     0.2767000     0.2696000     0.2763000     0.2762000
## dim                  NA     9.0000    10.0000000     9.0000000    10.0000000    10.0000000
## logmarg              NA -1490.0530 -1490.7032349 -1490.7102938 -1490.8707736 -1490.9381880

Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. For example, the posterior probability that hours is included in the model is 0.855. Further, the most likely model, which has posterior probability of 0.0455, includes an intercept, hours worked, IQ, education, tenure, age, marital status, urban living status, and mother’s education. While a posterior probability of 0.0455 sounds small, it is much larger than the uniform prior probability assigned to it, since there are $2^{15}$ possible models.

It is also possible to visualize the posterior distribution of the coefficients under the model averaging approach. We graph the posterior distribution of the coefficients of iq and sibs below. Note that the subset command dictates which variable is plotted.

# Obtain the coefficients from the model `bma_lwage`
coef_lwage <- coefficients(bma_lwage)


# `iq` is the 3rd variable, while `sibs` is the 13th variable in the data set
plot(coef_lwage, subset = c(3,13), ask = FALSE)

We can also provide 95% credible intervals for these coefficients:

confint(coef_lwage)

##                   2.5%        97.5%          beta
## Intercept  6.787292902 6.841036e+00  6.8142970694
## hours     -0.009276652 0.000000e+00 -0.0053079979
## iq         0.000000000 6.254572e-03  0.0037983313
## kww        0.000000000 8.355291e-03  0.0019605787
## educ       0.022385507 6.618889e-02  0.0440707549
## exper      0.000000000 2.098825e-02  0.0100264057
## tenure     0.000000000 1.277263e-02  0.0059357058
## age        0.000000000 2.548132e-02  0.0089659753
## married1   0.117744818 2.998621e-01  0.2092940731
## black1    -0.194130537 1.519152e-07 -0.0441863361
## south1    -0.102712596 1.281341e-04 -0.0221757978
## urban1     0.135265538 2.595701e-01  0.1981221313
## sibs       0.000000000 0.000000e+00  0.0000218455
## brthord   -0.020560222 0.000000e+00 -0.0019470674
## meduc      0.000000000 2.263957e-02  0.0086717156
## feduc      0.000000000 1.553921e-02  0.0025125930
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

For Questions 9-10, we’ll use a reduced data set which excludes wage, number of siblings, birth order, and parental education.

wage_red <- wage %>%
  select(-wage, -sibs, -brthord, -meduc, -feduc)

Let’s use BMA with the Zellner-Siow prior on the regression coefficients:

bma_lwage_red <- bas.lm(lwage ~ ., data = wage_red,  
                        prior = "ZS-null",
                        modelprior = uniform())

Based on this reduced data set, according to Bayesian model averaging, which of the following variables has the lowest marginal posterior inclusion probability?

kww
black
south
age –>correct

summary(bma_lwage_red)

##           P(B != 0 | Y)  model 1     model 2     model 3     model 4     model 5
## Intercept     1.0000000   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
## hours         0.9486375   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
## iq            0.9475699   1.0000   1.0000000   1.0000000   1.0000000   0.0000000
## kww           0.5000338   1.0000   0.0000000   0.0000000   1.0000000   1.0000000
## educ          1.0000000   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
## exper         0.9517468   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
## tenure        0.9987506   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
## age           0.2696057   0.0000   0.0000000   1.0000000   1.0000000   0.0000000
## married1      0.9999700   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
## black1        0.9866068   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
## south1        0.9157105   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
## urban1        1.0000000   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
## BF                   NA   1.0000   0.7635498   0.4039829   0.1630009   0.1148965
## PostProbs            NA   0.3292   0.2514000   0.1330000   0.0537000   0.0378000
## R2                   NA   0.2751   0.2708000   0.2737000   0.2760000   0.2678000
## dim                  NA  11.0000  10.0000000  11.0000000  12.0000000  10.0000000
## logmarg              NA 120.9272 120.6574088 120.0208031 119.1131860 118.7634622

True or False: The naive model with all variables included has posterior probability greater than 0.5.

True
False –>correct

Graph the posterior distribution of the coefficient of age, using the data set wage_red.

coef_lwage_red <- coefficients(bma_lwage_red)

plot(coef_lwage_red, subset = 8, ask = FALSE)

Because we have log transformed wage, interpretation of coefficients from the output is not as useful for understanding how the different predictors influence wages. Instead we can interpret coefficients after transforming back to the original units by exponentiation. The exponential of the posterior mean is not the same as the mean of the exponential, however, the median of wage can be found by exponentiating the median of log wage (i.e. the middle value is still in the middle with transformations that do not change the order of values).

Let’s look at the coefficients and 95% credible intervals

coef(bma_lwage_red)

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  2048 models 
##            post mean  post SD    post p(B != 0)
## Intercept   6.779004   0.011815   1.000000     
## hours      -0.005153   0.002011   0.948638     
## iq          0.003216   0.001266   0.947570     
## kww         0.002125   0.002527   0.500034     
## educ        0.051663   0.007925   1.000000     
## exper       0.011777   0.004362   0.951747     
## tenure      0.010513   0.002490   0.998751     
## age         0.002391   0.004879   0.269606     
## married1    0.197358   0.038797   0.999970     
## black1     -0.147152   0.044153   0.986607     
## south1     -0.074443   0.033753   0.915710     
## urban1      0.179091   0.027027   1.000000

coef(bma_lwage_red) %>%
  confint()

##                    2.5%        97.5%         beta
## Intercept  6.755818e+00  6.802062485  6.779003810
## hours     -8.154330e-03  0.000000000 -0.005152980
## iq         0.000000e+00  0.005080933  0.003216182
## kww        0.000000e+00  0.006903470  0.002124717
## educ       3.538127e-02  0.066409877  0.051663021
## exper      0.000000e+00  0.017864041  0.011777136
## tenure     5.629795e-03  0.015309703  0.010513266
## age       -2.078995e-05  0.015011820  0.002390900
## married1   1.212962e-01  0.272942812  0.197358085
## black1    -2.363023e-01 -0.066087597 -0.147151969
## south1    -1.234276e-01  0.000000000 -0.074442861
## urban1     1.260957e-01  0.231860444  0.179091430
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The exponential transformation applied to coefficients has a multiplicative effect on the posterior median. What this means is that a one unit increase in predictor $X_j$ leads to a $(\exp(\beta_j) -1) \times 100$ percent increase in the median wage [@StatNews83]. For a factor or indicator variable like urban, the posterior median for wages for urban areas (urban == 1) will be $(\exp(0.1791) - 1) \times 100$ percent higher than in rural areas (urban == 0). Similarly we can use the same transformation with the credible interval. First, let’s calculate the credible interval for the coefficient of urban and exponentiate it.

ci_urban <- coef(bma_lwage_red) %>%
  confint(parm = "urban1") %>%
  exp()

ci_urban

##            2.5%    97.5%    beta
## urban1 1.133498 1.261976 1.19613
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Then, we can substract 1 from the bounds of the interval and multipl them by 100 to make them a bit more straightforward to interpret

(ci_urban - 1) * 100

##            2.5%    97.5%     beta
## urban1 13.34977 26.19764 19.61301
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Based on this data, there is a 95% chance that median wages for urban areas are 13.52 to 26.42 times higher than in rural regions.

Find a 95% credible interval for the coefficient of educ and provide an interpretation.

ci_educ <- coef(bma_lwage_red) %>%
  confint(parm = "educ") %>%
  exp()

(ci_educ - 1)*100

##          2.5%    97.5%     beta
## educ 3.543903 6.826096 5.302084
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Prediction with BAS

Similar to last week’s lab, we will be using Bayesian predictive distribution for predictions and interpretation of predictions. Simulation is used in BAS to construct predictive intervals with Bayesian Model Averaging, while exact inference is often possible with predictive intervals under model selection.

Returning to the wage data set, let us find the predictive values under the Best Predictive Model (BPM), the one which has predictions closest to BMA and corresponding posterior standard deviations.

BPM_pred_lwage <- predict(bma_lwage, estimator = "BPM", se.fit = TRUE)
variable.names(BPM_pred_lwage)

##  [1] "Intercept" "hours"     "iq"        "kww"       "educ"      "exper"     "tenure"    "age"      
##  [9] "married1"  "urban1"    "meduc"

In the code above, the function variable.names can be used to extract the names of all of the predictors in the Best Probabilty model. This can be used to identify the variables in the Highest Probability Model (HPM)

HPM_pred_lwage <- predict(bma_lwage, estimator = "HPM")
variable.names(HPM_pred_lwage)

## [1] "Intercept" "hours"     "iq"        "educ"      "tenure"    "age"       "married1"  "urban1"   
## [9] "meduc"

and the Median Probability Model (MPM)

MPM_pred_lwage <- predict(bma_lwage, estimator = "MPM")
variable.names(MPM_pred_lwage)

##  [1] "Intercept" "hours"     "iq"        "educ"      "exper"     "tenure"    "age"       "married1" 
##  [9] "urban1"    "meduc"

The MPM includes exper in addition to all of the variables in the Highest Probability Model (HPM), while the BPM includes kww in addition to all of the variables in the MPM.

Based on these results which covariates are included in all of the following: the best predictive model, the median probability model, and the highest posterior probability model?

kww, married, urban
married, age, black
black, south, married
meduc, urban, married –>correct

Let us turn to examine what characteristics lead to the highest wages in the BPM model.

# Find the index of observation with the largest fitted value
opt <- which.max(BPM_pred_lwage$fit)

# Extract the row with this observation and glimpse at the row
wage_no_na %>% 
  slice(opt) %>%
  glimpse()

## Observations: 1
## Variables: 17
## $ wage    <int> 1586
## $ hours   <int> 40
## $ iq      <int> 127
## $ kww     <int> 48
## $ educ    <int> 16
## $ exper   <int> 16
## $ tenure  <int> 12
## $ age     <int> 37
## $ married <fct> 1
## $ black   <fct> 0
## $ south   <fct> 0
## $ urban   <fct> 1
## $ sibs    <int> 4
## $ brthord <int> 4
## $ meduc   <int> 16
## $ feduc   <int> 16
## $ lwage   <dbl> 7.36897

A 95% credible interval for predicting log wages can be obtained by

ci_lwage <- confint(BPM_pred_lwage, parm = "pred")
ci_lwage[opt,]

##     2.5%    97.5%     pred 
## 6.661865 8.056455 7.359160

To translate this back to wage (recall that we regress lwage), we may exponentiate the interval to obtain a 95% prediction interval for the wages of an individual with covariates at the levels of the individual specified by opt.

exp(ci_lwage[opt,])

##      2.5%     97.5%      pred 
##  782.0078 3154.0905 1570.5169

If we were to use BMA, the interval would be

Let us turn to examine what characteristics lead to the highest wages in the BPM model.

BMA_pred_lwage <- predict(bma_lwage, estimator = "BMA", se.fit = TRUE)

# Find the index of observation with the largest fitted value
opt0 <- which.max(BMA_pred_lwage$fit)

# Extract the row with this observation and glimpse at the row
wage_no_na %>% 
  slice(opt) %>%
  glimpse()

## Observations: 1
## Variables: 17
## $ wage    <int> 1586
## $ hours   <int> 40
## $ iq      <int> 127
## $ kww     <int> 48
## $ educ    <int> 16
## $ exper   <int> 16
## $ tenure  <int> 12
## $ age     <int> 37
## $ married <fct> 1
## $ black   <fct> 0
## $ south   <fct> 0
## $ urban   <fct> 1
## $ sibs    <int> 4
## $ brthord <int> 4
## $ meduc   <int> 16
## $ feduc   <int> 16
## $ lwage   <dbl> 7.36897

A 95% credible interval for predicting log wages can be obtained by

ci_lwage0 <- confint(BMA_pred_lwage, parm = "pred")
ci_lwage0[opt0,]

##     2.5%    97.5%     pred 
## 6.585177 7.985750 7.309875

exp(ci_lwage0[opt0,])

##      2.5%     97.5%      pred 
##  724.2794 2938.7793 1494.9899

Repeat these calculations for a 95% prediction interval for the individual who is predicted to have the highest predicted wages based on the best predictive model.

[414, 1717]
[782, 1571]
[782, 3154] –>correct
[706, 2950]

BPM_pred_lwage <- predict(bma_lwage, estimator = 'BPM', se.fit = TRUE)

# Find the index of observation with the largest fitted value
opt1 <- which.max(BPM_pred_lwage$fit)

# Extract the row with this observation and glimpse at the row
wage_no_na %>% 
  slice(opt1) %>%
  glimpse()

## Observations: 1
## Variables: 17
## $ wage    <int> 1586
## $ hours   <int> 40
## $ iq      <int> 127
## $ kww     <int> 48
## $ educ    <int> 16
## $ exper   <int> 16
## $ tenure  <int> 12
## $ age     <int> 37
## $ married <fct> 1
## $ black   <fct> 0
## $ south   <fct> 0
## $ urban   <fct> 1
## $ sibs    <int> 4
## $ brthord <int> 4
## $ meduc   <int> 16
## $ feduc   <int> 16
## $ lwage   <dbl> 7.36897

A 95% credible interval for predicting log wages can be obtained by

ci_lwage1 <- confint(BPM_pred_lwage, parm = "pred")
ci_lwage1[opt1,]

##     2.5%    97.5%     pred 
## 6.661865 8.056455 7.359160

exp(ci_lwage1[opt1,])

##      2.5%     97.5%      pred 
##  782.0078 3154.0905 1570.5169

This work is licensed under GNU General Public License v3.0.

Modeling Wages

Getting Started

Load packages

The data

Setting a seed

Exploring the data

Simple linear regression

Model diagnostics

Linear Regression After Transforming wage

Outliers

Multiple linear regression

Bayesian model averaging

Prediction with BAS

References

Linear Regression After Transforming `wage`