Complete all Exercises, and submit answers to Questions on the Coursera platform.

Modeling Wages

In the field of labor economics, the study of income and wages provides insight about topics ranging from gender discrimination to the benefits of higher education. In this lab, we will analyze cross-sectional wage data in order to practice using Bayesian methods such as BIC and Bayesian Model Averaging to construct parsimonious predictive models.

Getting Started

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. We also may use the MASS package to implement stepwise linear regression in one of the exercises. The data can be found in the companion package for this course, statsr.

Let’s load the packages.

library(statsr)
library(MASS)
library(dplyr)
library(ggplot2)
library(BAS)

This is the first time we’re using the BAS package. We will be using the bas.lm function from this package later in the lab to implement Bayesian Model Averaging. Please make sure that the version of BAS is 1.3.0 or greater.

The data

The data will be using in this lab were gathered as a random sample of 935 respondents throughout the United States. This data set was released as part of the series Instructional Stata Datasets for Econometrics by the Boston College Department of Economics (Wooldridge 2000).

Let’s load the data:

data(wage)

variable	description
`wage`	weekly earnings (dollars)
`hours`	average hours worked per week
`IQ`	IQ score
`kww`	knowledge of world work score
`educ`	number of years of education
`exper`	years of work experience
`tenure`	years with current employer
`age`	age in years
`married`	=1 if married
`black`	=1 if black
`south`	=1 if live in south
`urban`	=1 if live in a Standard Metropolitan Statistical Area
`sibs`	number of siblings
`brthord`	birth order
`meduc`	mother’s education (years)
`feduc`	father’s education (years)
`lwage`	natural log of `wage`

Is this an observational study or an experiment?
1. Observational study
2. Experiment

Exploring the data

As with any new data set a good place to start is standard exploratory data analysis. We will begin with the wage variable since it will be the response variable in our models.

Which of the following statements is false about the distribution of wage?
1. The median of the distribution is 905.
2. 25% of respondents make more than 1160 dollars per week.
3. 7 of the respondents make less than 300 dollars per week
4. wage is right-skewed, meaning that more respondents fall below the mean wage than above it.

n <- nrow(wage)
wage %>% summarize(Min = min(wage), Q25 = quantile(wage, 0.25), Median = median(wage), Mean = mean(wage), Q75 = quantile(wage, 0.75), Max = max(wage), Less300 = quantile(wage, 7/n))

## # A tibble: 1 x 7
##     Min   Q25 Median     Mean   Q75   Max  Less300
##   <dbl> <dbl>  <int>    <dbl> <dbl> <dbl>    <dbl>
## 1   115   669    905 957.9455  1160  3078 309.9251

Since wage is our response, we would like to explore the relationship of the other variables as predictors.

Exercise: Excluding wage and lwage, select two other variables that you think might be a good predictor of wage. Visualize their relationships with wage using appropriate plots.

ggplot(wage, aes(x=hours, y=wage))+geom_point()

ggplot(wage, aes(x=educ, y=wage))+geom_point()

Simple linear regression

One possible, simplistic, explanation for the variation in wages that we see in the data is that smarter people make more money. The plot below visualizes a scatterplot between weekly wage and IQ score.

ggplot(data = wage, aes(x = iq, y = wage)) +
  geom_point()

This plot is rather noisy. While there may be a slight positive linear relationship between IQ score and wage, IQ is at best a crude predictor of wages. We can quantify this by fitting a simple linear regression.

m_wage_iq = lm(wage ~ iq, data = wage)
m_wage_iq$coefficients

## (Intercept)          iq 
##  116.991565    8.303064

summary(m_wage_iq)$sigma

## [1] 384.7667

Recall from the lectures that under the model

\[wage_i = \alpha + \beta \cdot iq_i + \epsilon_i\]

if \(\epsilon_i \sim N(0, \sigma^2)\) and the reference prior \(p(\alpha, \beta, \sigma^2) \propto 1/\sigma^2\) is used, then the Bayesian posterior means and standard deviations will be equal to the frequentist estimates and standard errors respectively.

The Bayesian model specification assumes that the errors are normally distributed with a constant variance. As with the frequentist approach we check this assumption by examining the distribution of the residuals for the model. If the residuals are highly non-normal or skewed, the assumption is violated and any subsequent inference is not valid.

Examine the residuals of m_wage_iq. Is the assumption of normally distributed errors valid?
1. Yes, since the distribution of the dependent variable (wage) is roughly normally distributed.
2. Yes, since the distribution of the residuals of the model looks approximately normal.
3. No, since the distribution of the residuals of the model is left-skewed.
4. No, since the distribution of the residuals of the model is right-skewed.

plot(m_wage_iq, which = 1)

plot(m_wage_iq, which = 2)

Exercise: Refit the model, this time using educ (education) as the independent variable. Does your answer to the previous exercise change?

The assumption of normally distributed errors is not valid since the distribution of the residuals of the model is righr skewed.

# Summary statistics of the model
m_wage_educ <- lm(wage ~ educ, data = wage)
c(m_wage_educ$coef, Sigma = summary(m_wage_educ)$sigma)

## (Intercept)        educ       Sigma 
##   146.95244    60.21428   382.32033

# Information with details
summary(m_wage_educ)

## 
## Call:
## lm(formula = wage ~ educ, data = wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -877.38 -268.63  -38.38  207.05 2148.26 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  146.952     77.715   1.891   0.0589 .  
## educ          60.214      5.695  10.573   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 382.3 on 933 degrees of freedom
## Multiple R-squared:  0.107,  Adjusted R-squared:  0.106 
## F-statistic: 111.8 on 1 and 933 DF,  p-value: < 2.2e-16

# Checking assumption of errors being normal distribution of residudals with a constant variance
ggplot(m_wage_educ, aes(x=fitted(m_wage_educ), y=residuals(m_wage_educ)))+geom_point()+geom_abline()

plot(m_wage_educ, which = 2)

Variable transformation

One way to accommodate the right-skewness in the data is to (natural) log transform the dependent variable. Note that this is only possible if the variable is strictly positive, since the log of negative value is not defined and \(\log(0) = -\infty\). Let’s try to fit a linear model with log-wage as the dependent variable. Question 4 will be based on this log transformed model.

m_lwage_iq = lm(lwage ~ iq, data = wage)
m_lwage_iq$coefficients

## (Intercept)          iq 
## 5.886994223 0.008807157

summary(m_lwage_iq)$sigma

## [1] 0.399948

summary(m_lwage_iq)

## 
## Call:
## lm(formula = lwage ~ iq, data = wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09324 -0.25547  0.02261  0.27544  1.21486 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.8869942  0.0890206   66.13   <2e-16 ***
## iq          0.0088072  0.0008694   10.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3999 on 933 degrees of freedom
## Multiple R-squared:  0.09909,    Adjusted R-squared:  0.09813 
## F-statistic: 102.6 on 1 and 933 DF,  p-value: < 2.2e-16

Exercise: Examine the residuals of this model. Is the assumption of normally distributed residuals reasonable?

A: The assumption of normally distributed residuals is reasonable since the residuals are normally distributed with a relatively constant variance.

plot(m_lwage_iq, which = 1)

plot(m_lwage_iq, which = 2)

Recall that the posterior distribution of \(\alpha\) and \(\beta\) given \(\sigma^2\) is normal, but marginally follows a \(t\) distribution with \(n-p-1\) degrees of freedom. In this case, \(p=1\), since IQ is the only predictor of log-wage included in our model. Therefore both \(\alpha\) and \(\beta\) will have a posteriors that follow a \(t\) distribution 933 degrees of freedom - since the df is so large these distributions will actually be approximately normal.

Under the reference prior \(p(\alpha, \beta, \sigma^2) \propto 1/\sigma^2\), give a 95% posterior credible interval for \(\beta\), the coefficient of IQ.
1. (0.00793, 0.00967)
2. (0.00709, 0.01050)
3. (0.00663, 0.01098)
4. (0.00010, 0.01750)

confint(m_lwage_iq, level = 0.95)

##                   2.5 %     97.5 %
## (Intercept) 5.712290445 6.06169800
## iq          0.007100959 0.01051335

Exercise: The coefficient of IQ is very small, which is expected since a one point increase in IQ score can hardly be expected to have a high multiplicative effect on wage. One way to make the coefficient more interpretable is to standardize IQ before putting it into the model. From this new model, an increase in IQ of 1 standard deviation (15 points) is estimated to increase wage by what percentage?

The increase percent of wage per 1 standard deviation increase in IQ is approximately 0.3336.

m_lwage_liq <- lm(lwage ~ log(iq), data = wage)
summary(m_lwage_liq)

## 
## Call:
## lm(formula = lwage ~ log(iq), data = wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09743 -0.24828  0.02547  0.27542  1.21110 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.94217    0.38409   7.660 4.64e-14 ***
## log(iq)      0.83299    0.08334   9.995  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4005 on 933 degrees of freedom
## Multiple R-squared:  0.09672,    Adjusted R-squared:  0.09576 
## F-statistic: 99.91 on 1 and 933 DF,  p-value: < 2.2e-16

sd <- summary(m_lwage_liq)$sigma
beta_liq <- summary(m_lwage_liq)$coefficients[2,1]
wage_increase_by_1sd <- beta_liq*sd; print(wage_increase_by_1sd)

## [1] 0.3335917

Multiple linear regression

It is evident that wage can be explained by many predictors, such as experience, education, and IQ. We can include all relevant covariates in a regression model in an attempt to explain as much wage variation as possible.

m_lwage_full = lm(lwage ~ . - wage, data = wage)

The use of . in the lm tells R to include all covariates in the model which we then further modify with -wage which then excludes the wage variable from the model.

However, running this full model has a cost: we remove observations from our data since some measurements for (e.g. birth order, mother’s education, and father’s education) are missing. By default, the lm function does a complete-case analysis, and so it removes any observations with a missing (NA) value in one or more of the predictor variables.

Because of these missing values we must make an addition assumption in order for our inferences to be valid. This exclusion of rows with missing values requires that the data there is no systematic reason for the values to be missing, or in other words our data must be missing at random. For example, if all first-born children did not report their birth order, the data would not be missing at random. Without any additional information we will assume missing at random is reasonable and use the 663 complete observations (as opposed to the original 935) to fit the model. Both Bayesian and frequentist methods exist to handle data sets with missing data, but they are beyond the scope of this course.

From the model, all else begin equal, who would you expect to make more: a married black man or a single non-black man?
1. The married black man
2. The single non-black man

beta_m_lwage_full_contant <- summary(m_lwage_full)$coefficients[1,1]
beta_m_lwage_full_married <- summary(m_lwage_full)$coefficients[9,1]
beta_m_lwage_full_black <- summary(m_lwage_full)$coefficients[10,1]
wage_married_black <- beta_m_lwage_full_contant + beta_m_lwage_full_black * 1 + beta_m_lwage_full_married * 1
# 1 each for black and married #
wage_single_nonblack <- beta_m_lwage_full_contant + beta_m_lwage_full_black * 0 + beta_m_lwage_full_married * 0
# 0 each for nonblack and non-married #
data.frame(wage_married_black, wage_single_nonblack)

##   wage_married_black wage_single_nonblack
## 1           5.252058             5.156439

As you can see from a quick summary of the full linear model, many coefficients of independent variables are not statistically significant. In previous labs within this specialization, you selected variables based on Adjusted \(R^2.\) This module introduced the Bayesian Information Criterion (BIC), which is a metric that can be used for model selection. BIC is based on model fit, while simultaneously penalizing the number of parameters in proportion to the sample size. We can calculate the BIC of the full linear model using the command below:

BIC(m_lwage_full)

## [1] 586.3732

We can compare the BIC of the full model with that of a reduced model. Let’s try to remove birth order from the model. To ensure that the observations remain the same, the data set can be specified as na.omit(wage), which includes only the observations with no missing values.

m_lwage_nobrthord = lm(lwage ~ . -wage -brthord, data = na.omit(wage))
BIC(m_lwage_nobrthord)

## [1] 582.4815

As you can see, removing birth order from the regression reduces BIC, which we seek to minimize by model selection.

Elimination of which variable from the full model yielded the lowest BIC?
1. brthord
2. sibs
3. feduc
4. meduc

m_lwage_nosib <- lm(lwage ~ . -wage -sibs, data = na.omit(wage))
m_lwage_nofeduc <- lm(lwage ~ . -wage -feduc, data = na.omit(wage))
m_lwage_nomeduc <- lm(lwage ~ . -wage -meduc, data = na.omit(wage))
BIC_nobrthord <- BIC(m_lwage_nobrthord)
BIC_nosibs <- BIC(m_lwage_nosib)
BIC_nofeduc <- BIC(m_lwage_nofeduc)
BIC_nomeduc <- BIC(m_lwage_nomeduc)
data.frame(BIC_nobrthord, BIC_nosibs, BIC_nofeduc, BIC_nomeduc)

##   BIC_nobrthord BIC_nosibs BIC_nofeduc BIC_nomeduc
## 1      582.4815   581.4031    580.9743    582.3722

Exercise: R has a function stepAIC that will work backwards through the model space, removing variables until BIC can be no longer be lowered. It takes as inputs a full model, and a penalty parameter \(k\). Find the best model according to BIC (in which case \(k = \log(n)\)). Remember to use na.omit(wage) as your data set.

m_lwage_full1 <- lm(lwage ~ . -wage, data = na.omit(wage))
m_lwage_full1_step <- stepAIC(m_lwage_full1, trace = FALSE, direction = "backward")
m_lwage_full1_step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## lwage ~ (wage + hours + iq + kww + educ + exper + tenure + age + 
##     married + black + south + urban + sibs + brthord + meduc + 
##     feduc) - wage
## 
## Final Model:
## lwage ~ hours + iq + kww + educ + exper + tenure + age + married + 
##     black + south + urban + meduc
## 
## 
##        Step Df  Deviance Resid. Df Resid. Dev       AIC
## 1                              647   79.57803 -1373.584
## 2   - feduc  1 0.1318928       648   79.70992 -1374.487
## 3    - sibs  1 0.1948095       649   79.90473 -1374.868
## 4 - brthord  1 0.1662921       650   80.07102 -1375.490

Bayesian model averaging

Often, several models are equally plausible and choosing only one ignores the inherent uncertainty involved in choosing the variables to include in the model. A way to get around this problem is to implement Bayesian model averaging (BMA), in which multiple models are averaged to obtain posteriors of coefficients and predictions from new data. Dr. Merlise Clyde is the primary author of the R package BAS, which implements BMA. We can use this for either implementing BMA or selecting models. We start by applying BMA to the wage data.

wage_no_na = na.omit(wage)
bma_lwage = bas.lm(lwage ~ . -wage, data = wage_no_na,
                   prior = "BIC", 
                   modelprior = uniform())
bma_lwage

## 
## Call:
## bas.lm(formula = lwage ~ . - wage, data = wage_no_na, prior = "BIC",     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
## Intercept      hours         iq        kww       educ      exper  
##   1.00000    0.85540    0.89732    0.34790    0.99887    0.70999  
##    tenure        age   married1     black1     south1     urban1  
##   0.70389    0.52468    0.99894    0.34636    0.32029    1.00000  
##      sibs    brthord      meduc      feduc  
##   0.04152    0.12241    0.57339    0.23274

summary(bma_lwage)

##           P(B != 0 | Y)    model 1       model 2       model 3
## Intercept    1.00000000     1.0000     1.0000000     1.0000000
## hours        0.85540453     1.0000     1.0000000     1.0000000
## iq           0.89732383     1.0000     1.0000000     1.0000000
## kww          0.34789688     0.0000     0.0000000     0.0000000
## educ         0.99887165     1.0000     1.0000000     1.0000000
## exper        0.70999255     0.0000     1.0000000     1.0000000
## tenure       0.70388781     1.0000     1.0000000     1.0000000
## age          0.52467710     1.0000     1.0000000     0.0000000
## married1     0.99894488     1.0000     1.0000000     1.0000000
## black1       0.34636467     0.0000     0.0000000     0.0000000
## south1       0.32028825     0.0000     0.0000000     0.0000000
## urban1       0.99999983     1.0000     1.0000000     1.0000000
## sibs         0.04152242     0.0000     0.0000000     0.0000000
## brthord      0.12241286     0.0000     0.0000000     0.0000000
## meduc        0.57339302     1.0000     1.0000000     1.0000000
## feduc        0.23274084     0.0000     0.0000000     0.0000000
## BF                   NA     1.0000     0.5219483     0.5182769
## PostProbs            NA     0.0455     0.0237000     0.0236000
## R2                   NA     0.2710     0.2767000     0.2696000
## dim                  NA     9.0000    10.0000000     9.0000000
## logmarg              NA -1490.0530 -1490.7032349 -1490.7102938
##                 model 4       model 5
## Intercept     1.0000000     1.0000000
## hours         1.0000000     1.0000000
## iq            1.0000000     1.0000000
## kww           1.0000000     0.0000000
## educ          1.0000000     1.0000000
## exper         1.0000000     0.0000000
## tenure        1.0000000     1.0000000
## age           0.0000000     1.0000000
## married1      1.0000000     1.0000000
## black1        0.0000000     1.0000000
## south1        0.0000000     0.0000000
## urban1        1.0000000     1.0000000
## sibs          0.0000000     0.0000000
## brthord       0.0000000     0.0000000
## meduc         1.0000000     1.0000000
## feduc         0.0000000     0.0000000
## BF            0.4414346     0.4126565
## PostProbs     0.0201000     0.0188000
## R2            0.2763000     0.2762000
## dim          10.0000000    10.0000000
## logmarg   -1490.8707736 -1490.9381880

Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. For example, the posterior probability that hours is included in the model is 0.855. Further, the most likely model, which has posterior probability of 0.0455, includes an intercept, hours worked, IQ, education, tenure, age, marital status, urban living status, and mother’s education. While a posterior probability of 0.0455 sounds small, it is much larger than the uniform prior probability assigned to it, since there are \(2^{16}\) possible models.

It is also possible to visualize the posterior distribution of the coefficients under the model averaging approach. We graph the posterior distribution of the coefficients of iq and sibs below. Note that the subset command dictates which variable is plotted.

par(mfrow = c(1,2))
coef_lwage = coefficients(bma_lwage)
plot(coef_lwage, subset = c(3,13), ask=FALSE) # when "ask" argument is set to "FALSE", it doesn't ask for "return" for plot #

We can also provide 95% credible intervals for these coefficients:

confint(coef_lwage)

##                    2.5%       97.5%          beta
## Intercept  6.787139e+00 6.841368966  6.8142970694
## hours     -9.358674e-03 0.000000000 -0.0053079979
## iq         0.000000e+00 0.006243809  0.0037983313
## kww        0.000000e+00 0.008362018  0.0019605787
## educ       2.233127e-02 0.066241424  0.0440707549
## exper      0.000000e+00 0.021169031  0.0100264057
## tenure    -3.528865e-06 0.012780165  0.0059357058
## age        0.000000e+00 0.025445412  0.0089659753
## married1   1.123590e-01 0.296958331  0.2092940731
## black1    -1.935153e-01 0.000000000 -0.0441863361
## south1    -1.025376e-01 0.000000000 -0.0221757978
## urban1     1.331549e-01 0.257992901  0.1981221313
## sibs       0.000000e+00 0.000000000  0.0000218455
## brthord   -1.899410e-02 0.000000000 -0.0019470674
## meduc      0.000000e+00 0.022778650  0.0086717156
## feduc      0.000000e+00 0.015645278  0.0025125930
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

For questions 7-8, we’ll use a reduced data set which excludes number of siblings, birth order, and parental education.

wage_red <- subset(wage,
                   select = -c(sibs, brthord, feduc, meduc))

Based on this reduced data set, according to Bayesian model averaging, which of the following variables has the lowest marginal posterior inclusion probability?
1. kww
2. black
3. south
4. age

wage_red_no.na <- na.omit(wage_red)
bma_lwage_red <- bas.lm(lwage ~ . -wage, data = wage_red_no.na,
                        prior = "BIC",
                        modelprior = uniform())
bma_lwage_red

## 
## Call:
## bas.lm(formula = lwage ~ . - wage, data = wage_red_no.na, prior = "BIC",     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
## Intercept      hours         iq        kww       educ      exper  
##    1.0000     0.8692     0.9172     0.3217     1.0000     0.9335  
##    tenure        age   married1     black1     south1     urban1  
##    0.9980     0.1786     0.9999     0.9761     0.8149     1.0000

sort(summary(bma_lwage_red)[1:12, 1], decreasing = FALSE)

##       age       kww    south1     hours        iq     exper    black1 
## 0.1786252 0.3216992 0.8148861 0.8691891 0.9171607 0.9334844 0.9761347 
##    tenure  married1      educ    urban1 Intercept 
## 0.9980015 0.9999368 1.0000000 1.0000000 1.0000000

True or False: The naive model with all variables included has posterior probability greater than 0.5. (Use a Zellner-Siow null prior for the coefficients and a Beta-Binomial (1,1) prior for the models)
1. True
2. False

zs_lwage_red <- bas.lm(lwage ~ . -wage, data = wage_red_no.na,
                       prior = "ZS-null",
                       modelprior = beta.binomial(1,1))

## Warning in bas.lm(lwage ~ . - wage, data = wage_red_no.na, prior = "ZS-
## null", : We recommend using the implementation using the Jeffreys-Zellner-
## Siow prior (prior='JZS') which uses numerical integration rahter than the
## Laplace approximation

zs_lwage_red

## 
## Call:
## bas.lm(formula = lwage ~ . - wage, data = wage_red_no.na, prior = "ZS-null",     modelprior = beta.binomial(1, 1))
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
## Intercept      hours         iq        kww       educ      exper  
##    1.0000     0.9932     0.9893     0.8290     1.0000     0.9856  
##    tenure        age   married1     black1     south1     urban1  
##    0.9997     0.6534     1.0000     0.9973     0.9866     1.0000

sort(summary(zs_lwage_red)[1:12, 1], decreasing = FALSE)

##       age       kww     exper    south1        iq     hours    black1 
## 0.6533855 0.8289845 0.9856385 0.9866055 0.9893468 0.9931742 0.9973459 
##    tenure  married1    urban1      educ Intercept 
## 0.9997110 0.9999948 1.0000000 1.0000000 1.0000000

Exercise: Graph the posterior distribution of the coefficient of age, using the data set wage_red.

par(mfrow = c(1,1))
coef_lwage_red <- coefficients(bma_lwage_red)
plot(coef_lwage_red, subset = 8, ask = FALSE)

Prediction

A key advantage of Bayesian statistics is prediction and the probabilistic interpretation of predictions. Much of Bayesian prediction is done using simulation techniques, some of which was discussed near the end of this module. This is often applied in regression modeling, although we’ll work through an example with just an intercept term.

Suppose you observe four numerical observations of \(y\), which are 2, 2, 0 and 0 respectively with sample mean \(\bar{y} = 1\) and sample variance \(s^2 = 4/3\). Assuming that \(y \sim N(\mu, \sigma^2)\), under the reference prior \(p(\mu,\sigma^2) \propto 1/\sigma^2\), our posterior becomes

\[\mu|\sigma^2, y \sim N(1, \sigma^2/4)\] which is centered at the sample mean and \[1/\sigma^2, y \sim Gamma(\alpha = 3/2,\beta = 4/2)\] where \(\alpha = (n - 1)/2\) and \(\beta = s^2 (n-1)/2 = 2\).

To obtain the predictive distribution for \(y_5\), we can first simulate \(\sigma^2\) from its posterior and then \(\mu\) followed by \(y_5\). Our draws of \(y_5\) will be from the posterior predictive distribution for a new observation. The example below draws 100,000 times from the posterior predictive distribution of \(y_5\).

set.seed(314)
N = 100000
phi = rgamma(N,3/2,2)
sigma2 = 1/phi
mu = rnorm(N, 1, sqrt(sigma2/4))
y_5 = rnorm(N, mu, sqrt(sigma2))

We can view an estimate of the predictive distribution, by looking at the a smoothed version of the histogram of the simulated data:

hist(y_5, prob=T, breaks=100, xlab=expression(y[5]), main="")

A 95% central credible interval for a new observation is the interval (L, U) where \(P(Y_5 < L \mid Y) = .05/2\) and \(P(Y_5 > U \mid Y) = .05/2)\). In this case L is the 0.025 quantile and U is the 0.975 quantile. We can obtain those values using the quantile function to find the sample quantiles for 0.025 and 0.975 of \(y_5\).

Estimate a 95% central credible interval for a new observation \(y_5\)
1. (-3.71, 5.73)
2. (-3.11, 5.13)
3. (-1.18, 3.19)

L <- quantile(y_5, 0.025)
U <- quantile(y_5, 0.975)
c(L, U)

##      2.5%     97.5% 
## -3.109585  5.132511

Exercise: In the simple example above, it is possible to use integration to calculate the posterior predictive analytically. In this case, it is a scaled \(t\) distribution with 3 degrees of freedom \((n - 1)\) with mean \(1\) and scale = 5/3 (\(s^2(1 + 1/n)\)). Plot the empirical density of \(y\) alongside the actual density of the t-distribution. How do they compare?

x <- y_5
y <- pt(x, 3)
plot(x, y, type = "p")

Prediction with BAS

Simulation is used in BAS to construct predictive intervals with Bayesian Model averaging, while exact inference is often possible with predictive intervals under model selection.

Returning to the wage data set, let’s find predictive values under the best predictive model, the one that has predictions closest to BMA and corresponding posterior standard deviations.

BPM_pred_lwage =  predict(bma_lwage, estimator="BPM", se.fit=TRUE)
bma_lwage$namesx[BPM_pred_lwage$bestmodel+1]

##  [1] "Intercept" "hours"     "iq"        "kww"       "educ"     
##  [6] "exper"     "tenure"    "age"       "married1"  "urban1"   
## [11] "meduc"

We can compare this to the Highest probability model that we found earlier and the Median Probability Model (MPM)

MPM_pred_lwage =  predict(bma_lwage, estimator="MPM")
bma_lwage$namesx[MPM_pred_lwage$bestmodel+1]

##  [1] "Intercept" "hours"     "iq"        "educ"      "exper"    
##  [6] "tenure"    "age"       "married1"  "urban1"    "meduc"

The MPM includes exper in addition to all of the variables as the HPM, while the BPM includes kwh in addition to all of the variables in the MPM.

Exercise: Using the reduced data, what covariates are included in the best predictive model, the median probability model and the highest posterior probability model?

A: Three models include the same covariates. Those are “hours”, “iq”, “educ”, “exper”, “tenure”, “married”, “black”, “south” and “urabn”.

# BPM reduced #
BPM_pred_lwage_red <- predict(bma_lwage_red, estimator = "BPM")
BPM_cov <- bma_lwage_red$namesx[BPM_pred_lwage_red$bestmodel+1]

# MPM reduced #
MPM_pred_lwage_red <- predict(bma_lwage_red, estimator = "MPM")
MPM_cov <- bma_lwage_red$namesx[MPM_pred_lwage_red$bestmodel+1]

# HPM reduced #
HPM_pred_lwage_red <- predict(bma_lwage_red, estimator = "HPM")
HPM_cov <- bma_lwage_red$namesx[HPM_pred_lwage_red$bestmodel+1]

BPM_cov; MPM_cov; HPM_cov

##  [1] "Intercept" "hours"     "iq"        "educ"      "exper"    
##  [6] "tenure"    "married1"  "black1"    "south1"    "urban1"

##  [1] "Intercept" "hours"     "iq"        "educ"      "exper"    
##  [6] "tenure"    "married1"  "black1"    "south1"    "urban1"

##  [1] "Intercept" "hours"     "iq"        "educ"      "exper"    
##  [6] "tenure"    "married1"  "black1"    "south1"    "urban1"

Let’s turn to see what characteristics lead to the highest wages with the BPM model.

opt = which.max(BPM_pred_lwage$fit)
t(wage_no_na[opt, ])

##         [,1]     
## wage    "1586"   
## hours   "40"     
## iq      "127"    
## kww     "48"     
## educ    "16"     
## exper   "16"     
## tenure  "12"     
## age     "37"     
## married "1"      
## black   "0"      
## south   "0"      
## urban   "1"      
## sibs    "4"      
## brthord "4"      
## meduc   "16"     
## feduc   "16"     
## lwage   "7.36897"

A 95% credible interval for predicting log wages can be obtained by

ci_lwage = confint(BPM_pred_lwage, parm="pred")
ci_lwage[opt,]

##     2.5%    97.5%     pred 
## 6.661863 8.056457 7.359160

To translated back to wages, we may exponentiate the interval

exp(ci_lwage[opt,])

##      2.5%     97.5%      pred 
##  782.0062 3154.0967 1570.5169

to obtain a 95% prediction interval for the wages of an individual with covariates at the levels of the individual specified by opt.

If were to use BMA, the interval would be

BMA_pred_lwage =  predict(bma_lwage, estimator="BMA", se.fit=TRUE)
ci_bma_lwage = confint(BMA_pred_lwage, estimator="BMA")
opt_bma = which.max(BMA_pred_lwage$fit)
exp(ci_bma_lwage[opt_bma,])

##      2.5%     97.5%      pred 
##  733.3446 2989.2076 1494.9899

Exercise: Using the reduced data, construct a 95% prediction interval for the individual who is predicted to have the highest predicted wages under the BPM.

BPM_pred_lwage_red <- predict(bma_lwage_red, estimator = "BPM", se.fit = TRUE)
ci_bpm_lwage_red <- confint(BPM_pred_lwage_red, estimator = "BPM")
opt_bpm <- which.max(BPM_pred_lwage_red$fit)
exp(ci_bpm_lwage_red[opt_bpm,])

##      2.5%     97.5%      pred 
##  706.4592 2950.5928 1443.7706

References

Wooldridge, Jeffrey. 2000. Introductory Econometrics- A Modern Approach. South-Western College Publishing. http://fmwww.bc.edu/ec-p/data/wooldridge/wage2.dta.

Bayesian Linear Regression