Posterior QQ envelopes for linear regression

QQ plots are a useful tool to compare two probability distributions. In particular, they represent a powerful tool to check the validity of the assumption of the normality of the residual errors in linear regression models. The idea is to compare the empirical distribution of the residual errors against the fitted normal distribution.

Consider the linear regression model \[{\bf y}={\bf X}^T\beta+\epsilon\] where \({\bf y}=(y_1,\dots,y_n)\), \(\beta\in {\mathbb R}^p\), and \(\epsilon_i\stackrel{iid}{\sim} N(0,\sigma)\). In a Bayesian context, one can think of simulating \(m\) samples of size \(n\) from the predictive distribution of \(\pi(y_{n+1}\vert{\bf y},{\bf X})\), and comparing them against the true sample, which can be done using a sample from the posterior distribution of \((\beta,\sigma)\). This strategy defines an envelope of QQ plots. Thus, it is possible to construct a \((1-\alpha)\%\) QQ envelope by taking, at each point of the sample, the \((1-\alpha)\%\) probability interval of the samples. This plot is a useful tool to visually check the predictive ability of the model. Examples 1-4 in the following R code illustrates the construction of QQ envelopes for linear regression models, assuming normal errors, in 4 cases:

  1. The errors are truly normal.

  2. The errors are heavy tailed.

  3. The errors are asymmetric.

  4. The errors come from a mixture of normals.

The prior used for the fitted linear regression model with normal errors is the non-informative prior \(\pi(\beta,\sigma)\propto \dfrac{1}{\sigma}\). The strategy to construct the QQ envelopes is explained in [1] (see the References), and summarised below:

  1. Obtain \({\bf y}^{(1)}_{n+1},\dots,{\bf y}^{(m)}_{n+1}\) samples of size \(n\) (same size as the original data) from the posterior predictive distribution \(\pi(y_{n+1}\vert{\bf y},{\bf X})\).

  2. Construct \(m\) QQ-plots for each simulated sample against the original data. Using these \(m\) QQ-plots, we can generate an envelope, by taking the \(\alpha\%\) and \((1-\alpha)\%\) quantiles of the QQ-plots at each quantile point, which is shown in the shaded area.

  3. The envelope is compared against a straight line with intercept 0 and slope 1, which represents a perfect fit. The plots below also show the mean of the QQ plots, which represent the predictive distribution.

References

  1. Bayesian linear regression with skew-symmetric error distributions with applications to survival analysis
#########################################################################################################################
# Example 1: Good fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)
## Package: Rtwalk, Version: 1.8.0
## The R Implementation of 't-walk' MCMC Sampler
## http://www.cimat.mx/~jac/twalk/
## For citations, please use:
## Christen JA and Fox C (2010). A general purpose sampling algorithm for
## continuous distributions (the t-walk). Bayesian Analysis, 5(2),
## pp. 263-282. <URL:
## http://ba.stat.cmu.edu/journal/2010/vol05/issue02/christen.pdf>.
library(twopiece)
## Loading required package: flexclust
## Warning: package 'flexclust' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.3.2
## Loading required package: stats4
## Loading required package: mvtnorm
## Loading required package: label.switching
## Warning: package 'label.switching' was built under R version 3.3.2
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
e1 <- rnorm(ns,0,0.5) # residual errors
X1 <- cbind(1,rnorm(ns),rnorm(ns),rnorm(ns)) # design matrix
beta1 <- c(1,2,3,4) # true value of the regression parameters
y1 = X1%*%beta1 + e1

# Maximum Likehood analysis
summary(lm(y1~X1-1))
## 
## Call:
## lm(formula = y1 ~ X1 - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19494 -0.32331 -0.05984  0.28689  1.62071 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## X11  0.99307    0.03001   33.09   <2e-16 ***
## X12  2.03288    0.03006   67.63   <2e-16 ***
## X13  2.99201    0.02982  100.32   <2e-16 ***
## X14  3.99407    0.02959  134.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4726 on 246 degrees of freedom
## Multiple R-squared:  0.9925, Adjusted R-squared:  0.9924 
## F-statistic:  8143 on 4 and 246 DF,  p-value: < 2.2e-16
# Support function
Support <- function(VAR){ (VAR[5]>0)}

# minus log posterior
lp <- function(par){
  ll <- sum(dnorm(y1 - X1%*%par[1:4],0,par[5],log=T))
  lp0 <- -log(par[5])
  return(-ll - lp0)
}

X0 <- function(x) c(beta1,0.5) + runif(5,-0.1,0.1)

# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=5,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 12 X 105001 matrix to save output.  Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)

# Final posterior sample
post.samp<-out$output[ind,]

# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])

hist(post.samp[,2])

hist(post.samp[,3])

hist(post.samp[,4])

hist(post.samp[,5])

apply(post.samp,2,mean)
## [1] 0.9935689 2.0320059 2.9929442 3.9946685 0.4751480
plot(post.samp[,1],type="l")

plot(post.samp[,2],type="l")

plot(post.samp[,3],type="l")

plot(post.samp[,4],type="l")

plot(post.samp[,5],type="l")

# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------

# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
  index <- sample(dim(post.samp)[1],N,replace = T) 
  samp <- post.samp[index,]
  y.pred <- matrix(0, nrow = N, ncol = length(y1))
  for(i in 1:N){
    y.pred[i,] <- X1%*%samp[i,1:4] +  rnorm(1,0,samp[i,5])
  }
  return(y.pred)
}

N = 5000 
simpred1 = sim.pred(N)

#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------

fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)

for(i in 1:N){
  fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])

# Plot agains the line y=x
plot(sort(y1),smin,xlim=c(min(sort(y1)),max(sort(y1))),ylim=c(-15,25),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(y1),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(y1),rev(sort(y1))),c(smin,rev(smax)),col=colors()[355])
points(sort(y1),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()

########################################################################################################################
# Example 2: Bad fit due to heavy tails
########################################################################################################################
rm(list=ls())
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
e1 <- rt(ns,df=2) # residual errors
X1 <- cbind(1,rnorm(ns),rnorm(ns),rnorm(ns)) # design matrix
beta1 <- c(1,2,3,4) # true value of the regression parameters
y1 = X1%*%beta1 + e1

# Maximum Likehood analysis
summary(lm(y1~X1-1))
## 
## Call:
## lm(formula = y1 ~ X1 - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.0878  -0.8453   0.1186   0.9966  14.7995 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## X11   0.8738     0.1774   4.926 1.54e-06 ***
## X12   1.8067     0.1721  10.496  < 2e-16 ***
## X13   2.8086     0.1737  16.174  < 2e-16 ***
## X14   4.0561     0.1799  22.547  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.798 on 246 degrees of freedom
## Multiple R-squared:  0.8054, Adjusted R-squared:  0.8022 
## F-statistic: 254.5 on 4 and 246 DF,  p-value: < 2.2e-16
# Support function
Support <- function(VAR){ (VAR[5]>0)}

# minus log posterior
lp <- function(par){
  ll <- sum(dnorm(y1 - X1%*%par[1:4],0,par[5],log=T))
  lp0 <- -log(par[5])
  return(-ll - lp0)
}

X0 <- function(x) c(beta1,0.5) + runif(5,-0.1,0.1)

# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=5,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 12 X 105001 matrix to save output.  Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)

# Final posterior sample
post.samp<-out$output[ind,]

# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])

hist(post.samp[,2])

hist(post.samp[,3])

hist(post.samp[,4])

hist(post.samp[,5])

apply(post.samp,2,mean)
## [1] 0.8585485 1.8092475 2.8105556 4.0556496 2.8081134
plot(post.samp[,1],type="l")

plot(post.samp[,2],type="l")

plot(post.samp[,3],type="l")

plot(post.samp[,4],type="l")

plot(post.samp[,5],type="l")

# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------

# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
  index <- sample(dim(post.samp)[1],N,replace = T) 
  samp <- post.samp[index,]
  y.pred <- matrix(0, nrow = N, ncol = length(y1))
  for(i in 1:N){
    y.pred[i,] <- X1%*%samp[i,1:4] +  rnorm(1,0,samp[i,5])
  }
  return(y.pred)
}

N = 5000
simpred1 = sim.pred(N)

#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------

fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)

for(i in 1:N){
  fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])

# Plot agains the line y=x
plot(sort(y1),smin,xlim=c(min(sort(y1)),max(sort(y1))),ylim=c(-25,25),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(y1),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(y1),rev(sort(y1))),c(smin,rev(smax)),col=colors()[355])
points(sort(y1),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()

########################################################################################################################
# Example 3: Bad fit due to asymmetry (there is bias in the estimates and a lot of variability in the right-tail)
########################################################################################################################
rm(list=ls())
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
e1 <- rtp3(ns,0,0.5,-0.75,FUN=rnorm,param="eps") # residual errors
X1 <- cbind(1,rnorm(ns),rnorm(ns),rnorm(ns)) # design matrix
beta1 <- c(1,2,3,4) # true value of the regression parameters
y1 = X1%*%beta1 + e1

# Maximum Likehood analysis
summary(lm(y1~X1-1))
## 
## Call:
## lm(formula = y1 ~ X1 - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9083 -0.4242 -0.1028  0.3341  1.8268 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## X11  1.61165    0.03427   47.02   <2e-16 ***
## X12  2.01463    0.03314   60.80   <2e-16 ***
## X13  3.01639    0.03349   90.07   <2e-16 ***
## X14  3.98965    0.03522  113.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5406 on 246 degrees of freedom
## Multiple R-squared:  0.9915, Adjusted R-squared:  0.9913 
## F-statistic:  7140 on 4 and 246 DF,  p-value: < 2.2e-16
# Support function
Support <- function(VAR){ (VAR[5]>0)}

# minus log posterior
lp <- function(par){
  ll <- sum(dnorm(y1 - X1%*%par[1:4],0,par[5],log=T))
  lp0 <- -log(par[5])
  return(-ll - lp0)
}

X0 <- function(x) c(beta1,0.5) + runif(5,-0.1,0.1)

# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=5,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 12 X 105001 matrix to save output.  Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)

# Final posterior sample
post.samp<-out$output[ind,]

# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])

hist(post.samp[,2])

hist(post.samp[,3])

hist(post.samp[,4])

hist(post.samp[,5])

apply(post.samp,2,mean)
## [1] 1.6115685 2.0146675 3.0160383 3.9897554 0.5431182
plot(post.samp[,1],type="l")

plot(post.samp[,2],type="l")

plot(post.samp[,3],type="l")

plot(post.samp[,4],type="l")

plot(post.samp[,5],type="l")

# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------

# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
  index <- sample(dim(post.samp)[1],N,replace = T) 
  samp <- post.samp[index,]
  y.pred <- matrix(0, nrow = N, ncol = length(y1))
  for(i in 1:N){
    y.pred[i,] <- X1%*%samp[i,1:4] +  rnorm(1,0,samp[i,5])
  }
  return(y.pred)
}

N = 5000 
simpred1 = sim.pred(N)

#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------

fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)

for(i in 1:N){
  fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])

# Plot agains the line y=x
plot(sort(y1),smin,xlim=c(min(sort(y1)),max(sort(y1))),ylim=c(-15,25),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(y1),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(y1),rev(sort(y1))),c(smin,rev(smax)),col=colors()[355])
points(sort(y1),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()

#########################################################################################################################
# Example 4: Bad induced for a two component mixture with equal weights for the errors: 
# A mixture inducing bimodality: there is bias in the estimates and some variability and lack of fit in the tails
#########################################################################################################################

rm(list=ls())
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
#e1<-c(rnorm(ns/2,-1,1),rnorm(ns/2,1,1)) # Multimodal two component mixture, weights=0.5
e1<-c(rnorm(50,-1,5),rnorm(200,3,1))  # Bimodal two component mixture, weights=0.2 and 0.8

X1 <- cbind(1,rnorm(ns),rnorm(ns),rnorm(ns)) # design matrix
beta1 <- c(1,2,3,4) # true value of the regression parameters
y1 = X1%*%beta1 + e1

# Maximum Likehood analysis
summary(lm(y1~X1-1))
## 
## Call:
## lm(formula = y1 ~ X1 - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1357  -0.2582   0.5039   1.2362   7.5295 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## X11   3.2214     0.1722   18.71   <2e-16 ***
## X12   1.9441     0.1724   11.27   <2e-16 ***
## X13   2.9508     0.1711   17.25   <2e-16 ***
## X14   4.0078     0.1697   23.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.711 on 246 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.8405 
## F-statistic: 330.3 on 4 and 246 DF,  p-value: < 2.2e-16
# Support function
Support <- function(VAR){ (VAR[5]>0)}

# minus log posterior
lp <- function(par){
  ll <- sum(dnorm(y1 - X1%*%par[1:4],0,par[5],log=T))
  lp0 <- -log(par[5])
  return(-ll - lp0)
}

X0 <- function(x) c(beta1,0.5) + runif(5,-0.1,0.1)

# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=5,Obj=lp,Supp=Support,x0=X0(),xp0=X0(), PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 12 X 105001 matrix to save output.  Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)

# Final posterior sample
post.samp<-out$output[ind,]

# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])

hist(post.samp[,2])

hist(post.samp[,3])

hist(post.samp[,4])

hist(post.samp[,5])

apply(post.samp,2,mean)
## [1] 3.223586 1.939797 2.956954 4.003701 2.722752
plot(post.samp[,1],type="l")

plot(post.samp[,2],type="l")

plot(post.samp[,3],type="l")

plot(post.samp[,4],type="l")

plot(post.samp[,5],type="l")

# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------

# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
  index <- sample(dim(post.samp)[1],N,replace = T) 
  samp <- post.samp[index,]
  y.pred <- matrix(0, nrow = N, ncol = length(y1))
  for(i in 1:N){
    y.pred[i,] <- X1%*%samp[i,1:4] +  rnorm(1,0,samp[i,5])
  }
  return(y.pred)
}

N = 5000 
simpred1 = sim.pred(N)

#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------

fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)

for(i in 1:N){
  fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])

# Plot agains the line y=x
plot(sort(y1),smin,xlim=c(min(sort(y1)),max(sort(y1))),ylim=c(-20,25),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(y1),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(y1),rev(sort(y1))),c(smin,rev(smax)),col=colors()[355])
points(sort(y1),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()