QQ plots are a useful tool to compare two probability distributions. In particular, they represent a powerful tool to check the validity of the assumption of the normality of the residual errors in linear regression models. The idea is to compare the empirical distribution of the residual errors against the fitted normal distribution.
Consider the linear regression model \[{\bf y}={\bf X}^T\beta+\epsilon\] where \({\bf y}=(y_1,\dots,y_n)\), \(\beta\in {\mathbb R}^p\), and \(\epsilon_i\stackrel{iid}{\sim} N(0,\sigma)\). In a Bayesian context, one can think of simulating \(m\) samples of size \(n\) from the predictive distribution of \(\pi(y_{n+1}\vert{\bf y},{\bf X})\), and comparing them against the true sample, which can be done using a sample from the posterior distribution of \((\beta,\sigma)\). This strategy defines an envelope of QQ plots. Thus, it is possible to construct a \((1-\alpha)\%\) QQ envelope by taking, at each point of the sample, the \((1-\alpha)\%\) probability interval of the samples. This plot is a useful tool to visually check the predictive ability of the model. Examples 1-4 in the following R code illustrates the construction of QQ envelopes for linear regression models, assuming normal errors, in 4 cases:
The errors are truly normal.
The errors are heavy tailed.
The errors are asymmetric.
The errors come from a mixture of normals.
The prior used for the fitted linear regression model with normal errors is the non-informative prior \(\pi(\beta,\sigma)\propto \dfrac{1}{\sigma}\). The strategy to construct the QQ envelopes is explained in [1] (see the References), and summarised below:
Obtain \({\bf y}^{(1)}_{n+1},\dots,{\bf y}^{(m)}_{n+1}\) samples of size \(n\) (same size as the original data) from the posterior predictive distribution \(\pi(y_{n+1}\vert{\bf y},{\bf X})\).
Construct \(m\) QQ-plots for each simulated sample against the original data. Using these \(m\) QQ-plots, we can generate an envelope, by taking the \(\alpha\%\) and \((1-\alpha)\%\) quantiles of the QQ-plots at each quantile point, which is shown in the shaded area.
The envelope is compared against a straight line with intercept 0 and slope 1, which represents a perfect fit. The plots below also show the mean of the QQ plots, which represent the predictive distribution.
References
#########################################################################################################################
# Example 1: Good fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)
## Package: Rtwalk, Version: 1.8.0
## The R Implementation of 't-walk' MCMC Sampler
## http://www.cimat.mx/~jac/twalk/
## For citations, please use:
## Christen JA and Fox C (2010). A general purpose sampling algorithm for
## continuous distributions (the t-walk). Bayesian Analysis, 5(2),
## pp. 263-282. <URL:
## http://ba.stat.cmu.edu/journal/2010/vol05/issue02/christen.pdf>.
library(twopiece)
## Loading required package: flexclust
## Warning: package 'flexclust' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.3.2
## Loading required package: stats4
## Loading required package: mvtnorm
## Loading required package: label.switching
## Warning: package 'label.switching' was built under R version 3.3.2
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
e1 <- rnorm(ns,0,0.5) # residual errors
X1 <- cbind(1,rnorm(ns),rnorm(ns),rnorm(ns)) # design matrix
beta1 <- c(1,2,3,4) # true value of the regression parameters
y1 = X1%*%beta1 + e1
# Maximum Likehood analysis
summary(lm(y1~X1-1))
##
## Call:
## lm(formula = y1 ~ X1 - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19494 -0.32331 -0.05984 0.28689 1.62071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X11 0.99307 0.03001 33.09 <2e-16 ***
## X12 2.03288 0.03006 67.63 <2e-16 ***
## X13 2.99201 0.02982 100.32 <2e-16 ***
## X14 3.99407 0.02959 134.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4726 on 246 degrees of freedom
## Multiple R-squared: 0.9925, Adjusted R-squared: 0.9924
## F-statistic: 8143 on 4 and 246 DF, p-value: < 2.2e-16
# Support function
Support <- function(VAR){ (VAR[5]>0)}
# minus log posterior
lp <- function(par){
ll <- sum(dnorm(y1 - X1%*%par[1:4],0,par[5],log=T))
lp0 <- -log(par[5])
return(-ll - lp0)
}
X0 <- function(x) c(beta1,0.5) + runif(5,-0.1,0.1)
# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=5,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 12 X 105001 matrix to save output. Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)
# Final posterior sample
post.samp<-out$output[ind,]
# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])
hist(post.samp[,2])
hist(post.samp[,3])
hist(post.samp[,4])
hist(post.samp[,5])
apply(post.samp,2,mean)
## [1] 0.9935689 2.0320059 2.9929442 3.9946685 0.4751480
plot(post.samp[,1],type="l")
plot(post.samp[,2],type="l")
plot(post.samp[,3],type="l")
plot(post.samp[,4],type="l")
plot(post.samp[,5],type="l")
# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------
# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
index <- sample(dim(post.samp)[1],N,replace = T)
samp <- post.samp[index,]
y.pred <- matrix(0, nrow = N, ncol = length(y1))
for(i in 1:N){
y.pred[i,] <- X1%*%samp[i,1:4] + rnorm(1,0,samp[i,5])
}
return(y.pred)
}
N = 5000
simpred1 = sim.pred(N)
#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------
fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)
for(i in 1:N){
fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])
# Plot agains the line y=x
plot(sort(y1),smin,xlim=c(min(sort(y1)),max(sort(y1))),ylim=c(-15,25),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(y1),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(y1),rev(sort(y1))),c(smin,rev(smax)),col=colors()[355])
points(sort(y1),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()
########################################################################################################################
# Example 2: Bad fit due to heavy tails
########################################################################################################################
rm(list=ls())
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
e1 <- rt(ns,df=2) # residual errors
X1 <- cbind(1,rnorm(ns),rnorm(ns),rnorm(ns)) # design matrix
beta1 <- c(1,2,3,4) # true value of the regression parameters
y1 = X1%*%beta1 + e1
# Maximum Likehood analysis
summary(lm(y1~X1-1))
##
## Call:
## lm(formula = y1 ~ X1 - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.0878 -0.8453 0.1186 0.9966 14.7995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X11 0.8738 0.1774 4.926 1.54e-06 ***
## X12 1.8067 0.1721 10.496 < 2e-16 ***
## X13 2.8086 0.1737 16.174 < 2e-16 ***
## X14 4.0561 0.1799 22.547 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.798 on 246 degrees of freedom
## Multiple R-squared: 0.8054, Adjusted R-squared: 0.8022
## F-statistic: 254.5 on 4 and 246 DF, p-value: < 2.2e-16
# Support function
Support <- function(VAR){ (VAR[5]>0)}
# minus log posterior
lp <- function(par){
ll <- sum(dnorm(y1 - X1%*%par[1:4],0,par[5],log=T))
lp0 <- -log(par[5])
return(-ll - lp0)
}
X0 <- function(x) c(beta1,0.5) + runif(5,-0.1,0.1)
# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=5,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 12 X 105001 matrix to save output. Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)
# Final posterior sample
post.samp<-out$output[ind,]
# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])
hist(post.samp[,2])
hist(post.samp[,3])
hist(post.samp[,4])
hist(post.samp[,5])
apply(post.samp,2,mean)
## [1] 0.8585485 1.8092475 2.8105556 4.0556496 2.8081134
plot(post.samp[,1],type="l")
plot(post.samp[,2],type="l")
plot(post.samp[,3],type="l")
plot(post.samp[,4],type="l")
plot(post.samp[,5],type="l")
# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------
# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
index <- sample(dim(post.samp)[1],N,replace = T)
samp <- post.samp[index,]
y.pred <- matrix(0, nrow = N, ncol = length(y1))
for(i in 1:N){
y.pred[i,] <- X1%*%samp[i,1:4] + rnorm(1,0,samp[i,5])
}
return(y.pred)
}
N = 5000
simpred1 = sim.pred(N)
#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------
fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)
for(i in 1:N){
fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])
# Plot agains the line y=x
plot(sort(y1),smin,xlim=c(min(sort(y1)),max(sort(y1))),ylim=c(-25,25),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(y1),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(y1),rev(sort(y1))),c(smin,rev(smax)),col=colors()[355])
points(sort(y1),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()
########################################################################################################################
# Example 3: Bad fit due to asymmetry (there is bias in the estimates and a lot of variability in the right-tail)
########################################################################################################################
rm(list=ls())
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
e1 <- rtp3(ns,0,0.5,-0.75,FUN=rnorm,param="eps") # residual errors
X1 <- cbind(1,rnorm(ns),rnorm(ns),rnorm(ns)) # design matrix
beta1 <- c(1,2,3,4) # true value of the regression parameters
y1 = X1%*%beta1 + e1
# Maximum Likehood analysis
summary(lm(y1~X1-1))
##
## Call:
## lm(formula = y1 ~ X1 - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9083 -0.4242 -0.1028 0.3341 1.8268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X11 1.61165 0.03427 47.02 <2e-16 ***
## X12 2.01463 0.03314 60.80 <2e-16 ***
## X13 3.01639 0.03349 90.07 <2e-16 ***
## X14 3.98965 0.03522 113.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5406 on 246 degrees of freedom
## Multiple R-squared: 0.9915, Adjusted R-squared: 0.9913
## F-statistic: 7140 on 4 and 246 DF, p-value: < 2.2e-16
# Support function
Support <- function(VAR){ (VAR[5]>0)}
# minus log posterior
lp <- function(par){
ll <- sum(dnorm(y1 - X1%*%par[1:4],0,par[5],log=T))
lp0 <- -log(par[5])
return(-ll - lp0)
}
X0 <- function(x) c(beta1,0.5) + runif(5,-0.1,0.1)
# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=5,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 12 X 105001 matrix to save output. Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)
# Final posterior sample
post.samp<-out$output[ind,]
# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])
hist(post.samp[,2])
hist(post.samp[,3])
hist(post.samp[,4])
hist(post.samp[,5])
apply(post.samp,2,mean)
## [1] 1.6115685 2.0146675 3.0160383 3.9897554 0.5431182
plot(post.samp[,1],type="l")
plot(post.samp[,2],type="l")
plot(post.samp[,3],type="l")
plot(post.samp[,4],type="l")
plot(post.samp[,5],type="l")
# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------
# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
index <- sample(dim(post.samp)[1],N,replace = T)
samp <- post.samp[index,]
y.pred <- matrix(0, nrow = N, ncol = length(y1))
for(i in 1:N){
y.pred[i,] <- X1%*%samp[i,1:4] + rnorm(1,0,samp[i,5])
}
return(y.pred)
}
N = 5000
simpred1 = sim.pred(N)
#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------
fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)
for(i in 1:N){
fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])
# Plot agains the line y=x
plot(sort(y1),smin,xlim=c(min(sort(y1)),max(sort(y1))),ylim=c(-15,25),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(y1),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(y1),rev(sort(y1))),c(smin,rev(smax)),col=colors()[355])
points(sort(y1),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()
#########################################################################################################################
# Example 4: Bad induced for a two component mixture with equal weights for the errors:
# A mixture inducing bimodality: there is bias in the estimates and some variability and lack of fit in the tails
#########################################################################################################################
rm(list=ls())
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
#e1<-c(rnorm(ns/2,-1,1),rnorm(ns/2,1,1)) # Multimodal two component mixture, weights=0.5
e1<-c(rnorm(50,-1,5),rnorm(200,3,1)) # Bimodal two component mixture, weights=0.2 and 0.8
X1 <- cbind(1,rnorm(ns),rnorm(ns),rnorm(ns)) # design matrix
beta1 <- c(1,2,3,4) # true value of the regression parameters
y1 = X1%*%beta1 + e1
# Maximum Likehood analysis
summary(lm(y1~X1-1))
##
## Call:
## lm(formula = y1 ~ X1 - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1357 -0.2582 0.5039 1.2362 7.5295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X11 3.2214 0.1722 18.71 <2e-16 ***
## X12 1.9441 0.1724 11.27 <2e-16 ***
## X13 2.9508 0.1711 17.25 <2e-16 ***
## X14 4.0078 0.1697 23.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.711 on 246 degrees of freedom
## Multiple R-squared: 0.843, Adjusted R-squared: 0.8405
## F-statistic: 330.3 on 4 and 246 DF, p-value: < 2.2e-16
# Support function
Support <- function(VAR){ (VAR[5]>0)}
# minus log posterior
lp <- function(par){
ll <- sum(dnorm(y1 - X1%*%par[1:4],0,par[5],log=T))
lp0 <- -log(par[5])
return(-ll - lp0)
}
X0 <- function(x) c(beta1,0.5) + runif(5,-0.1,0.1)
# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=5,Obj=lp,Supp=Support,x0=X0(),xp0=X0(), PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 12 X 105001 matrix to save output. Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)
# Final posterior sample
post.samp<-out$output[ind,]
# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])
hist(post.samp[,2])
hist(post.samp[,3])
hist(post.samp[,4])
hist(post.samp[,5])
apply(post.samp,2,mean)
## [1] 3.223586 1.939797 2.956954 4.003701 2.722752
plot(post.samp[,1],type="l")
plot(post.samp[,2],type="l")
plot(post.samp[,3],type="l")
plot(post.samp[,4],type="l")
plot(post.samp[,5],type="l")
# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------
# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
index <- sample(dim(post.samp)[1],N,replace = T)
samp <- post.samp[index,]
y.pred <- matrix(0, nrow = N, ncol = length(y1))
for(i in 1:N){
y.pred[i,] <- X1%*%samp[i,1:4] + rnorm(1,0,samp[i,5])
}
return(y.pred)
}
N = 5000
simpred1 = sim.pred(N)
#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------
fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)
for(i in 1:N){
fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])
# Plot agains the line y=x
plot(sort(y1),smin,xlim=c(min(sort(y1)),max(sort(y1))),ylim=c(-20,25),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(y1),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(y1),rev(sort(y1))),c(smin,rev(smax)),col=colors()[355])
points(sort(y1),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()