QQ plots are a useful tool to compare two probability distributions. The idea is to compare the empirical distribution of the data against the quantiles of the fitted normal distribution.
Let \({\bf y}=(y_1,\dots,y_n)\) be a random sample, and suppose that we are interested in visually assessing the normality assumption on this sample. In a Bayesian context, one can think of simulating \(m\) samples of size \(n\) from the predictive distribution of \(\pi(y_{n+1}\vert{\bf y})\), and comparing them against the true sample, which can be done using a sample from the posterior distribution of \((\mu,\sigma)\). This strategy defines an envelope of QQ plots. Thus, it is possible to construct a \((1-\alpha)\%\) QQ envelope by taking, at each point of the sample, the \((1-\alpha)\%\) probability interval of the samples. This plot is a useful tool to visually check the predictive ability of the model. Examples 1-3 in the following R code illustrates the construction of QQ envelopes for assesing the normality assumption in 3 cases:
The data are truly normal.
The data are heavy tailed.
The data are asymmetric.
The data are simulated from a mixture of normals.
The prior used for the fitted normal model is the non-informative prior \(\pi(\mu,\sigma)\propto \dfrac{1}{\sigma}\). The strategy to construct the QQ envelopes is explained in [1] (see the References), and summarised below:
Obtain \({\bf y}^{(1)}_{n+1},\dots,{\bf y}^{(m)}_{n+1}\) samples of size \(n\) (same size as the original data) from the posterior predictive distribution \(\pi(y_{n+1}\vert{\bf y})\).
Construct \(m\) QQ-plots for each simulated sample against the original data. Using these \(m\) QQ-plots, we can generate an envelope, by taking the \(\alpha\%\) and \((1-\alpha)\%\) quantiles of the QQ-plots at each quantile point, which is shown in the shaded area.
The envelope is compared against a straight line with intercept 0 and slope 1, which represents a perfect fit. The plots below also show the mean of the QQ plots, which represent the predictive distribution.
References
#########################################################################################################################
# Example 1: Good fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)
## Package: Rtwalk, Version: 1.8.0
## The R Implementation of 't-walk' MCMC Sampler
## http://www.cimat.mx/~jac/twalk/
## For citations, please use:
## Christen JA and Fox C (2010). A general purpose sampling algorithm for
## continuous distributions (the t-walk). Bayesian Analysis, 5(2),
## pp. 263-282. <URL:
## http://ba.stat.cmu.edu/journal/2010/vol05/issue02/christen.pdf>.
library(twopiece)
## Loading required package: flexclust
## Warning: package 'flexclust' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.3.2
## Loading required package: stats4
## Loading required package: mvtnorm
## Loading required package: label.switching
## Warning: package 'label.switching' was built under R version 3.3.2
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
data <- rnorm(ns,0,1) # data
# Support function
Support <- function(VAR){ (VAR[2]>0)}
# minus log posterior
lp <- function(par){
ll <- sum(dnorm(data,par[1],par[2],log=T))
lp0 <- -log(par[2])
return(-ll - lp0)
}
X0 <- function(x) c(0,0.5) + runif(2,-0.1,0.1)
# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=2,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 6 X 105001 matrix to save output. Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)
# Final posterior sample
post.samp<-out$output[ind,]
# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])
hist(post.samp[,2])
apply(post.samp,2,mean)
## [1] -0.008334019 0.945697643
# MLE
c(mean(data),sqrt(var(data)))
## [1] -0.008560464 0.942100153
plot(post.samp[,1],type="l")
plot(post.samp[,2],type="l")
# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------
# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
index <- sample(dim(post.samp)[1],N,replace = T)
samp <- post.samp[index,]
y.pred <- matrix(0, nrow = N, ncol = length(data))
for(i in 1:N){
y.pred[i,] <- rnorm(1,samp[i,1],samp[i,2])
}
return(y.pred)
}
N = 5000
simpred1 = sim.pred(N)
#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------
fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)
for(i in 1:N){
fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])
# Plot agains the line y=x
plot(sort(data),smin,xlim=c(min(sort(data)),max(sort(data))),ylim=c(-5,5),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(data),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(data),rev(sort(data))),c(smin,rev(smax)),col=colors()[355])
points(sort(data),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()
#########################################################################################################################
# Example 2: Bad fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)
library(twopiece)
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
data <- rt(ns,df=2) # data
# Support function
Support <- function(VAR){ (VAR[2]>0)}
# minus log posterior
lp <- function(par){
ll <- sum(dnorm(data,par[1],par[2],log=T))
lp0 <- -log(par[2])
return(-ll - lp0)
}
X0 <- function(x) c(0,0.5) + runif(2,-0.1,0.1)
# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=2,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 6 X 105001 matrix to save output. Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)
# Final posterior sample
post.samp<-out$output[ind,]
# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])
hist(post.samp[,2])
apply(post.samp,2,mean)
## [1] -0.1205728 2.8017692
# MLE
c(mean(data),sqrt(var(data)))
## [1] -0.1224136 2.7954842
plot(post.samp[,1],type="l")
plot(post.samp[,2],type="l")
# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------
# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
index <- sample(dim(post.samp)[1],N,replace = T)
samp <- post.samp[index,]
y.pred <- matrix(0, nrow = N, ncol = length(data))
for(i in 1:N){
y.pred[i,] <- rnorm(1,samp[i,1],samp[i,2])
}
return(y.pred)
}
N = 5000
simpred1 = sim.pred(N)
#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------
fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)
for(i in 1:N){
fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])
# Plot agains the line y=x
plot(sort(data),smin,xlim=c(min(sort(data)),max(sort(data))),ylim=c(-15,15),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(data),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(data),rev(sort(data))),c(smin,rev(smax)),col=colors()[355])
points(sort(data),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()
#########################################################################################################################
# Example 3: bad fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)
library(twopiece)
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
data <- rtp3(ns,0,0.5,-0.75,FUN=rnorm,param="eps") # residual errors
# Support function
Support <- function(VAR){ (VAR[2]>0)}
# minus log posterior
lp <- function(par){
ll <- sum(dnorm(data,par[1],par[2],log=T))
lp0 <- -log(par[2])
return(-ll - lp0)
}
X0 <- function(x) c(0,0.5) + runif(2,-0.1,0.1)
# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=2,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 6 X 105001 matrix to save output. Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)
# Final posterior sample
post.samp<-out$output[ind,]
# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])
hist(post.samp[,2])
apply(post.samp,2,mean)
## [1] 0.6135358 0.5385218
# MLE
c(mean(data),sqrt(var(data)))
## [1] 0.6131926 0.5378883
plot(post.samp[,1],type="l")
plot(post.samp[,2],type="l")
# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------
# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
index <- sample(dim(post.samp)[1],N,replace = T)
samp <- post.samp[index,]
y.pred <- matrix(0, nrow = N, ncol = length(data))
for(i in 1:N){
y.pred[i,] <- rnorm(1,samp[i,1],samp[i,2])
}
return(y.pred)
}
N = 5000
simpred1 = sim.pred(N)
#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------
fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)
for(i in 1:N){
fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])
# Plot agains the line y=x
plot(sort(data),smin,xlim=c(min(sort(data)),max(sort(data))),ylim=c(-3,3),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(data),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(data),rev(sort(data))),c(smin,rev(smax)),col=colors()[355])
points(sort(data),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()
#########################################################################################################################
# Example 4: bad fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)
library(twopiece)
# Simulated data
ns <- 250 # number of simulations
set.seed(123)
data <- c(rnorm(50,-1,5),rnorm(200,3,1)) # residual errors
# Support function
Support <- function(VAR){ (VAR[2]>0)}
# minus log posterior
lp <- function(par){
ll <- sum(dnorm(data,par[1],par[2],log=T))
lp0 <- -log(par[2])
return(-ll - lp0)
}
X0 <- function(x) c(0,0.5) + runif(2,-0.1,0.1)
# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=2,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)
## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 6 X 105001 matrix to save output. Sampling (no graphics mode).
# Thinning and burn-in
ind <- seq(5000,NS,25)
# Final posterior sample
post.samp<-out$output[ind,]
# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])
hist(post.samp[,2])
apply(post.samp,2,mean)
## [1] 2.207620 2.704341
# MLE
c(mean(data),sqrt(var(data)))
## [1] 2.218962 2.695553
plot(post.samp[,1],type="l")
plot(post.samp[,2],type="l")
# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------
# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
index <- sample(dim(post.samp)[1],N,replace = T)
samp <- post.samp[index,]
y.pred <- matrix(0, nrow = N, ncol = length(data))
for(i in 1:N){
y.pred[i,] <- rnorm(1,samp[i,1],samp[i,2])
}
return(y.pred)
}
N = 5000
simpred1 = sim.pred(N)
#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------
fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)
for(i in 1:N){
fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])
# Plot agains the line y=x
plot(sort(data),smin,xlim=c(min(sort(data)),max(sort(data))),ylim=c(-15,15),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(data),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(data),rev(sort(data))),c(smin,rev(smax)),col=colors()[355])
points(sort(data),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()