Posterior QQ envelopes: normality test

Posterior QQ envelopes for normality test

QQ plots are a useful tool to compare two probability distributions. The idea is to compare the empirical distribution of the data against the quantiles of the fitted normal distribution.

Let \({\bf y}=(y_1,\dots,y_n)\) be a random sample, and suppose that we are interested in visually assessing the normality assumption on this sample. In a Bayesian context, one can think of simulating \(m\) samples of size \(n\) from the predictive distribution of \(\pi(y_{n+1}\vert{\bf y})\), and comparing them against the true sample, which can be done using a sample from the posterior distribution of \((\mu,\sigma)\). This strategy defines an envelope of QQ plots. Thus, it is possible to construct a \((1-\alpha)\%\) QQ envelope by taking, at each point of the sample, the \((1-\alpha)\%\) probability interval of the samples. This plot is a useful tool to visually check the predictive ability of the model. Examples 1-3 in the following R code illustrates the construction of QQ envelopes for assesing the normality assumption in 3 cases:

The data are truly normal.
The data are heavy tailed.
The data are asymmetric.
The data are simulated from a mixture of normals.

The prior used for the fitted normal model is the non-informative prior \(\pi(\mu,\sigma)\propto \dfrac{1}{\sigma}\). The strategy to construct the QQ envelopes is explained in [1] (see the References), and summarised below:

Obtain \({\bf y}^{(1)}_{n+1},\dots,{\bf y}^{(m)}_{n+1}\) samples of size \(n\) (same size as the original data) from the posterior predictive distribution \(\pi(y_{n+1}\vert{\bf y})\).
Construct \(m\) QQ-plots for each simulated sample against the original data. Using these \(m\) QQ-plots, we can generate an envelope, by taking the \(\alpha\%\) and \((1-\alpha)\%\) quantiles of the QQ-plots at each quantile point, which is shown in the shaded area.
The envelope is compared against a straight line with intercept 0 and slope 1, which represents a perfect fit. The plots below also show the mean of the QQ plots, which represent the predictive distribution.

References

Bayesian linear regression with skew-symmetric error distributions with applications to survival analysis

#########################################################################################################################
# Example 1: Good fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)

## Package: Rtwalk, Version: 1.8.0

## The R Implementation of 't-walk' MCMC Sampler

## http://www.cimat.mx/~jac/twalk/

## For citations, please use:

## Christen JA and Fox C (2010). A general purpose sampling algorithm for

## continuous distributions (the t-walk). Bayesian Analysis, 5(2),

## pp. 263-282. <URL:

## http://ba.stat.cmu.edu/journal/2010/vol05/issue02/christen.pdf>.

library(twopiece)

## Loading required package: flexclust

## Warning: package 'flexclust' was built under R version 3.3.2

## Loading required package: grid

## Loading required package: lattice

## Loading required package: modeltools

## Warning: package 'modeltools' was built under R version 3.3.2

## Loading required package: stats4

## Loading required package: mvtnorm

## Loading required package: label.switching

## Warning: package 'label.switching' was built under R version 3.3.2

# Simulated data
ns <- 250 # number of simulations
set.seed(123)
data <- rnorm(ns,0,1) # data


# Support function
Support <- function(VAR){ (VAR[2]>0)}

# minus log posterior
lp <- function(par){
  ll <- sum(dnorm(data,par[1],par[2],log=T))
  lp0 <- -log(par[2])
  return(-ll - lp0)
}

X0 <- function(x) c(0,0.5) + runif(2,-0.1,0.1)

# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=2,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)

## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 6 X 105001 matrix to save output.  Sampling (no graphics mode).

# Thinning and burn-in
ind <- seq(5000,NS,25)

# Final posterior sample
post.samp<-out$output[ind,]

# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])

hist(post.samp[,2])

apply(post.samp,2,mean)

## [1] -0.008334019  0.945697643

# MLE
c(mean(data),sqrt(var(data)))

## [1] -0.008560464  0.942100153

plot(post.samp[,1],type="l")

plot(post.samp[,2],type="l")

# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------

# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
  index <- sample(dim(post.samp)[1],N,replace = T) 
  samp <- post.samp[index,]
  y.pred <- matrix(0, nrow = N, ncol = length(data))
  for(i in 1:N){
    y.pred[i,] <-   rnorm(1,samp[i,1],samp[i,2])
  }
  return(y.pred)
}

N = 5000 
simpred1 = sim.pred(N)

#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------

fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)

for(i in 1:N){
  fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])

# Plot agains the line y=x
plot(sort(data),smin,xlim=c(min(sort(data)),max(sort(data))),ylim=c(-5,5),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(data),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(data),rev(sort(data))),c(smin,rev(smax)),col=colors()[355])
points(sort(data),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()

#########################################################################################################################
# Example 2: Bad fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)
library(twopiece)

# Simulated data
ns <- 250 # number of simulations
set.seed(123)
data <- rt(ns,df=2) # data


# Support function
Support <- function(VAR){ (VAR[2]>0)}

# minus log posterior
lp <- function(par){
  ll <- sum(dnorm(data,par[1],par[2],log=T))
  lp0 <- -log(par[2])
  return(-ll - lp0)
}

X0 <- function(x) c(0,0.5) + runif(2,-0.1,0.1)

# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=2,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)

## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 6 X 105001 matrix to save output.  Sampling (no graphics mode).

# Thinning and burn-in
ind <- seq(5000,NS,25)

# Final posterior sample
post.samp<-out$output[ind,]

# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])

hist(post.samp[,2])

apply(post.samp,2,mean)

## [1] -0.1205728  2.8017692

# MLE
c(mean(data),sqrt(var(data)))

## [1] -0.1224136  2.7954842

plot(post.samp[,1],type="l")

plot(post.samp[,2],type="l")

# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------

# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
  index <- sample(dim(post.samp)[1],N,replace = T) 
  samp <- post.samp[index,]
  y.pred <- matrix(0, nrow = N, ncol = length(data))
  for(i in 1:N){
    y.pred[i,] <-   rnorm(1,samp[i,1],samp[i,2])
  }
  return(y.pred)
}

N = 5000 
simpred1 = sim.pred(N)

#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------

fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)

for(i in 1:N){
  fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])

# Plot agains the line y=x
plot(sort(data),smin,xlim=c(min(sort(data)),max(sort(data))),ylim=c(-15,15),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(data),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(data),rev(sort(data))),c(smin,rev(smax)),col=colors()[355])
points(sort(data),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()

#########################################################################################################################
# Example 3: bad fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)
library(twopiece)

# Simulated data
ns <- 250 # number of simulations
set.seed(123)
data <- rtp3(ns,0,0.5,-0.75,FUN=rnorm,param="eps") # residual errors


# Support function
Support <- function(VAR){ (VAR[2]>0)}

# minus log posterior
lp <- function(par){
  ll <- sum(dnorm(data,par[1],par[2],log=T))
  lp0 <- -log(par[2])
  return(-ll - lp0)
}

X0 <- function(x) c(0,0.5) + runif(2,-0.1,0.1)

# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=2,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)

## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 6 X 105001 matrix to save output.  Sampling (no graphics mode).

# Thinning and burn-in
ind <- seq(5000,NS,25)

# Final posterior sample
post.samp<-out$output[ind,]

# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])

hist(post.samp[,2])

apply(post.samp,2,mean)

## [1] 0.6135358 0.5385218

# MLE
c(mean(data),sqrt(var(data)))

## [1] 0.6131926 0.5378883

plot(post.samp[,1],type="l")

plot(post.samp[,2],type="l")

# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------

# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
  index <- sample(dim(post.samp)[1],N,replace = T) 
  samp <- post.samp[index,]
  y.pred <- matrix(0, nrow = N, ncol = length(data))
  for(i in 1:N){
    y.pred[i,] <-   rnorm(1,samp[i,1],samp[i,2])
  }
  return(y.pred)
}

N = 5000 
simpred1 = sim.pred(N)

#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------

fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)

for(i in 1:N){
  fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])

# Plot agains the line y=x
plot(sort(data),smin,xlim=c(min(sort(data)),max(sort(data))),ylim=c(-3,3),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(data),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(data),rev(sort(data))),c(smin,rev(smax)),col=colors()[355])
points(sort(data),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()

#########################################################################################################################
# Example 4: bad fit
#########################################################################################################################
rm(list=ls())
# required packages
library(Rtwalk)
library(twopiece)

# Simulated data
ns <- 250 # number of simulations
set.seed(123)
data <- c(rnorm(50,-1,5),rnorm(200,3,1)) # residual errors

# Support function
Support <- function(VAR){ (VAR[2]>0)}

# minus log posterior
lp <- function(par){
  ll <- sum(dnorm(data,par[1],par[2],log=T))
  lp0 <- -log(par[2])
  return(-ll - lp0)
}

X0 <- function(x) c(0,0.5) + runif(2,-0.1,0.1)

# Sampling from the posterior using the twalk
NS = 105000 # number of posterior simulations
out <- Runtwalk(Tr=NS,dim=2,Obj=lp,Supp=Support,x0=X0(),xp0=X0(),PlotLogPost = FALSE)

## This is the twalk for R.
## Evaluating the objective density at initial values.
## Opening 6 X 105001 matrix to save output.  Sampling (no graphics mode).

# Thinning and burn-in
ind <- seq(5000,NS,25)

# Final posterior sample
post.samp<-out$output[ind,]

# Some histograms, traceplots, and posterior medians
hist(post.samp[,1])

hist(post.samp[,2])

apply(post.samp,2,mean)

## [1] 2.207620 2.704341

# MLE
c(mean(data),sqrt(var(data)))

## [1] 2.218962 2.695553

plot(post.samp[,1],type="l")

plot(post.samp[,2],type="l")

# Constructing the envelopes
#---------------------------------
# Simulation from the predictive
#---------------------------------

# Functions to obtain N samples from the predictive distribution
sim.pred = function(N){
  index <- sample(dim(post.samp)[1],N,replace = T) 
  samp <- post.samp[index,]
  y.pred <- matrix(0, nrow = N, ncol = length(data))
  for(i in 1:N){
    y.pred[i,] <-   rnorm(1,samp[i,1],samp[i,2])
  }
  return(y.pred)
}

N = 5000 
simpred1 = sim.pred(N)

#-------------------------------------------
# Predictive QQ plot envelope
#-------------------------------------------

fitsim <- matrix(rep(0,ns*N),ncol=ns,nrow=N)

for(i in 1:N){
  fitsim[i,] <- sort(sample(simpred1,ns,replace=F))
}
# Predictive envelops
smin = smax = smean = vector()
for(i in 1:ns) smin[i] = quantile(fitsim[,i],0.025)
for(i in 1:ns) smax[i] = quantile(fitsim[,i],0.975)
for(i in 1:ns) smean[i] = mean(fitsim[,i])

# Plot agains the line y=x
plot(sort(data),smin,xlim=c(min(sort(data)),max(sort(data))),ylim=c(-15,15),pch=3,cex=0.5,type="l",cex.axis=1.5,xlab="",ylab="",main="QQ plot envelope")
points(sort(data),smax,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
polygon(c(sort(data),rev(sort(data))),c(smin,rev(smax)),col=colors()[355])
points(sort(data),smean,xlim=c(0,1.2),ylim=c(0,2),pch=3,cex=0.5,type="l")
abline(a=0,b=1,lwd=2)
box()

Posterior QQ envelopes: normality test

F Javier Rubio and Jairo Fuquene

26/06/2017

Posterior QQ envelopes for normality test