Tractable Bayesian Variable Selection: Beyond normality. Analysis of DLD data

Consider the linear regression model \[ y = X \theta + \epsilon, \] where \(y=(y_1,\ldots,y_n)^T\) is the observed outcome for \(n\) individuals, \(X\) is an \(n \times p\) matrix with potential predictors, \(\theta=(\theta_1,\ldots,\theta_p)^T \in \mathbb{R}^p\) are regression coefficients and \(\epsilon=(\epsilon_1,\ldots,\epsilon_n)^T\) are independent and identically distributed (id) errors. Suppose that the goal is to determine the non-zero coefficients in \(\theta\) under an arbitrary data-generating distribution for the \(\epsilon_i\)’s, building a framework that remains convenient for large \(p\). Let \(\gamma_j=\mbox{I}(\theta_j \neq 0)\) for \(j=1,\ldots,p\) be variable inclusion indicators and \(p_\gamma=\sum_{j=1}^{p} \gamma_j\) the number of active variables. To consider that residuals may be asymmetric and/or have thicker-than-normal tails \(\gamma_{p+1}=1\) denotes the presence of asymmetry (\(\gamma_{p+1}=0\) otherwise) and \(\gamma_{p+2}=1\) that of thick tails (\(\gamma_{p+2}=0\) for Normal tails). Thus \(\gamma=(\gamma_1,\ldots,\gamma_{p+2})\) denotes the assumed model.

Rossell and Rubio [1] proposed the use of two-piece distributions [3,4] to flexibly model the residual errors. A random variable \(y_i \in \mathbb{R}\) following a two-piece distribution [1,4] with location \(x_{\gamma i}^T\theta_\gamma\), scale \(\sqrt{\vartheta} \in \mathbb{R}^+\) and asymmetry \(\alpha\) has density function \(s(y_i;x_{\gamma i}^T\theta_\gamma,\vartheta,\alpha) =\) \[ \dfrac{2}{\sqrt{\vartheta}[a(\alpha)+b(\alpha)]} \left[f\left(\dfrac{y_i-x_{\gamma i}^T\theta_\gamma}{\sqrt{\vartheta} a(\alpha)} \right) I(y_i<x_{\gamma i}^T\theta_\gamma) + f\left(\dfrac{y_i-x_{\gamma i}^T \theta_\gamma}{\sqrt{\vartheta}b(\alpha)} \right) I(y_i \geq x_{\gamma i}^T \theta_\gamma) \right],\nonumber \] where \(f(\cdot)\) is a symmetric unimodal density with mode at \(0\) and support on \(\mathbb R\), and \(a(\alpha),b(\alpha) \in \mathbb{R}^+\). These distributions are implemented in the R package ‘twopiece’ [3]. Rossell and Rubio [1] consider the cases when \(f\) is a Normal distribution or a Laplace distribution since these models lead to interpretable estimates of the regression parameters (linked to asymmetric least squares and quantile regression) and appealing asymptotic properties. In order to conduct Bayesian variable selection, Rossell and Rubio [1] employ non-local priors [5] to enforce sparsity and discard degrees of asymmetry that are irrelevant in practice (see [1] for a detailed formulation of these priors). This methodology is implemented in the R package ‘mombf’ [2].

The following R code presents the analysis of the DLD data presented in an example in [1]. The data set and source files are available at [6].

References

  1. Tractable Bayesian variable selection: beyond normality
  2. ‘mombf’ R package
  3. ‘twopiece’ R package
  4. Inference in two-piece location-scale models with Jeffreys priors
  5. On the use of non-local prior densities in Bayesian hypothesis tests
  6. DLD Data and other examples
library(mombf)     #at CRAN
## Warning: package 'mombf' was built under R version 3.3.3
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.3.2
## Loading required package: ncvreg
## Warning: package 'ncvreg' was built under R version 3.3.3
## Loading required package: actuar
## Warning: package 'actuar' was built under R version 3.3.3
## 
## Attaching package: 'actuar'
## The following object is masked from 'package:grDevices':
## 
##     cm
## Loading required package: mgcv
## Warning: package 'mgcv' was built under R version 3.3.3
## Loading required package: nlme
## Warning: package 'nlme' was built under R version 3.3.3
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
library(twopiece)  #at https://r-forge.r-project.org/projects/twopiece
## Loading required package: flexclust
## Warning: package 'flexclust' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.3.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.3.2
## Loading required package: stats4
## Loading required package: label.switching
## Warning: package 'label.switching' was built under R version 3.3.2
library(parallel)  #if not available replace mclapply below for lapply
library(xtable)    #for tables
source('routines_simulation.R')

#############################################################################################################################
## 6. Analysis of DLD data
## GSE71008 DATA (provided as supplement to Rossell and Rubio, Tractable Bayesian variable selection: beyond normality)
#############################################################################################################################

x <- read.table('dld.txt',sep='\t',header=TRUE)
y <- x[,1]; x <- cbind(rep(1,nrow(x)),x[,-1])
priorCoef <- momprior(tau=0.348); priorSkew <- momprior(tau=0.357); priorDelta <- modelbbprior(1,1)
fitn <- modelSelection(y, x=x, family='twopiecenormal', priorCoef=priorCoef, priorSkew=priorSkew, priorDelta=priorDelta, niter=10^4, method='auto')
## Greedy searching posterior mode... Done.
## Running Gibbs sampler........... Done.
fita <- modelSelection(y, x=x, family='auto', priorCoef=priorCoef, priorSkew=priorSkew, priorDelta=priorDelta, niter=10^4, method='Laplace')
## Greedy searching posterior mode... Done.
## Running Gibbs sampler........... Done.
#Top models
head(postProb(fitn))
##                modelid         family         pp
## 1       23,43,46,53,54 twopiecenormal 0.58355850
## 2    19,23,43,46,53,54 twopiecenormal 0.13935526
## 3    23,25,43,46,53,54 twopiecenormal 0.11534804
## 4 19,23,25,43,46,53,54 twopiecenormal 0.01954103
## 5          23,43,46,53 twopiecenormal 0.01571814
## 6       19,23,43,46,53 twopiecenormal 0.01302452
head(postProb(fita))
##                modelid  family         pp
## 1       23,43,46,53,54 laplace 0.36899700
## 2    23,25,43,46,53,54 laplace 0.19752439
## 3    19,23,25,43,46,54 laplace 0.13714687
## 4       19,23,43,46,54 laplace 0.11151517
## 5       23,25,43,46,54 laplace 0.03206751
## 6 19,23,25,43,46,53,54 laplace 0.02687166
fita$margpp[-1:-ncol(x)] #inference on residual distribution
##    family.normal  family.tpnormal   family.laplace family.tplaplace 
##     2.983065e-09     3.492783e-12     9.992666e-01     7.333568e-04
#Residual QQ-plot under top model
sel <- as.numeric(strsplit(as.character(postProb(fitn)[1,'modelid']), split=',')[[1]])
xsel <- as.matrix(x[,sel])
lm1 <- lm(y ~ -1 + xsel)
qqnorm(scale(residuals(lm1)),main=''); abline(0,1)

#Marginal inclusion probabilities under Normal and inferred error distribution
margpp <- data.frame(symbol=colnames(x),normal=round(fitn$margpp,3),auto=round(fita$margpp[1:ncol(x)],3))
margpp <- margpp[order(margpp$auto,decreasing=TRUE),]
head(margpp,n=8)
##            symbol normal  auto
## C6orf226 C6orf226  1.000 1.000
## ECH1         ECH1  1.000 1.000
## CSF2RA     CSF2RA  1.000 1.000
## RRP1B       RRP1B  0.943 0.998
## FBXL19     FBXL19  0.993 0.683
## MTMR1       MTMR1  0.184 0.446
## SLC35B4   SLC35B4  0.212 0.308
## RAB3GAP2 RAB3GAP2  0.007 0.036
#Quantile regression (corresponds to fixed asymmetry parameter alpha)
priorCoef <- momprior(tau=0.348); priorDelta <- modelbbprior(1,1)
fitalpha0 <- modelSelection(y, x=x, family='twopiecelaplace', priorCoef=priorCoef, priorSkew=0, priorDelta=priorDelta, niter=10^4) #median regression
## Greedy searching posterior mode... Done.
## Running Gibbs sampler........... Done.
fitalpha1 <- modelSelection(y, x=x, family='twopiecelaplace', priorCoef=priorCoef, priorSkew=atanh(-0.5), priorDelta=priorDelta, niter=10^4) #first quartile
## Greedy searching posterior mode... Done.
## Running Gibbs sampler........... Done.
fitalpha2 <- modelSelection(y, x=x, family='twopiecelaplace', priorCoef=priorCoef, priorSkew=atanh(0.5), priorDelta=priorDelta, niter=10^4) #third quartile
## Greedy searching posterior mode... Done.
## Running Gibbs sampler........... Done.
tab1 <- postProb(fitalpha0)[1:5,c(1,3)]
tab2 <- postProb(fitalpha1)[1:5,c(1,3)]
tab3 <- postProb(fitalpha2)[1:5,c(1,3)]
tab1[,1] <- sapply(strsplit(as.character(tab1[,1]),split=','), function(z) paste(names(x)[as.numeric(z)],collapse=', '))
tab2[,1] <- sapply(strsplit(as.character(tab2[,1]),split=','), function(z) paste(names(x)[as.numeric(z)],collapse=', '))
tab3[,1] <- sapply(strsplit(as.character(tab3[,1]),split=','), function(z) paste(names(x)[as.numeric(z)],collapse=', '))

xtable(tab1,digits=c(0,0,3)) #alpha= 0
## % latex table generated in R 3.3.1 by xtable 1.8-2 package
## % Thu Aug 10 16:20:17 2017
## \begin{table}[ht]
## \centering
## \begin{tabular}{rlr}
##   \hline
##  & modelid & pp \\ 
##   \hline
## 1 & C6orf226, ECH1, CSF2RA, FBXL19, RRP1B & 0.346 \\ 
##   2 & C6orf226, MTMR1, ECH1, CSF2RA, FBXL19, RRP1B & 0.270 \\ 
##   3 & SLC35B4, C6orf226, MTMR1, ECH1, CSF2RA, RRP1B & 0.129 \\ 
##   4 & SLC35B4, C6orf226, ECH1, CSF2RA, RRP1B & 0.082 \\ 
##   5 & SLC35B4, C6orf226, MTMR1, ECH1, CSF2RA, FBXL19, RRP1B & 0.035 \\ 
##    \hline
## \end{tabular}
## \end{table}
xtable(tab2,digits=c(0,0,3)) #alpha= -0.5
## % latex table generated in R 3.3.1 by xtable 1.8-2 package
## % Thu Aug 10 16:20:17 2017
## \begin{table}[ht]
## \centering
## \begin{tabular}{rlr}
##   \hline
##  & modelid & pp \\ 
##   \hline
## 1 & C6orf226, ECH1, CSF2RA, FBXL19, RRP1B & 0.425 \\ 
##   2 & SLC35B4, C6orf226, ECH1, CSF2RA, RRP1B & 0.386 \\ 
##   3 & C6orf226, MTMR1, ECH1, CSF2RA, FBXL19, RRP1B & 0.054 \\ 
##   4 & SLC35B4, C6orf226, MTMR1, ECH1, CSF2RA, RRP1B & 0.043 \\ 
##   5 & SLC35B4, C6orf226, MTMR1, ECH1, CSF2RA, FBXL19, RRP1B & 0.019 \\ 
##    \hline
## \end{tabular}
## \end{table}
xtable(tab3,digits=c(0,0,3)) #alpha= 0.5
## % latex table generated in R 3.3.1 by xtable 1.8-2 package
## % Thu Aug 10 16:20:17 2017
## \begin{table}[ht]
## \centering
## \begin{tabular}{rlr}
##   \hline
##  & modelid & pp \\ 
##   \hline
## 1 & C6orf226, ECH1, CSF2RA, FBXL19, RRP1B & 0.387 \\ 
##   2 & SLC35B4, C6orf226, ECH1, CSF2RA, RRP1B & 0.349 \\ 
##   3 & SLC35B4, C6orf226, MTMR1, ECH1, CSF2RA, RRP1B & 0.117 \\ 
##   4 & C6orf226, MTMR1, ECH1, CSF2RA, FBXL19, RRP1B & 0.049 \\ 
##   5 & C6orf226, MTMR1, RAB3GAP2, ECH1, CSF2RA, RRP1B & 0.023 \\ 
##    \hline
## \end{tabular}
## \end{table}
#Marginal inclusion probabilities
# - Genes C6orf226, ECH1, CSF2RA, FBXL19, RRP1B appear in the top model for all alpha's (all have margpp>0.99)
# - Gene MTMR1 features only for alpha=0
common <- which(names(x) %in% c('C6orf226','ECH1','CSF2RA','FBXL19','RRP1B'))
fitalpha0$margpp[common]
##  C6orf226      ECH1    CSF2RA    FBXL19     RRP1B 
## 1.0000000 0.9999999 1.0000000 0.7283919 0.9989878
fitalpha1$margpp[common]
##  C6orf226      ECH1    CSF2RA    FBXL19     RRP1B 
## 1.0000000 0.9999992 1.0000000 0.5796193 0.9978713
fitalpha2$margpp[common]
##  C6orf226      ECH1    CSF2RA    FBXL19     RRP1B 
## 1.0000000 0.9999997 1.0000000 0.4763648 0.9990625
spec <- which(names(x) %in% c('MTMR1'))
round(rbind(fitalpha1$margpp[spec],fitalpha0$margpp[spec],fitalpha2$margpp[spec]),3)
##      MTMR1
## [1,] 0.148
## [2,] 0.499
## [3,] 0.228