1 Two group analysis: Fitting normal model for 2 groups with no predictors

1.1 IQ data

This example is based on (Kruschke 2015), Chapter 16.

One example of two-groups problem is testing effect of a drug when one group receives a placebo and another receives the drug. The response variable is result of IQ test.

In this example we estimate mean and standard deviation for two groups using robust \(t\)-distribution with common normality parameter \(\lambda\).

The diagram of the model structure shows how the model needs to be coded. Prepare the data.

myDataFrame <- read.csv("TwoGroupIQ.csv")
y <- as.numeric(myDataFrame[,"Score"])
x <- as.numeric(as.factor(myDataFrame[,"Group"]))
(xLevels <- levels(as.factor(myDataFrame[,"Group"])))
## [1] "Placebo"    "Smart Drug"
Ntotal = length(y)
# Specify the data in a list, for later shipment to JAGS:
dataList = list(
    y = y,
    x = x,
    Ntotal = Ntotal,
    meanY = mean(y),
    sdY = sd(y)
)
dataList
## $y
##   [1] 102 107  92 101 110  68 119 106  99 103  90  93  79  89 137 119 126 110
##  [19]  71 114 100  95  91  99  97 106 106 129 115 124 137  73  69  95 102 116
##  [37] 111 134 102 110 139 112 122  84 129 112 127 106 113 109 208 114 107  50
##  [55] 169 133  50  97 139  72 100 144 112 109  98 106 101 100 111 117 104 106
##  [73]  89  84  88  94  78 108 102  95  99  90 116  97 107 102  91  94  95  86
##  [91] 108 115 108  88 102 102 120 112 100 105 105  88  82 111  96  92 109  91
## [109]  92 123  61  59 105 184  82 138  99  93  93  72
## 
## $x
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1
## 
## $Ntotal
## [1] 120
## 
## $meanY
## [1] 104.1333
## 
## $sdY
## [1] 22.43532

The utilities file from (Kruschke 2015).

#source("../DBDA2Eprograms/DBDA2E-utilities.R")
library(rstan)
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

1.2 Normal assumption

Assuming that both groups samples have normal distributions, estimate the parameters and compare them.

Write description of the model for stan.

\[ y_{i\mid j} \sim N(\mu_j, \sigma^2_j) \\ \mu_j \sim N(\bar{Y}, \frac{1}{100*SD_Y^2}) \\ \sigma_j \sim uniform(\frac{SD_Y}{1000}, SD_Y*1000) \]

# Use the description for Stan from file "ch16_2.stan"
modelString = "
data {
    int<lower=1> Ntotal;
    int x[Ntotal];
    real y[Ntotal];
    real meanY;
    real sdY;
}
transformed data {
    real unifLo;
    real unifHi;
    real normalSigma;
    unifLo = sdY/1000;
    unifHi = sdY*1000;
    normalSigma = sdY*100;
}
parameters {
    real mu[2];
    real<lower=0> sigma[2];
}
model {
    sigma ~ uniform(unifLo, unifHi);
    mu ~ normal(meanY, normalSigma);
    for ( i in 1:Ntotal ) {
        y[i] ~ normal(mu[x[i]] , sigma[x[i]]);
    }
}
"

If running the description in modelString for the first time create stanDSONormal, otherwise reuse it.

stanDsoNormal <- stan_model(model_code=modelString)

If saved DSO is used load it, then run the chains.

save(stanDsoNormal, file="DSONormal1.Rds")
load("DSONormal1.Rds")

Run MCMC.

parameters = c("mu","sigma")     # The parameters to be monitored
adaptSteps = 500               # Number of steps to "tune" the samplers
burnInSteps = 1000
nChains = 4
thinSteps = 1
numSavedSteps = 5000
stanFitNormal <- sampling(stanDsoNormal,
                   data = dataList,
                   pars = parameters, # optional
                   chains = nChains,
                   iter = (ceiling(numSavedSteps/nChains)*thinSteps+burnInSteps),
                   warmup = burnInSteps,
                   init = "random", # optional
                   thin = thinSteps
                   )

Save the fit object for further analysis.

save(stanFitNormal,file="StanNormalFit2Groups.Rdata")
load("StanNormalFit2Groups.Rdata")

Explore the results.

# text statistics:
print (stanFitNormal)
## Inference for Stan model: a4cb1c74477f9a8b1e476a89a6556245.
## 4 chains, each with iter=2250; warmup=1000; thin=1; 
## post-warmup draws per chain=1250, total post-warmup draws=5000.
## 
##             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
## mu[1]     100.01    0.03 2.44   95.18   98.40   99.98  101.61  104.78  5045
## mu[2]     107.85    0.04 3.24  101.56  105.73  107.83  110.02  114.29  5544
## sigma[1]   18.33    0.03 1.83   15.22   17.05   18.19   19.43   22.25  4636
## sigma[2]   25.98    0.03 2.43   21.81   24.32   25.79   27.46   31.28  5124
## lp__     -423.24    0.03 1.46 -426.93 -423.92 -422.91 -422.20 -421.43  2835
##          Rhat
## mu[1]       1
## mu[2]       1
## sigma[1]    1
## sigma[2]    1
## lp__        1
## 
## Samples were drawn using NUTS(diag_e) at Sun Mar 14 01:33:42 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
# estimates & hdi:
plot(stanFitNormal)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

# samples
rstan::traceplot(stanFitNormal, ncol=1, inc_warmup=F)

pairs(stanFitNormal, pars=c('mu','sigma'))

stan_scat(stanFitNormal, c('mu[1]','mu[2]'))

stan_scat(stanFitNormal, c('mu[1]','sigma[1]'))

stan_scat(stanFitNormal, c('mu[1]','sigma[2]'))

stan_scat(stanFitNormal, c('mu[2]','sigma[1]'))

stan_scat(stanFitNormal, c('mu[2]','sigma[2]'))

stan_scat(stanFitNormal, c('sigma[1]','sigma[2]'))

stan_hist(stanFitNormal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

stan_dens(stanFitNormal)

# autocorrelation:
stan_ac(stanFitNormal, separate_chains = T)

stan_diag(stanFitNormal,information = "sample",chain=0)
## Warning: Removed 2 rows containing missing values (geom_bar).

stan_diag(stanFitNormal,information = "stepsize",chain = 0)

stan_diag(stanFitNormal,information = "treedepth",chain = 0)

stan_diag(stanFitNormal,information = "divergence",chain = 0)

If you prefer output using coda class reformat the chains into coda:

library(coda)
## 
## Attaching package: 'coda'
## The following object is masked from 'package:rstan':
## 
##     traceplot
stan2coda <- function(stanFitNormal) {
    # apply to all chains
    mcmc.list(lapply(1:ncol(stanFitNormal), function(x) mcmc(as.array(stanFitNormal)[,x,])))
}
codaSamples <- stan2coda(stanFitNormal)
summary(codaSamples)
## 
## Iterations = 1:1250
## Thinning interval = 1 
## Number of chains = 4 
## Sample size per chain = 1250 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##             Mean    SD Naive SE Time-series SE
## mu[1]     100.01 2.435  0.03444        0.03373
## mu[2]     107.85 3.245  0.04589        0.04481
## sigma[1]   18.33 1.827  0.02584        0.02572
## sigma[2]   25.98 2.425  0.03430        0.03312
## lp__     -423.24 1.463  0.02069        0.02771
## 
## 2. Quantiles for each variable:
## 
##             2.5%     25%     50%     75%   97.5%
## mu[1]      95.18   98.40   99.98  101.61  104.78
## mu[2]     101.56  105.73  107.83  110.02  114.29
## sigma[1]   15.22   17.05   18.19   19.43   22.25
## sigma[2]   21.81   24.32   25.79   27.46   31.28
## lp__     -426.93 -423.92 -422.91 -422.20 -421.43
plot(codaSamples)

autocorr.plot(codaSamples)

effectiveSize(codaSamples)
##    mu[1]    mu[2] sigma[1] sigma[2]     lp__ 
## 5333.533 5268.788 5773.737 5374.483 2824.515
gelman.diag(codaSamples)
## Potential scale reduction factors:
## 
##          Point est. Upper C.I.
## mu[1]             1       1.00
## mu[2]             1       1.00
## sigma[1]          1       1.00
## sigma[2]          1       1.01
## lp__              1       1.01
## 
## Multivariate psrf
## 
## 1
gelman.plot(codaSamples)

plot(density(codaSamples[[1]][,1]),xlim=c(10,120),ylim=c(0,.25),main="Posterior Densities")  # mu[1], 1st chain
lines(density(codaSamples[[1]][,2]))                         # mu[2], 1st chain
lines(density(codaSamples[[1]][,3]))                         # sigma[1], 1st chain
lines(density(codaSamples[[1]][,4]))                         # sigma[2], 1st chain
lines(density(codaSamples[[2]][,1]),col="red")               # mu[1], 2nd chain
lines(density(codaSamples[[2]][,2]),col="red")               # mu[2], 2nd chain
lines(density(codaSamples[[2]][,3]),col="red")               # sigma[1], 2nd chain
lines(density(codaSamples[[2]][,4]),col="red")               # sigma[2], 2nd chain

Or you can use shinystan to analyze fitted model

library(shinystan)
## Loading required package: shiny
## 
## This is shinystan version 2.5.0

Launch shiny application with the loaded object.

launch_shinystan(stanFitNormal)
## 
## Launching ShinyStan interface... for large models this  may take some time.
## 
## Listening on http://127.0.0.1:3759

1.3 Robust assumption

Use the robust assumption of Student-t distribution instead of normal.

modelString = "
data {
    int<lower=1> Ntotal;
    int x[Ntotal];
    real y[Ntotal];
    real meanY;
    real sdY;
}
transformed data {
    real unifLo;
    real unifHi;
    real normalSigma;
    real expLambda ;
    unifLo = sdY/1000;
    unifHi = sdY*1000;
    normalSigma = sdY*100;
    expLambda = 1/29.0;
}
parameters {
    real<lower=0> nuMinusOne;
    real mu[2] ; // 2 groups
    real<lower=0> sigma[2] ; // 2 groups
}
transformed parameters {
    real<lower=0> nu ;
    nu = nuMinusOne + 1 ;
}
model {
    sigma  ~ uniform( unifLo , unifHi ) ; // vectorized 2 groups
    mu ~ normal( meanY , normalSigma ) ; // vectorized 2 groups
    nuMinusOne ~ exponential( expLambda ) ;
    for ( i in 1:Ntotal ) {
      y[i] ~ student_t( nu , mu[x[i]] , sigma[x[i]] ) ; // nested index of group
    }
}
"

If running the description in modelString for the first time create stanDSO, otherwise reuse it.

stanDsoRobust <- stan_model(model_code=modelString) 

If saved DSO is used load it, then run the chains.

save(stanDsoRobust,file="DSORobust1.Rds")
load(file="DSORobust1.Rds")

If necessary, initialize chains. Parameter init can be: one of digit 0, string “0” or “random,” a function that returns a list, or a list of initial parameter values with which to indicate how the initial values of parameters are specified.

  • “0”: initialize all to be zero on the unconstrained support;
  • “random”: randomly generated;
  • list: a list of lists equal in length to the number of chains (parameter chains), where each list in the list of lists * specifies the initial values of parameters by name for the corresponding chain.
  • function: a function that returns a list for specifying the initial values of parameters for a chain. The function can take an optional parameter chain_id.

Since Stan has pretty complicated parameter tuning process during which among other parameters it selects initial values, it may be a good idea to let Stan select default initial parameters until you get enough experience.

Run MCMC.

parameters = c( "mu" , "sigma" , "nu" )     # The parameters to be monitored
adaptSteps = 500               # Number of steps to "tune" the samplers
burnInSteps = 1000
nChains = 4 
thinSteps = 1
numSavedSteps<-5000
# Get MC sample of posterior:
stanFitRobust <- sampling( object=stanDsoRobust , 
                   data = dataList , 
                   pars = parameters , # optional
                   chains = nChains ,
                   cores=nChains,
                   iter = ( ceiling(numSavedSteps/nChains)*thinSteps
                            +burnInSteps ) , 
                   warmup = burnInSteps , 
                   init = "random" , # optional
                   thin = thinSteps )

Save the fit object.

save(stanFitRobust,file="StanRobustFit2Groups.Rdata")
load("StanRobustFit2Groups.Rdata")

Explore the results.

print(stanFitRobust)
## Inference for Stan model: 4b6e29d2aaa99872bfe773f032d14961.
## 4 chains, each with iter=2250; warmup=1000; thin=1; 
## post-warmup draws per chain=1250, total post-warmup draws=5000.
## 
##             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
## mu[1]      99.20    0.03 1.86   95.55   97.90   99.20  100.47  102.83  5133
## mu[2]     107.16    0.03 2.61  102.12  105.37  107.18  108.86  112.33  5907
## sigma[1]   11.39    0.03 1.76    8.31   10.16   11.26   12.50   15.20  4184
## sigma[2]   17.98    0.04 2.68   13.16   16.11   17.88   19.68   23.57  4513
## nu          3.90    0.03 1.64    1.95    2.84    3.55    4.50    8.09  3562
## lp__     -451.37    0.04 1.61 -455.36 -452.21 -451.05 -450.18 -449.20  1852
##          Rhat
## mu[1]       1
## mu[2]       1
## sigma[1]    1
## sigma[2]    1
## nu          1
## lp__        1
## 
## Samples were drawn using NUTS(diag_e) at Sun Mar 14 01:36:46 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(stanFitRobust)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

rstan::traceplot(stanFitRobust, ncol=1, inc_warmup=F)

pairs(stanFitRobust, pars=c('nu','mu','sigma'))

stan_scat(stanFitRobust, c('nu','mu[1]'))

stan_scat(stanFitRobust, c('nu','mu[2]'))

stan_scat(stanFitRobust, c('nu','sigma[1]'))

stan_scat(stanFitRobust, c('nu','sigma[2]'))

stan_scat(stanFitRobust, c('mu[1]','sigma[1]'))

stan_scat(stanFitRobust, c('sigma[1]','sigma[2]'))

Note correlation between the sigma parameters.
Is there a sign of correlation in non-robust approach?
How can you explain positive correlation?

stan_hist(stanFitRobust)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

stan_dens(stanFitRobust)

stan_ac(stanFitRobust, separate_chains = T)

stan_diag(stanFitRobust,information = "sample",chain=0)
## Warning: Removed 2 rows containing missing values (geom_bar).

stan_diag(stanFitRobust,information = "stepsize",chain = 0)

stan_diag(stanFitRobust,information = "treedepth",chain = 0)

stan_diag(stanFitRobust,information = "divergence",chain = 0)

launch_shinystan(stanFitRobust)
## 
## Launching ShinyStan interface... for large models this  may take some time.
## 
## Listening on http://127.0.0.1:3759

2 Comparison of the groups

2.1 FNP approach

For comparison decide whether the two groups are different or not using the FNP approach.

Use t-test with unequal variances:

qqnorm(y[x==1])
qqline(y[x==1])

qqnorm(y[x==2])
qqline(y[x==2])

2.2 Bayesian approach

2.3 FNP approach to Markov chains

References

Kruschke, John K. 2015. Doing Bayesian Data Analysis : A Tutorial with r, JAGS, and Stan. Book. 2E [edition]. Amsterdam: Academic Press is an imprint of Elsevier.