This example is based on (Kruschke 2015), Chapter 16.
One example of two-groups problem is testing effect of a drug when one group receives a placebo and another receives the drug. The response variable is result of IQ test.
In this example we estimate mean and standard deviation for two groups using robust \(t\)-distribution with common normality parameter \(\lambda\).
Prepare the data.
myDataFrame <- read.csv("TwoGroupIQ.csv")
y <- as.numeric(myDataFrame[,"Score"])
x <- as.numeric(as.factor(myDataFrame[,"Group"]))
(xLevels <- levels(as.factor(myDataFrame[,"Group"])))
## [1] "Placebo" "Smart Drug"
Ntotal = length(y)
# Specify the data in a list, for later shipment to JAGS:
dataList = list(
y = y,
x = x,
Ntotal = Ntotal,
meanY = mean(y),
sdY = sd(y)
)
dataList
## $y
## [1] 102 107 92 101 110 68 119 106 99 103 90 93 79 89 137 119 126 110
## [19] 71 114 100 95 91 99 97 106 106 129 115 124 137 73 69 95 102 116
## [37] 111 134 102 110 139 112 122 84 129 112 127 106 113 109 208 114 107 50
## [55] 169 133 50 97 139 72 100 144 112 109 98 106 101 100 111 117 104 106
## [73] 89 84 88 94 78 108 102 95 99 90 116 97 107 102 91 94 95 86
## [91] 108 115 108 88 102 102 120 112 100 105 105 88 82 111 96 92 109 91
## [109] 92 123 61 59 105 184 82 138 99 93 93 72
##
## $x
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1
##
## $Ntotal
## [1] 120
##
## $meanY
## [1] 104.1333
##
## $sdY
## [1] 22.43532
The utilities file from (Kruschke 2015).
#source("../DBDA2Eprograms/DBDA2E-utilities.R")
library(rstan)
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
Assuming that both groups samples have normal distributions, estimate the parameters and compare them.
Write description of the model for stan.
\[ y_{i\mid j} \sim N(\mu_j, \sigma^2_j) \\ \mu_j \sim N(\bar{Y}, \frac{1}{100*SD_Y^2}) \\ \sigma_j \sim uniform(\frac{SD_Y}{1000}, SD_Y*1000) \]
# Use the description for Stan from file "ch16_2.stan"
modelString = "
data {
int<lower=1> Ntotal;
int x[Ntotal];
real y[Ntotal];
real meanY;
real sdY;
}
transformed data {
real unifLo;
real unifHi;
real normalSigma;
unifLo = sdY/1000;
unifHi = sdY*1000;
normalSigma = sdY*100;
}
parameters {
real mu[2];
real<lower=0> sigma[2];
}
model {
sigma ~ uniform(unifLo, unifHi);
mu ~ normal(meanY, normalSigma);
for ( i in 1:Ntotal ) {
y[i] ~ normal(mu[x[i]] , sigma[x[i]]);
}
}
"
If running the description in modelString for the first time create stanDSONormal, otherwise reuse it.
stanDsoNormal <- stan_model(model_code=modelString)
If saved DSO is used load it, then run the chains.
save(stanDsoNormal, file="DSONormal1.Rds")
load("DSONormal1.Rds")
Run MCMC.
parameters = c("mu","sigma") # The parameters to be monitored
adaptSteps = 500 # Number of steps to "tune" the samplers
burnInSteps = 1000
nChains = 4
thinSteps = 1
numSavedSteps = 5000
stanFitNormal <- sampling(stanDsoNormal,
data = dataList,
pars = parameters, # optional
chains = nChains,
iter = (ceiling(numSavedSteps/nChains)*thinSteps+burnInSteps),
warmup = burnInSteps,
init = "random", # optional
thin = thinSteps
)
Save the fit object for further analysis.
save(stanFitNormal,file="StanNormalFit2Groups.Rdata")
load("StanNormalFit2Groups.Rdata")
Explore the results.
# text statistics:
print (stanFitNormal)
## Inference for Stan model: a4cb1c74477f9a8b1e476a89a6556245.
## 4 chains, each with iter=2250; warmup=1000; thin=1;
## post-warmup draws per chain=1250, total post-warmup draws=5000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
## mu[1] 100.01 0.03 2.44 95.18 98.40 99.98 101.61 104.78 5045
## mu[2] 107.85 0.04 3.24 101.56 105.73 107.83 110.02 114.29 5544
## sigma[1] 18.33 0.03 1.83 15.22 17.05 18.19 19.43 22.25 4636
## sigma[2] 25.98 0.03 2.43 21.81 24.32 25.79 27.46 31.28 5124
## lp__ -423.24 0.03 1.46 -426.93 -423.92 -422.91 -422.20 -421.43 2835
## Rhat
## mu[1] 1
## mu[2] 1
## sigma[1] 1
## sigma[2] 1
## lp__ 1
##
## Samples were drawn using NUTS(diag_e) at Sun Mar 14 01:33:42 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
# estimates & hdi:
plot(stanFitNormal)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
# samples
rstan::traceplot(stanFitNormal, ncol=1, inc_warmup=F)
pairs(stanFitNormal, pars=c('mu','sigma'))
stan_scat(stanFitNormal, c('mu[1]','mu[2]'))
stan_scat(stanFitNormal, c('mu[1]','sigma[1]'))
stan_scat(stanFitNormal, c('mu[1]','sigma[2]'))
stan_scat(stanFitNormal, c('mu[2]','sigma[1]'))
stan_scat(stanFitNormal, c('mu[2]','sigma[2]'))
stan_scat(stanFitNormal, c('sigma[1]','sigma[2]'))
stan_hist(stanFitNormal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
stan_dens(stanFitNormal)
# autocorrelation:
stan_ac(stanFitNormal, separate_chains = T)
stan_diag(stanFitNormal,information = "sample",chain=0)
## Warning: Removed 2 rows containing missing values (geom_bar).
stan_diag(stanFitNormal,information = "stepsize",chain = 0)
stan_diag(stanFitNormal,information = "treedepth",chain = 0)
stan_diag(stanFitNormal,information = "divergence",chain = 0)
If you prefer output using coda class reformat the chains into coda:
library(coda)
##
## Attaching package: 'coda'
## The following object is masked from 'package:rstan':
##
## traceplot
stan2coda <- function(stanFitNormal) {
# apply to all chains
mcmc.list(lapply(1:ncol(stanFitNormal), function(x) mcmc(as.array(stanFitNormal)[,x,])))
}
codaSamples <- stan2coda(stanFitNormal)
summary(codaSamples)
##
## Iterations = 1:1250
## Thinning interval = 1
## Number of chains = 4
## Sample size per chain = 1250
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu[1] 100.01 2.435 0.03444 0.03373
## mu[2] 107.85 3.245 0.04589 0.04481
## sigma[1] 18.33 1.827 0.02584 0.02572
## sigma[2] 25.98 2.425 0.03430 0.03312
## lp__ -423.24 1.463 0.02069 0.02771
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu[1] 95.18 98.40 99.98 101.61 104.78
## mu[2] 101.56 105.73 107.83 110.02 114.29
## sigma[1] 15.22 17.05 18.19 19.43 22.25
## sigma[2] 21.81 24.32 25.79 27.46 31.28
## lp__ -426.93 -423.92 -422.91 -422.20 -421.43
plot(codaSamples)
autocorr.plot(codaSamples)
effectiveSize(codaSamples)
## mu[1] mu[2] sigma[1] sigma[2] lp__
## 5333.533 5268.788 5773.737 5374.483 2824.515
gelman.diag(codaSamples)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## mu[1] 1 1.00
## mu[2] 1 1.00
## sigma[1] 1 1.00
## sigma[2] 1 1.01
## lp__ 1 1.01
##
## Multivariate psrf
##
## 1
gelman.plot(codaSamples)
plot(density(codaSamples[[1]][,1]),xlim=c(10,120),ylim=c(0,.25),main="Posterior Densities") # mu[1], 1st chain
lines(density(codaSamples[[1]][,2])) # mu[2], 1st chain
lines(density(codaSamples[[1]][,3])) # sigma[1], 1st chain
lines(density(codaSamples[[1]][,4])) # sigma[2], 1st chain
lines(density(codaSamples[[2]][,1]),col="red") # mu[1], 2nd chain
lines(density(codaSamples[[2]][,2]),col="red") # mu[2], 2nd chain
lines(density(codaSamples[[2]][,3]),col="red") # sigma[1], 2nd chain
lines(density(codaSamples[[2]][,4]),col="red") # sigma[2], 2nd chain
Or you can use shinystan to analyze fitted model
library(shinystan)
## Loading required package: shiny
##
## This is shinystan version 2.5.0
Launch shiny application with the loaded object.
launch_shinystan(stanFitNormal)
##
## Launching ShinyStan interface... for large models this may take some time.
##
## Listening on http://127.0.0.1:3759
Use the robust assumption of Student-t distribution instead of normal.
modelString = "
data {
int<lower=1> Ntotal;
int x[Ntotal];
real y[Ntotal];
real meanY;
real sdY;
}
transformed data {
real unifLo;
real unifHi;
real normalSigma;
real expLambda ;
unifLo = sdY/1000;
unifHi = sdY*1000;
normalSigma = sdY*100;
expLambda = 1/29.0;
}
parameters {
real<lower=0> nuMinusOne;
real mu[2] ; // 2 groups
real<lower=0> sigma[2] ; // 2 groups
}
transformed parameters {
real<lower=0> nu ;
nu = nuMinusOne + 1 ;
}
model {
sigma ~ uniform( unifLo , unifHi ) ; // vectorized 2 groups
mu ~ normal( meanY , normalSigma ) ; // vectorized 2 groups
nuMinusOne ~ exponential( expLambda ) ;
for ( i in 1:Ntotal ) {
y[i] ~ student_t( nu , mu[x[i]] , sigma[x[i]] ) ; // nested index of group
}
}
"
If running the description in modelString for the first time create stanDSO, otherwise reuse it.
stanDsoRobust <- stan_model(model_code=modelString)
If saved DSO is used load it, then run the chains.
save(stanDsoRobust,file="DSORobust1.Rds")
load(file="DSORobust1.Rds")
If necessary, initialize chains. Parameter init can be: one of digit 0, string “0” or “random,” a function that returns a list, or a list of initial parameter values with which to indicate how the initial values of parameters are specified.
Since Stan has pretty complicated parameter tuning process during which among other parameters it selects initial values, it may be a good idea to let Stan select default initial parameters until you get enough experience.
Run MCMC.
parameters = c( "mu" , "sigma" , "nu" ) # The parameters to be monitored
adaptSteps = 500 # Number of steps to "tune" the samplers
burnInSteps = 1000
nChains = 4
thinSteps = 1
numSavedSteps<-5000
# Get MC sample of posterior:
stanFitRobust <- sampling( object=stanDsoRobust ,
data = dataList ,
pars = parameters , # optional
chains = nChains ,
cores=nChains,
iter = ( ceiling(numSavedSteps/nChains)*thinSteps
+burnInSteps ) ,
warmup = burnInSteps ,
init = "random" , # optional
thin = thinSteps )
Save the fit object.
save(stanFitRobust,file="StanRobustFit2Groups.Rdata")
load("StanRobustFit2Groups.Rdata")
Explore the results.
print(stanFitRobust)
## Inference for Stan model: 4b6e29d2aaa99872bfe773f032d14961.
## 4 chains, each with iter=2250; warmup=1000; thin=1;
## post-warmup draws per chain=1250, total post-warmup draws=5000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
## mu[1] 99.20 0.03 1.86 95.55 97.90 99.20 100.47 102.83 5133
## mu[2] 107.16 0.03 2.61 102.12 105.37 107.18 108.86 112.33 5907
## sigma[1] 11.39 0.03 1.76 8.31 10.16 11.26 12.50 15.20 4184
## sigma[2] 17.98 0.04 2.68 13.16 16.11 17.88 19.68 23.57 4513
## nu 3.90 0.03 1.64 1.95 2.84 3.55 4.50 8.09 3562
## lp__ -451.37 0.04 1.61 -455.36 -452.21 -451.05 -450.18 -449.20 1852
## Rhat
## mu[1] 1
## mu[2] 1
## sigma[1] 1
## sigma[2] 1
## nu 1
## lp__ 1
##
## Samples were drawn using NUTS(diag_e) at Sun Mar 14 01:36:46 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
plot(stanFitRobust)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
rstan::traceplot(stanFitRobust, ncol=1, inc_warmup=F)
pairs(stanFitRobust, pars=c('nu','mu','sigma'))
stan_scat(stanFitRobust, c('nu','mu[1]'))
stan_scat(stanFitRobust, c('nu','mu[2]'))
stan_scat(stanFitRobust, c('nu','sigma[1]'))
stan_scat(stanFitRobust, c('nu','sigma[2]'))
stan_scat(stanFitRobust, c('mu[1]','sigma[1]'))
stan_scat(stanFitRobust, c('sigma[1]','sigma[2]'))
Note correlation between the sigma parameters.
Is there a sign of correlation in non-robust approach?
How can you explain positive correlation?
stan_hist(stanFitRobust)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
stan_dens(stanFitRobust)
stan_ac(stanFitRobust, separate_chains = T)
stan_diag(stanFitRobust,information = "sample",chain=0)
## Warning: Removed 2 rows containing missing values (geom_bar).
stan_diag(stanFitRobust,information = "stepsize",chain = 0)
stan_diag(stanFitRobust,information = "treedepth",chain = 0)
stan_diag(stanFitRobust,information = "divergence",chain = 0)
launch_shinystan(stanFitRobust)
##
## Launching ShinyStan interface... for large models this may take some time.
##
## Listening on http://127.0.0.1:3759
For comparison decide whether the two groups are different or not using the FNP approach.
Use t-test with unequal variances:
qqnorm(y[x==1])
qqline(y[x==1])
qqnorm(y[x==2])
qqline(y[x==2])