Finding the structure of a chain event graph

An example: The BEST dataset

Chain Event Graphs are a powerful class of models that depict a process of unfolding events. This tutorial will focus on how to take a new dataset and produce a functional chain event graph. For illustrative purposes we will consider an artificial sample dataset from the https://euroqol.org/eq-5d-instruments/eq-5d-3l-about/. The BEST study records the following variables

  • euroqu:11: Mobility
  • euroqu:12: Self-Care
  • euroqu:13: Usual Activities
  • euroqu:14: Pain/Discomfort
  • euroqu:15: Anxiety/Depression

Each variable has three levels:

  • Level 1: No Problems
  • Level 2: Some Problems
  • Level 3: Extreme Problems

We want to understand the relationships between these variables. Are any of these relationships causal? How can we tell? One common starting point for asking questions like this is to use a bn discovery algorithm, such as those in the package deal.

summary(bp)
##  euroqol1  euroqol2  euroqol3  euroqol4  euroqol5 
##  -99: 14   -99: 17   -99: 12   -99: 11   -99: 15  
##  1  :358   1  :603   1  :202   1  : 72   1  :406  
##  2  :491   2  :186   2  :565   2  :608   2  :354  
##            3  : 57   3  : 84   3  :172   3  : 88

The missing data (indicated by a 9) has been removed from this data. The subsequent methods rely on a complete dataset. Other preprocessing. Chain event graphs work on discrete data, which for this example in R we will class as factors. Any continuous data must be binned at this point.

Current search methods take a saturated event tree and find the highest scoring chain event graph. Thus, we must first specify an ordering for our data. There are several ways to do this. We might first consult with experts who use the best study and determine what ordering they would suggest.

bp <- bp[,c(4,2,3,5,1)] #within these orderings the score doesn't seem to matter too much

Bayesian Network Discovery

Alternatively, as in Collazo 2016, we can first fit a Bayesian network model to the data. BNs admit a partial ordering that we can then use for the CEG search. Here, we find the optimal CEG using the deal package:

library(deal)
bp.fit <-network(bp) #cast dataframe as a network with prob property for each node
bp.fit.prior <- jointprior(bp.fit, 325)#determine joint distribution for thenetwork
## Warning: Your choice of imaginary sample size is very low
## We advice you to set the imaginary sample size to more than 1536 
## Imaginary sample size: 325
bp.fit <- getnetwork(learn(bp.fit,bp,bp.fit.prior))#determines the parents of each node & local parameters
bp.hisc <- autosearch(bp.fit,bp,bp.fit.prior,trace=F,removecycles = T)#takes initial network, flips arrows, and looks for highest score
## [Autosearch (1) -4296.079 [euroqol4|euroqol2][euroqol2][euroqol3][euroqol5][euroqol1]
## (2) -4198.299 [euroqol4|euroqol2][euroqol2][euroqol3|euroqol4][euroqol5][euroqol1]
## (3) -4110.058 [euroqol4|euroqol2][euroqol2][euroqol3|euroqol4][euroqol5|euroqol4][euroqol1]
## (4) -4031.455 [euroqol4|euroqol2][euroqol2][euroqol3|euroqol4][euroqol5|euroqol4][euroqol1|euroqol3]
## (5) -3972.333 [euroqol4|euroqol2][euroqol2][euroqol3|euroqol4:euroqol2][euroqol5|euroqol4][euroqol1|euroqol3]
## (6) -3922.118 [euroqol4|euroqol2][euroqol2][euroqol3|euroqol4:euroqol2][euroqol5|euroqol4][euroqol1|euroqol2:euroqol3]
## (7) -3876.248 [euroqol4|euroqol2][euroqol2][euroqol3|euroqol4:euroqol2][euroqol5|euroqol4:euroqol2][euroqol1|euroqol2:euroqol3]
## (8) -3856.549 [euroqol4|euroqol2][euroqol2][euroqol3|euroqol4:euroqol2][euroqol5|euroqol4:euroqol2][euroqol1|euroqol4:euroqol2:euroqol3]
## (9) -3844.105 [euroqol4|euroqol2][euroqol2][euroqol3|euroqol4:euroqol2][euroqol5|euroqol4:euroqol2:euroqol3][euroqol1|euroqol4:euroqol2:euroqol3]
## Total 0.24 add 0.12 rem 0 turn 0 sort 0 choose 0.1 rest 0.02 ]
plot(getnetwork(bp.hisc))#behold

What conditional independence relationships does this encode? By inspection, using the Markov property, we can tell that 5 (Anxiety/Depression) is independent of 1 (Mobility) given 2, 3, and 4 (Self Care, Usual Activities, and Pain/Discomfort). The global Markov property tells us a node is independent of its non-descendants given its parents.

What variable ordering is implicit in this structure? A node’s parents should come after its parents. Thus, one possible ordering for this graph is 2, 1, 4, 3, 5. This variable ordering enables us to search the space of possible trees with this ordering below.

The CEG package

We now turn our attention to the CEG package.

The function below is specific to the CEG package and removes NAs and whitespace. However, we need to be careful because the coding of the problem indicates that the -99 values are the missing values.

library(ceg)
## Warning: package 'ceg' was built under R version 3.4.3
bp <- CheckAndCleanData(bp) #check to remove NAs & whitespace
## Your data do not contain rows with <NA>, absent, or whitespace  values
bp[-which(bp==-99,arr.in=TRUE),] -> bp  #remove the actual white space

Now we are ready to find the stratified event tree.

bp.set <- Stratified.event.tree(bp)
## ~~~ Stratified.event.tree: initializator ~~~
plot(bp.set)
## pdf 
##   2

Our data populates the contingency tables of the tree

bp.prior <- PriorDistribution(bp.set,3)#alpha=3
bp.sst <- Stratified.staged.tree(bp)
plot(bp.sst)

Our stratified stage tree fails here. By inspection, we notice that many of the pathways in the contingency table are zeored out. We do not have enough data to initialize the tree there. We notice that the extreme cases (Level 3 in the data) are very sparse, so we’ll discretize the data so that it just has two datas and try again.

bp_binary <- bp
bp_binary[which(bp==3,arr.in=TRUE)] <- 2
bp_binary <- as.data.frame(lapply(bp_binary,factor))#ensure treated as discrete rather than continuous variables
#
bp.bin.sst <- Stratified.staged.tree(bp_binary)
## ~~~ Stratified.event.tree: initializator ~~~ 
## ~~~ OAHC: initializator ~~~ 
## ~~~ OAHC: initializator ~~~ 
## ~~~ OAHC: initializator ~~~ 
## ~~~ OAHC: initializator ~~~ 
## ~~~ OAHC: initializator ~~~ 
## ~~~ Stratified.staged.tree: initializator ~~~
bp.bin.set <- Stratified.event.tree(bp_binary)
## ~~~ Stratified.event.tree: initializator ~~~
plot(bp.bin.set)
## pdf 
##   2
ceg:::ContingencyTable(stratified.event.tree = bp.bin.set,data=bp_binary)
## [[1]]
##      [,1] [,2]
## [1,]   72  764
## 
## [[2]]
##      [,1] [,2]
## [1,]   36   36
## [2,]  558  206
## 
## [[3]]
##      [,1] [,2]
## [1,]   17   19
## [2,]   12   24
## [3,]  147  411
## [4,]   25  181
## 
## [[4]]
##      [,1] [,2]
## [1,]   12    5
## [2,]    9   10
## [3,]    4    8
## [4,]    8   16
## [5,]  101   46
## [6,]  207  204
## [7,]    9   16
## [8,]   50  131
## 
## [[5]]
##       [,1] [,2]
##  [1,]   11    1
##  [2,]    3    2
##  [3,]    4    5
##  [4,]    4    6
##  [5,]    2    2
##  [6,]    4    4
##  [7,]    4    4
##  [8,]    8    8
##  [9,]   75   26
## [10,]   30   16
## [11,]   90  117
## [12,]   72  132
## [13,]    5    4
## [14,]    8    8
## [15,]   12   38
## [16,]   24  107
plot(bp.bin.sst)
## pdf 
##   2

The colorings on these stages tell us which stages need to be combined in the CEG

plot(bp.bin.sst)

## png 
##   2
bp.bin.ceg <- Stratified.ceg.model(bp.bin.sst)
## ~~~ CEGStretified: initializator ~~~
plot(bp.bin.ceg)

## png 
##   2

Success. Now we’ve found a CEG with an optimal coloring under the OAHC algorithm.