Graphical Markov Models Lectures

PhD Program in Statistics

Author

Affiliation

Giovanni M. Marchetti

DISIA, University of Florence

Published

April 2, 2026

1 Introduction

1.1 Basic ideas

Graphical Markov Models discuss a particular structure of multivariate distributions based on independencies among a set of variables. The independence structure is not as straightforward as it might seem at first glance, especially when considering marginal and conditional independencies.

Graphical models define precisely what is the independence structure of a multivariate model using graphs where the nodes are the variables and the edges (lines or arrows) present or missing define the associations or the independencies.
These notes are partly based on Cox and Wermuth (1996), Evans (2025), Højsgaard, Edwards, and Lauritzen (2012) and Whittaker (1990).
This chapter starts with a discussion of conditioning and marginalizing in multivariate distributions with examples from two classical distributions like the Gaussian and the Multinomial.
As usual we start from a vector of random variables denoted by X_V = (X_1, X_2, \dots, X_d)^T representing a multivariate population with d components. The suffix V indicates just the set V = \{1, 2,\dots, d\}.
To simplify the notation we often omit the subscript V by writing X instead of X_V. Also we use the simpler notation for the variables like X, Y, Z and so on.
We will assume that the population can be either multivariate Gaussian or Bernoulli and that we observe an i.i.d. sample X_V^{(i)} \equiv (X^{(1)}, \dots, X^{(n)})^T.

1.2 Marginal and conditional densities

Distribuzioni marginali

Given two variables (X, Y) the marginal density of X is defined by p(x) = \begin{cases} \int_{\mathcal{X}} p(x, y) dy & \text{continuous case}\\ \sum_{x \in \mathcal{X}} p(x,y) & \text{discrete case}. \end{cases}
Remember that marginalizing implies considering a set of variables, but ignoring the others. For instance, by observing a single variable in a population means including implicitly all the other possible features.

This concept is essential when in a specific study there are some variables that can be observed and other variables that are unobserved.

Distribuzioni condizionate

Invece, a differenza della marginalizzazione, l’operazione di condizionamento consente di capire come si trasforma la distribuzione data se si seleziona una sotto-popolazione.
A conditional density of X given Y is defined by any function p(x|y) such that p(x,y) = p(y) \cdot p(x|y) The conditional density p(x|y) is undefined if p(y) = 0. If p(y) >0 we get the usual definition p(x|y) = \frac{p(x,y)}{p(y)}.

We use a simple but imprecise notation for p(y) and p(x|y) instead of p_Y(y) and p_{X|Y}(x|y).

Notation

The marginal and conditional distributions are denoted using a subset notation instead of the whole variables.
With (X_1, X_2, X_3) the marginal density of (X_1, X_3) is denoted by

p(x_1, x_3)\quad \text{ or } \quad p(x_a) \text{ where } a = \{1,3\}. In this case it is common to say: we marginalize over X_2.
The density of X_2 \mid (X_1, X_3) is denoted by

p(x_2\mid x_1, x_3)\quad \text{ or } \quad p(x_b | x_a) \text{ where } a = \{1, 3\} \text{ and } b = \{2\}.

In this case it is common to say: we condition on (X_1, X_3).

Example with Gaussian distributions

A d\times 1 random vector X\in \mathbb{R}^d has a multivariate Gaussian distribution N(0, \Sigma) with mean zero and covariance matrix \Sigma if it has a density p(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\Big \{ - \textstyle\frac{1}{2} x^T\Sigma^{-1}x\Big \} with a support \mathbb{R}^d. The general Gaussian Y \sim N(\mu, \Sigma) with mean \mu is obtained by the transformation Y = \mu + X.
The Gaussian distributions satisfy two important properties.
1. The marginal distribution of any subset X_a of the random vector X is Gaussian
2. The conditional distribution of any subset X_b given a disjoint subset X_a of X is Gaussian.
This implies that the family of multivariate Gaussian distributions is closed under marginalization and conditioning.
Specifically, assume that X has 5 components with a = \{1, 2\} and b = \{3, 4, 5\}.

Then, by marginalizing X = (X_a, X_b) over X_b we get X_a \sim N(\mu_a, \Sigma_{aa}) where \mu_a = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \quad \Sigma_{aa} = \begin{pmatrix} \sigma_{11} & \cdot \\ \sigma_{21} & \sigma_{22} \end{pmatrix}

By conditioning X_b on X_a we get X_{b} \mid X_a \sim N(\mu_{b|a}, \Sigma_{bb|a}) where \begin{align*} \mu_{b|a} &= \mu_b + B(X_a -\mu_a), \text{ with } B = \Sigma_{ba}\Sigma_{aa}^{-1} \\ \Sigma_{bb|a} &= \Sigma_{bb} - \Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab} \end{align*}

Therefore the conditional expectation E(X_b \mid X_a) is a linear function of X_a and the conditional covariance matrix does not depend on X_a.

1.3 Types of independence

Now we start to discuss the structure of independencies in a multivariate distribution by making a distinction between marginal and conditional independencies.

Marginal independence

Two random variables X and Y wth supports \mathcal{X} and \mathcal{Y} are said marginally independent, denoted by X \;\perp \hspace{-2.2ex}\perp\;Y, if

p(x, y) = p(x) p(y)

for all (x, y) \in \mathcal{X} \times \mathcal{Y}. Remember that the last condition is essential.

Exercise Let (X,Y) be a random vector with uniform density on the support \{(x,y) : -1 \le x \le 1,\; 0 \le y \le 1 - x^2\}. Are X and Y independent?

Conditional independence

Suppose that X,Y are two random variables defined on a product space \mathcal{X} \times \mathcal{Y}, and that there is a third random variable Z so that the distribution is p(x, y, z).

Then X and Y are said conditionally independent given Z if p (x| y, z) = p(x|z) for all (x, y, z) \in \mathcal{X} \times \mathcal{Y} \times \mathcal{Z} such that p(y, z) > 0.
The interpretation is that if we study the conditional distribution of X given Y and Z, the variable Y is irrelevant when we condition on Z.

Basic theorem

Let X, Y, Z be random variables on a Cartesian product space \mathcal{X} \times \mathcal{Y} \times \mathcal{Z}. Then the following are equivalent.

\begin{align*} 1.\; & p(x | y, z) = p(x | z) \text{for all } x, y, z \text{ such that } p(y, z) > 0\\ 2.\; & p(x, y | z) = p(x | z) \cdot p(y | z) \text{ for all } x, y, z \text{ such that } p(z) > 0\\ 3.\; & p(x, y, z) = p(y, z) \cdot p(x | z) \text{ for all } x, y, z \\ 4.\; & p(z) \cdot p(x, y, z) = p(x, z) \cdot p(y, z) \text{ for all } x, y, z \\ 5.\; & p(x, y, z) = f(x, z) \cdot g(y, z) \text{ for some functions } f, g \text{ and all } x, y, z. \end{align*}

1.4 Properties of conditional independence

Conditional independence satisfies the following implications called semigraphoid properties:

X \;\perp \hspace{-2.2ex}\perp\;Y \mid Z \implies Y \;\perp \hspace{-2.2ex}\perp\;X \mid Z (symmetry)
X \;\perp \hspace{-2.2ex}\perp\;(Y,W) \mid Z \implies X \;\perp \hspace{-2.2ex}\perp\;Y \mid Z (decomposition)
X \;\perp \hspace{-2.2ex}\perp\;(Y,W) \mid Z \implies X \;\perp \hspace{-2.2ex}\perp\;W \mid (Y,Z) (weak union)
X \;\perp \hspace{-2.2ex}\perp\;W \mid (Y,Z) \; \& \; X \;\perp \hspace{-2.2ex}\perp\;Y \mid Z \implies X \;\perp \hspace{-2.2ex}\perp\;(Y,W) \mid Z (contraction)

The essential property

Note that the converse to (4. contraction) follows from (2. decomposition) and (3. weak union). Thus we can combine the two properties as the rule X \;\perp \hspace{-2.2ex}\perp\;(Y,W) \mid Z \iff X \;\perp \hspace{-2.2ex}\perp\;Y \mid Z \;\&\; X \;\perp \hspace{-2.2ex}\perp\;W \mid (Y,Z)

Thus, the semigraphoid properties reduce to the above property plus the symmetry property.

Intersection property

There is an additional property called intersection (that is the reverse of weak union)

If p(x,y,z,w) > 0 then $X W (Y,Z) ; ;&; XY (W,Z) X (Y,W) Z.

The intersection property however doesn’t generally hold if the distribution is not positive.

The rules from 1. to 5. define the so called graphoid properties.

Examples of independence for binary data

Marginal independence

Let X_1, X_2 be a binary random variables with joint probabilities \pi_{ij}>0. When X_1 \;\perp \hspace{-2.2ex}\perp\;X_2? The answer is if \pi_{ij} = \pi_{i+}\pi_{+j}, \quad \text{ for all } (i,j) \in \mathcal{I}_2. Equivalently, X_1\;\perp \hspace{-2.2ex}\perp\;X_2 \iff \pi_{00}\pi_{11} - \pi_{01}\pi_{10} = 0 \iff \text{OR} = \frac{\pi_{00}\pi_{11}}{\pi_{01}\pi_{10}} = 1. The OR, called odds-ratio (or cross product ratio), is an association measure for binary variables, taking values > 0.

Conditional independence

Consider now three binary variables X_1, X_2, X_3 with joint probabilities \pi_{ijk}>0. The probabilities can be stored in a 2\times 2\times 2 table. The conditional independence requires that for each k = 0,1 the conditional odds-ratio of the tables (X_1, X_2) \mid X_3 = k are all equal to 1. Formally

X_{1} \;\perp \hspace{-2.2ex}\perp\;X_{2} \mid X_3 \iff \frac{\pi_{00k}\pi_{11k}}{\pi_{01k}\pi_{10k}} = 1, \quad \text{ for all } k = 0,1.

Marginal independence does not imply a conditional independence

X_1 and X_2 are not independent conditionally on X_3 (divide the counts by 100) \begin{array}{cc|cc|cc} & X_3 & 0 & 0 & 1 & 1 \\ X_1 & X_2 & 0 & 1 & 0 & 1 \\ \hline 0 & & 37 & 8 & 9 & 38 \\ 1 & & 3 & 2 & 1 & 2 \\ \hline & & \text{OR}= 3.08 & &\text{OR}= 0.47 & \\ \end{array}

But X_1 and X_2 are independent if we marginalize over X_3 \begin{array}{cc|cc} X_1& X_2& 0 & 1\\ \hline 0 & & 46 & 46 \\ 1 & & 4 & 4\\\hline & &\text{OR}= 1 & \\ \end{array} > Notice that the conditional association are positive and negative.

Conditional independence and marginal independence

X_1 and X_2 are independent conditionally on X_3 (divide the counts by 1000)

\begin{array}{cc|cc|cc} & X_3 & 0 & 0 & 1 & 1 \\ X_1 & X_2 & 0 & 1 & 0 & 1 \\ \hline 0 & & 384 & 96 & 40 & 360 \\ 1 & & 16 & 4 & 10 & 90 \\ \hline & & \text{odr}= 1 & &\text{odr}= 1 & \\ \end{array}

But X_1 and X_2 are dependent if we ignore X_3 \begin{array}{cc|cc} X_1& X_2& 0 & 1\\ \hline 0 & & 424 & 456 \\ 1 & & 26 & 94\\\hline & &\text{odr}= 3.36 & \\ \end{array}

1.5 Computational advantages of conditional independence

Consider X, Y and Z on a product space with joint density p_\theta(x,y,z) = g_\tau(x,y) h_\xi(y, z) for some functions g_\tau(), h_\xi() with \theta = (\tau, \xi) variation independent.
Then, to find the MLE of \theta we can maximize separately the factors \max_{\tau} \prod_{i= 1}^n g_\tau(x_i, y_i), \qquad \max_\xi \prod_{i=1}^n h_\xi(y_i, z_i), by considering the two marginal distribution of (X,Y) and (Y, Z).

Computations

Even moderatedly sized contingency table will cause statistical problems due to the curse of dimensionality.
If we have d binary variables the contingency table has 2^d cells. For example with d = 15 we have 32768 cells, most of them probably empty unless you have a quite large number of individuals.
But conditional independence can help: if we have the independence X_A \;\perp \hspace{-2.2ex}\perp\;X_B \mid X_C, with V = A\cup B \cup C this implies that

p(x_V) = p(x_C) \cdot p(x_A \mid x_C) \cdot p(x_B\mid x_C)

and we can store each factor in computer memory separately wich means

2^{|A|} + 2^{|A|+|C|} + 2^{|B|+|C|} = 2^{|C|} (1 + 2^{|A|} + 2^{|B|})\quad \text{cells}

that represent a considerable saving with respect to 2^{|A|+|B| + |C|}.

If d = 15 with A = 5, |C| = 2 and |B| = 8, the total number of cells is 4 ( 1 + 32 + 256) = 1156. instead of 32768.

1.6 Multivariate Bernoulli

When the variables X_V = (X_1, \dots, X_d)^T are all binary with levels \{0,1\} the joint distribution is called a multivariate Bernoulli with probabilities p(x_V) = \Pr(X_1 = x_1, \dots, X_d = x_d), \quad x \in \mathcal{I} = \{0,1\}^d with p(x) >0 and \sum_{x \in \mathcal{I}} p(x) = 1.

The parameters are \pi = (p_V(x), x \in \mathcal{I}), a (2^d\times 1) vector taking value in a simplex of dimension (2^d-1).

For example for d = 3 a multivariate Bernoulli is

\begin{array}{cccc} 1 & 2 & 3 & \pi \\ \hline 0 & 0 & 0 & 0.38 \\ 1 & 0 & 0 & 0.01 \\ 0 & 1 & 0 & 0.07 \\ 1 & 1 & 0 & 0.01 \\ 0 & 0 & 1 & 0.42 \\ 1 & 0 & 1 & 0.02 \\ 0 & 1 & 1 & 0.08 \\ 1 & 1 & 1 & 0.01 \\ \hline \end{array}

where \pi is a 2^3-vector of the joint probabilities p(x_1, x_2, x_3).

An i.i.d sample of size n = 100 from this Bernoulli is obtained using random sample from a Multinomial with size 1 and probabilities \pi.

R code

set.seed(111)
n <- 100
p <- c(0.38, 0.01, 0.07, 0.01,
       0.42, 0.02, 0.08, 0.01)
X <- expand.grid(X1 = factor(0:1),
                 X2 = factor(0:1),
                 X3 = factor(0:1))
Z <- rmultinom(n, size = 1, prob = p)
cell <- apply(Z, 2, function(x) which(x==1))
data <- X[cell,]
rownames(data) <- 1:n

The result is an n \times 3 binary matrix. The first rows are shown below:

head(data)

When the data frame is cross-tabulated we obtain the frequencies that are the number of units who have a specific response pattern. Here it is:

df<- as.data.frame(table(data))
df

  X1 X2 X3 Freq
1  0  0  0   43
2  1  0  0    2
3  0  1  0    9
4  1  1  0    1
5  0  0  1   38
6  1  0  1    1
7  0  1  1    6
8  1  1  1    0

In this case the vector of counts has a multinomial distribution with index n and probability vector \pi.
Notice that some combinations of levels can have frequency zero if the sample size is small.

Log-linear parameterization

The vector of the probabilities \pi > 0 is written in a lexicographical order.
The 2^d parameters \pi must satisfy the constraint \sum \pi_i = 1 and its parameter space is a simplex of dimension 2^d -1.

Reparametrization

There is a 1-1 map from \pi to a parameter \theta that can vary in a real space of dimension 2^{d}-1.
I show first the map for two variables.
- Define a matrix with a Kronecker product L = \begin{pmatrix} 1 & 0\\ -1 & 1 \\ \end{pmatrix} \otimes \begin{pmatrix} 1 & 0\\ -1 & 1 \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 & 0\\ -1 & 1 & 0 & 0\\ -1 & 0 & 1 & 0\\ 1 & -1 & -1 & 1\\ \end{pmatrix}
- Then define \theta = L \log \pi.

R example

Let (X_1, X_2) two binary variables. If \lambda = (0, 1/2, -1/4, 1/10) the joint distribution is as follows.

lambda <- c(0, 1/2, -1/4, 1/10)
L <- matrix(c(1, 1,
              0, 1), 2, 2)
print(L %x% L)

     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    1    1    0    0
[3,]    1    0    1    0
[4,]    1    1    1    1

p <- exp((L %x% L) %*% lambda); p <- p/sum(p)
p

          [,1]
[1,] 0.2063307
[2,] 0.3401817
[3,] 0.1606905
[4,] 0.2927971

The joint probability distribution is p(x_1, x_2) = \exp\{\lambda_0 + \lambda_1 x_1 + \lambda_2 x_2 + \lambda_{12}x_1 x_2\}.
The component of the final vector \lambda are denoted (\lambda_0, \lambda_1, \lambda_2, \lambda_{12}) corresponding to the all the subsets of the variables in the same lexicographic order of \pi.
In general the log-linear parameters for p(x_V)>0 as defined by the relation p(x_V) = \exp\Big(\sum_{A\subseteq V} \lambda_A x_A\Big)

Log-likelihood in the log-linear parameterization

Assume that X_V is a binary random vector and that we have an i.i.d sample of size N from this random vector. After cross-tabulating the sample data we define a vector of frequencies n(x_v) (the number of individuals who have the response pattern x_V.
These frequencies are the sufficient statistics for a multinomial model whose log-likelihood is

\ell(p) = \sum_{x_V} n(x_V) \log p(x_V), \qquad p(x_V) > 0, \quad\sum_{x_V} p(x_V) = 1.
From this it can be shown that the multinomial distribution is an exponential family with canonical parameters \lambda.
The log-likelihood for \lambda given a sample of size N and cell frequencies stored in a vector n = n(x_V) can be written as \ell(\lambda) = n \mathbb{L} \lambda - N \log[{\boldsymbol{1}}^T \exp(\mathbb{L}\lambda )] Here the canonical parameters are the component of \lambda called the log-linear parameters and the sufficient statistics are the cell frequencies x.

Conditional independence in log-linear models

When the log-linear parametrization defines a conditional independence X_2 \;\perp \hspace{-2.2ex}\perp\;X_3 \mid X_1?
The joint distribution for a binary r.v. (X_1, X_2, X_3) is p(x_1, x_2, x_3) = \exp(\lambda_0 + \lambda_1 x_1 + \lambda_2 x_2 + \lambda_{12}x_1 x_2 + \lambda_3 x_3 + \lambda_{13}x_1 x_3 + \lambda_{23}x_2 x_3 + \lambda_{123}x_1 x_2 x_3)

Proposition: p>0 satisfies X_2 \;\perp \hspace{-2.2ex}\perp\;X_3 \mid X_1 if and only if \lambda_{23} = \lambda_{123}= 0.

Proof \begin{align*} p(x_1, x_2 x_3) &= \exp(\lambda_0 + \lambda_1 x_1 + \lambda_2 x_2 + \lambda_{12}x_1 x_2 + \lambda_3 x_3 + \lambda_{13}x_1 x_3 ) \\ &=\exp(\lambda_0 + \lambda_{1}x_1 + \lambda_2 x_2 + \lambda_{12}x_1 x_2) \cdot \exp(\lambda_3 x_3 + \lambda_{13}x_1 x_3) \end{align*} so that p(x_1, x_2, x_3) = \psi_{12}(x_1, x_2) \cdot \psi_{13}(x_1, x_3).

In general, we have this theorem.

Theorem: If p >0 is distribution for d binary variables, with associated parameters \lambda_C for C\subseteq V then the conditional independence X_i \;\perp \hspace{-2.2ex}\perp\;X_j \mid X_{V \setminus \{i,j\}} holds if an only if \lambda_{ij} = 0 \qquad \text{for all } \{i,j\} \subseteq C \subseteq V.

Proof. See Evans (2024) p. 16.

1.7 Multivariate Gaussian

Let K be the inverse of the covariance matrix \Sigma > 0, called the concentration matrix. The joint density can be written in exponential family form p(x) = \textstyle \frac{1}{(2\pi)^{d/2}}\exp\{-\frac{1}{2} x^T_V K x_V + \mu^T K x_v -\frac{1}{2} \mu^T K \mu + \frac{1}{2} \log |K|\}. - The canonical parameters are \eta \equiv K\mu and K and - the corresponding sufficient statistics are s(x_V) = (x_V , \textstyle -\frac{1}{2}x_V x_V^T). - The cumulant function is 2 k(K, \eta) = \eta^T K^{-1} \eta - \log|K|.

The MLE of the parameters \mu and \Sigma are
- \hat \mu = \frac{1}{n} \sum_i x_i
- \hat \Sigma = \frac{1}{n} \sum_i x_i x_i^T - (\frac{1}{n} x_i)(\frac{1}{n} x_i^T)

Marginal independence for the Gaussian

It is well-known that in a Gaussian distribution X \sim N(\mu, \Sigma) two variables X_i and X_j are independent if and only if \sigma_{ij} = 0. This rule can be extended to a random vector X = (X_a, X_b) where a and b define a partition of V. The covariance matrix has a block structure \Sigma = \begin{pmatrix} \Sigma_{aa} & \cdot\\ \Sigma_{ba} & \Sigma_{bb} \end{pmatrix} where the dot means a block that is symmetric. Then, X_a \;\perp \hspace{-2.2ex}\perp\;X_b if and only if \Sigma_{ab} = 0. Thus the covariance matrix is block diagonal.

Composition property for Gaussian variables

We have seen that in general from the decomposition property
X \;\perp \hspace{-2.2ex}\perp\;YW \implies X \;\perp \hspace{-2.2ex}\perp\;Y \text{ and } X \;\perp \hspace{-2.2ex}\perp\;W. Gaussian distributions satisfy also the reverse implication so that the following equivalence holds X \;\perp \hspace{-2.2ex}\perp\;YW \iff X \;\perp \hspace{-2.2ex}\perp\;Y \text{ and } X \;\perp \hspace{-2.2ex}\perp\;W called the composition property. Notice that this property is not shared by discrete distributions.

Example: You can verify that the two marginal tables (X_1, X_2) and (X_1, X_3) are uniform, implying the two independencies X_1\;\perp \hspace{-2.2ex}\perp\;X_3 and X_2 \;\perp \hspace{-2.2ex}\perp\;X_3. However surprisingly the joint independence X_1 \;\perp \hspace{-2.2ex}\perp\;(X_2, X_3) does not hold. \begin{array}{ccccccc} \hline &X_3 &0 & 0 & 1 & 1 & \\ X_1 & X_2 & 0 & 1 & 0 & 1 & \text{sum} \\ \hline 0 & & 0.20 & 0.05 & 0.05 & 0.20 & 0.5 \\ 1 & & 0.15 & 0.10 & 0.10 & 0.15 & 0.5 \\ \hline \text{sum} & & 0.35 & 0.15 & 0.15 & 0.35 & 1.0 \\ \end{array}

Role of the concentration matrix

Proposition In a Gaussian distribution
X_i \;\perp \hspace{-2.2ex}\perp\;X_j \mid X_{V \setminus\{i,j\}} \text{ if and only if } \kappa_{ij} = 0 where \kappa_{ij} is the correponding entry in the concentration matrix.

Fundamental properties of Gaussian distributions

In general, a Gaussian distribution is closed under marginalization and conditioning. This means that

the marginal distribution of any subset X_a of the random vector X_V is multivariate normal
the conditional distribution of any subset X_b given a disjoint subset X_a of X_V are multivariate normal.

Specifically:

by marginalizing X = (X_a, X_b) over X_b we get X_a \sim N(\mu_a, \Sigma_{aa})
by conditioning X_b on X_a we get X_{b|a} \sim N(\mu_{b|a}, \Sigma_{bb|a}) where \begin{align*} \mu_{b|a} &= \mu_b + B(X_a -\mu_a), \text{ with } B = \Sigma_{ba}\Sigma_{aa}^{-1} \\ \Sigma_{bb|a} &= \Sigma_{bb} - \Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab} \end{align*}

Therefore the conditional expectation E(X_b \mid X_a) is a linear function of X_a and the conditional covariance matrix does not depend on X_a.

Independence of the residuals and the conditioning variables

Proposition The random vector \varepsilon_b = X_b - B X_a is independent of X_a.

Proof. As both the variables are Gaussian we can prove their independence by checking that the covariance is zero: \begin{align*} \mathrm{cov}(X_b - B X_a, X_a) &= \mathrm{cov}(X_b, X_a) - B \mathrm{var}(X_a)\\ &= \Sigma_{ba} - B \Sigma_{aa}\\ &= \Sigma_{ba} - \Sigma_{ba} \Sigma_{aa}^{-1} \Sigma_{aa} = 0. \end{align*}

Factorization of a bivariate Gaussian

A bivariate Gaussian distribution of (X, Y) with mean zero and covariance matrix \Sigma with variances \sigma_{XX}, \sigma_{YY} and covariance \sigma_{YX} can be factorized as p(x, y) = p(x) p(y|x) by using the following set of transformations:

\begin{align*} X &= \varepsilon_{X}\\ Y &= \textstyle \beta X + \varepsilon_{Y} \end{align*}

where - \beta = \sigma_{YX}/\sigma_{XX} is the regression coefficient of Y on X, - \varepsilon_{X} and \varepsilon_{Y} are two independent normal variables with mean zero and variances respectively \sigma_{XX} and \sigma_{YY}- \sigma_{YX}^2/\sigma_{XX}

Prove this result.

Multivariate regression

The previous equations can be generalized.

For instance, let X_V be a trivariate Gaussian distribution (X_1, X_2, X_3) and let (a,b) a partition of the variables with a = \{1,2\} and b = \{3\}. Then, the factorization
f_{ab}(x_a, x_b) = f_a(x_a) f_{b|a}(x_b|x_a) can be obtained by \begin{align*} X_a &= \varepsilon_a\\ X_b &= B X_a + \varepsilon_b \end{align*} where

B = \Sigma_{ba}\Sigma_{aa}^{-1} is a vector of the partial regression coefficients of X_b on X_a,
\varepsilon_{a} and \varepsilon_{b} are two independent Gaussian with mean zero and with covariance matrix \Sigma_{aa} and variance \sigma_{bb} - \Sigma_{ba}\Sigma_{aa}^{-1} \Sigma_{ab}, respectively.

R Example

Consider again the Gaussian distribution with mean zero and covariance matrix

\Sigma = \begin{pmatrix} 1 & \cdot & \cdot \\ -0.8 & 1 & \cdot\\ -0.32 & 0.4 & 1 \end{pmatrix}.

Partition the variables with a = \{1,2\} and b = \{3\} and calculate \beta and the variance of \varepsilon_{b}.

R code

S <- matrix(c(1, -0.8, -0.32,
              -0.8, 1, 0.4,
             -0.32, 0.4, 1), 3, 3)
a <- c(1,2)
b <- 3
beta <- S[b,a] %*% solve(S[a,a])
print(beta)

             [,1] [,2]
[1,] 9.375217e-17  0.4

print(S[b,b] - S[b,a] %*% solve(S[a,a]) %*% S[a,b])

     [,1]
[1,] 0.84

that gives \beta = (0, 0.4) and \mathrm{var}(\varepsilon) = 0.84.

Results

The equations are

\begin{align*} X_{a} &= N(0, \Sigma_{aa})\\ X_3 &= N(0\, X_1 + 0.4\, X_2 , \;0.84). \end{align*}

The conditional distribution of X_3 given (X_1, X_2) depends only on X_2 so that, by definition, X_3 \;\perp \hspace{-2.2ex}\perp\;X_1 \mid X_2.

Blocks of the concentration matrix

Proposition If the concentration matrix K = \Sigma^{-1} is decomposed in blocks as

K = \begin{pmatrix} K_{aa} & K_{ba}^T \\ K_{ba} & K_{bb} \end{pmatrix}

the following identities hold:

K_{bb}^{-1} = \Sigma_{bb|a}
\Sigma_{ba}\Sigma_{aa}^{-1} = - K_{bb}^{-1} K_{ba}

A special case happens if a = \{1,\dots, d-1\} and b = \{d\}:

\sigma_{dd|a} = \displaystyle \frac{1}{\kappa_{dd}}
\beta_{j} = \displaystyle -\frac{\kappa_{jd}}{\kappa_{dd}}

R Example

K <- solve(S)
print(K)

              [,1]       [,2]          [,3]
[1,]  2.777778e+00  2.2222222 -6.555603e-17
[2,]  2.222222e+00  2.9682540 -4.761905e-01
[3,] -3.436405e-17 -0.4761905  1.190476e+00

beta <- - solve(K[b,b]) %*% K[b,a]
print(beta)

            [,1] [,2]
[1,] 2.88658e-17  0.4

print(solve(K[b,b]))

     [,1]
[1,] 0.84

Conditional independence in Gaussian models

Let X_V = (Y, X, Z) a trivariate Gaussian distribution with mean zero. It is rather easy to verify whether X \;\perp \hspace{-2.2ex}\perp\;Y \mid Z by looking at the regression coefficients.

We start from the equation Y = \beta_{X} X + \beta_{Z} Z + \varepsilon_Y where \varepsilon_Y \sim N(0, \sigma_{YY|XZ}) is jointly independent on (X, Z).

Recall that \sigma_{YY|XZ} is a constant.

A fundamental result

Proposition The following conditions are equivalent: \begin{align*} Y \;\perp \hspace{-2.2ex}\perp\;X \mid Z\\ \beta_X = 0\\ \kappa_{YX} = 0. \end{align*}

Proof: Y \;\perp \hspace{-2.2ex}\perp\;X\mid Z if and only if p(y\mid x, z) = p(y|x) for all (x,y,z). In a conditional Gaussian distribution this is equivalent to Y = \beta_{Z} Z + \varepsilon_Y with \beta_X = 0.

Thus the last condition follows from \beta_X = \frac{-\kappa_{YX}}{\kappa_{YY}}.

Notice that here \beta_{X} is a conditional regression coefficient of Y on X given Z. This is sometimes denoted by \beta_{Y|X.Z} not to be confused with a marginal regression coefficient \beta_{Y|X}.

2 Undirected graphs

2.1 Graphical models

It is useful to regard conditional independence as expressing irrelevance. For example, X\;\perp \hspace{-2.2ex}\perp\;Y \mid Z is interpretable by saying that if we know Z the information about Y is irrelevant for knowledge of X.
The general idea of graphical models is to use graphs to represent a set of (conditional or marginal) independences. This is done by identifying the variables with the nodes and the pairwise dependencies as edges so that the missing edges and the separation between groups of nodes express the independencies.

Undirected graph

We give first the definition of an undirected graph based on two fundamental objects:

a set of nodes (or vertices) V
and a set of edges E = \{ \{i,j\} : i,j \in V, i \ne j\}.

The graph is therefore denoted by \mathcal{G}= (V, E).

Definitions

All the edges connecting two nodes i and j are represented by lines i -\!\!\!- j. If i and j are connected by an edge they are sais to be adjacent (in symbols i \sim j) in the graph. The nodes adjacent to i are said the neighbours of i and their set is said the boundary of i.

A compact way of defining an undirected graph is the adjacency matrix defined by a symmetric |V| \times |V| Boolean matrix G = (g_{ij}) whose entries are g_{ij} = 1 if i\sim j and 0 otherwise.

Given a subset of nodes W of \mathcal{G}, the graph \mathcal{G}_W with nodes W and edges from \mathcal{G} whose endpoints are contained in W is said the an induced subgraph of \mathcal{G}.

Examples

The following examples use the R package gRbase.

library(gRbase)
ug1 <-  ug(~A:B + B:C:D + E, result = "igraph")
ug1

IGRAPH 12dc986 UN-- 5 4 -- 
+ attr: name (v/c)
+ edges from 12dc986 (vertex names):
[1] A--B B--C B--D C--D

ug(~A:B + B:C:D + E, result = "matrix")

  A B C D E
A 0 1 0 0 0
B 1 0 1 1 0
C 0 1 0 1 0
D 0 1 1 0 0
E 0 0 0 0 0

plot(ug1, 
      vertex.size = 40, vertex.label.cex =1.3, 
      vertex.label.family = "sans", edge.width = 2)

Paths

In a graph there are missing edges and multiple connections creating paths. A path is a sequence of adjacent nodes without repetition, meaning that following a path you cannot meet a node multiple times.

In this graph there are two paths from A to C, that is A, B, C and A, B, D, C.

The concept of separation

Definition Two nodes i, j are said separated by S in the graph if every path from i to j contain at least one node in S.

Example In the previous graph ug, A and C are separated by S = \{B, D\}. They are also separated by S = \{B\}, but they are not separated by \{D\}. In fact, in the two paths A, B, C and A, B, D, C the first does not contain D.

In the graph the node E is not adjacent to any other node. Therefore, for example, there are no paths from A to E. In this case we say that A and E are separated by the empty set.

The definition of separation is generalized to three subset of nodes A, B, S not necessarily disjoint. In this case we say that A and B are separated by S if every path from any node i in A and any node j in B contains at least one node in S.

Exercises

Exercise Is it true that the definition of separation if and only if A and B are separated by the empty set in \mathcal{G}_{V\setminus S} ?
Exercise Consider the graph

Check if \{X,W\} and Z are separated by Y.

2.2 Markov properties

A graph should code an independence model for the variables associated to its nodes. The independencies among the variables are read off the graph according to its edges. The rules that define the independencies are called the undirected graph Markov properties. The undirected graph models are called Markov networks or Markov random fields.

The pairwise Markov property

Let \mathcal{G} be and undirected graph with set of nodes V, and let p be a probability distribution over the random variables X_V.

Definition We say that p satisfies the pairwise Markov property for \mathcal{G} if i \not\sim j \implies X_i \;\perp \hspace{-2.2ex}\perp\;X_j \mid X_{X \setminus \{i,j\}} That is, for each missing edge in \mathcal{G} the variables X_i and X_j are conditionally independent given the remaining ones.

Example Given the graph ug2 a distribution obeys the pairwise Markov property if and only if

W \;\perp \hspace{-2.2ex}\perp\;U \mid X,Y,Z,
X \;\perp \hspace{-2.2ex}\perp\;Z \mid U,Y,W and
W \;\perp \hspace{-2.2ex}\perp\;Z \mid X,Y,U.

Let \mathcal{G} be and undirected graph with set of nodes V, and let p be a probability distribution over the random variables X_V.

The global Markov property

Definition We say that p satisfies the global Markov property for \mathcal{G} if for any disjoint sets A, B, S \text{$A$ and $B$ are separated by $S$ in } \mathcal{G}\implies X_A \;\perp \hspace{-2.2ex}\perp\;X_B \mid X_C

Example If a distribution p obeys the global Markov property with respect to the graph ug1 then for example

A \;\perp \hspace{-2.2ex}\perp\;C \mid B
A \;\perp \hspace{-2.2ex}\perp\;CD \mid B
AE \;\perp \hspace{-2.2ex}\perp\;CD \mid B
C \;\perp \hspace{-2.2ex}\perp\;E \mid \emptyset, that is C \;\perp \hspace{-2.2ex}\perp\;E.

The separation can be verified in R using the function separates in package gRbase as follows.

separates("A", "C", "B", ug1)

[1] TRUE

separates("A", c("C", "D"), "B", ug1)

[1] TRUE

separates("C", "E", "", ug1)

[1] TRUE

Proposition The global Markov property implies the pairwise Markov property.

Exercise Prove the claim.

2.3 Cliques and Factorization

Let \mathcal{G} be an undirected graph with vertices V.

Cliques

Definition A subset of nodes C is said to be complete if i \sim j for every i, j \in C. A maximal complete set is called a clique.

Let’s find the cliques of the graph ug2 usng the R package igraph.

library(igraph, warn = FALSE)

plot(ug2,vertex.size = 40, vertex.label.cex =1.3, 
      vertex.label.family = "sans", edge.width = 2)

max_cliques(ug2)

[[1]]
+ 3/5 vertices, named, from b941197:
[1] U Y Z

[[2]]
+ 3/5 vertices, named, from b941197:
[1] U Y X

[[3]]
+ 3/5 vertices, named, from b941197:
[1] Y X W

So the set of cliques for the graph ug2 is \mathcal{C}(\mathcal{G}) = \{\{U, Y, Z\}, \{U, Y, X\}, \{Y, X, W\}\}.

Finding the maximal cliques of a general undirected graph requires exponential time.

Factorization

Definition We say a distribution with density p factorizes according to \mathcal{G} if

p(x_V) = \prod_{C \in \mathcal{C}(\mathcal{G})} \psi_C(x_C)

for some functions \psi_C(x_C) called potentials.

Example Let x_V = (x_1, x_2, x_3) a binary random vector with

p(x_V) =2^{\frac{1}{13} x_1 x_2 + x_1 x_3}

factorizes as the product of two potentials

\psi_{12}(x_1, x_2) =2^{ \frac{1}{13}x_1 x_2}, \qquad \psi_{13}(x_1, x_3) = 2^{x_1 x_3}

according to \mathcal{G}= 2 -\!\!\!- 1 -\!\!\!- 3 with cliques \{1,2\} and \{1, 3\}. Moreover the factorization implies the conditional independence X_2 \;\perp \hspace{-2.2ex}\perp\;X_3 \mid X_1.

Relations among the Markov properties

The following result shows that in fact the factorization according to the cliques implies the conditional independencies given by the global Markov property.

Theorem If p(x_V ) factorizes according to \mathcal{G}, then p obeys the global Markov property with respect to \mathcal{G}.

Proof See Evans (2024) p. 20.

But there is also a result by Hammersley and Clifford, valid for strictly positive distributions that shows that the pairwise Markov property implies the factorization.

Theorem (Hammersley-Clifford) If p(x_V) > 0 obeys the pairwise Markov property with respect to \mathcal{G}, then p factorizes according to \mathcal{G}.

Proof See Lauritzen (1996).

We can conclude from the previous results that \text{factorization} \implies \text{global Markov property} \implies \text{pairwise Markov property}.
Moreover if p is positive \text{factorization} \iff \text{pairwise Markov property} in which case all three conditions are equivalent.
Remember that in general the result is not true if p is not strictly positive.

2.4 Decompositions

Some undirected graph models have the property of decomposability.

Definition If the set of nodes is V = A \cup B \cup S, with A, B, S disjoint we say that (A, B, S) is a decomposition of \mathcal{G} if

the subgraph \mathcal{G}_S is complete
A and B are separated by S in \mathcal{G}.

Moreover if A and B are both non-empty the decomposition is said to be proper.

R Example

ug5 <- ug(~ x1:x2:x3 + x3:x4:x5)
plot(ug5, vertex.size = 40, vertex.label.cex =1.3, 
      vertex.label.family = "sans", edge.width = 2)

is.decomposition(c("x4", "x5"), c("x1", "x2"), "x3", ug5)

[1] TRUE

separates(c("x4", "x5"), c("x1", "x2"), "x3", ug5)

[1] TRUE

Thus the graph G admits a decomposition with A = \{X_4, X_5\}, B = \{X_2, X_3\}, S = \{X_3\}.

Note that S containing a single node is complete.

Exercise Find another decomposition of the previous graph.

Exercise Consider the graph below. Find all the possible separating sets. Is there a decomposition?

ug4 <- ug(~ A:B + B:C + C:D + A:D)
plot(ug4, vertex.size = 40, vertex.label.cex =1.3, 
      vertex.label.family = "sans", edge.width = 2)

Decomposability

Definition An undirected graph is said to be decomposable if it is complete or there is a proper decomposition (A, B, S) and \mathcal{G}_{A\cup S} and \mathcal{G}_{B \cup S} are also decomposable.

The concept of decomposability is quite important and it is related to the structure of cycles in the undirected graph \mathcal{G}.

A cycle is a sequence of distinct nodes \langle v_1, v_2, \dots, v_k \rangle such that there is a path v_1- v_2- \cdots - v_k and an edge v_k - v_1.
A chord on a cycle is any edge between two nodes not adjacent on the cycle.
A graph is said chordal or triangulated if whenever there is a cycle of length \ge 4 it contains a chord.

Then we have the following theorem:

Theorem Let \mathcal{G} be a un undirected graph. Then \mathcal{G}\text{ is decomposable } \iff \mathcal{G}\text{ is triangulated}

Proof See Evans (2024).

R examples

plot(ug1)

is.triangulated(ug1)

[1] TRUE

plot(ug2)

is.triangulated(ug2)

[1] TRUE

plot(ug4)

is.triangulated(ug4)

[1] FALSE

plot(ug5)

is.triangulated(ug5)

[1] TRUE

Exercise Find without using R whether the following graph is decomposable

ug5 <- ug(~ a:b:c + d:e:c + a:d + b:e)
plot(ug5, vertex.size = 40, vertex.label.cex =1.3, 
      vertex.label.family = "sans", edge.width = 2)

2.5 The running intersection property

If a graph is triangulated it can be proved that the cliques can be always ordered as (C_1,\dots,C_k) say, to respect the so called running intersection property.
The property states that for every j = 2, \dots, k there is a \sigma(j) such that

C_j \cap (C_1 \cup \cdots \cup C_{j-1}) = C_j \cap C_{\sigma(j)},

that is such that the intersection of each set with all the preceding objects is contained in a single set.

Given that order, we define the jth separator set for j \ge 2 as S_j= C_j \cap (C_1 \cup \cdots \cup C_{j-1}).
Also we define the sets R_j= C_j \setminus S_j with S_1 = \emptyset.
The sets S_j are called separators as they separate R_j from \bigcup_{i=1}^{j-1} C_i \setminus S_j.
Any clique C_i where S_j \subset C_i with i < j is a possible parent of C_i.

In R package gRbase the rip() function returns the running intersection ordering if the graph is triangulated (otherwise, it returns list()).

Examples

Example Consider again the triangulated graph ug2:

plot(ug2,  vertex.size = 40, vertex.label.cex =1.3, 
      vertex.label.family = "sans", edge.width = 2)

print(rip(ug2)[1:3])

$nodes
[1] "U" "Z" "Y" "X" "W"

$cliques
$cliques[[1]]
[1] "U" "Y" "Z"

$cliques[[2]]
[1] "U" "Y" "X"

$cliques[[3]]
[1] "Y" "X" "W"


$separators
$separators[[1]]
character(0)

$separators[[2]]
[1] "U" "Y"

$separators[[3]]
[1] "Y" "X"

Example Another graph with 7 nodes:

ug7 <- ug(~ x4:x6 + x6:x7 + x7:x3:x2  + x3:x2:x5 + x3:x5:x1)
plot(ug7,  vertex.size = 40, vertex.label.cex =1.3, 
      vertex.label.family = "sans", edge.width = 3, layout = layout_as_tree)

print(rip(ug7)[1:3])

$nodes
[1] "x4" "x6" "x7" "x3" "x2" "x5" "x1"

$cliques
$cliques[[1]]
[1] "x4" "x6"

$cliques[[2]]
[1] "x6" "x7"

$cliques[[3]]
[1] "x2" "x3" "x7"

$cliques[[4]]
[1] "x2" "x3" "x5"

$cliques[[5]]
[1] "x5" "x3" "x1"


$separators
$separators[[1]]
character(0)

$separators[[2]]
[1] "x6"

$separators[[3]]
[1] "x7"

$separators[[4]]
[1] "x2" "x3"

$separators[[5]]
[1] "x5" "x3"

GYO algorithm

An algorithm to verify if the cliques satisfy the running intersection property was proposed in 1979 by Graham and independently by Yu and Özsoyoğlu. This algorithm is named GYO and is used in database theory to check whether a hypergraph is acyclic. See Ullman slides.

Below there is my R implementation.

`gyo` <- function(mar) {
  while (length(mar) > 1) {
    fe <- find_ears(mar)
    if (all(fe == FALSE)) {
      return(FALSE)
    }
    h <- mar[fe][1]
    mar <- setdiff(mar, h)
    print(sapply(h, function(x) paste(x, collapse = ",")))
  }
  print(sapply(mar, function(x) paste(x, collapse = ",")))
  return(TRUE)
}

`find_ears` <- function(mar) {
  T <- length(mar)
  e <- rep(0, T)
  for (t in 1:T) {
    h <- mar[t]
    dif <- setdiff(mar, h)
    U <- intersect(unlist(h), unique(unlist(dif)))
    ear <- sum(sapply(dif, function(x) all(U %in% x)))
    # print( sapply(h, function(x) paste(x, collapse =  ".")) )
    e[t] <- ear
  }
  e >= 1
}

Example Check whether the ordering of the following cliques satisfies the running intersection property.

cli <- list(c(4,6), c(6,7), c(2,3,5), c(1,3,5), c(7,3,2))
gyo(cli)

[1] "4,6"
[1] "6,7"
[1] "1,3,5"
[1] "2,3,5"
[1] "7,3,2"

[1] TRUE

cli2 <- list(c(4,6), c(6,7), c(2,3,5), c(1,3,5), c(7,3,4))
gyo(cli2)

[1] "2,3,5"
[1] "1,3,5"

[1] FALSE

cli3 <- list(c(1,2), c(2,3), c(3,4), c(1,4))
gyo(cli3)

[1] FALSE

2.6 Decomposable graphs and factorization

Theorem Let \mathcal{G} be a decomposable undirected graph with cliques C_1, \dots, C_k. Then a distribution p factorizes with respect to \mathcal{G} if and only if

p(x_V) = \prod_{i=1}^k p(x_{C_i \setminus S_i} \mid x_{S_i} ) = \prod_{i=1}^k \frac{p(x_{C_i})}{p(x_{S_i})}.

The quantities p(x_{C_i\setminus S_i} \mid x_{S_i} ) are variation independent and inference for p(x_V) can be based on separate inferences for each p(x_{C_i}).

Proof See Evans (2024).

Example The factorization for the decomposable undirected graph ug7 is based on the clique ordering \begin{array}{ccc} C_i & S_i & C_i \setminus S_i \\ \hline 4,6 & \emptyset & 4,6 \\ 6,7 & 6 & 7\\ 2,3,7 & 7 & 2,3\\ 2,3,5 & 2,3 & 5\\ 1,3,5 & 3,5 & 1\\ \hline \end{array}

Therefore \begin{align*} p(x_V) &= p(x_4, x_6) \cdot \frac{p(x_6, x_7)}{p(x_6)} \cdot \frac{p(x_2, x_3, x_7)}{p(x_7)} \cdot \frac{p(x_2, x_3, x_5)}{p(x_2, x_3)} \cdot \frac{p(x_1, x_3, x_5)}{p(x_3, x_5)}\\ &= p(x_4, x_6) \cdot p(x_7 | x_6) \cdot p(x_2, x_3|x_7) \cdot p(x_5|x_2,x_3) \cdot p(x_1| x_3, x_5). \end{align*}

2.7 Fit a decomposable model to a contingency table

The last result is quite useful for statistical inference since we only need to consider the marginals of variables corresponding to cliques.
Suppose that we have a contingency table with counts n(x_V). The likelihood for a decomposable graph is

\begin{align*} \ell(p; n) &= \sum_{x_V} n(x_V) \log p(x_V) \\ &= \sum_{x_V} n(x_V) \sum_{i = 1}^k \log p(x_{C_i\setminus S_i} \mid x_{S_i})\\ &= \sum_{i = 1}^k \sum_{x_{C_i}} \log p(x_{C_i \setminus S_i} \mid x_{S_i}) \end{align*}

so inference about p(x_{C_i \setminus S_i} \mid x_{S_i}) should be based entirely upon n(x_{C_i}).
Using Lagrange multipliers to find the maximum under normalization constraints we can find the MLE in closed form

\hat p(x_{C_i \setminus S_i} \mid n_{x_{S_i}}) = \frac{n(x_{C_i})}{n(x_{S_i})} \quad \text{ and }\quad \hat p(x_{C_i}) = \frac{ n(x_{C_i})}{N}.

using the empirical distribution for each clique.

Example

Consider this table with 6 binary variables collected at the beginning of a 15 year follow-up study of probable risk factors for coronary thrombosis. Data are from all men employed in a car factory.

Variables

smoking
strenous mental work
strenuous physical work
systolic blood pressure
ratio of lipoproteins
Family anamnesis of coronary heart disease.

ftable(smoke + mental + phys ~ systol + protein + family , data = reinis)

                      smoke    y               n            
                      mental   y       n       y       n    
                      phys     y   n   y   n   y   n   y   n
systol protein family                                       
y      y       y              44 129 112  12  40 145  67  23
               n               5   9  21   1   7  17   9   4
       n       y              23  50  70   7  32  80  66  13
               n               7   9  14   2   3  16  14   3
n      y       y              35 109  80   7  12  67  33   9
               n               4  14  11   5   3  17   8   2
       n       y              24  51  73   7  25  63  57  16
               n               4   5  13   4   0  14  11   4

Structure learning based on model selection suggested the following decomposable graph with 4 cliques.

g <-  ug(~ systol *smoke *protein + protein*phys *smoke + phys * smoke * mental + mental*family)

plot(g,vertex.size = 40, vertex.label.cex =1.3, 
      vertex.label.family = "sans", edge.width = 3)

Below I show the cliques, the separators and the running intersection ordering.

print(rip(g)[2:3])

$cliques
$cliques[[1]]
[1] "smoke"   "protein" "systol" 

$cliques[[2]]
[1] "smoke"   "protein" "phys"   

$cliques[[3]]
[1] "smoke"  "mental" "phys"  

$cliques[[4]]
[1] "family" "mental"


$separators
$separators[[1]]
character(0)

$separators[[2]]
[1] "smoke"   "protein"

$separators[[3]]
[1] "smoke" "phys" 

$separators[[4]]
[1] "mental"

2.8 Maximum likelihood

The multinomial graphical model is fitted by maximum likelihood with a Poisson generalized linear model using the result below:

Proposition: Let X_i \sim \text{Poisson}(\mu_i) independently, and let N = \sum_{i=1}^t X_i. Then,

\begin{align*} N &\sim \text{Poisson} (\sum_i \mu_i)\\ (X_1, \dots, X_t)^T \mid N = n &\sim \text{Multinom}(n, (\pi_1, \dots, \pi_t)^T) \end{align*}

where \pi= \mu_i / \sum_j \mu_j.

Specifically, if we assume that the joint frequencies are independently Poisson X_i, then we get a Multinomial by conditioning on their sum.

df <- as.data.frame(reinis)
ml <-  glm(Freq ~ systol *smoke *protein + protein*phys *smoke + phys * smoke * mental + mental*family, 
    family = poisson, data = df)
ml


Call:  glm(formula = Freq ~ systol * smoke * protein + protein * phys * 
    smoke + phys * smoke * mental + mental * family, family = poisson, 
    data = df)

Coefficients:
            (Intercept)                  systoln                   smoken  
                3.71485                 -0.22841                 -0.21133  
               proteinn                    physn                  mentaln  
               -0.41861                  1.10791                  0.95080  
                familyn           systoln:smoken         systoln:proteinn  
               -1.93627                 -0.49731                  0.22290  
        smoken:proteinn           proteinn:physn             smoken:physn  
                0.35535                 -0.43706                  0.34545  
          physn:mentaln           smoken:mentaln          mentaln:familyn  
               -3.11567                 -0.21704                  0.29251  
systoln:smoken:proteinn    smoken:proteinn:physn     smoken:physn:mentaln  
                0.32489                 -0.01973                  0.60616  

Degrees of Freedom: 63 Total (i.e. Null);  46 Residual
Null Deviance:      2027 
Residual Deviance: 51.36    AIC: 377

Notice that the fitted values coincide within the marginal distribution of the cliques.

nhat <- fitted(ml)
dfhat <- df; dfhat$Freq <- nhat
tabhat <- xtabs(nhat ~ . , data = dfhat)
margin.table(tabhat, margin = c("family", "mental"))

      mental
family   y   n
     y 929 652
     n 134 126

margin.table(reinis, margin = c("family", "mental"))

      mental
family   y   n
     y 929 652
     n 134 126

margin.table(tabhat, margin = c("mental", "phys", "smoke"))

, , smoke = y

      phys
mental   y   n
     y 146 376
     n 394  45

, , smoke = n

      phys
mental   y   n
     y 122 419
     n 265  74

margin.table(reinis, margin = c("mental", "phys", "smoke"))

, , smoke = y

      phys
mental   y   n
     y 146 376
     n 394  45

, , smoke = n

      phys
mental   y   n
     y 122 419
     n 265  74

Conditional independencies

Also the empirical distribution of the cliques given the separators match those in the fitted values.

2.9 Gaussian graphical models

Preliminaries

Recall a basic result for the multivariate Gaussian distribution: if X_V \sim N_d(\mu, \Sigma) and A is a q × d matrix of full rank q then A X_V \sim N_q(A\mu, A\Sigma A^T).
From here on we will assume throughout that \mu = 0. Note that the dependence structure is entirely determined by \Sigma, and \mu is an orthogonal parameter to \Sigma.
We consider only the case when \Sigma >0, thus the density is strictly positive
Hence, by the Hammersley-Clifford Theorem, the pairwise and global Markov properties, and the factorization criterion are all equivalent.
Recall that X_a \;\perp \hspace{-2.2ex}\perp\;X_b if and only if \Sigma_{ab} = 0 and that the composition property holds. Thus X \;\perp \hspace{-2.2ex}\perp\;Y and X \;\perp \hspace{-2.2ex}\perp\;Z does imply X \;\perp \hspace{-2.2ex}\perp\;YZ.

A basic result

Proposition Let K = \Sigma^{−1}. Then the distribution of X_V is Markov with respect to a graph \mathcal{G} if and only if \kappa_{ij} = 0 whenever there is a missing edge i -\!\!\!-j in \mathcal{G}.

R example

K <- matrix(c(4, 1.2, -1.3, 0.8,
             1.2, 3,   0,   2.1,
            -1.3, 0,   2,    0,
             0.8, 2.1, 0,    2), 4, 4, byrow = TRUE)
S <- solve(K)
eigen(S)$values

[1] 2.9841818 0.8150429 0.2895912 0.1670892

There are two zero concentrations $_{23} = 0 $ and \kappa_{34}=0. Both imply X_2 \;\perp \hspace{-2.2ex}\perp\;X_3 \mid (X_1, X_4) \quad \& \quad X_3 \;\perp \hspace{-2.2ex}\perp\;X_4 \mid (X_1, X_2) Thus the Gaussian distribution is Markov wrt the graph

G <- ug(~ 1:2:4 + 1:3)
plot(G, vertex.size = 40, vertex.label.cex =1.3, 
      vertex.label.family = "sans", edge.width = 2)

Structure of the concentration matrix for a decomposition

Notation

If M is a matrix whose rows and columns are indexed by A \subseteq V, when we write \{M\}_{AA} we mean a matrix of the same order whose A,A entries are M and with zeroes elsewhere.

# R example of {M}
expand <- function(M, d, A){
    M0 <- matrix(0, d, d)
    M0[A,A] <- M
    M0
}
M <- matrix(1:9, 3, 3)
A <- 3:5
expand(M, 5, A)

     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0    0    1    4    7
[4,]    0    0    2    5    8
[5,]    0    0    3    6    9

Lemma Let \mathcal{G} be a graph with decomposition (A, B, S), and X_V \sim N_d(0, \Sigma). Then p(x_V ) is Markov with respect to \mathcal{G} if and only if the following identity holds:

K = \{\Sigma_{AS,AS}^{-1}\}_{AS,AS} + \{\Sigma_{SB,SB}^{-1}\}_{SB,SB} - \{\Sigma_{SS}^{-1}\}_{SS} and \Sigma_{AS,AS} and \Sigma_{SB,SB} are Markov with respect to \mathcal{G}_{A\cup S} and \mathcal{G}_{B\cup S} respectively.

# R Example 

A <- c(1,2); S <- c(3); B <- c(4,5)

K <- structure(c(0.005234, -0.00244,-0.00287,  0,          0, 
                 -0.00244, 0.01035, -0.00561,  0,          0, 
                 -0.00287, -0.00561, 0.02849, -0.00755,   -0.00493, 
                 0,            0,   -0.00755,  0.00982,   -0.00204,
                 0,            0,   -0.00493, -0.00204,    0.00644), 
               dim = c(5, 5), dimnames = list(1:5, 1:5))

Sigma <- solve(K)
Sigma

         1         2         3         4         5
1 306.1365 127.27471 101.66135 100.97305 109.80986
2 127.2747 172.80277  85.19758  84.62074  92.02645
3 101.6614  85.19758 112.96635 112.20150 122.02099
4 100.9730  84.62074 112.20150 220.44805 155.72476
5 109.8099  92.02645 122.02099 155.72476 298.01894

AS <- union(A, S)
SB <- union(S, B)
expand(solve(Sigma[AS,AS]), 5, 1:3) + 
expand(solve(Sigma[SB,SB]), 5, 3:5) -
expand(solve(Sigma[S,S]), 5, 3)

          [,1]     [,2]     [,3]     [,4]     [,5]
[1,]  0.005234 -0.00244 -0.00287  0.00000  0.00000
[2,] -0.002440  0.01035 -0.00561  0.00000  0.00000
[3,] -0.002870 -0.00561  0.02849 -0.00755 -0.00493
[4,]  0.000000  0.00000 -0.00755  0.00982 -0.00204
[5,]  0.000000  0.00000 -0.00493 -0.00204  0.00644

By applying the previous result repeatedly to a decomposable graph with cliques C_t and separators S_t (t = 1, \dots, k) we get that X_V is Markov with respect to \mathcal{G} if and only if \Sigma^{-1} = \sum_{t=1}^k \{\Sigma_{C_t, C_t}^{-1}\}_{C_t, C_t} - \sum_{t=2}^k \{\Sigma_{S_t, S_t}^{-1}\}_{S_t, S_t}.

2.10 Maximum likelihood estimation

Let $(X^{(1)}, , X^{(n)})T $ be an i.i.d. sample of size n from \sim N(0, \Sigma).

Under an unrestricted model the sufficient statistics for \Sigma is the sample covariance matrix W = \textstyle \frac{1}{n} \sum_{i=1}^n X^{(i)} X^{(i)T}. Also W\equiv \hat\Sigma is the MLE for \Sigma.
If there is a graphical model defined by \mathcal{G} let \hat \Sigma^{\mathcal{G}} be the MLE of the covariance matrix under the restriction that the distribution satisfies the Markov property for \mathcal{G}.
Also let \hat K^\mathcal{G} its inverse.

As i\not \sim j \implies \kappa_{ij} = 0, it turns out that the sufficient statistics for a graph \mathcal{G} reduce to the entries in W that correspond to edges in the graph.

Thus the estimated concentration matrix \hat K is defined by \begin{align*} \kappa_{ij} = 0 & \qquad \text{ if } i \not\sim j\\ \kappa^{ij} = W_{ij} & \qquad\text{ if } i \sim j \end{align*} where \kappa^{ij} is the entry (i,j) of \hat K.

For a decomposable graph \mathcal{G} with cliques C_1,\dots, C_k this means that the MLE is

\hat K^{-1} = \sum_{t=1}^k \{W_{C_t, C_t}^{-1}\}_{C_t, C_t} - \sum_{t=2}^k \{W_{S_t, S_t}^{-1}\}_{S_t, S_t}. This matches the sufficient statistics so that \hat \Sigma_{C_t,C_t} = W_{C_t, C_t} for each t = 1, \dots, k.

2.11 Fitting non-decomposable graphs

If the graph \mathcal{G} is non-decomposable the MLE is not in closed form and we need an algorithm based on iterative proportion fitting (IPF). We can use the R function fitConGraph in package ggm.

Example

Consider the dataset simdat with 6 variables observed on 200 individuals assumed to be sampled from a multivariate Gaussian distribution and estimate the sample covariance matrix S

source("simdat.R")
S <- cov(simdat)
print(round(S, 3))

        y     x1     x2     x3     x4     x5
y   4.569 -2.415 -0.337  2.813 -0.646  0.418
x1 -2.415  2.485  0.291 -1.630  0.697  0.003
x2 -0.337  0.291  0.488 -0.593  0.053 -0.023
x3  2.813 -1.630 -0.593  4.040 -0.320  0.173
x4 -0.646  0.697  0.053 -0.320  1.009 -0.047
x5  0.418  0.003 -0.023  0.173 -0.047  1.094

A simple method to select a graphical model is considering all the partial correlations conditional on the rest of the variables, i.e., r_{ij.V\setminus \{i,j\}} that can be found using the estimator \rho_{ij.V\setminus \{i,j\}} = - \frac{\sigma^{ij}}{\sqrt{\sigma^{ii}\sigma^{jj}}} where k^{ij} = (S^{-1})_{ij} are the estimated concentrations. The partial correlations can be obtained with parcor.

library(ggm, warn = FALSE)
round(cor(simdat),3)

        y     x1     x2     x3     x4     x5
y   1.000 -0.717 -0.226  0.655 -0.301  0.187
x1 -0.717  1.000  0.265 -0.514  0.440  0.002
x2 -0.226  0.265  1.000 -0.423  0.075 -0.032
x3  0.655 -0.514 -0.423  1.000 -0.158  0.082
x4 -0.301  0.440  0.075 -0.158  1.000 -0.045
x5  0.187  0.002 -0.032  0.082 -0.045  1.000

P <- parcor(cor(simdat))
round(P,3)

        y     x1     x2     x3     x4     x5
y   1.000 -0.573  0.141  0.485 -0.005  0.257
x1 -0.573  1.000  0.132 -0.046  0.347  0.205
x2  0.141  0.132  1.000 -0.365 -0.018 -0.037
x3  0.485 -0.046 -0.365  1.000  0.074 -0.044
x4 -0.005  0.347 -0.018  0.074  1.000 -0.057
x5  0.257  0.205 -0.037 -0.044 -0.057  1.000

The values in the partial correlation matrix that are closer to zero indicate the missing edges:

G <- 0 + (abs(P)>= 0.05); diag(G) <- 0
G

   y x1 x2 x3 x4 x5
y  0  1  1  1  0  1
x1 1  0  1  0  1  1
x2 1  1  0  1  0  0
x3 1  0  1  0  1  0
x4 0  1  0  1  0  1
x5 1  1  0  0  1  0

This suggests a graph to be fitted.

drawGraph(G)

The graph is non-decomposable and we use the IPF

ml <- fitConGraph(G, S, n = 200)
ml

$Shat
            y           x1           x2         x3          x4           x5
y   4.5688786 -2.414701046 -0.336759557  2.8129414 -0.64548858  0.417974557
x1 -2.4147010  2.485218932  0.291400308 -1.5457704  0.69693839  0.002859092
x2 -0.3367596  0.291400308  0.487919173 -0.5933878  0.06524697 -0.008774948
x3  2.8129414 -1.545770408 -0.593387850  4.0395543 -0.31966182  0.240089720
x4 -0.6454886  0.696938390  0.065246970 -0.3196618  1.00875647 -0.047098406
x5  0.4179746  0.002859092 -0.008774948  0.2400897 -0.04709841  1.093722417

$dev
[1] 1.228574

$df
[1] 5

$it
[1] 4

Note that the IPF required 4 iterations to find the MLE
The Deviance is \chi^2_5 = 1.223 suggesting a well-fitting model.
The fitted concentration matrix is obtained by inverting the fitted covariance matrix

Khat <- solve(ml$Shat)
round(Khat, 3)

        y     x1     x2     x3     x4     x5
y   0.651  0.476 -0.188 -0.287  0.000 -0.188
x1  0.476  0.997 -0.220  0.000 -0.380 -0.202
x2 -0.188 -0.220  2.566  0.424  0.000  0.000
x3 -0.287  0.000  0.424  0.506 -0.051  0.000
x4  0.000 -0.380  0.000 -0.051  1.241  0.066
x5 -0.188 -0.202  0.000  0.000  0.066  0.990

After removing also the edge x_3\sim x_4 we obtain a decomposable model with a good fit and deviance \chi_5 = 2.6.

G0 <- G
G0[4,3] <- G0[3,4] <- 0
drawGraph(G0)

ml0 <- fitConGraph(G0, S, n = 200)
lapply(ml0, round, 3)

$Shat
        y     x1     x2     x3     x4     x5
y   4.569 -2.415 -0.337  2.813 -0.654  0.418
x1 -2.415  2.485  0.291 -1.455  0.697  0.003
x2 -0.337  0.291  0.488 -0.204  0.081 -0.010
x3  2.813 -1.455 -0.204  4.040 -0.320  0.258
x4 -0.654  0.697  0.081 -0.320  1.009 -0.047
x5  0.418  0.003 -0.010  0.258 -0.047  1.094

$dev
[1] 31.783

$df
[1] 6

$it
[1] 5

round(solve(ml0$Shat), 3)

        y     x1     x2     x3     x4     x5
y   0.659  0.474  0.053 -0.273  0.000 -0.188
x1  0.474  0.991 -0.207  0.000 -0.370 -0.202
x2  0.053 -0.207  2.210  0.000  0.000  0.000
x3 -0.273  0.000  0.000  0.435 -0.039  0.000
x4  0.000 -0.370  0.000 -0.039  1.238  0.064
x5 -0.188 -0.202  0.000  0.000  0.064  0.989

2.12 References

Cox, R., D., and N. Wermuth. 1996. Multivariate Dependencies. Models, Analysis and Interpretation. London: Chapman & Hall.

Evans, R. J. 2025. “Graphical Models.” Lecture notes, University of Oxford. https://www.stats.ox.ac.uk/~evans/gms/lecturenotes_25.pdf.

Højsgaard, S., D. Edwards, and S. Lauritzen. 2012. Graphical Models with r. Use r! Springer: New York. https://books.google.it/books?id=em5QdpWmljAC.

Whittaker, J. 1990. Graphical Models in Applied Multivariate Statistics. Chichester: John Wiley & Sons.