1 Introducción

Podemos pensar que los vértices de una red pertenecen a clases y que la propensión a establecer vínculos entre pares de vértices depende de la pertenencia de clase de los dos vértices.

Los enlaces se producen debido a la equivalencia estructural (structural equivalence), es decir, la similitud de los roles sociales.

El vértice \(i\in V\) del grafo \(G=(V,E)\) pertenece a una sola clase de una partición \(\mathcal{P} = \{C_1,\ldots,C_Q\}\) de \(V\) con \(Q\) comunidades.

El modelo de bloques se puede escribir como: \[ p(\mathbf{y}\mid\boldsymbol{\theta}) = \frac{1}{\kappa}\,\exp{\left\{ \sum_{q,r}\theta_{q,r}\, L_{q,r}(\mathbf{y}) \right\}} \] donde:

2 Modelo de bloques estocásticos

En la práctica, no se sabe a qué clases pertenecen los vértices ni el número de clases.

La pertenencia a una clase (class membership) de cada vértice \(i\) se determina de forma independiente, de acuerdo con una distribución de probabilidad común sobre el conjunto \(\{1,\ldots,Q\}\): \[ z_{i,q} = \begin{cases} 1 & \mbox{si el vértice $i$ pertence a la comunidad $q$} \\ 0 & \mbox{en otro caso} \end{cases} \]

Así, \[ \boldsymbol{z}_i\mid \boldsymbol{\alpha}\stackrel{\text{iid}}{\sim} \textsf{Multinomial}(1,\boldsymbol{\alpha}) \qquad\Longleftrightarrow\qquad \textsf{Pr}(z_{i,q}=1\mid\alpha_q) = \alpha_q \] donde \(\boldsymbol{z}_i=(z_{i,1},\ldots,z_{i,Q})\), para \(i = 1,\ldots,n\), y \(\boldsymbol{\alpha}=(\alpha_1,\ldots,\alpha_Q)\) con \(\sum_{q=1}^Q\alpha_q = 1\).

Por lo tanto, dados los valores de \(\boldsymbol{z}_1,\ldots,\boldsymbol{z}_n\) se tiene que las díadas se pueden modelar como condicionalmente independientes con distribución Bernoulli: \[ y_{i,j}\mid\mathbf{\Pi},\boldsymbol{z}_i,\boldsymbol{z}_j \stackrel{\text{ind}}{\sim} \textsf{Bernoulli}\left( \pi_{\xi_i,\xi_j} \right) \] donde \(\mathbf{\Pi}=[\pi_{q,r}]\) es una martiz de \(Q\times Q\) que contiene las probabilidades de interacción, y \(\xi_i=\xi(\boldsymbol{z}_i)\) denota la posición \(q\) en \(\boldsymbol{z}_i\) tal que \(z_{i,q} = 1\) (i.e., \(\xi_i=q\) significa que el vértice \(i\) hace parte de la comunidad \(q\)).

Se han propuesto en la literatura varios métodos que aproximan o modifican el método de máxima verosimilitud. El paquete blockmodels implementa uno de esos métodos llamado algoritmo EM variacional, el cual optimiza un límite inferior en la verosimilitud de los datos observados.

3 Ejemplo: Blogs de política

Red de blogs políticos franceses clasificados por el proyecto Observatoire Presidentielle en cuanto a afiliación política. Un enlace indica que al menos uno de los dos blogs hacía referencia al otro en su página web. Disponible en la libreria sand de R.

Fuente original: http://observatoire-presidentielle.fr/

suppressMessages(suppressWarnings(library(igraph)))
# datos
data(fblog)
# orden
vcount(fblog)
## [1] 192
# tamaño
ecount(fblog)
## [1] 1431
# dirigida?
is_directed(fblog)
## [1] FALSE
# paquete para ajustar SBM
# membership_type = "SBM_sym" para redes no dirigidas 
# membership_type = "SBM" para redes dirigidas
suppressMessages(suppressWarnings(library(blockmodels)))
set.seed(42)
# matriz de adyacencia
A <- as.matrix(igraph::as_adjacency_matrix(fblog))
dim(A)
## [1] 192 192
# formulacion del modelo
fblog.sbm <- blockmodels::BM_bernoulli(membership_type = "SBM_sym", adj = A, verbosity = 0, plotting = "")
# estimacion
fblog.sbm$estimate()

Verosimilitud de clasificación de integración (ICL, integration classification likelihood). Este criterio es similar en espíritu a varios criterios de información de los modelos estándar, por ejemplo, como el criterio de información de Akaike (AIC) y el criterio de información Bayesiana (BIC), etc.), pero adaptado específicamente a problemas de agrupamiento.

Para \(Q=10\), podemos extraer las estimaciones de la probabilidad de pertenencia a las comunidades, es decir, las estimaciones de los valores esperados de los \(\boldsymbol{z}_i\), dado que \(\mathbf{Y} = \mathbf{y}\).

# criterio
ICLs <- fblog.sbm$ICL
ICLs
##  [1] -5028.312 -4580.705 -4318.101 -4118.934 -3986.373 -3865.908 -3810.839
##  [8] -3768.287 -3742.687 -3720.026 -3725.161 -3752.341 -3783.983 -3833.919
## [15] -3886.516
# n. de grupos optimo
Q <- which.max(ICLs)
Q
## [1] 10
# grafico del ICL
plot(fblog.sbm$ICL, xlab = "Q", ylab = "ICL", type = "b", pch = 16)
lines(x = c(Q,Q), y = c(min(ICLs),max(ICLs)), col = "red", lty = 2)

Estas estimaciones se pueden usar para determinar las etiquetas de las asignaciones a las comunidades (cluster assignments o class memberships). Esto se puede lograr por medio de la probabilidad máximal (superiores a 0.8586 en este caso).

# probabilidades estimadas de pertenencia a las comunidades
Z <- fblog.sbm$memberships[[Q]]$Z
head(x = round(Z,3), n = 10)
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
##  [1,] 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
##  [2,] 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
##  [3,] 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
##  [4,] 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
##  [5,] 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
##  [6,] 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
##  [7,] 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
##  [8,] 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001
##  [9,] 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## [10,] 0.001 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001
tail(x = round(Z,3), n = 10)
##         [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
## [183,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
## [184,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
## [185,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
## [186,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
## [187,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
## [188,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
## [189,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
## [190,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
## [191,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
## [192,] 0.001 0.001 0.001 0.001 0.995 0.001 0.001 0.001 0.001 0.001
# dimension
dim(Z)
## [1] 192  10
# asignaciones
cl.labs <- apply(Z,1,which.max)
head(x = cl.labs, n = 10)
##  [1] 1 1 1 1 1 1 1 3 3 6
tail(x = cl.labs, n = 10)
##  [1] 5 5 5 5 5 5 5 5 5 5
# resumen de las probabilidades maximales
summary(Z[cbind(1:vcount(fblog),cl.labs)])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8586  0.9953  0.9953  0.9938  0.9953  0.9953
# tamaño de las comunidades
table(cl.labs)
## cl.labs
##  1  2  3  4  5  6  7  8  9 10 
## 35 27 11 21 24 25  6  7 13 23
# probabilidades de los grupos
alpha <- table(cl.labs)/vcount(fblog)
round(alpha, 3)
## cl.labs
##     1     2     3     4     5     6     7     8     9    10 
## 0.182 0.141 0.057 0.109 0.125 0.130 0.031 0.036 0.068 0.120
# probabilidades de los grupos (ordenadas)
round(alpha[order(alpha, decreasing = T)], 3)
## cl.labs
##     1     2     6     5    10     4     9     3     8     7 
## 0.182 0.141 0.130 0.125 0.120 0.109 0.068 0.057 0.036 0.031

Probabilidades de interacción \(\mathbf{\Pi}=[\pi_{q,r}]\) (intection probabilities):

# matriz de probabilidades de interaccion
Pi.mat <- fblog.sbm$model_parameters[[Q]]$pi
round(Pi.mat, 3)
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
##  [1,] 0.084 0.003 0.003 0.015 0.004 0.006 0.105 0.062 0.087 0.031
##  [2,] 0.003 0.212 0.007 0.001 0.002 0.005 0.014 0.615 0.034 0.001
##  [3,] 0.003 0.007 0.910 0.001 0.001 0.036 0.018 0.002 0.043 0.001
##  [4,] 0.015 0.001 0.001 0.349 0.001 0.002 0.429 0.069 0.081 0.091
##  [5,] 0.004 0.002 0.001 0.001 0.543 0.009 0.029 0.008 0.093 0.004
##  [6,] 0.006 0.005 0.036 0.002 0.009 0.341 0.041 0.085 0.108 0.004
##  [7,] 0.105 0.014 0.018 0.429 0.029 0.041 0.906 0.191 0.380 0.776
##  [8,] 0.062 0.615 0.002 0.069 0.008 0.085 0.191 0.974 0.382 0.063
##  [9,] 0.087 0.034 0.043 0.081 0.093 0.108 0.380 0.382 0.383 0.071
## [10,] 0.031 0.001 0.001 0.091 0.004 0.004 0.776 0.063 0.071 0.545
# probabilidades grupo 3
round(Pi.mat[3,], 3)
##  [1] 0.003 0.007 0.910 0.001 0.001 0.036 0.018 0.002 0.043 0.001
# grafico
corrplot::corrplot(corr = Pi.mat, type = "full", col.lim = c(0,1),  method = "shade", addgrid.col = "gray90", tl.col = "black")

Matriz de adyacencia original y ordenada:

# funciones
get_adjacency_ordered <- function(xi, A) 
{
  xi2 <- xi[order(xi)]
  indices <- order(xi)
  d <- NULL
  for (i in 1:(length(xi)-1)) if (xi2[i] != xi2[i+1]) d <- c(d, i)
  list(A = A[indices,indices], d = d)
}
heat.plot0 <- function (mat, show.grid = FALSE, cex.axis, tick, labs, col.axis, ...)
{ 
        JJ <- dim(mat)[1]
        colorscale <- c("white", rev(heat.colors(100)))
        if(missing(labs))     labs <- 1:JJ
        if(missing(col.axis)) col.axis <- rep("black", JJ)
        if(missing(cex.axis)) cex.axis <- 1
        if(missing(tick))     tick <- TRUE
        ## adjacency matrix
        image(seq(1, JJ), seq(1, JJ), mat, axes = FALSE, xlab = "", ylab = "", col = colorscale[seq(floor(100*min(mat)), floor(100*max(mat)))], ...)
        for(j in 1:JJ){
                axis(1, at = j, labels = labs[j], las = 2, cex.axis = cex.axis, tick, col.axis = col.axis[j], col.ticks = col.axis[j])
                axis(2, at = j, labels = labs[j], las = 2, cex.axis = cex.axis, tick, col.axis = col.axis[j], col.ticks = col.axis[j])
        }
        box()
        if(show.grid) grid(nx = JJ, ny = JJ)
}
# asignaciones 
xi <- apply(Z,1,which.max)
# asignaciones ordenadas 
xi2 <- xi[order(xi)]
# matriz de adyacencia original
Y <- A
# matriz de adyacencia ordenada y lineas divisorias de acuerdo con las comunidades
tmp   <- get_adjacency_ordered(xi = xi, A = Y)
A.ord <- tmp$A
d     <- tmp$d
# grafico
par(mfrow = c(1,2))
heat.plot0(mat = Y,     tick = F, labs = NA)
heat.plot0(mat = A.ord, tick = F, labs = NA)
abline(v = d+.5, h = d+.5)

Grafo:

par(mfrow = c(1,2))
# grafico original
cols <- RColorBrewer::brewer.pal(n = 9, name = "Set1")
party.nums <- as.numeric(as.factor(V(fblog)$PolParty))
V(fblog)$color <- cols[party.nums]
set.seed(42)
plot(fblog, layout = layout_with_fr, vertex.label=NA, vertex.size = 5, vertex.frame.color = "black")
# grafico comunidades
cols <- RColorBrewer::brewer.pal(n = 12, name = "Paired")
V(fblog)$color <- cols[cl.labs]
set.seed(42)
plot(fblog, layout = layout_with_fr, vertex.label=NA, vertex.size = 5, vertex.frame.color = "black")

Para realizar la bondad de ajuste del modelo se pueden usar estadísticos de prueba por medio de métodos de simulación.

# simulacion
n <- vcount(fblog)
B <- 1200
Pi.mat <- 0.5*(t(Pi.mat) + Pi.mat) # asegurarse que Pi.mat sea simetrica
sdeg <- list(B)
stat <- matrix(NA, B, 6)
for (i in 1:B) {
   bs <- stats::rmultinom(n = 1, size = n, prob = alpha)
   g  <- igraph::sample_sbm(n = n, pref.matrix = Pi.mat, block.sizes = bs, directed = F)
   sdeg[[i]]  <- summary(igraph::degree(g))
   stat[i,1] <- igraph::edge_density(graph = g, loops = F)
   stat[i,2] <- igraph::transitivity(graph = g, type = "global")
   stat[i,3] <- igraph::assortativity_degree(graph = g, directed = F)
   stat[i,4] <- igraph::mean_distance(graph = g, directed = F)
   stat[i,5] <- mean(igraph::degree(graph = g))
   stat[i,6] <- sd(igraph::degree(graph = g))
 }
# valores observados
dens_obs <- igraph::edge_density(graph = fblog, loops = F)
tran_obs <- igraph::transitivity(graph = fblog, type = "global")
asso_obs <- igraph::assortativity_degree(graph = fblog, directed = F)
mdis_obs <- igraph::mean_distance(graph = fblog, directed = F)
mdeg_obs <- mean(igraph::degree(graph = fblog))
sdeg_obs <- sd(igraph::degree(graph = fblog))
# resumen distr. grado observado
summary(degree(fblog))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    8.00   13.00   14.91   18.00   56.00
# resumen distr. grado simulacion
Reduce('+', sdeg)/B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.941   9.142  13.110  15.155  18.858  49.614
# valor p
round(mean(stat[,1] > dens_obs), 4)
## [1] 0.5725
round(mean(stat[,2] > tran_obs), 4)
## [1] 0.0108
round(mean(stat[,3] > asso_obs), 4)
## [1] 0.9125
round(mean(stat[,4] > mdis_obs), 4)
## [1] 0.1067
round(mean(stat[,5] > mdeg_obs), 4)
## [1] 0.5725
round(mean(stat[,6] > sdeg_obs), 4)
## [1] 0.1733
par(mfrow = c(2,3))

hist(x = stat[,1], freq = F, col = "gray90", border = "gray90", xlim = range(stat[,1]), xlab = "Densidad",       ylab = "Densidad", main = "Densidad")
abline(v = dens_obs, col = 4, lty = 2)
abline(v = quantile(stat[,1], c(0.025, 0.975)), lty = 3, col = 2)

hist(x = stat[,2], freq = F, col = "gray90", border = "gray90", xlim = range(stat[,2]), xlab = "Transitividad",  ylab = "Densidad", main = "Transitividad")
abline(v = tran_obs, col = 4, lty = 2)
abline(v = quantile(stat[,2], c(0.025, 0.975)), lty = 3, col = 2)

hist(x = stat[,3], freq = F, col = "gray90", border = "gray90", xlim = range(stat[,3]), xlab = "Asortatividad", ylab = "Densidad", main = "Asortatividad")
abline(v = asso_obs, col = 4, lty = 2)
abline(v = quantile(stat[,3], c(0.025, 0.975)), lty = 3, col = 2)

hist(x = stat[,4], freq = F, col = "gray90", border = "gray90", xlim = range(stat[,4]), xlab = "Densidad",       ylab = "Densidad", main = "Distancia prom.")
abline(v = mdis_obs, col = 4, lty = 2)
abline(v = quantile(stat[,4], c(0.025, 0.975)), lty = 3, col = 2)

hist(x = stat[,5], freq = F, col = "gray90", border = "gray90", xlim = range(stat[,5]), xlab = "Transitividad",  ylab = "Densidad", main = "Grado prom.")
abline(v = mdeg_obs, col = 4, lty = 2)
abline(v = quantile(stat[,5], c(0.025, 0.975)), lty = 3, col = 2)

hist(x = stat[,6], freq = F, col = "gray90", border = "gray90", xlim = range(stat[,6]), xlab = "Asortatividad", ylab = "Densidad", main = "Grado DE")
abline(v = sdeg_obs, col = 4, lty = 2)
abline(v = quantile(stat[,6], c(0.025, 0.975)), lty = 3, col = 2)

Comparación con el agrupamiento natural:

# agrupamiento natural
party.nums <- as.numeric(as.factor(V(fblog)$PolParty))
# agrupamiento estimado
cl.labs <- apply(Z,1,which.max)
# comparacion
round(igraph::compare(comm1 = party.nums, comm2 = cl.labs, method = "rand"), 4)
## [1] 0.8619
round(igraph::compare(comm1 = party.nums, comm2 = cl.labs, method = "adjusted.rand"), 4)
## [1] 0.4608
round(igraph::compare(comm1 = party.nums, comm2 = cl.labs, method = "nmi"), 4)
## [1] 0.6565
table(party.nums, cl.labs)
##           cl.labs
## party.nums  1  2  3  4  5  6  7  8  9 10
##          1  2  0  0  0  0  0  0  0  0  0
##          2  2  1  0  0  0  0  0  1  7  0
##          3  7  0  0  0  0  0  0  0  0  0
##          4  1  0  0  0 24  0  0  0  0  0
##          5  9  0  0  0  0  0  0  0  1  1
##          6  7  0  0  0  0  0  0  0  0  0
##          7  6  0  0 21  0  0  6  0  2 22
##          8  0 24  0  0  0  1  0  6  1  0
##          9  1  2 11  0  0 24  0  0  2  0
# grafo
g.cl <- graph_from_adjacency_matrix(adjmatrix = Pi.mat, mode = "undirected", weighted = TRUE)
# parametros del grafico
vsize    <- 100*sqrt(alpha)
ewidth   <- 10*E(g.cl)$weight
PolP     <- V(fblog)$PolParty
class.by.PolP <- as.matrix(table(cl.labs,PolP))
pie.vals <- lapply(1:Q, function(i) as.vector(class.by.PolP[i,]))
my.cols  <- topo.colors(length(unique(PolP)))
# grafico 
set.seed(42)
plot(g.cl, edge.width = ewidth, vertex.shape = "pie", vertex.pie = pie.vals, vertex.pie.color = list(my.cols), vertex.size = vsize, vertex.label.dist = 0.1*vsize, vertex.label.degree = pi)
# leyenda
my.names    <- names(table(PolP))
my.names[2] <- "Comm. Anal."
my.names[5] <- "PR de G"
legend(x = "topleft", my.names, fill = my.cols, bty = "n", ncol = 2)