Tarea 8: redes

Datos

Los datos son tomados de Moviegalaxies - Social Networks in Movies. La descripción básica de estos datos es:

Methods: We created a movie script parser and determined same-scene appereance of characters as a proxy of connectedness (each co-appeareance is measured > > as one degree unit per scene).

Para los siguientes ejercicios, usa una de las películas en la carpeta datos/movie-galaxies, o usa Gephi para seleccionar otra película, y después exportar en formato graphml. También puedes trabajar dentro de Gephi si quieres.

Ejemplo

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidygraph)

## 
## Attaching package: 'tidygraph'

## The following object is masked from 'package:stats':
## 
##     filter

library(ggraph)
theme_set(theme_minimal())

getwd()

## [1] "/home/rstudio/metodos-analiticos-mcd-2020"

setwd("~/metodos-analiticos-mcd-2020/ejercicios/redes")
#The read_graph function is able to read graphs in various representations from a file, or from a http connection. Currently some simple formats are supported.
red_rj <- igraph::read_graph("../../datos/movie-galaxies/lesmiserables.graphml", 
  format = "graphml") %>% 
  as_tbl_graph
#The tbl_graph class is a thin wrapper around an igraph object that provides methods for manipulating the graph using the tidy API.
red_rj

## # A tbl_graph: 77 nodes and 254 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 77 x 9 (active)
##   label          r     g     b      x       y  size `Modularity Class` id   
##   <chr>      <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>              <dbl> <chr>
## 1 Valjean      245    91    91  -87.9    6.81 100                    1 11   
## 2 Gavroche      91   245    91  388.  -110.    61.6                  8 48   
## 3 Marius       194    91   245  206.    13.8   53.4                  6 55   
## 4 Javert        91   245   194  -81.5  204.    47.9                  7 27   
## 5 Thenardier    91   245   194   82.8  203.    45.1                  7 25   
## 6 Fantine       91   194   245 -313.   289.    42.4                  2 23   
## # … with 71 more rows
## #
## # Edge Data: 254 x 5
##    from    to `Edge Label` weight id   
##   <int> <int> <chr>         <dbl> <chr>
## 1    18    61 ""                1 0    
## 2    18    45 ""                8 1    
## 3    18    46 ""               10 2    
## # … with 251 more rows

Puedes extraer la tabla de nodos y de aristas como sigue:

red_rj %>% activate(nodes)

## # A tbl_graph: 77 nodes and 254 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 77 x 9 (active)
##   label          r     g     b      x       y  size `Modularity Class` id   
##   <chr>      <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>              <dbl> <chr>
## 1 Valjean      245    91    91  -87.9    6.81 100                    1 11   
## 2 Gavroche      91   245    91  388.  -110.    61.6                  8 48   
## 3 Marius       194    91   245  206.    13.8   53.4                  6 55   
## 4 Javert        91   245   194  -81.5  204.    47.9                  7 27   
## 5 Thenardier    91   245   194   82.8  203.    45.1                  7 25   
## 6 Fantine       91   194   245 -313.   289.    42.4                  2 23   
## # … with 71 more rows
## #
## # Edge Data: 254 x 5
##    from    to `Edge Label` weight id   
##   <int> <int> <chr>         <dbl> <chr>
## 1    18    61 ""                1 0    
## 2    18    45 ""                8 1    
## 3    18    46 ""               10 2    
## # … with 251 more rows

#Activate will ungroup a grouped_tbl_graph

Pregunta 1 Explica qué información está disponible a nivel nodo y a nivel arista

Los nodos son son los personajes en la película:

Jean Valjean (Hugh Jackman) Marius Pontmercy (Eddie Redmayne) Enjolras (Aaron Tveit) Courfeyrac (Fraa Fee) Cosette (Amanda Seyfried)

Javert (Russel Crowe) Gavroche (Daniel Huttlestone)

Las aristas dirigidas son conexiones entre un personaje y otro. La importancia de un personaje depende de cuántos otros personajes han interactuado con él y también de qué tan importantes son los personajes que tuvieron la interacción.

Centralidad de nodos

Calculamos la centralidad de cada personaje según el número de conexiones que tiene en la gráfica, es decir, el grado de cada nodo.

# nota: el grado lo calculamos ponderado por los pesos de las aristas (sin ponderación, 
# cada arista aporta 1, con ponderación cada arista aporta su peso.)
red_rj <- red_rj %>% activate(nodes) %>% 
  mutate(central_grado = centrality_degree(weights = weight)) 
#The centrality of a node measures the importance of node in the network. As the concept of importance is ill-defined and dependent on the network and the questions under consideration, many centrality measures exist.
#The weight of the edges to use for the calculation. Will be evaluated in the context of the edge data.
resumen_central <- red_rj %>% as_tibble() %>% 
  select(label, central_grado) %>% arrange(desc(central_grado))
resumen_central

** Pregunta 2**: ¿Cuáles son los personajes más importantes en término de grado (número de conexiones) en la película que escogiste? ¿Cuáles son los menos importantes?

El más importante es Valjean con 158 conexiones a otros personajes (aristas), después Marius con 104, luego Enjolras con 91 Menos importantes: Oldman, Boulatruelle, todas con grado 1

Usaremos también el betweeness de los nodos, que mide que tan bien comunica un nodo a cualquier otro par de nodos: un nodo es importante en betweeness cuando pasan muchos caminos únicos a través de él.

Betweeness: qué tan importante o único es un nodo para conectar otros pares de nodos de la red (por ejemplo, una persona con betweeness alto controla más fácilmente el flujo de información en una red social).

# nota: el grado lo calculamos ponderado por los pesos de las aristas (sin ponderación, 
# cada arista aporta 1, con ponderación cada arista aporta su peso.)
red_rj <- red_rj %>% activate(nodes) %>% 
  mutate(central_between = centrality_betweenness(weights = weight)) 
resumen_central <- red_rj %>% as_tibble() %>% select(label, central_grado, central_between) %>% 
  arrange(desc(central_grado))
resumen_central

library(ggrepel)
#ggrepel This package contains extra geoms for ggplot2.
ggplot(resumen_central, aes(x = central_grado + 1 , y = central_between + 1, label = label)) + 
  geom_point() + geom_text_repel() +
  scale_x_log10() + scale_y_log10()

#geom_text_repel adds text directly to the plot. geom_label_repel draws a rectangle underneath the text, making it easier to read. The text labels repel away from each other and away from the data points.

Pregunta 3 ¿Qué personajes de tu película son importantes en una medida de centralidad pero no tanto en la otra?

Importantes en centralidad pero no tanto en betweeness: -Courfeyrac, 84/1.8214286
-Combeferre, 68/0.0000000 -Enjolras, 91/7.8682900

Visualización

Empieza haciendo una gráfica simple:

ggraph(red_rj, layout = "circle") + 
    geom_edge_link(colour = "gray") + 
    geom_node_point() +
    geom_node_text(aes(label = label))

#ggrap is the equivalent of ggplot2::ggplot() in ggplot2. It takes care of setting up the plot object along with creating the layout for the plot based on the graph and the specification passed in.

Ahora incluye tamaño y color en tu gráfica para los nodos. Puedes usar una medida de centralidad:

ggraph(red_rj, layout = "circle") + 
    geom_edge_link(colour = "gray") + 
    geom_node_point(aes(size = central_between, colour = central_between)) +
    geom_node_text(aes(label = label), repel = TRUE)

Escoge un layout basado en fuerzas (puedes experimentar con stress, fr, graphopt, gem:

Este es un diagrama con fr:

ggraph(red_rj, layout = "fr") + 
    geom_edge_link(colour = "gray") + 
    geom_node_point(aes(size = central_between, colour = central_between)) +
    geom_node_text(aes(label = label), size = 3, repel = TRUE)

ggraph(red_rj, layout = "stress") + 
    geom_edge_link(colour = "gray") + 
    geom_node_point(aes(size = central_between, colour = central_between)) +
    geom_node_text(aes(label = label), size = 3, repel = TRUE)

ggraph(red_rj, layout = "graphopt") + 
    geom_edge_link(colour = "gray") + 
    geom_node_point(aes(size = central_between, colour = central_between)) +
    geom_node_text(aes(label = label), size = 3, repel = TRUE)

ggraph(red_rj, layout = "gem") + 
    geom_edge_link(colour = "gray") + 
    geom_node_point(aes(size = central_between, colour = central_between)) +
    geom_node_text(aes(label = label), size = 3, repel = TRUE)

Agrega ancho de las aristas dependiendo del peso:

ggraph(red_rj, layout = "stress") + 
    geom_edge_link(aes(edge_width = weight), alpha = 0.5, colour = "gray70") + 
    geom_node_point(aes(size = central_between, colour = central_between)) +
    geom_node_text(aes(label = label), size = 3, repel = TRUE)

Podemos usar el logaritmo de las medidas de centralidad para apreciar mejor variación:

set.seed(8823)
ggraph(red_rj, layout = "fr") + 
    geom_edge_link(aes(edge_width = weight), alpha = 0.5, colour = "gray70") + 
    geom_node_point(aes(size = central_between, colour = log(central_between))) +
    geom_node_text(aes(label = label), size = 3, repel = TRUE)

Pregunta 4. ¿qué algoritmos de layout funcioron mejor para tu gráfica? ¿Por qué?

Pregunta 5: en tu gráfica, ¿puedes explicar por qué algunos nodos son relativamente más importantes en una medida centralidad que en la otra?

Pregunta 6 ¿Pudiste aprender algo de la estructura de la película examinando estas representaciones?