Tidy Network Analysis with {tidytext} and Network Visualization with {ggraph}

alper yilmaz

2023-05-25

You can view this presentation at https://rpubs.com/alperyilmaz/tidygraph-slides

Contents are modified from I2DS Tools for Data Science workshop by Ania Matysiak and Johanna Mehler

Network analysis

  • “Connected data” is pretty common: Social network, biological networks
  • Tackled with “graph theory” algoritms

Basic concepts of networks

Nodes and edges

Nodes and edges

Graph or Network

Types of networks

Adjacency Matrices

Network Topology

Topology is the way in which the nodes and edges are arranged within a network

Biological networks

PPI

Tidy approach for network data

How can we represent network data in a tidy way?

{tidygraph} package keeps nodes and edges data as two separate tibble data. Both of which can be manipulated, processed with {dplyr} verbs.

Graphs, a tidy approach

Sample network

simple friend or following network

library(tidyverse)
library(tidygraph)

friends <- tibble(person1 = c("alice", "charles", "david", "bob", "fiona", "gary", "bob", "alice", "david"), 
                  person2 = c("bob", "bob", "bob", "allen", "bob", "bob", "henry", "allen", "fiona"))

friends
# A tibble: 9 × 2
  person1 person2
  <chr>   <chr>  
1 alice   bob    
2 charles bob    
3 david   bob    
4 bob     allen  
5 fiona   bob    
6 gary    bob    
7 bob     henry  
8 alice   allen  
9 david   fiona  

let’s convert this dataframe to a graph/network

friends_g <- friends |> as_tbl_graph()

friends_g
# A tbl_graph: 8 nodes and 9 edges
#
# A directed acyclic simple graph with 1 component
#
# Node Data: 8 × 1 (active)
  name   
  <chr>  
1 alice  
2 charles
3 david  
4 bob    
5 fiona  
6 gary   
# … with 2 more rows
#
# Edge Data: 9 × 2
   from    to
  <int> <int>
1     1     4
2     2     4
3     3     4
# … with 6 more rows

Let’s calculate degrees. Notice that “bob” is followed by 5 people, “bob” follows 2 people (total degree 7)

friends_g |> mutate(deg = centrality_degree())
# A tbl_graph: 8 nodes and 9 edges
#
# A directed acyclic simple graph with 1 component
#
# Node Data: 8 × 2 (active)
  name      deg
  <chr>   <dbl>
1 alice       2
2 charles     1
3 david       2
4 bob         2
5 fiona       1
6 gary        1
# … with 2 more rows
#
# Edge Data: 9 × 2
   from    to
  <int> <int>
1     1     4
2     2     4
3     3     4
# … with 6 more rows

The default degree is mode="out"

Let’s get all connections (in and out)

friends_g |>
  mutate(deg = centrality_degree(mode="all"))
# A tbl_graph: 8 nodes and 9 edges
#
# A directed acyclic simple graph with 1 component
#
# Node Data: 8 × 2 (active)
  name      deg
  <chr>   <dbl>
1 alice       2
2 charles     1
3 david       2
4 bob         7
5 fiona       2
6 gary        1
# … with 2 more rows
#
# Edge Data: 9 × 2
   from    to
  <int> <int>
1     1     4
2     2     4
3     3     4
# … with 6 more rows

Compare “in”, “out” and “all” degree counts for directed and undirected graphs. Remember, the defaul is “Directed” graph and defult degree is “out”

friends |> as_tbl_graph() |> mutate(default_deg=centrality_degree(),
                                    in_deg=centrality_degree(mode="in"),
                                    out_deg=centrality_degree(mode="out"),
                                    all_deg=centrality_degree(mode="all"))
# A tbl_graph: 8 nodes and 9 edges
#
# A directed acyclic simple graph with 1 component
#
# Node Data: 8 × 5 (active)
  name    default_deg in_deg out_deg all_deg
  <chr>         <dbl>  <dbl>   <dbl>   <dbl>
1 alice             2      0       2       2
2 charles           1      0       1       1
3 david             2      0       2       2
4 bob               2      5       2       7
5 fiona             1      1       1       2
6 gary              1      0       1       1
# … with 2 more rows
#
# Edge Data: 9 × 2
   from    to
  <int> <int>
1     1     4
2     2     4
3     3     4
# … with 6 more rows
friends |> as_tbl_graph(directed = FALSE) |> mutate(default_deg=centrality_degree(),
                                    in_deg=centrality_degree(mode="in"),
                                    out_deg=centrality_degree(mode="out"),
                                    all_deg=centrality_degree(mode="all"))
# A tbl_graph: 8 nodes and 9 edges
#
# An undirected simple graph with 1 component
#
# Node Data: 8 × 5 (active)
  name    default_deg in_deg out_deg all_deg
  <chr>         <dbl>  <dbl>   <dbl>   <dbl>
1 alice             2      2       2       2
2 charles           1      1       1       1
3 david             2      2       2       2
4 bob               7      7       7       7
5 fiona             2      2       2       2
6 gary              1      1       1       1
# … with 2 more rows
#
# Edge Data: 9 × 2
   from    to
  <int> <int>
1     1     4
2     2     4
3     3     4
# … with 6 more rows

We can use dplyr verbs on nodes (or edges) tibbles. Let’s do inner_join and filtering.

Let’s have a data frame of birth years for people

friends_bday <- tibble(name = c("alice", "charles", "david", "bob", "fiona", "gary", "henry", "allen"), 
       bday = c("2001", "2001", "2002", "1999", "1997", "2003", "2000", "2005"))

Here are various processes (annotated)

result <- friends_g |>
  mutate(deg = centrality_degree(mode="all")) |>  # calculate degree for nodes
  filter(deg > 1) |>                              # remove nodes with deg==1
  inner_join(friends_bday) |>                     # join bday data with nodes
  arrange(bday) |>                                # arrange nodes according to bday
  activate(edges) |>                              # switch to edge tibble
  mutate(betw = centrality_edge_betweenness()) |> # calculate betweenness for edges
  activate(nodes)                                 # switch back to nodes tibble
result
# A tbl_graph: 5 nodes and 6 edges
#
# A directed acyclic simple graph with 1 component
#
# Node Data: 5 × 3 (active)
  name    deg bday 
  <chr> <dbl> <chr>
1 fiona     2 1997 
2 bob       7 1999 
3 alice     2 2001 
4 david     2 2002 
5 allen     2 2005 
#
# Edge Data: 6 × 3
   from    to  betw
  <int> <int> <dbl>
1     3     2     1
2     4     2     2
3     2     5     3
# … with 3 more rows

Visualization

Let’s create, process and visualize the toy network below

sample_net <- readxl::read_excel("data/sample_network.xlsx") |>
  as_tbl_graph(directed=FALSE)

sample_net
# A tbl_graph: 19 nodes and 32 edges
#
# An undirected simple graph with 1 component
#
# Node Data: 19 × 1 (active)
  name 
  <chr>
1 a    
2 b    
3 c    
4 d    
5 e    
6 f    
# … with 13 more rows
#
# Edge Data: 32 × 2
   from    to
  <int> <int>
1     1     2
2     1     4
3     1     6
# … with 29 more rows

Let’s calculate degree (connections) for each node and then arrange according to degree

sample_net %>% 
  mutate(deg=centrality_degree()) %>% 
  arrange(-deg)
# A tbl_graph: 19 nodes and 32 edges
#
# An undirected simple graph with 1 component
#
# Node Data: 19 × 2 (active)
  name    deg
  <chr> <dbl>
1 j         7
2 d         6
3 g         5
4 h         5
5 b         4
6 f         4
# … with 13 more rows
#
# Edge Data: 32 × 2
   from    to
  <int> <int>
1     5     8
2     2     8
3     6     8
# … with 29 more rows

Note

arrange does not change shape or contents of the network

Which node was expected have highest betweenness?

sample_net %>% 
  mutate(betw=centrality_betweenness()) %>% 
  arrange(-betw)
# A tbl_graph: 19 nodes and 32 edges
#
# An undirected simple graph with 1 component
#
# Node Data: 19 × 2 (active)
  name   betw
  <chr> <dbl>
1 h      88  
2 i      81.5
3 j      66.5
4 m      32  
5 g      28.8
6 q      17  
# … with 13 more rows
#
# Edge Data: 32 × 2
   from    to
  <int> <int>
1    13    15
2     8    15
3     7    15
# … with 29 more rows

{ggraph} package

ggraph is very similar to ggplot. We don’t need to map x and y values, we just have node and edge info. We need to choose geom for the nodes (point, circle, text, etc.) and geom for edges (link, arc, diagonal, elbow, etc.)

library(ggraph)

sample_net |>
  ggraph() +
  geom_node_point() +
  geom_edge_link()

Labels as layer

Let’s add node labels with geom_node_label. This geom need mapping of label to a column, we’ll be using name column.

sample_net |>
  ggraph() +
  geom_node_point() +
  geom_edge_link() +
  geom_node_label(aes(label=name))

We can map graph features (in this case centrality_degree) to node properties. Let’s add repel for labels and use a theme specific for graphs.

sample_net |>
  ggraph() +
  geom_node_point(aes(size=centrality_degree())) +
  geom_edge_link() +
  geom_node_label(aes(label=name),repel = T) +
  theme_graph()

sample_net %>%
  ggraph() +
  geom_node_point(aes(size=centrality_degree(), 
                      color=centrality_betweenness())) +  
  geom_edge_link() +
  geom_node_label(aes(label=name),repel = T) +
  theme_graph()

Network visualization layouts

Relationship between words

Last week, we used {tidytext} to process text data. Let’s calculate relationships between words and visualize it as a network.

Bigram is a pair of consecutive written units such as letters, syllables, or words. In our case we’ll be working on bigram words. Here’s an example bigram word tokenization for the sentence “Quick brown fox jumps over lazy dog.”

Quick brown
      brown fox
            fox jumps
                jumps over
                      over lazy
                           lazy dog

Packages

These are the packages that will be used

library(tidytext)
library(tidytext)
library(janeaustenr)
library(widyr)
library(ggraph)

Here’s how bigram words are tokenized

austen_bigrams <- austen_books() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
austen_bigrams
# A tibble: 675,025 × 2
   book                bigram         
   <fct>               <chr>          
 1 Sense & Sensibility sense and      
 2 Sense & Sensibility and sensibility
 3 Sense & Sensibility <NA>           
 4 Sense & Sensibility by jane        
 5 Sense & Sensibility jane austen    
 6 Sense & Sensibility <NA>           
 7 Sense & Sensibility <NA>           
 8 Sense & Sensibility <NA>           
 9 Sense & Sensibility <NA>           
10 Sense & Sensibility <NA>           
# … with 675,015 more rows

Let’s check most frequent bigrams

austen_bigrams |> count(bigram, sort=TRUE)
# A tibble: 193,210 × 2
   bigram      n
   <chr>   <int>
 1 <NA>    12242
 2 of the   2853
 3 to be    2670
 4 in the   2221
 5 it was   1691
 6 i am     1485
 7 she had  1405
 8 of her   1363
 9 to the   1315
10 she was  1309
# … with 193,200 more rows

Removing stop words won’t be a simple anti_join since we have two words in a row. We need to separate them into individual words.

austen_bigram_clean <-   austen_bigrams |> 
  separate(bigram, into=c("word1","word2")) |> 
  anti_join(stop_words, by=c("word1"="word")) |>
  anti_join(stop_words, by=c("word2"="word"))

Let’s count now

austen_bigram_clean |> count(word1, word2, sort=TRUE) |> head()
# A tibble: 6 × 3
  word1   word2         n
  <chr>   <chr>     <int>
1 <NA>    <NA>      12242
2 sir     thomas      300
3 miss    crawford    240
4 captain wentworth   167
5 miss    woodhouse   152
6 lady    russell     136

Network of bigrams

In bigram count outout, we have “word1”, “word2” and then frequency of them. This looks like two nodes and edge between them. Let’s convert this data into network

austen_bigram_clean |> 
  filter(!is.na(word1)) |> 
  count(word1, word2, sort=TRUE) |> 
  filter(n>20) |>                    # filtering out infrequent pairs for cleaner result
  as_tbl_graph()
# A tbl_graph: 86 nodes and 70 edges
#
# A directed acyclic simple graph with 18 components
#
# Node Data: 86 × 1 (active)
  name   
  <chr>  
1 sir    
2 miss   
3 captain
4 lady   
5 frank  
6 colonel
# … with 80 more rows
#
# Edge Data: 70 × 3
   from    to     n
  <int> <int> <int>
1     1    29   300
2     2    30   240
3     3    31   167
# … with 67 more rows

Let’s visualize the network

austen_bigram_clean |> 
  filter(!is.na(word1)) |> 
  count(word1, word2, sort=TRUE) |> 
  filter(n>20) |>                    
  as_tbl_graph() |>
  ggraph(layout = "kk") + 
  geom_node_point() + 
  geom_edge_link() + 
  geom_node_label(aes(label=name))

Co-occurrence and correlation

Bigram counts only considers words which are next to each other. {widyr} package allows counting pairwise counting of words in a predefined section/window. Below is a visual describing {widyr} in action.

Let’s count words in 10 line sections (considering single book)

austen_section_words <- austen_books() |>
  filter(book == "Pride & Prejudice") |>
  mutate(section = row_number() %/% 10) |>
  filter(section > 0) |>
  unnest_tokens(word, text) |>
  anti_join(stop_words)

austen_section_words
# A tibble: 37,240 × 3
   book              section word        
   <fct>               <dbl> <chr>       
 1 Pride & Prejudice       1 truth       
 2 Pride & Prejudice       1 universally 
 3 Pride & Prejudice       1 acknowledged
 4 Pride & Prejudice       1 single      
 5 Pride & Prejudice       1 possession  
 6 Pride & Prejudice       1 fortune     
 7 Pride & Prejudice       1 wife        
 8 Pride & Prejudice       1 feelings    
 9 Pride & Prejudice       1 views       
10 Pride & Prejudice       1 entering    
# … with 37,230 more rows
austen_section_words |>
  pairwise_count(word, section, sort = TRUE) 
# A tibble: 796,008 × 3
   item1     item2         n
   <chr>     <chr>     <dbl>
 1 darcy     elizabeth   144
 2 elizabeth darcy       144
 3 miss      elizabeth   110
 4 elizabeth miss        110
 5 elizabeth jane        106
 6 jane      elizabeth   106
 7 miss      darcy        92
 8 darcy     miss         92
 9 elizabeth bingley      91
10 bingley   elizabeth    91
# … with 795,998 more rows

So, the words “darcy” and “elizabeth” were found in same section (10 lines) 144 times, irrespective of order or distance between them.

Again, words are nodes and count is the edge data, let’s convert this into network

austen_section_words |>
  pairwise_count(word, section, sort = TRUE) |> 
  filter(n>20) |> 
  as_tbl_graph(directed = F) |> 
  activate(edges) |> 
  distinct() |>            # removing duplicate edges
  ggraph() + 
  geom_node_point() + 
  geom_edge_link() + 
  geom_node_label(aes(label=name))

Pairwise correlation

Pairs like “Elizabeth” and “Darcy” are the most common co-occurring words, but that’s not particularly meaningful since they’re also the most common individual words. We may instead want to examine correlation among words, which indicates how often they appear together relative to how often they appear separately.

In particular, here we’ll focus on the phi coefficient, a common measure for binary correlation. The focus of the phi coefficient is how much more likely it is that either both word X and Y appear, or neither do, than that one appears without the other.

Consider the following table:

Has word Y No word Y Total
Has word X \(n_{11}\) \(n_{10}\) \(n_{1\cdot}\)
No word X \(n_{01}\) \(n_{00}\) \(n_{0\cdot}\)
Total \(n_{\cdot 1}\) \(n_{\cdot 0}\) n

For example, that \(n_{11}\) represents the number of documents where both word X and word Y appear, \(n_{00}\) the number where neither appears, and \(n_{10}\) and \(n_{01}\) the cases where one appears without the other. In terms of this table, the phi coefficient is:

\[\phi=\frac{n_{11}n_{00}-n_{10}n_{01}}{\sqrt{n_{1\cdot}n_{0\cdot}n_{\cdot0}n_{\cdot1}}}\] > The phi coefficient is equivalent to the Pearson correlation, which you may have heard of elsewhere, when it is applied to binary data).

austen_section_words |>
  group_by(word) |>
  filter(n() >= 10) |>   # removing infrequent words for simpler calculations
  ungroup() |>
  pairwise_cor(word, section, sort = TRUE)
# A tibble: 742,182 × 3
   item1     item2     correlation
   <chr>     <chr>           <dbl>
 1 bourgh    de              0.951
 2 de        bourgh          0.951
 3 pounds    thousand        0.701
 4 thousand  pounds          0.701
 5 william   sir             0.664
 6 sir       william         0.664
 7 catherine lady            0.663
 8 lady      catherine       0.663
 9 forster   colonel         0.622
10 colonel   forster         0.622
# … with 742,172 more rows

Let’s convert this to network and visualize it

austen_section_words |>
  group_by(word) |>
  filter(n() >= 10) |>   # try lower numbers and see what happens
  pairwise_cor(word, section, sort = TRUE) |> 
  filter(correlation > 0.25) |> 
  as_tbl_graph(directed=FALSE) |> 
  ggraph(layout="kk") + 
  geom_node_point() + 
  geom_edge_link() + 
  geom_node_label(aes(label=name))

Legos of tidy verbs

Consider the following code, since different functions from different packages take in tidy data and output tidy data, we can combine them easily. The packages of each function is annotated at each line.

austen_books() |>
  filter(book == "Pride & Prejudice") |>       # dplyr
  mutate(section = row_number() %/% 10) |>     # dplyr
  filter(section > 0) |>                       # dplyr
  unnest_tokens(word, text) |>                 # tidytext
  anti_join(stop_words) |>                     # dplyr
  group_by(word) |>                            # dplyr
  filter(n() >= 10) |>                         # dplyr
  pairwise_cor(word, section, sort = TRUE) |>  # widyr
  filter(correlation > 0.25) |>                # dplyr
  as_tbl_graph(directed=FALSE) |>              # tidygraph
  mutate(deg=centrality_degree()) |>           # dplyr / tidygraph
  ggraph(layout="kk") +                        # ggraph
  geom_node_point() +                          # ggraph
  geom_edge_link() +                           # ggraph
  geom_node_label(aes(label=name))             # ggraph