0. Prep

  • Download “example retweets.csv” and “network exercise.csv” to your folder

  • Open a new R file and SAVE it to your folder

  • Set working directory to your folder

1. Packages

We will focus on two packages: tidygraph for preprocessing and analysis, and ggraph for visualization. They are both developed by Thomas Lin Pedersen.

There are several other packages widely used for network analysis and visualization, such as igraph and network.

Although tidygraph and ggraph are newer packages than igraph and network, they have a big advantage: they bring network analysis into the tidyverse workflow.

  • Under the framework we’ve been working under (for textual analyses), network data seem harder to grasp. There’s a discrepancy between relational data and the tidy data idea — relational data cannot in any meaningful way be encoded as a single tidy data frame.

  • To solve this discrepancy, tidygraph and ggraphadopt the tidy data idea. You can view tidygraph as an extension of dplyr - it allows us to use dplyr functions we’ve grown familiar with to manipulate data frame. You can view ggraph as an extension of ggplot2.

More on tidygraph and ggraph: https://www.data-imaginist.com/2017/introducing-tidygraph/

#install.packages("tidygraph")
#install.packages("ggraph")

library(tidygraph)
## 
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
## 
##     filter
library(ggraph)
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

2. Preprocessing

Usually, the data we have is in a raw, textual format:

data <- read.csv("example retweets.csv")
str(data)
## 'data.frame':    1246 obs. of  6 variables:
##  $ id             : int  6672 6673 6705 6707 6712 6714 6717 6718 6720 6724 ...
##  $ date           : chr  "10/30/16" "10/30/16" "10/30/16" "10/30/16" ...
##  $ screen_name    : chr  "canadaballer22" "bigman9378" "Bojispirit" "StCathWriter" ...
##  $ text           : chr  "RT @CNN: Fake news has become a plague on the web, says @brianstelter. His advice? Triple check before you shar"| __truncated__ "RT @CNN: Fake news has become a plague on the web, says @brianstelter. His advice? Triple check before you shar"| __truncated__ "RT @CNN: Fake news has become a plague on the web, says @brianstelter. His advice? Triple check before you shar"| __truncated__ "RT @CNN: Fake news has become a plague on the web, says @brianstelter. His advice? Triple check before you shar"| __truncated__ ...
##  $ follower_count : int  11 74 21 4453 311 8956 7 334 3282 1474 ...
##  $ following_count: int  269 321 282 3285 176 1433 79 2510 4959 4087 ...

We need to identify the accounts who retweeted others and the accounts being retweeted:

data_clean <- data %>%
  mutate(from = screen_name, to = gsub("RT @(\\S*): .*","\\1",text)) %>% 
  select(from, to) %>% # Question: What does select do?
  count(from, to) # Question: What does this line do?
head(data_clean)
##              from              to n
## 1 01507db5dba34b2             CNN 1
## 2      0DarylZero        _Makada_ 1
## 3    10thCrusader realDonaldTrump 1
## 4      123_talent             CNN 1
## 5   1966OldSchool             CNN 1
## 6         1DanCox realDonaldTrump 1

Up to now, the data type we are working with is data frame. We need to transform the data type to to a “network type” (where there are nodes and edges).

The package tidygraph provides an easy way for us to do that (as_tbl_graph):

data_graph <- as_tbl_graph(data_clean)
data_graph
## # A tbl_graph: 1301 nodes and 1233 edges
## #
## # A directed multigraph with 69 components
## #
## # A tibble: 1,301 × 1
##   name           
##   <chr>          
## 1 01507db5dba34b2
## 2 0DarylZero     
## 3 10thCrusader   
## 4 123_talent     
## 5 1966OldSchool  
## 6 1DanCox        
## # ℹ 1,295 more rows
## #
## # A tibble: 1,233 × 3
##    from    to     n
##   <int> <int> <int>
## 1     1  1226     1
## 2     2  1227     1
## 3     3  1228     1
## # ℹ 1,230 more rows
#View(data_graph)

We will mainly work with this tidygraph object (i.e., network data) from now. We use activate() to tell R whether we want to work on the nodes part or the edges part of the object.

As a side note, you can get the tidygraph object back into data frames easily:

data_nodes <- data_graph %>%
  activate(nodes) %>%
  as_tibble()
#View(data_nodes)

data_edges <- data_graph %>%
  activate(edges) %>%
  as_tibble()
#View(data_edges)

Okay, back to the tidygraph object. As noted above, one of the biggest advantages of tidygraph is that we can use dplyr functions to manipulate the nodes or the edges in a way very similar to how we manipulate data frames.

Let’s unpack the codes below:

data_graph_ideology <- data_graph %>%
  activate(nodes) %>%
  mutate(conservative_elites = ifelse(name=="realDonaldTrump" | name=="FoxNews" | name=="BreitbartNews",1,0), 
         liberal_elites = ifelse(name=="BernieSanders" | name=="CNN" | name=="HuffPostPol",1,0)) %>% 
  activate(edges) %>%
  mutate(retweet_ideology = factor(ifelse(.N()$conservative_elites[to] == 1, "conservative",
                                          ifelse(.N()$liberal_elites[to] == 1, "liberal",
                                                 "unknown")))) 

data_graph_ideology
## # A tbl_graph: 1301 nodes and 1233 edges
## #
## # A directed multigraph with 69 components
## #
## # A tibble: 1,233 × 4
##    from    to     n retweet_ideology
##   <int> <int> <int> <fct>           
## 1     1  1226     1 liberal         
## 2     2  1227     1 unknown         
## 3     3  1228     1 conservative    
## 4     4  1226     1 liberal         
## 5     5  1226     1 liberal         
## 6     6  1228     1 conservative    
## # ℹ 1,227 more rows
## #
## # A tibble: 1,301 × 3
##   name            conservative_elites liberal_elites
##   <chr>                         <dbl>          <dbl>
## 1 01507db5dba34b2                   0              0
## 2 0DarylZero                        0              0
## 3 10thCrusader                      0              0
## # ℹ 1,298 more rows

3. Visualizing

We use the package ggraph to visualize our network.

The grammar of ggraph is similar to ggplot2. We use + rather than %>% to connect the commands.

A simple plot:

set.seed(10000)
ggraph(data_graph_ideology, layout = "kk") + 
  geom_node_point() +
  geom_edge_link()
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Change the style of plot:

set.seed(10000)
ggraph(data_graph_ideology, layout = "graphopt") + 
  geom_node_point(size = 0.3) +
  geom_edge_diagonal(aes(color = retweet_ideology, width = n), # color by group
                 arrow = arrow(length = unit(1, 'mm')), # add arrows
                 start_cap = circle(0.3, 'mm'), end_cap = circle(0.3, 'mm')) + # add space between edge and node
  scale_edge_width(range = c(0.1, 0.7), guide = "none") + 
  scale_edge_color_manual(values=c("tomato","steelblue","grey")) + # group colors
  theme_graph() # use theme for graph (e.g. no background)

More on layouts: https://www.data-imaginist.com/2017/ggraph-introduction-layouts/

More on styles of nodes: https://www.data-imaginist.com/2017/ggraph-introduction-nodes/

More on styles of edges: https://www.data-imaginist.com/2017/ggraph-introduction-edges/

4. Calculating network measures

Centrality

Here we calculate two types of centrality:

data_graph_centrality <- data_graph_ideology %>%
  activate(nodes) %>%
  mutate(in_degree = centrality_degree(mode = "in")) %>%
  activate(edges) %>%
  mutate(edge_betweenness = centrality_edge_betweenness())

data_graph_centrality 
## # A tbl_graph: 1301 nodes and 1233 edges
## #
## # A directed multigraph with 69 components
## #
## # A tibble: 1,233 × 5
##    from    to     n retweet_ideology edge_betweenness
##   <int> <int> <int> <fct>                       <dbl>
## 1     1  1226     1 liberal                         1
## 2     2  1227     1 unknown                         1
## 3     3  1228     1 conservative                    1
## 4     4  1226     1 liberal                         1
## 5     5  1226     1 liberal                         1
## 6     6  1228     1 conservative                    1
## # ℹ 1,227 more rows
## #
## # A tibble: 1,301 × 4
##   name            conservative_elites liberal_elites in_degree
##   <chr>                         <dbl>          <dbl>     <dbl>
## 1 01507db5dba34b2                   0              0         0
## 2 0DarylZero                        0              0         0
## 3 10thCrusader                      0              0         0
## # ℹ 1,298 more rows

Update the plot using node in-degree centrality:

set.seed(10000)
ggraph(data_graph_centrality, layout = "graphopt") + 
  geom_node_point(aes(size = in_degree)) + # NEW LINE
  scale_size_continuous(range = c(0.1, 10)) + # NEW LINE
  geom_node_text(aes(label = name, filter = in_degree > 1), size = 2.5, repel = TRUE) + # NEW LINE
  geom_edge_diagonal(aes(color = retweet_ideology, width = n), 
                 arrow = arrow(length = unit(1, 'mm')),
                 alpha = 0.5) + # REVISED
  scale_edge_width(range = c(0.1, 0.7), guide = "none") + 
  scale_edge_color_manual(values=c("tomato","steelblue","grey")) + 
  theme_graph() 

Community detection

We can see how the nodes “cluster” together by performing community detection. See a list of different community detection algorithms here: https://tidygraph.data-imaginist.com/reference/group_graph.html

data_graph_community <- data_graph_centrality %>% 
  activate(nodes) %>% 
  mutate(community = group_components())

summary(factor(as.data.frame(data_graph_community)$community))
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 556 532  30  28  10   6   4   4   4   3   3   3   3   3   3   3   3   2   2   2 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
##  61  62  63  64  65  66  67  68  69 
##   2   2   2   2   2   2   2   2   1

We simplify this result by combining the smaller communities:

data_graph_community <- data_graph_community %>% 
  activate(nodes) %>% 
  mutate(community_clean = factor(ifelse(community==1|community==2|community==3|community==4|community==5, community, 6)))

summary(factor(as.data.frame(data_graph_community)$community_clean))
##   1   2   3   4   5   6 
## 556 532  30  28  10 145

Visualize the network using community detection. Question: How is this plot different from the last plot?

set.seed(10000)
ggraph(data_graph_community, layout = "graphopt") + 
  geom_node_point(aes(size = in_degree, color = community_clean)) +
  scale_size_continuous(range = c(0.1, 10)) +
  geom_node_text(aes(label = name, filter = in_degree > 1), size = 2.5, repel = TRUE)+
  geom_edge_diagonal(width = 0.1, alpha = 0.3) +
  theme_graph() 

Global measures

We can also calculate measures for the whole network. See more here: https://tidygraph.data-imaginist.com/reference/graph_measures.html

data_graph_global <- data_graph_community %>%
  mutate(diameter = graph_diameter(directed = FALSE), # What is diameter?
         number_of_edges = graph_size(),
         number_of_nodes = graph_order(),
         graph_clique = graph_clique_num(),# Get the size of the largest clique
         graph_reciprocity = graph_reciprocity()) %>% # Measures the proportion of mutual connections in the graph
  as_tibble()
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `graph_clique = graph_clique_num()`.
## Caused by warning in `clique_num()`:
## ! At core/cliques/maximal_cliques_template.h:269 : Edge directions are ignored for maximal clique calculation.
str(data_graph_global)
## tibble [1,301 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name               : chr [1:1301] "01507db5dba34b2" "0DarylZero" "10thCrusader" "123_talent" ...
##  $ conservative_elites: num [1:1301] 0 0 0 0 0 0 0 0 0 0 ...
##  $ liberal_elites     : num [1:1301] 0 0 0 0 0 0 0 0 0 0 ...
##  $ in_degree          : Named num [1:1301] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "names")= chr [1:1301] "01507db5dba34b2" "0DarylZero" "10thCrusader" "123_talent" ...
##  $ community          : int [1:1301] 2 1 1 2 2 1 1 2 2 1 ...
##  $ community_clean    : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 2 1 1 2 2 1 ...
##  $ diameter           : num [1:1301] 7 7 7 7 7 7 7 7 7 7 ...
##  $ number_of_edges    : num [1:1301] 1233 1233 1233 1233 1233 ...
##  $ number_of_nodes    : int [1:1301] 1301 1301 1301 1301 1301 1301 1301 1301 1301 1301 ...
##  $ graph_clique       : int [1:1301] 2 2 2 2 2 2 2 2 2 2 ...
##  $ graph_reciprocity  : num [1:1301] 0 0 0 0 0 0 0 0 0 0 ...

Because much of tidygraph is a wrapper of igraph, you can directly use igraph functions for tidygraph objects. Below we calculate network density using an igraph function:

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following object is masked from 'package:tidygraph':
## 
##     groups
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
igraph::edge_density(data_graph)
## [1] 0.000729025

Exercise

  • Step 0: Clean your environment using rm(list=ls()). Open a new R file, SAVE it to your folder, and set the working directory.

  • Step 1: Read “network exercise.csv” into your environment. Inspect the data.

    • This file is a small sample of the Reddit Hyperlink Network data posted in the Stanford Large Network Dataset Collection. This dataset aggregates Reddit posts that create hyperlinks from one subreddit to another subreddit

    • SOURCE_SUBREDDIT: the subreddit where the link originates

    • TARGET_SUBREDDIT: the subreddit where the link ends

    • POST_ID: the post in the source subreddit that starts the link

    • TIMESTAMP: time time of the post

    • LINK SENTIMENT: sentiment towards the target post. The value is -1 if the source is negative towards the target post, and 1 if it is neutral or positive

  • Step 2: Preprocess the data.

    • Count the number of times a subreddit is hyperlinked by a post from another subreddit

    • Transform the data to tidygraph format

  • Step 3: Calculate the below measures using the codes we’ve gone over in class.

    • Node in-degree centrality

    • Edge betweeness centrality

  • Step 4: Calculate two new measures by modifying the codes we’ve gone over and reading the documentations.

  • Step 5: Perform community detection.

  • Step 6: Visualize the network by leveraging at least two measures you’ve calculated in Step 3-5.

  • Step 7: Share your visualization with two other people. Compare your plots. Remember, programming is both science and art.