Download “example retweets.csv” and “network exercise.csv” to your folder
Open a new R file and SAVE it to your folder
Set working directory to your folder
We will focus on two packages: tidygraph for
preprocessing and analysis, and ggraph for visualization.
They are both developed by Thomas Lin Pedersen.
There are several other packages widely used for network analysis and
visualization, such as igraph and network.
Although tidygraph and ggraph are newer
packages than igraph and network, they have a
big advantage: they bring network analysis into the tidyverse
workflow.
Under the framework we’ve been working under (for textual analyses), network data seem harder to grasp. There’s a discrepancy between relational data and the tidy data idea — relational data cannot in any meaningful way be encoded as a single tidy data frame.
To solve this discrepancy, tidygraph and
ggraphadopt the tidy data idea. You can view
tidygraph as an extension of dplyr - it allows
us to use dplyr functions we’ve grown familiar with to
manipulate data frame. You can view ggraph as an extension
of ggplot2.
More on tidygraph and ggraph: https://www.data-imaginist.com/2017/introducing-tidygraph/
#install.packages("tidygraph")
#install.packages("ggraph")
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
library(ggraph)
## Loading required package: ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Usually, the data we have is in a raw, textual format:
data <- read.csv("example retweets.csv")
str(data)
## 'data.frame': 1246 obs. of 6 variables:
## $ id : int 6672 6673 6705 6707 6712 6714 6717 6718 6720 6724 ...
## $ date : chr "10/30/16" "10/30/16" "10/30/16" "10/30/16" ...
## $ screen_name : chr "canadaballer22" "bigman9378" "Bojispirit" "StCathWriter" ...
## $ text : chr "RT @CNN: Fake news has become a plague on the web, says @brianstelter. His advice? Triple check before you shar"| __truncated__ "RT @CNN: Fake news has become a plague on the web, says @brianstelter. His advice? Triple check before you shar"| __truncated__ "RT @CNN: Fake news has become a plague on the web, says @brianstelter. His advice? Triple check before you shar"| __truncated__ "RT @CNN: Fake news has become a plague on the web, says @brianstelter. His advice? Triple check before you shar"| __truncated__ ...
## $ follower_count : int 11 74 21 4453 311 8956 7 334 3282 1474 ...
## $ following_count: int 269 321 282 3285 176 1433 79 2510 4959 4087 ...
We need to identify the accounts who retweeted others and the accounts being retweeted:
data_clean <- data %>%
mutate(from = screen_name, to = gsub("RT @(\\S*): .*","\\1",text)) %>%
select(from, to) %>% # Question: What does select do?
count(from, to) # Question: What does this line do?
head(data_clean)
## from to n
## 1 01507db5dba34b2 CNN 1
## 2 0DarylZero _Makada_ 1
## 3 10thCrusader realDonaldTrump 1
## 4 123_talent CNN 1
## 5 1966OldSchool CNN 1
## 6 1DanCox realDonaldTrump 1
Up to now, the data type we are working with is data frame. We need to transform the data type to to a “network type” (where there are nodes and edges).
The package tidygraph provides an easy way for us to do
that (as_tbl_graph):
data_graph <- as_tbl_graph(data_clean)
data_graph
## # A tbl_graph: 1301 nodes and 1233 edges
## #
## # A directed multigraph with 69 components
## #
## # A tibble: 1,301 × 1
## name
## <chr>
## 1 01507db5dba34b2
## 2 0DarylZero
## 3 10thCrusader
## 4 123_talent
## 5 1966OldSchool
## 6 1DanCox
## # ℹ 1,295 more rows
## #
## # A tibble: 1,233 × 3
## from to n
## <int> <int> <int>
## 1 1 1226 1
## 2 2 1227 1
## 3 3 1228 1
## # ℹ 1,230 more rows
#View(data_graph)
We will mainly work with this tidygraph object (i.e.,
network data) from now. We use activate() to tell R whether
we want to work on the nodes part or the edges part of
the object.
As a side note, you can get the tidygraph object back
into data frames easily:
data_nodes <- data_graph %>%
activate(nodes) %>%
as_tibble()
#View(data_nodes)
data_edges <- data_graph %>%
activate(edges) %>%
as_tibble()
#View(data_edges)
Okay, back to the tidygraph object. As noted above, one
of the biggest advantages of tidygraph is that we can use
dplyr functions to manipulate the nodes or the edges in a
way very similar to how we manipulate data frames.
Let’s unpack the codes below:
data_graph_ideology <- data_graph %>%
activate(nodes) %>%
mutate(conservative_elites = ifelse(name=="realDonaldTrump" | name=="FoxNews" | name=="BreitbartNews",1,0),
liberal_elites = ifelse(name=="BernieSanders" | name=="CNN" | name=="HuffPostPol",1,0)) %>%
activate(edges) %>%
mutate(retweet_ideology = factor(ifelse(.N()$conservative_elites[to] == 1, "conservative",
ifelse(.N()$liberal_elites[to] == 1, "liberal",
"unknown"))))
data_graph_ideology
## # A tbl_graph: 1301 nodes and 1233 edges
## #
## # A directed multigraph with 69 components
## #
## # A tibble: 1,233 × 4
## from to n retweet_ideology
## <int> <int> <int> <fct>
## 1 1 1226 1 liberal
## 2 2 1227 1 unknown
## 3 3 1228 1 conservative
## 4 4 1226 1 liberal
## 5 5 1226 1 liberal
## 6 6 1228 1 conservative
## # ℹ 1,227 more rows
## #
## # A tibble: 1,301 × 3
## name conservative_elites liberal_elites
## <chr> <dbl> <dbl>
## 1 01507db5dba34b2 0 0
## 2 0DarylZero 0 0
## 3 10thCrusader 0 0
## # ℹ 1,298 more rows
We use the package ggraph to visualize our network.
The grammar of ggraph is similar to
ggplot2. We use + rather than
%>% to connect the commands.
A simple plot:
set.seed(10000)
ggraph(data_graph_ideology, layout = "kk") +
geom_node_point() +
geom_edge_link()
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Change the style of plot:
set.seed(10000)
ggraph(data_graph_ideology, layout = "graphopt") +
geom_node_point(size = 0.3) +
geom_edge_diagonal(aes(color = retweet_ideology, width = n), # color by group
arrow = arrow(length = unit(1, 'mm')), # add arrows
start_cap = circle(0.3, 'mm'), end_cap = circle(0.3, 'mm')) + # add space between edge and node
scale_edge_width(range = c(0.1, 0.7), guide = "none") +
scale_edge_color_manual(values=c("tomato","steelblue","grey")) + # group colors
theme_graph() # use theme for graph (e.g. no background)
More on layouts: https://www.data-imaginist.com/2017/ggraph-introduction-layouts/
More on styles of nodes: https://www.data-imaginist.com/2017/ggraph-introduction-nodes/
More on styles of edges: https://www.data-imaginist.com/2017/ggraph-introduction-edges/
Here we calculate two types of centrality:
data_graph_centrality <- data_graph_ideology %>%
activate(nodes) %>%
mutate(in_degree = centrality_degree(mode = "in")) %>%
activate(edges) %>%
mutate(edge_betweenness = centrality_edge_betweenness())
data_graph_centrality
## # A tbl_graph: 1301 nodes and 1233 edges
## #
## # A directed multigraph with 69 components
## #
## # A tibble: 1,233 × 5
## from to n retweet_ideology edge_betweenness
## <int> <int> <int> <fct> <dbl>
## 1 1 1226 1 liberal 1
## 2 2 1227 1 unknown 1
## 3 3 1228 1 conservative 1
## 4 4 1226 1 liberal 1
## 5 5 1226 1 liberal 1
## 6 6 1228 1 conservative 1
## # ℹ 1,227 more rows
## #
## # A tibble: 1,301 × 4
## name conservative_elites liberal_elites in_degree
## <chr> <dbl> <dbl> <dbl>
## 1 01507db5dba34b2 0 0 0
## 2 0DarylZero 0 0 0
## 3 10thCrusader 0 0 0
## # ℹ 1,298 more rows
Update the plot using node in-degree centrality:
set.seed(10000)
ggraph(data_graph_centrality, layout = "graphopt") +
geom_node_point(aes(size = in_degree)) + # NEW LINE
scale_size_continuous(range = c(0.1, 10)) + # NEW LINE
geom_node_text(aes(label = name, filter = in_degree > 1), size = 2.5, repel = TRUE) + # NEW LINE
geom_edge_diagonal(aes(color = retweet_ideology, width = n),
arrow = arrow(length = unit(1, 'mm')),
alpha = 0.5) + # REVISED
scale_edge_width(range = c(0.1, 0.7), guide = "none") +
scale_edge_color_manual(values=c("tomato","steelblue","grey")) +
theme_graph()
We can see how the nodes “cluster” together by performing community detection. See a list of different community detection algorithms here: https://tidygraph.data-imaginist.com/reference/group_graph.html
data_graph_community <- data_graph_centrality %>%
activate(nodes) %>%
mutate(community = group_components())
summary(factor(as.data.frame(data_graph_community)$community))
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 556 532 30 28 10 6 4 4 4 3 3 3 3 3 3 3 3 2 2 2
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 61 62 63 64 65 66 67 68 69
## 2 2 2 2 2 2 2 2 1
We simplify this result by combining the smaller communities:
data_graph_community <- data_graph_community %>%
activate(nodes) %>%
mutate(community_clean = factor(ifelse(community==1|community==2|community==3|community==4|community==5, community, 6)))
summary(factor(as.data.frame(data_graph_community)$community_clean))
## 1 2 3 4 5 6
## 556 532 30 28 10 145
Visualize the network using community detection. Question: How is this plot different from the last plot?
set.seed(10000)
ggraph(data_graph_community, layout = "graphopt") +
geom_node_point(aes(size = in_degree, color = community_clean)) +
scale_size_continuous(range = c(0.1, 10)) +
geom_node_text(aes(label = name, filter = in_degree > 1), size = 2.5, repel = TRUE)+
geom_edge_diagonal(width = 0.1, alpha = 0.3) +
theme_graph()
We can also calculate measures for the whole network. See more here: https://tidygraph.data-imaginist.com/reference/graph_measures.html
data_graph_global <- data_graph_community %>%
mutate(diameter = graph_diameter(directed = FALSE), # What is diameter?
number_of_edges = graph_size(),
number_of_nodes = graph_order(),
graph_clique = graph_clique_num(),# Get the size of the largest clique
graph_reciprocity = graph_reciprocity()) %>% # Measures the proportion of mutual connections in the graph
as_tibble()
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `graph_clique = graph_clique_num()`.
## Caused by warning in `clique_num()`:
## ! At core/cliques/maximal_cliques_template.h:269 : Edge directions are ignored for maximal clique calculation.
str(data_graph_global)
## tibble [1,301 × 11] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:1301] "01507db5dba34b2" "0DarylZero" "10thCrusader" "123_talent" ...
## $ conservative_elites: num [1:1301] 0 0 0 0 0 0 0 0 0 0 ...
## $ liberal_elites : num [1:1301] 0 0 0 0 0 0 0 0 0 0 ...
## $ in_degree : Named num [1:1301] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:1301] "01507db5dba34b2" "0DarylZero" "10thCrusader" "123_talent" ...
## $ community : int [1:1301] 2 1 1 2 2 1 1 2 2 1 ...
## $ community_clean : Factor w/ 6 levels "1","2","3","4",..: 2 1 1 2 2 1 1 2 2 1 ...
## $ diameter : num [1:1301] 7 7 7 7 7 7 7 7 7 7 ...
## $ number_of_edges : num [1:1301] 1233 1233 1233 1233 1233 ...
## $ number_of_nodes : int [1:1301] 1301 1301 1301 1301 1301 1301 1301 1301 1301 1301 ...
## $ graph_clique : int [1:1301] 2 2 2 2 2 2 2 2 2 2 ...
## $ graph_reciprocity : num [1:1301] 0 0 0 0 0 0 0 0 0 0 ...
Because much of tidygraph is a wrapper of
igraph, you can directly use igraph functions
for tidygraph objects. Below we calculate network density
using an igraph function:
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following object is masked from 'package:tidygraph':
##
## groups
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
igraph::edge_density(data_graph)
## [1] 0.000729025
Step 0: Clean your environment using
rm(list=ls()). Open a new R file, SAVE it to your folder,
and set the working directory.
Step 1: Read “network exercise.csv” into your environment. Inspect the data.
This file is a small sample of the Reddit Hyperlink Network data posted in the Stanford Large Network Dataset Collection. This dataset aggregates Reddit posts that create hyperlinks from one subreddit to another subreddit
SOURCE_SUBREDDIT: the subreddit where the link originates
TARGET_SUBREDDIT: the subreddit where the link ends
POST_ID: the post in the source subreddit that starts the link
TIMESTAMP: time time of the post
LINK SENTIMENT: sentiment towards the target post. The value is -1 if the source is negative towards the target post, and 1 if it is neutral or positive
Step 2: Preprocess the data.
Count the number of times a subreddit is hyperlinked by a post from another subreddit
Transform the data to tidygraph format
Step 3: Calculate the below measures using the codes we’ve gone over in class.
Node in-degree centrality
Edge betweeness centrality
Step 4: Calculate two new measures by modifying the codes we’ve gone over and reading the documentations.
Node betweeness centrality is the notion of
being a bridge between others. Use
centrality_betweenness(). https://tidygraph.data-imaginist.com/reference/centrality.html
Node transitivity is the notion of “the friends
of my friends are my friends”. Use local_transitivity() to
calculate the transitivity of each node, that is, the propensity for the
nodes neighbors to be connected. https://tidygraph.data-imaginist.com/reference/local_graph.html
Step 5: Perform community detection.
Step 6: Visualize the network by leveraging at least two measures you’ve calculated in Step 3-5.
Step 7: Share your visualization with two other people. Compare your plots. Remember, programming is both science and art.