This article shows an implementation of graph theory to build a social network analysis.
Social Network Analyisis is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties, edges, or links (relationships or interactions) that connect them.
Below is the list of required package if you wish to reproduce the codes. All codes and dataset is available at my github repo .
options(scipen = 999)
# for data wrangling. very helpfull for preparing nodes and edges data
library(tidyverse)
library(lubridate)
# for building network and visualization
library(tidygraph)
library(graphlayouts)
# already included in tidygraph but just fyi
library(igraph)
library(ggraph)
# for crawling Twitter data
library(rtweet)
All data is extracted directly from Twitter using Twitter API. To get the access, you need to create a Twitter Developer Apps first. The tutorial can be accessed on this website . After you have created a Twitter App, you need to create a token using the create_token()
function in R. All the key access can be acquired from the Twitter App.
apikey <- "xxx"
apisecret <- "xxx"
acctoken <- "xxx"
tokensecret <- "xxx"
token <- create_token(app = "xxx",
consumer_key = apikey,
consumer_secret = apisecret,
access_token = acctoken,
access_secret = tokensecret)
After you have created a token, you may start to search for tweets. For this illustration, we want to search all tweets with a hashtag of #COVID19.
I have prepared a dataset that contain tweets related to #COVID19. The data is extracted from June 7-9 2020.
We will create an activity network that visualize the activity of mention and retweet, which is the form of interaction between tweeter user. To build a network, first we need to build the graph. A graph consists of two main elements: edges and vertices/nodes. Edges is the link or connection between two vertice, or in this case, between to user. The edge can be a directed connection (has arrow to indicate direction) or an undirected connection (no arrow). A graph that contains more information is called a network.
Below is the edges that appear based on our dataset. The connection is represented by the column from
and to
.
# Cleaning Mention Screen Name Function
mention_clean <- function(x){
if(grepl(",",x) == TRUE){
gsub('^.|[^[:alnum:][:blank:]_,?&/\\-]',"",x)
} else{
x
}
}
# Apply mention_clean function to mentions_screen_name column using sapply()
edge_nn <- tweet %>%
select(screen_name,is_retweet,mentions_screen_name) %>%
mutate(mentions_screen_name = sapply(mentions_screen_name, mention_clean)) %>%
filter(mentions_screen_name != "NA")
# specify interaction type
edge_nn <- edge_nn %>%
mutate(type = ifelse(is_retweet == "TRUE", "retweet", "mention"))
# seperate value in mention_screen_name by comma
edge_nn <- edge_nn %>%
select(screen_name,mentions_screen_name,type) %>%
separate_rows(mentions_screen_name,sep = ",") %>%
setNames(c("from","to","type")) %>%
count(from,to,type)
edge_nn %>% head()
We might want to inspect how many interaction is a retweet activity or a mention/reply activity.
We already have the edge for our graph. Now we also need to create the vertices which is collected from unique user from the edge.
# create nodes dataframe by unique value in both edges column
nodes_nn <- data.frame(V = unique(c(edge_nn$from,edge_nn$to)),
stringsAsFactors = F)
tail(nodes_nn)
Now we can build the graph using graph_from_data_frame()
function from igraph
package. For this analysis, we will only make an undirected graph (no arrow to indicate direction).
# Build graph data
network_nn <- graph_from_data_frame(d = edge_nn, # Edge
vertices = nodes_nn, # Vertice
directed = F # Is directed Graph?
) %>%
as_tbl_graph() # Transform graph to table
network_nn
## # A tbl_graph: 35327 nodes and 39067 edges
## #
## # An undirected multigraph with 3882 components
## #
## # Node Data: 35,327 x 1 (active)
## name
## <chr>
## 1 _____tigerlily_
## 2 ___ACSaid
## 3 ___yezzy
## 4 __aboutmary
## 5 __allisxn
## 6 __BL00M__
## # ... with 35,321 more rows
## #
## # Edge Data: 39,067 x 4
## from to type n
## <int> <int> <chr> <int>
## 1 1 25504 retweet 1
## 2 2 25505 retweet 1
## 3 3 6773 retweet 1
## # ... with 39,064 more rows
There are several metrics that can be used to analyze the properties of the graph.
The graph density represent how dense the connection in the graph is. The density is the ratio between the number of existing edge compared to all possible edge.
## [1] 0.0000626093
It is shows that from all possible connection between nodes, there is only 0.006% existing edge in the graph. This indicate that the graph is not very efficient at spreading information, because there is only small fraction of connection built.
Average path length shows the mean of distance
## [1] 6.978752
The number indicates that it took about 6-7 steps of travel from a single node to the next node.
Centrality measures the importance of a single node based on its relation with other nodes. There are several centrality measures, including:
For more detailed resource regarding graph centrality, you may visit this website .
We will also detect communities based on the relation between nodes. A community is a collection of nodes that is highly connected to each other.
# create community, calculate centrality and remove loop edge
set.seed(123)
network_nn <- network_nn %>%
activate(nodes) %>%
mutate(community = group_louvain(), # Build community through clustering algorithm
degree = centrality_degree(), # Calculate degree centrality
between = centrality_betweenness(), # Calculate betweeness centrality
closeness = centrality_closeness(), # Calculate closeness centrality
eigen = centrality_eigen()) %>% # Calculate eigen centrality
activate(edges) %>%
filter(!edge_is_loop()) # Remove loop edges
network_act_df <- as.data.frame(network_nn %>% activate(nodes))
network_act_df %>%
head()
We will inspect the top 5 user based on each centrality measures.
kp_activity <- data.frame(
network_act_df %>% arrange(-degree) %>% select(name) %>% head(),
network_act_df %>% arrange(-between) %>% select(name) %>% head(),
network_act_df %>% arrange(-closeness) %>% select(name) %>% head(),
network_act_df %>% arrange(-eigen) %>% select(name) %>% head()) %>%
setNames(c("Degree","Betweenness","Closeness","Eigen"))
kp_activity
The top user based on the degree centrality indicates that this user has a lot of interaction, either a retweet or mentions.
We will some of the most retweeted topic from DrRobDavidson, who is the executive director of Committee to Protect Medicare.
tweet %>%
filter(mentions_screen_name == "DrRobDavidson") %>%
arrange(desc(retweet_count)) %>%
distinct(text) %>%
pull(text)
## [1] "Bottom line: If we see a big spike in #COVID19 cases, it's due to lack of testing & contact tracing, & a weak president rushing to reopen society prematurely for political reasons. Don’t let @realdonaldtrump use politics to rewrite history or shift blame. #BlackLivesMatter 8/8"
## [2] "We’re seeing a spike in cases of #COVID19 in AR, AZ, CA, MA, NC, NH, NV, OK, SC, TN, UT, and WA.And FL saw the most cases in 1 day this week. The cause? The rushed & reckless #reopeningofAmerica pushed by @realDonaldTrump while failing to implement a national testing program. 2/8"
## [3] "Further, while there may be risk of getting #COVID19 from gathering close to others, the risk of an African American man dying at the hands of the police in their lifetime is 1/1000. Which means the current state of policing in America is a public health crisis. 5/8"
## [4] "The difference between these protests & anti-lockdown protests of weeks past? Previous protests opposed the very measures that were clearly flattening the curve of #COVID19. This week’s protests oppose measures that are killing Americans. They’re opposites. 6/8"
## [5] "The president has a transparent motive to link a rise in #COVID19 to #peacefulprotests against police brutality. However, reopening entire states nationwide is far more risky than targeted protests in select cities. We wouldn’t see an impact of protests for another few wks. 4/8"
## [6] "#coronavirus #COVID19\n#MemorialDayWeekend \n\nTHREAD\n\nBy @DrRobDavidson https://t.co/F82jwaowmu"
We will also check the top tweets based on number of replies from ANI (Asian News International), which mostly cover news about situations in India.
tweet %>%
filter(mentions_screen_name == "ANI") %>%
arrange(desc(reply_count)) %>%
distinct(text) %>%
pull(text) %>%
head()
## [1] "Delhi Medical Association strongly condemns the way Delhi CM is warning the doctors & threatening hospitals about #COVID19 patients' admissions&tests. FIR on Sir Ganga Ram Hospital is highly condemnable and demoralizing for the whole medical fraternity: Delhi Medical Association https://t.co/SsirANUdVC"
## [2] "Delhi hospitals will be available for the people of Delhi only, while Central hospitals will remain open for all: Delhi Chief Minister Arvind Kejriwal. #COVID19 https://t.co/W66TrJmCr3"
## [3] "#WATCH Delhi hospitals will be available for the people of Delhi only, while Central Govt hospitals will remain open for all. Private hospitals except those where special surgeries like neurosurgery are performed also reserved for Delhi residents: CM Arvind Kejriwal #COVID19 https://t.co/D47nRhXaUZ"
## [4] "Uttarakhand: Preparations for reopening of Badrinath temple underway. However, temple authorities had written to CM&Chamoli DM urging them to keep yatra suspended till June 30. #COVID19\n\nCentre has allowed reopening of religious places from tomorrow. State Govt yet to decide. https://t.co/SiucTzdzPa"
## [5] "A senior officer of the Delhi Disaster Management Authority has tested positive for #COVID19."
## [6] "#WATCH Rajasthan: Social distancing norms violated as people in huge numbers gathered at Pratap Chowk in Baran for the inauguration ceremony of Maharana Pratap’s statue, amid #COVID19 pandemic. Congress MLA Panachand Meghwal also took part in the event. (06.06.2020) https://t.co/rWb4jSLWgh"
We have several chart ot visualize the network, but the popular one is the chord diagram and the classic network.
important_person <- network_act_df %>%
filter(community %in% 1:5) %>%
select(-community) %>%
pivot_longer(-name, names_to = "measures", values_to = "values") %>%
group_by(measures) %>%
arrange(desc(values)) %>%
slice(1:6) %>%
ungroup() %>%
distinct(name) %>%
pull(name)
network_nn %>%
activate(nodes) %>%
mutate(ids = row_number()) %>%
filter(community %in% 1:3) %>% arrange(community,ids) %>%
mutate(node_label = ifelse(name %in% important_person, name, "")) %>%
mutate(node_size = ifelse(name %in% important_person, degree, 0)) %>%
ggraph(layout = "linear", circular = T) +
geom_edge_arc(alpha = 0.05, aes(col = as.factor(type), edge_width = n*0.5)) +
geom_node_label(aes(label = node_label, size = node_size), repel = T,
show.legend = F, fontface = "bold", label.size = 0,
segment.colour="slateblue", fill = "#ffffff66") +
coord_fixed() +
labs(title = "Twitter Activity Network #COVID19",
subtitle = "Retweets and mention between 3 top communities") +
theme_graph() +
guides(edge_width = F,
edge_colour = guide_legend(title = "Tweet Type",
override.aes = list(edge_alpha = 1))) +
theme(legend.position = "bottom",
plot.title = element_text(size = rel(2)),
plot.subtitle = element_text(size = rel(1)),
legend.text = element_text(size = rel(1)))
The chord diagram shows the relation of an entity with other entities with a single line that indicate the connection. The chord diagram for the 3 separate communities shows that most of the interaction is a retweet activity (blue line).
We can also visualize the network with graph. We will try to visualize the top 5 communities since visualizing all network will take too much time to process.
set.seed(13)
network_nn %>%
activate(nodes) %>%
mutate(ids = row_number(),
community = as.character(community)) %>%
filter(community %in% 1:5) %>%
arrange(community,ids) %>%
mutate(node_label = ifelse(name %in% important_person, name, "")) %>%
ggraph(layout = "fr") +
geom_edge_link(alpha = 0.3, aes(color = type)) +
geom_node_point(aes(size = degree, fill = community), shape = 21, alpha = 0.7, color = "grey30") +
geom_node_label(aes(label = node_label), repel = T, alpha = 0.5) +
scale_fill_manual(values = c("firebrick", "blue4", "magenta", "green3", "orange")) +
guides(size = F) +
labs(title = "Top 5 Community of #COVID19",
color = "Interaction", fill = "Community") +
theme_void() +
theme(legend.position = "top")
We can clearly see that most of the interaction is a retweet to the big nodes inside each communities. Some important user, including the big nodes (with high degree centrality) and other important user such as the user with high betweness (act as a bridge between nodes) is highlighted by mentioning their screen name.