LBB: Social Network Analysis

Team Algoritma

1/20/2022

1 Introduction

For LBB (Learning By Building) Social Network Analysis, we will use data from twitter based on trending topic #SecureTheTribe extracted from Twitter API. Number of tweet related to this trending topic is 30,000. We already extracted the data before making this analysis and save it into CSV format file named worldwide.

2 Library

# Data Wrangling
library(tidyverse) 

# For graph and visualization
library(tidygraph)
library(ggraph)
library(igraph)
library(rtweet)

3 Exploratory Data Analysis

3.1 Read Data

tweets <- read.csv("worldwide.csv")

head(tweets)

Let’s find how many retweet data in this dataframe

table(tweets$is_retweet)
## 
## FALSE  TRUE 
## 11051 18949
tweets %>% 
  group_by(screen_name) %>%
  summarise(total_retweet = sum(retweet_count)) %>%
  arrange(desc(total_retweet)) %>% 
  head()

The most retweeted screen_name from this dataframe is Whizsword.

3.2 Data cleansing

tweets %>% 
  select(screen_name, mentions_screen_name)
  • As seen above, we can see mentions_screen_name column may consist of several username separated by “space”, thus we need to separate those row which has more than 1 username information.

  • Noted in mentions_screen_name column did not contain punctuation other than “_ (underscore)”. We decide not to delete underscore value since it is part of twitter username, if deleted we will not be able to find the username in twitter app.

  • We also noted mentions_screen_name contain many “blank” value, thus it needed to be filtered or deleted from dataframe.

For data cleansing, we will perform couple of steps as below:

  1. Clean string value from screen_name and mentions_screen_name columns.
  2. Separate few names in mention_screen_name into separate row.
  3. Delete missing/blank value in mention_screen_name.
  4. Change screen_name and mention_screen_name columns name into from and to.
  5. Save cleaned data into object called edge_df.
edge_df <-
tweets %>% 
  select(screen_name, mentions_screen_name) %>% #Step 1
  separate_rows(mentions_screen_name, sep = " ") %>% #Step 2
  filter(mentions_screen_name != "") %>% #step 3
  rename(from = screen_name,
         to = mentions_screen_name) #Step 4

edge_df

After cleaning tweets data and saved into object edge_df, we want to create another object called nodes_df which contain unique value in both from and to column.

nodes_df <- data.frame(name = unique(c(edge_df$from,edge_df$to)),
                        stringsAsFactors = F)

Now we want to create graph data using tbl_graph() function with Undirected Graph parameter assuming relation between Nodes are 2 ways.

graph_tweets <- tbl_graph(nodes = nodes_df,
                          edges = edge_df,
                          directed = F)

3.3 Centrality Measurement

Now we will measure centrality from each Nodes using:

  • Degree: For finding very connected individuals, popular individuals, individuals who are likely to hold most information or individuals who can quickly connect with the wider network.
  • Betweenness: For finding the individuals who influence the flow around a system.
  • Closeness: For finding the individuals who are best placed to influence the entire network most quickly.
  • Eigen: measures a node’s influence based on the number of links it has to other nodes in the network, then goes a step further by also taking into account how well connected a node is, and how many links their connections have, and so on through the network.

For reference Centrality measurement

graph_tweets <- graph_tweets %>% 
  activate(nodes) %>%
  mutate(degree = centrality_degree(), # Degree centrality
         between = centrality_betweenness(normalized = T), # Betweeness centrality
         closeness = centrality_closeness(), # Closeness centrality
         eigen = centrality_eigen() # Eigen centrality
         )
#Convert graph data into data frame
network_act_df <- graph_tweets %>% 
  activate(nodes) %>% 
  as.data.frame()

Now we want to know, which username account has the highest value by each centrality, that we can consider as “Popular Username”.

pop_username <- data.frame(
  network_act_df %>% arrange(-degree) %>% select(name) %>% head(),
  network_act_df %>% arrange(-between) %>% select(name) %>% head(),
  network_act_df %>% arrange(-closeness) %>% select(name) %>% head(),
  network_act_df %>% arrange(-eigen) %>% select(name) %>% head()
) %>% setNames(c("Degree","Betweenness","Closeness","Eigen"))

pop_username

From result above, we noted that username gimbakakanda has the highest value from all centrality measurement, thus we can consider gimbakakanda is considered as popular username. gimbakakanda is an author and journalist from Aljazeera and Daily Trust.

tweets %>% 
  filter(mentions_screen_name == "gimbakakanda") %>%
  group_by(mentions_screen_name, text) %>% 
  tally() %>% 
  arrange(-n) %>%
  pull(text) %>% 
  head(3)
## [1] "We don jam tribalism wey pass our own today.  #SecureTheTribe"           
## [2] "Thank you @gimbakakanda #SecureTheTribe"                                 
## [3] "\"Stop telling your history from 1970.\" - @gimbakakanda #SecureTheTribe"

Most popular tweet from gimbakakanda is “We don jam tribalism wey pass our own today.”

3.4 Graph Visualization

We will create cluster using Louvain method, whereas Louvain is one of the most popular method to uncovering community structure.

set.seed(123)
graph_tweets <- graph_tweets %>% 
  activate(nodes) %>% 
  mutate(community = group_louvain()) %>% # clustering
  activate(edges) %>% 
  filter(!edge_is_loop())  # Remove loop edges
graph_tweets %>% 
  activate(nodes) %>% 
  as.data.frame() %>% 
  count(community)

There are 388 cluster created.

To gathered information from Important Username within each cluster, we will create function and assign it into object called important_user.

# fungsi untuk mendapatkan orang orang penting di tiap cluster
important_user <- function(data) {
  name_person <- data %>%
  as.data.frame() %>% 
  filter(community %in% 1:5) %>% 
  select(-community) %>% 
  pivot_longer(-name, names_to = "measures", values_to = "values") %>% 
  group_by(measures) %>% 
  arrange(desc(values)) %>% 
  slice(1:6) %>% 
  ungroup() %>% 
  distinct(name) %>% 
  pull(name)
  
  return(name_person)
}
#create object contain important person
important_person <- 
graph_tweets %>% 
  activate(nodes) %>% 
  important_user()
# Visualization using ggraph.
set.seed(123)
graph_tweets %>%
  activate(nodes) %>%
  mutate(ids = row_number(),
         community = as.character(community)) %>%
  filter(community %in% 1:3) %>% # number of community.
  arrange(community,ids) %>%
  mutate(node_label = ifelse(name %in% important_person, name,NA)) %>%
  ggraph(layout = "fr") +
  geom_edge_link(alpha = 0.3 ) +
  geom_node_point(aes(size = degree, fill = community), shape = 21, alpha = 0.7, color = "grey30") +
  geom_node_label(aes(label = node_label), repel = T, alpha = 0.8 ) +
  guides(size = "none") +
  labs(title = "Top 3 Community of #SecureTheTribe",
       color = "Interaction",
       fill = "Community") +
  theme_void() +
  theme(legend.position = "top")

3.5 Conclusion

From top 3 community visualization above, we noted the most popular username gimbakakanda is in community 1(red). Community 1 (red) and 2 (green) has 2 (two) important person each, while community 3 (blue) only has one important person Lola_OJ. idgaf_whatever is slightly closed to community 2 green. The denser the community/cluster, the more efficient information can spread.