1 Introduction
For LBB (Learning By Building) Social Network Analysis, we will use data from twitter based on trending topic #SecureTheTribe extracted from Twitter API. Number of tweet related to this trending topic is 30,000. We already extracted the data before making this analysis and save it into CSV format file named worldwide.
2 Library
# Data Wrangling
library(tidyverse)
# For graph and visualization
library(tidygraph)
library(ggraph)
library(igraph)
library(rtweet)
3 Exploratory Data Analysis
3.1 Read Data
tweets <- read.csv("worldwide.csv")
head(tweets)
Let’s find how many retweet data in this dataframe
table(tweets$is_retweet)
##
## FALSE TRUE
## 11051 18949
tweets %>%
group_by(screen_name) %>%
summarise(total_retweet = sum(retweet_count)) %>%
arrange(desc(total_retweet)) %>%
head()
The most retweeted screen_name from this dataframe is Whizsword.
3.2 Data cleansing
tweets %>%
select(screen_name, mentions_screen_name)
As seen above, we can see
mentions_screen_namecolumn may consist of several username separated by “space”, thus we need to separate those row which has more than 1 username information.Noted in
mentions_screen_namecolumn did not contain punctuation other than “_ (underscore)”. We decide not to delete underscore value since it is part of twitter username, if deleted we will not be able to find the username in twitter app.We also noted
mentions_screen_namecontain many “blank” value, thus it needed to be filtered or deleted from dataframe.
For data cleansing, we will perform couple of steps as below:
- Clean string value from
screen_nameandmentions_screen_namecolumns. - Separate few names in
mention_screen_nameinto separate row. - Delete missing/blank value in
mention_screen_name. - Change
screen_nameandmention_screen_namecolumns name intofromandto. - Save cleaned data into object called
edge_df.
edge_df <-
tweets %>%
select(screen_name, mentions_screen_name) %>% #Step 1
separate_rows(mentions_screen_name, sep = " ") %>% #Step 2
filter(mentions_screen_name != "") %>% #step 3
rename(from = screen_name,
to = mentions_screen_name) #Step 4
edge_df
After cleaning tweets data and saved into object edge_df, we want to create another object called nodes_df which contain unique value in both from and to column.
nodes_df <- data.frame(name = unique(c(edge_df$from,edge_df$to)),
stringsAsFactors = F)
Now we want to create graph data using tbl_graph() function with Undirected Graph parameter assuming relation between Nodes are 2 ways.
graph_tweets <- tbl_graph(nodes = nodes_df,
edges = edge_df,
directed = F)
3.3 Centrality Measurement
Now we will measure centrality from each Nodes using:
Degree: For finding very connected individuals, popular individuals, individuals who are likely to hold most information or individuals who can quickly connect with the wider network.Betweenness: For finding the individuals who influence the flow around a system.Closeness: For finding the individuals who are best placed to influence the entire network most quickly.Eigen: measures a node’s influence based on the number of links it has to other nodes in the network, then goes a step further by also taking into account how well connected a node is, and how many links their connections have, and so on through the network.
For reference Centrality measurement
graph_tweets <- graph_tweets %>%
activate(nodes) %>%
mutate(degree = centrality_degree(), # Degree centrality
between = centrality_betweenness(normalized = T), # Betweeness centrality
closeness = centrality_closeness(), # Closeness centrality
eigen = centrality_eigen() # Eigen centrality
)
#Convert graph data into data frame
network_act_df <- graph_tweets %>%
activate(nodes) %>%
as.data.frame()
Now we want to know, which username account has the highest value by each centrality, that we can consider as “Popular Username”.
pop_username <- data.frame(
network_act_df %>% arrange(-degree) %>% select(name) %>% head(),
network_act_df %>% arrange(-between) %>% select(name) %>% head(),
network_act_df %>% arrange(-closeness) %>% select(name) %>% head(),
network_act_df %>% arrange(-eigen) %>% select(name) %>% head()
) %>% setNames(c("Degree","Betweenness","Closeness","Eigen"))
pop_username
From result above, we noted that username gimbakakanda has the highest value from all centrality measurement, thus we can consider gimbakakanda is considered as popular username. gimbakakanda is an author and journalist from Aljazeera and Daily Trust.
tweets %>%
filter(mentions_screen_name == "gimbakakanda") %>%
group_by(mentions_screen_name, text) %>%
tally() %>%
arrange(-n) %>%
pull(text) %>%
head(3)
## [1] "We don jam tribalism wey pass our own today. #SecureTheTribe"
## [2] "Thank you @gimbakakanda #SecureTheTribe"
## [3] "\"Stop telling your history from 1970.\" - @gimbakakanda #SecureTheTribe"
Most popular tweet from gimbakakanda is “We don jam tribalism wey pass our own today.”
3.4 Graph Visualization
We will create cluster using Louvain method, whereas Louvain is one of the most popular method to uncovering community structure.
set.seed(123)
graph_tweets <- graph_tweets %>%
activate(nodes) %>%
mutate(community = group_louvain()) %>% # clustering
activate(edges) %>%
filter(!edge_is_loop()) # Remove loop edges
graph_tweets %>%
activate(nodes) %>%
as.data.frame() %>%
count(community)
There are 388 cluster created.
To gathered information from Important Username within each cluster, we will create function and assign it into object called important_user.
# fungsi untuk mendapatkan orang orang penting di tiap cluster
important_user <- function(data) {
name_person <- data %>%
as.data.frame() %>%
filter(community %in% 1:5) %>%
select(-community) %>%
pivot_longer(-name, names_to = "measures", values_to = "values") %>%
group_by(measures) %>%
arrange(desc(values)) %>%
slice(1:6) %>%
ungroup() %>%
distinct(name) %>%
pull(name)
return(name_person)
}
#create object contain important person
important_person <-
graph_tweets %>%
activate(nodes) %>%
important_user()
# Visualization using ggraph.
set.seed(123)
graph_tweets %>%
activate(nodes) %>%
mutate(ids = row_number(),
community = as.character(community)) %>%
filter(community %in% 1:3) %>% # number of community.
arrange(community,ids) %>%
mutate(node_label = ifelse(name %in% important_person, name,NA)) %>%
ggraph(layout = "fr") +
geom_edge_link(alpha = 0.3 ) +
geom_node_point(aes(size = degree, fill = community), shape = 21, alpha = 0.7, color = "grey30") +
geom_node_label(aes(label = node_label), repel = T, alpha = 0.8 ) +
guides(size = "none") +
labs(title = "Top 3 Community of #SecureTheTribe",
color = "Interaction",
fill = "Community") +
theme_void() +
theme(legend.position = "top")
3.5 Conclusion
From top 3 community visualization above, we noted the most popular username gimbakakanda is in community 1(red). Community 1 (red) and 2 (green) has 2 (two) important person each, while community 3 (blue) only has one important person Lola_OJ. idgaf_whatever is slightly closed to community 2 green. The denser the community/cluster, the more efficient information can spread.