library(tidyverse)
library(tidygraph)
library(ggraph)
library(igraph)
library(Rcpp)
Humans are social creatures where social interaction is one of their main needs. Currently there are many ways to do social interaction, including by utilizing social media. Social media is a tool to make it easier for humans to interact and share information.
Through social media, humans can interact easily and quickly without being limited by geographical conditions, we can interact with people who are very far away geographically quickly. The interactions that occur through social media produce data that can be analyzed to find out the interactions that occur between users. One form of analysis that can be used is social network analysis.
This project aims to provide an understanding to the public regarding the use of the graph concept in analyzing interactions that occur on social media.
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
tweets <- read_rds("nft.RDS")
head(tweets, 10)
The head() function is used to view the first few (default 6) data. From the results above, it can be seen that there are 90 columns with several descriptions of these columns as follows: - user_id : id of twitter user - status_id : id of created status - created_at : time when the state was created - screen_name : twitter username - is_retweet : type of the status whether a retweet or not
Before creating a graph, we explore the existing data. We can count the number of data that is a retweet or not by using the count() function.
tweets %>%
count(is_retweet)
From the results above, it can be seen that the majority of the existing tweet data are retweets, therefore let’s look at a summary of the number of retweets.
tweets %>%
filter(is_retweet == "TRUE") %>%
select(is_retweet, retweet_count) %>%
summary()
## is_retweet retweet_count
## Mode:logical Min. : 0
## TRUE:13058 1st Qu.: 88
## Median : 458
## Mean : 1379
## 3rd Qu.: 1395
## Max. :49552
From the results above, a tweet can be retweeted a maximum of 49552 times, whereas if we look at the median value, it is only 458. let’s find out which tweet has the largest number of retweet_count, and what the content of the tweet is
tweets %>%
filter(retweet_count == max(retweet_count))
We can conclude that b1nary_eth screen_name has the larget retweet_count.
In social network analysis, each node is represented by user_name, while edges can be represented by existing relationships mention, retweet, and quote. In the analysis process this time we will see the connection between users based on mentions in tweets.
The purpose of this cleansing process is to make the existing raw data ready to be used in the graph making process. The initial stage of the cleansing process is to take the screen_name column, and mentions_screen_name using select(), the two columns will represent the existing nodes (from and to in the data graph).
tweets %>%
select(screen_name,mentions_screen_name)
screen_name shows the name of the twitter user who has a tweet, while mentions_screen_name is the user mentioned in the tweet. If you see that the mentions_screen_name column still has several names in one row, it must be normalized so that one row only consists of one pair.
but before that we need to remove the string “c()” which is in the mentions_screen_name column. The process of deleting strings with certain patterns can be done with the str_remove_all() function then enter the pattern you want to remove
tweets %>%
select(screen_name,mentions_screen_name) %>%
mutate(mentions_screen_name =str_remove_all(string = mentions_screen_name,
pattern = "^c\\(|\\)$"))
Now we have separated the data that has multiple values in the mentions_screen_name column into several rows with the separate_rows() function.
tweets %>%
select(screen_name,mentions_screen_name) %>%
mutate(mentions_screen_name =str_replace_all(string = mentions_screen_name,
pattern = "^c\\(|\\)$",
replacement = "")) %>%
separate_rows(mentions_screen_name,sep = ",") %>%
na.omit()
The last step is that we can remove all data containing NA with the na.omit() function and remove the quotation marks (") left in the mentions_screen_name column. In addition we can change the name of the screen_name column to from and mentions_screen_name with to. Then save the result in an object named edge_df.
edge_df <-
tweets %>%
select(screen_name,mentions_screen_name) %>%
mutate(mentions_screen_name =str_replace_all(string = mentions_screen_name,
pattern = "^c\\(|\\)$",
replacement = "")) %>%
separate_rows(mentions_screen_name,sep = ",") %>%
na.omit() %>%
mutate(mentions_screen_name = str_replace_all(string = mentions_screen_name,
pattern = "[[:punct:] ]+",
replacement = "")) %>%
rename(from = screen_name,
to = mentions_screen_name)
head(edge_df)
After doing data cleaning to create data edges, one more data is needed, namely data nodes. Data nodes can be obtained from all user names that exist in the data edges.
nodes_df <- data.frame(name = unique(c(edge_df$from,edge_df$to)))
tail(nodes_df)
After getting all the data, now the graph can be formed using the tbl_graph() function. In the directed parameter we use FALSE which means this graph is an undirected graph, by making it undirected the relationship between nodes can be seen from 2 directions. We can also make it directed if the assumption used is that the relationship between nodes is not necessarily two-way.
graph_tweets <- tbl_graph(nodes = nodes_df,
edges = edge_df,
directed = F)
After creating the graph we can calculate the centrality value for each node. There are 4 measures of centrality used here, namely degree, betweenness, closeness and eigenvalue.
graph_tweets <- graph_tweets %>%
activate(nodes) %>%
mutate(degree = centrality_degree(),
between = centrality_betweenness(normalized = T),
closeness = centrality_closeness(),
eigen = centrality_eigen()
)
To see the results of the centrality calculation above, the data nodes can be taken in the form of a dataframe to make it easier for further analysis.
network_act_df <- graph_tweets %>%
activate(nodes) %>%
as.data.frame()
head(network_act_df)
From the data above, it can be seen that each node has a different centrality value, to find out which user name has the highest value for each centrality measure, we can convert it to the format below.
network_act_df %>%
arrange(-eigen) %>%
select(name) %>%
slice(1:6)
kp_activity <- data.frame(
network_act_df %>% arrange(-degree) %>% select(name) %>% slice(1:6),
network_act_df %>% arrange(-between) %>% select(name) %>% slice(1:6),
network_act_df %>% arrange(-closeness) %>% select(name) %>% slice(1:6),
network_act_df %>% arrange(-eigen) %>% select(name) %>% slice(1:6)
) %>% setNames(c("degree","betweenness","closeness","eigen"))
kp_activity
From the data above, we can see that the mandytn98 account is the account with the highest eigenvalue and degree, which means that the account is the most “popular” both locally and on the global network.
Then we can see that the GenWealth0 account is bound in degrees, clones, and between which means he has a close relationship by the well-known accounts in the NFT.
From the results of graph creation and calculation of the centrality value, we can visualize the graph. To simplify the interpretation of the plot later we need to group the nodes into several clusters. Therefore, we will do the clustering process on the graph first.
The clustering method used in this graph is the Louvain method, where this method looks at the density of the existing network. If data is changed to directed use group_leading_eigen.
set.seed(123)
graph_tweets <- graph_tweets %>%
activate(nodes) %>%
mutate(community = group_louvain()) %>%
activate(edges) %>%
filter(!edge_is_loop())
The group_louvain() function aims to create a cluster using the louvain method and label each node directly.
graph_tweets %>%
activate(nodes) %>%
as.data.frame() %>%
count(community)
Important people in each cluster are the people with the greatest centrality value. %in% is used when you want to retrieve data (filter) with multiple values
# fungsi untuk mendapatkan orang orang penting di tiap cluster
important_user <- function(data) {
name_person <- data %>%
as.data.frame() %>%
filter(community %in% 1:5) %>%
select(-community) %>%
pivot_longer(-name, names_to = "measures", values_to = "values") %>%
group_by(measures) %>%
arrange(desc(values)) %>%
slice(1:6) %>%
ungroup() %>%
distinct(name) %>%
pull(name)
return(name_person)
}
So that the visualization displayed is not messy, the user_name label that is displayed is only the account with the highest centrality value in each cluster.
graph_tweets %>%
activate(nodes) %>%
as.data.frame() %>% # mengubah nodes menjadi dataframe
filter(community %in% 1:5) %>% # mengambil community yang bernilai 1 sampai 5
select(-community) %>% # menghapus colom community
pivot_longer(-name, names_to = "measures", values_to = "values") %>% # mengubah data wide format menjadi long format
group_by(measures) %>% # data dikelompokkan berdasarkan measures(nilai centrality)
arrange(desc(values)) %>% # diurutkan datanya berdsarkan nilai centrality
slice(1:6) %>% # diambil 6 nilai terbesar dari masing masing centrality
ungroup() %>% # pengelompokan data dilepaskan
distinct(name) %>% # mengambil nama2 yang uniqe
pull(name) # mengambil kolom nama saja
## [1] "GenWealth0" "opensea" "nftspartan" "TastyBonesNFT"
## [5] "misscryptolog" "bgrathod6" "smolrunnersNFT" "JAYNFTs"
## [9] "REALSWAK" "CryptoKing1st" "snowcorp" "mandytn98"
## [13] "Arizona998" "mandyyyNFT" "meganXONFT" "veejayart"
## [17] "Gabby_NFT"
important_person <-
graph_tweets %>%
activate(nodes) %>%
important_user()
important_person
## [1] "GenWealth0" "opensea" "nftspartan" "TastyBonesNFT"
## [5] "misscryptolog" "bgrathod6" "smolrunnersNFT" "JAYNFTs"
## [9] "REALSWAK" "CryptoKing1st" "snowcorp" "mandytn98"
## [13] "Arizona998" "mandyyyNFT" "meganXONFT" "veejayart"
## [17] "Gabby_NFT"
set.seed(13)
graph_tweets %>%
activate(nodes) %>%
mutate(ids = row_number(),
community = as.character(community)) %>%
filter(community %in% 1:5) %>% # ubah berdasarkan jumlah cluster yang ingin di analisa 1:berapa?
arrange(community,ids) %>%
mutate(node_label = ifelse(name %in% important_person, name,NA)) %>%
ggraph(layout = "fr") +
geom_edge_link(alpha = 0.3 ) +
geom_node_point(aes(size = degree, fill = community), shape = 21, alpha = 0.7, color = "grey30") +
geom_node_label(aes(label = node_label), repel = T, alpha = 0.3 ) +
guides(size = "none") +
labs(title = "Top 5 Community of #NFT",
color = "Interaction",
fill = "Community") +
theme_void() +
theme(legend.position = "top")
From the graph visualization above, we can see that there are 5 clusters. Each cluster has several important accounts based on its centrality value, the larger the size of the node, the more “popular” the node is (high eigen centrality). The above visualization also shows the different density of each cluster, the denser a cluster is, the more efficient the distribution of information.