This note illustrates how I used R programming to analyse messages trending on the social network Twitter (i.e., tweets), in this case, tweets with the hashtag #Singapore. It consists of three parts. First, I identified the users who were pivotal in the transmission of #Singapore during the last week of October 2020. Second, I identified the key topics underlying the tweets with this hashtag. Finally, I analysed the sentiments of these tweets. I have included annotations to explain the steps I took.
I used data on the network of users who retweeted posts with #Singapore. I collected up to 5,000 tweets that were posted around the period between 25 and 31 Oct 2020.
# I used the standard Twitter Application Programming Interface (API) to access tweets published in the past seven days for free. To do so, I only needed a normal Twitter account.
# Install and upload the rtweet package to collect Twitter data.
install.packages("rtweet", repos="https://cran.rstudio.com")
library(rtweet)
# Install and upload the httpuv package to authenticate access to the Twitter API via the web browser.
install.packages("httpuv", repos="https://cran.rstudio.com")
library(httpuv)
# Collect up to 5,000 tweets with #Singapore for the past seven days using the following script:
# twts_sg <- search_tweets("#Singapore", n = 5000, lang = "en")
# For convenience, I accessed the data in advance and uploaded the data set onto my github repository.
# Read data stored in github.
twts_sg <- read.csv("https://raw.githubusercontent.com/chenfwei/Sabbatical-Project/main/twts_sg_repo.csv")
A network consists of nodes (vertices) and ties (edges). In a retweet network, the ties are directed. The source nodes are the users who retweet while the target nodes are the users who posted the original tweet. From the tweets collected, I extracted the source and target nodes.
# Extract source and target nodes from the tweet data frame.
sg_df <- twts_sg[, c("screen_name" , "retweet_screen_name" )]
# View the first 6 rows of the data frame.
head(sg_df)
## screen_name retweet_screen_name
## 1 style_artist94 kateStrasdin
## 2 joeltheobscure kixes
## 3 mercadomagico
## 4 Blytheweigh
## 5 LucySussex kateStrasdin
## 6 ABF_1994 kixes
# Install and load "dplyr" for manipulating data.
install.packages("dplyr", repos="https://cran.rstudio.com")
library(dplyr)
# Convert blanks to NAs, then remove rows with NAs.
sg_complete <- sg_df %>% mutate_all(na_if,"")
sg_complete <- sg_complete[complete.cases(sg_complete), ]
# Create a matrix of the data frame and view its first 6 rows.
sg_matrx <- as.matrix(sg_complete)
head(sg_matrx)
## screen_name retweet_screen_name
## 1 "style_artist94" "kateStrasdin"
## 2 "joeltheobscure" "kixes"
## 5 "LucySussex" "kateStrasdin"
## 6 "ABF_1994" "kixes"
## 7 "razzbabble" "fmtoday"
## 488 "UKASEAN" "UKASEAN"
Next, I created the retweet network in order to analyse and visualise the relationships among the nodes.
# Install and upload the "igraph" package for analysing and visualising networks.
install.packages("igraph", repos="https://cran.rstudio.com")
library(igraph)
# Convert the matrix to a retweet network.
sg_nw_rt <- graph_from_edgelist(el = sg_matrx, directed = TRUE)
# View the retweet network.
sg_nw_rt
## IGRAPH 098e7ac DN-- 1978 1922 --
## + attr: name (v/c)
## + edges from 098e7ac (vertex names):
## [1] style_artist94->kateStrasdin joeltheobscure->kixes
## [3] LucySussex ->kateStrasdin ABF_1994 ->kixes
## [5] razzbabble ->fmtoday UKASEAN ->UKASEAN
## [7] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [9] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [11] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [13] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [15] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## + ... omitted several edges
# Explore the set of nodes.
V(sg_nw_rt)
## + 1978/1978 vertices, named, from 098e7ac:
## [1] style_artist94 kateStrasdin joeltheobscure kixes
## [5] LucySussex ABF_1994 razzbabble fmtoday
## [9] UKASEAN AlvinMK_ cassTTAAecg CapitaLand
## [13] krishzen CurieuxExplorer Michael65413248 AnnielizzieSten
## [17] JerryHicksUnite Charismehehehe sining_nihiraya Avargas2403
## [21] SUPERBRUTAL_ armaanyounis VivMilano Fxworld2
## [25] ExanteData ASEAN_Insider RommelOliva3 asia_mobility
## [29] ChristinFairy matturban4 allmusicxcess DebsCa
## [33] anhistorianblog corixpartners abhishek__AI Dan4tographer
## [37] AFPphoto BetoCTeves PawlowskiMario Transform_Sec
## + ... omitted several vertices
# Print the number of nodes.
vcount(sg_nw_rt)
## [1] 1978
# Explore the set of ties.
E(sg_nw_rt)
## + 1922/1922 edges from 098e7ac (vertex names):
## [1] style_artist94 ->kateStrasdin joeltheobscure ->kixes
## [3] LucySussex ->kateStrasdin ABF_1994 ->kixes
## [5] razzbabble ->fmtoday UKASEAN ->UKASEAN
## [7] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [9] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [11] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [13] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [15] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [17] UKASEAN ->UKASEAN UKASEAN ->UKASEAN
## [19] UKASEAN ->UKASEAN AlvinMK_ ->AlvinMK_
## + ... omitted several edges
# Print the number of ties.
ecount(sg_nw_rt)
## [1] 1922
# Add node attribute id.
V(sg_nw_rt)$id <- 1:vcount(sg_nw_rt)
I analysed the influence of the users based on three indices.
The out-degree of a user indicates the number of times the user retweets posts. A user with a high out-degree score can be used as a medium to retweet posts. The users with the top 10 out-degree scores are indicated below.
# Create data frames containing the nodes and ties.
sg_nw_rt_df <- as_data_frame(sg_nw_rt, what = "both")
sg_nodes <- sg_nw_rt_df$vertices
sg_ties <- sg_nw_rt_df$edges
# Calculate the out-degree scores, sort the nodes based on these scores, and view the users with the top 10 scores.
out_degree <- degree(sg_nw_rt, mode = c("out"))
out_degree_sort <- sort(out_degree, decreasing = TRUE)
out_degree_sort[1:10]
## Khattiy74899201 UKASEAN hkdemonow ROB19TIFTIF econometriclub
## 15 14 13 13 12
## Rh63th Michael65413248 HaseenahKoya CarbonCraftLtd surveybellcom
## 11 7 6 6 6
The in-degree of a user indicates the number of times the user’s posts are retweeted. A user with a high in-degree score is influential as his tweets are retweeted many times. The users with the top 10 in-degree scores are indicated below.
# Calculate the in-degree scores, sort the nodes based on these scores, and view the users with the top 10 scores.
in_degree <- degree(sg_nw_rt, mode = c("in"))
in_degree_sort <- sort(in_degree, decreasing = TRUE)
in_degree_sort[1:10]
## kixes hkfp alvinfoo Michael65413248 authorjamiller
## 154 81 76 74 61
## FangSladeDrum RoyalNavy TimGurung mvollmer1 jetex
## 48 36 31 30 29
The betweenness of a user represents the degree to which the user stands between two other users who are not connected. A user with a high betweenness score would have more control over the network because more information will pass through the user. The users with the top 10 betweenness scores are indicated below.
# Compute the betweenness scores, sort the nodes based on these scores, and view the users with the top 10 scores.
betwn_nw <- betweenness(sg_nw_rt, directed = TRUE)
betwn_nw_sort <- betwn_nw %>%
sort(decreasing = TRUE) %>%
round()
betwn_nw_sort[1:10]
## kixes HaseenahKoya tjc_singapore glengyron econometriclub
## 130 38 18 9 8
## FrRonconi mathmadeasy woonglyyou WorldDengueDay USAmbSG
## 5 5 4 3 3
# Add the scores to the nodes.
sg_nodes_scores <- sg_nodes %>%
mutate(out_degree = out_degree) %>%
mutate(in_degree = in_degree) %>%
mutate(betweenness = betwn_nw)
As shown above, differences in out-degree scores are not large. However, the user “kixes” has significantly higher in-degree and betweenness scores than other users - his tweets are retweeted many times and he links many other users.
To understand the influence of “kixes” better, I visualised his network.
# Install and load "ggraph" and "graphlayouts" packages to visualise the network.
install.packages("ggraph", repos="https://cran.rstudio.com")
library(ggraph)
install.packages("graphlayouts", repos="https://cran.rstudio.com")
library(graphlayouts)
# Extract a sub-graph comprising only nodes and ties originating from "kixes".
sg_ego <- ego(sg_nw_rt, order = 10, nodes = c("kixes"))
sg_sel <- induced_subgraph(sg_nw_rt, unlist(sg_ego))
# Add the betweenness scores to the graph.
V(sg_sel)$betweenness <- betweenness(sg_sel)
# View the graph.
sg_graph <- ggraph(sg_sel, layout = "stress") +
geom_edge_link0()+
geom_node_point()
sg_graph
To examine the network more closely, I created an interactive visualisation.
# Install and upload the "visNetwork" package for creating interactive visualisations.
install.packages("visNetwork", repos="https://cran.rstudio.com")
library(visNetwork)
# Convert the graph data into the visNetwork format.
visdata <- toVisNetworkData(sg_sel)
nodes <- visdata$nodes
edges <- visdata$edges
nodes$size <- nodes$betweenness
# Visualise the network.
visNetwork(nodes = nodes, edges = edges) %>%
# Label the nodes using the user names.
visNodes(label = id) %>%
# Use arrows to indicate the directions of the ties.
visEdges(arrows = c('to,from'), smooth = TRUE) %>%
# Allow highlighting of nearest nodes and ties and selection of nodes by user name.
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)
The size of each node represents its betweenness score. Users in the network of “kixes” who have high betweenness scores are “HaseenahKoya”, “tjc_singapore” and “glengyron”. In other words, the transmission of tweets with #Singapore was mainly driven by “kixes” and his connections with these three users.
First, I created a corpus of texts with #Singapore. To analyse the data, I first retained only the words in the corpus.
# Install and upload "tm" package to mine texts.
install.packages("tm", repos="https://cran.rstudio.com")
library(tm)
# Install and upload "qdap" and "qdapRegex" packages to clean texts.
install.packages("qdap", repos="https://cran.rstudio.com")
library(qdap)
install.packages("qdapRegex", repos="https://cran.rstudio.com")
library(qdapRegex)
# Install and upload "magrittr" package to provide a pipe-like operator.
library(magrittr)
# Extract texts from tweets.
text_twts_sg <- twts_sg$text
# Remove urls.
text_tws_url <- rm_twitter_url(text_twts_sg)
# Remove special characters, punctuation, and numbers.
text_tws_chrs <- gsub("[^A-Za-z]"," " , text_tws_url)
# Convert texts into a corpus.
text_sg_corpus <- text_tws_chrs %>%
VectorSource() %>%
Corpus()
# Convert the corpus to lower case.
text_sg_corpus_lwr <- tm_map(text_sg_corpus, tolower)
# Remove English stopwords.
text_sg_corpus_stpwd <- tm_map(text_sg_corpus_lwr, removeWords, stopwords("english"))
# Remove spaces.
corpus_sg <- tm_map(text_sg_corpus_stpwd, stripWhitespace)
Next, I created a document term matrix (DTM) from the corpus. A DTM is a matrix representation of a corpus where the documents are the rows and the words are the columns. This allowed me to count the frequency of each word in each document.
# Create a DTM.
dtm_sg <- DocumentTermMatrix(corpus_sg)
dtm_sg
## <<DocumentTermMatrix (documents: 4525, terms: 10629)>>
## Non-/sparse entries: 71391/48024834
## Sparsity : 100%
## Maximal term length: 34
## Weighting : term frequency (tf)
# Find the sum of word counts in each document.
rowTotals <- apply(dtm_sg, 1, sum)
head(rowTotals)
## 1 2 3 4 5 6
## 21 28 19 10 21 12
# Select rows with a row total greater than zero to exclude documents with no words.
dtm_sg_new <- dtm_sg[rowTotals > 0, ]
dtm_sg_new
## <<DocumentTermMatrix (documents: 4525, terms: 10629)>>
## Non-/sparse entries: 71391/48024834
## Sparsity : 100%
## Maximal term length: 34
## Weighting : term frequency (tf)
Then, I derived topics from the DTM using topic modelling, which is the task of discovering topics from a vast amount of text. For this analysis, I derived a model with five topics and surfaced the top eight terms for each topic.
# Install and load the "topicmodels" package to run topic models.
install.packages("topicmodels", repos="https://cran.rstudio.com")
library(topicmodels)
# Create a topic model with 5 topics.
topicmodl_5 <- LDA(dtm_sg_new, k = 5)
# Select and view the top 8 terms in the topic model.
top_8terms <- terms(topicmodl_5,8)
top_8terms
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "singapore" "singapore" "singapore" "singapore" "singapore"
## [2,] "travel" "free" "amp" "building" "covid"
## [3,] "bubble" "zerowaste" "malaysia" "shot" "amp"
## [4,] "hongkong" "jobs" "australia" "amos" "hongkong"
## [5,] "will" "new" "usa" "child" "wwii"
## [6,] "covid" "amp" "thailand" "urban" "travel"
## [7,] "hong" "story" "vietnam" "porn" "cases"
## [8,] "kong" "wine" "canada" "yee" "murder"
We may derive slightly different results each time the model is run. But generally, the most popular topics were related to COVID-19 (e.g., Singapore’s travel bubble with Hong Kong), Singapore’s sustainability initiatives, and high-profile criminal cases (e.g., drug runner Gobi Avedian who escaped the death penalty, Amos Yee who was arrested for possessing child pornography).
This involves deriving and quantifying emotions from the tweets with #Singapore. First, the tweets were broken down into individual sentences. Then, for every sentence, each word was matched to (a) a positive or negative sentiment and/or (b) at least one of eight specific emotions (e.g., anger, surprise) based on a pre-defined sentiment dictionary. Every word-sentiment or word-emotion match corresponded to a score of “1”. Words that were not found in the dictionary were not matched.
Take, for example, the sentence “I might get angry and decide to do something horrible”. “angry” would be matched to “negative”, “anger”, and “disgust”, while “horrible” would be matched to “negative”, “anger”, “disgust”, and “fear”. Therefore, this sentence would score “2” each for “negative”, “anger”, and “disgust”, as well as “1” for “fear”.
# Install and load the "syuzhet" package to perform sentiment analysis.
install.packages("syuzhet", repos="https://cran.rstudio.com")
library(syuzhet)
# Perform sentiment analysis for tweets with #Singapore. The "get_nrc_sentiment" function implements Saif Mohammad’s NRC emotion lexicon, which is a list of words and their associations with two sentiments (negative and positive) and eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust). The function returns a data frame in which each row represents a sentence from the text. The columns represent each of the sentiments/emotions. The figures represent the scores for the respective sentences and sentiments/emotions.
sent.value <- get_nrc_sentiment(text_twts_sg)
# View the scores.
sent.value[1:5, 1:7]
## anger anticipation disgust fear joy sadness surprise
## 1 0 0 0 0 0 0 0
## 2 3 3 1 3 0 2 1
## 3 1 0 0 1 0 1 0
## 4 1 0 0 1 0 1 0
## 5 0 0 0 0 0 0 0
# Calculate the sum of scores.
score <- colSums(sent.value[,])
# Convert the scores to a data frame.
score_df <- data.frame(score)
# Convert the row names into a "sentiment" column and combine it with the scores.
sent.score <- cbind(sentiment = row.names(score_df),
score_df, row.names = NULL)
# View the data frame.
print(sent.score)
## sentiment score
## 1 anger 1384
## 2 anticipation 3016
## 3 disgust 849
## 4 fear 2306
## 5 joy 2255
## 6 sadness 1246
## 7 surprise 1274
## 8 trust 2927
## 9 negative 2947
## 10 positive 5971
Next, I plotted the aggregate scores for the two sentiments and eight emotions.
# Plot the aggregate scores.
ggplot(data = sent.score, aes(x = sentiment, y = score,
fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Based on the plot, the tweets with #Singapore were mostly positive, expressing anticipation and trust. Tweets with negative sentiments mainly expressed fear.