Analysis of Twitter Data with R: An Illustration

Objectives

This note illustrates how I used R programming to analyse messages trending on the social network Twitter (i.e., tweets), in this case, tweets with the hashtag #Singapore. It consists of three parts. First, I identified the users who were pivotal in the transmission of #Singapore during the last week of October 2020. Second, I identified the key topics underlying the tweets with this hashtag. Finally, I analysed the sentiments of these tweets. I have included annotations to explain the steps I took.

1. Identifying influential users

I used data on the network of users who retweeted posts with #Singapore. I collected up to 5,000 tweets that were posted around the period between 25 and 31 Oct 2020.

# I used the standard Twitter Application Programming Interface (API) to access tweets published in the past seven days for free. To do so, I only needed a normal Twitter account.

# Install and upload the rtweet package to collect Twitter data.
install.packages("rtweet", repos="https://cran.rstudio.com")
library(rtweet)

# Install and upload the httpuv package to authenticate access to the Twitter API via the web browser.
install.packages("httpuv", repos="https://cran.rstudio.com")
library(httpuv)

# Collect up to 5,000 tweets with #Singapore for the past seven days using the following script:
# twts_sg <- search_tweets("#Singapore", n = 5000, lang = "en")
# For convenience, I accessed the data in advance and uploaded the data set onto my github repository.

# Read data stored in github.
twts_sg <- read.csv("https://raw.githubusercontent.com/chenfwei/Sabbatical-Project/main/twts_sg_repo.csv")

A network consists of nodes (vertices) and ties (edges). In a retweet network, the ties are directed. The source nodes are the users who retweet while the target nodes are the users who posted the original tweet. From the tweets collected, I extracted the source and target nodes.

# Extract source and target nodes from the tweet data frame.
sg_df <- twts_sg[, c("screen_name" , "retweet_screen_name" )]

# View the first 6 rows of the data frame.
head(sg_df)

##      screen_name retweet_screen_name
## 1 style_artist94        kateStrasdin
## 2 joeltheobscure               kixes
## 3  mercadomagico                    
## 4    Blytheweigh                    
## 5     LucySussex        kateStrasdin
## 6       ABF_1994               kixes

# Install and load "dplyr" for manipulating data.
install.packages("dplyr", repos="https://cran.rstudio.com")
library(dplyr)

# Convert blanks to NAs, then remove rows with NAs.
sg_complete <- sg_df %>% mutate_all(na_if,"")
sg_complete <- sg_complete[complete.cases(sg_complete), ]

# Create a matrix of the data frame and view its first 6 rows.
sg_matrx <- as.matrix(sg_complete)
head(sg_matrx)

##     screen_name      retweet_screen_name
## 1   "style_artist94" "kateStrasdin"     
## 2   "joeltheobscure" "kixes"            
## 5   "LucySussex"     "kateStrasdin"     
## 6   "ABF_1994"       "kixes"            
## 7   "razzbabble"     "fmtoday"          
## 488 "UKASEAN"        "UKASEAN"

Next, I created the retweet network in order to analyse and visualise the relationships among the nodes.

# Install and upload the "igraph" package for analysing and visualising networks.
install.packages("igraph", repos="https://cran.rstudio.com")
library(igraph)

# Convert the matrix to a retweet network.
sg_nw_rt <- graph_from_edgelist(el = sg_matrx, directed = TRUE)

# View the retweet network.
sg_nw_rt

## IGRAPH 098e7ac DN-- 1978 1922 -- 
## + attr: name (v/c)
## + edges from 098e7ac (vertex names):
##  [1] style_artist94->kateStrasdin joeltheobscure->kixes       
##  [3] LucySussex    ->kateStrasdin ABF_1994      ->kixes       
##  [5] razzbabble    ->fmtoday      UKASEAN       ->UKASEAN     
##  [7] UKASEAN       ->UKASEAN      UKASEAN       ->UKASEAN     
##  [9] UKASEAN       ->UKASEAN      UKASEAN       ->UKASEAN     
## [11] UKASEAN       ->UKASEAN      UKASEAN       ->UKASEAN     
## [13] UKASEAN       ->UKASEAN      UKASEAN       ->UKASEAN     
## [15] UKASEAN       ->UKASEAN      UKASEAN       ->UKASEAN     
## + ... omitted several edges

# Explore the set of nodes.
V(sg_nw_rt)

## + 1978/1978 vertices, named, from 098e7ac:
##    [1] style_artist94  kateStrasdin    joeltheobscure  kixes          
##    [5] LucySussex      ABF_1994        razzbabble      fmtoday        
##    [9] UKASEAN         AlvinMK_        cassTTAAecg     CapitaLand     
##   [13] krishzen        CurieuxExplorer Michael65413248 AnnielizzieSten
##   [17] JerryHicksUnite Charismehehehe  sining_nihiraya Avargas2403    
##   [21] SUPERBRUTAL_    armaanyounis    VivMilano       Fxworld2       
##   [25] ExanteData      ASEAN_Insider   RommelOliva3    asia_mobility  
##   [29] ChristinFairy   matturban4      allmusicxcess   DebsCa         
##   [33] anhistorianblog corixpartners   abhishek__AI    Dan4tographer  
##   [37] AFPphoto        BetoCTeves      PawlowskiMario  Transform_Sec  
## + ... omitted several vertices

# Print the number of nodes.
vcount(sg_nw_rt)

## [1] 1978

# Explore the set of ties.
E(sg_nw_rt)

## + 1922/1922 edges from 098e7ac (vertex names):
##  [1] style_artist94 ->kateStrasdin    joeltheobscure ->kixes          
##  [3] LucySussex     ->kateStrasdin    ABF_1994       ->kixes          
##  [5] razzbabble     ->fmtoday         UKASEAN        ->UKASEAN        
##  [7] UKASEAN        ->UKASEAN         UKASEAN        ->UKASEAN        
##  [9] UKASEAN        ->UKASEAN         UKASEAN        ->UKASEAN        
## [11] UKASEAN        ->UKASEAN         UKASEAN        ->UKASEAN        
## [13] UKASEAN        ->UKASEAN         UKASEAN        ->UKASEAN        
## [15] UKASEAN        ->UKASEAN         UKASEAN        ->UKASEAN        
## [17] UKASEAN        ->UKASEAN         UKASEAN        ->UKASEAN        
## [19] UKASEAN        ->UKASEAN         AlvinMK_       ->AlvinMK_       
## + ... omitted several edges

# Print the number of ties.
ecount(sg_nw_rt)

## [1] 1922

# Add node attribute id. 
V(sg_nw_rt)$id <- 1:vcount(sg_nw_rt)

I analysed the influence of the users based on three indices.

The out-degree of a user indicates the number of times the user retweets posts. A user with a high out-degree score can be used as a medium to retweet posts. The users with the top 10 out-degree scores are indicated below.

# Create data frames containing the nodes and ties.
sg_nw_rt_df <- as_data_frame(sg_nw_rt, what = "both")
sg_nodes <- sg_nw_rt_df$vertices
sg_ties <- sg_nw_rt_df$edges

# Calculate the out-degree scores, sort the nodes based on these scores, and view the users with the top 10 scores.
out_degree <- degree(sg_nw_rt, mode = c("out"))
out_degree_sort <- sort(out_degree, decreasing = TRUE)
out_degree_sort[1:10]

## Khattiy74899201         UKASEAN       hkdemonow     ROB19TIFTIF  econometriclub 
##              15              14              13              13              12 
##          Rh63th Michael65413248    HaseenahKoya  CarbonCraftLtd   surveybellcom 
##              11               7               6               6               6

The in-degree of a user indicates the number of times the user’s posts are retweeted. A user with a high in-degree score is influential as his tweets are retweeted many times. The users with the top 10 in-degree scores are indicated below.

# Calculate the in-degree scores, sort the nodes based on these scores, and view the users with the top 10 scores.
in_degree <- degree(sg_nw_rt, mode = c("in"))
in_degree_sort <- sort(in_degree, decreasing = TRUE)
in_degree_sort[1:10]

##           kixes            hkfp        alvinfoo Michael65413248  authorjamiller 
##             154              81              76              74              61 
##   FangSladeDrum       RoyalNavy       TimGurung       mvollmer1           jetex 
##              48              36              31              30              29

The betweenness of a user represents the degree to which the user stands between two other users who are not connected. A user with a high betweenness score would have more control over the network because more information will pass through the user. The users with the top 10 betweenness scores are indicated below.

# Compute the betweenness scores, sort the nodes based on these scores, and view the users with the top 10 scores.
betwn_nw <- betweenness(sg_nw_rt, directed = TRUE)
betwn_nw_sort <- betwn_nw %>%
  sort(decreasing = TRUE) %>%
  round()
betwn_nw_sort[1:10]

##          kixes   HaseenahKoya  tjc_singapore      glengyron econometriclub 
##            130             38             18              9              8 
##      FrRonconi    mathmadeasy     woonglyyou WorldDengueDay        USAmbSG 
##              5              5              4              3              3

# Add the scores to the nodes.
sg_nodes_scores <- sg_nodes %>%
  mutate(out_degree = out_degree) %>%
  mutate(in_degree = in_degree) %>% 
  mutate(betweenness = betwn_nw)

As shown above, differences in out-degree scores are not large. However, the user “kixes” has significantly higher in-degree and betweenness scores than other users - his tweets are retweeted many times and he links many other users.

To understand the influence of “kixes” better, I visualised his network.

# Install and load "ggraph" and "graphlayouts" packages to visualise the network.
install.packages("ggraph", repos="https://cran.rstudio.com")
library(ggraph)
install.packages("graphlayouts", repos="https://cran.rstudio.com")
library(graphlayouts)

# Extract a sub-graph comprising only nodes and ties originating from "kixes".
sg_ego <- ego(sg_nw_rt, order = 10, nodes = c("kixes"))
sg_sel <- induced_subgraph(sg_nw_rt, unlist(sg_ego))

# Add the betweenness scores to the graph.
V(sg_sel)$betweenness <- betweenness(sg_sel)

# View the graph.
sg_graph <- ggraph(sg_sel, layout = "stress") +
  geom_edge_link0()+
  geom_node_point()
sg_graph

To examine the network more closely, I created an interactive visualisation.

# Install and upload the "visNetwork" package for creating interactive visualisations.
install.packages("visNetwork", repos="https://cran.rstudio.com")
library(visNetwork)

# Convert the graph data into the visNetwork format.
visdata <- toVisNetworkData(sg_sel)
nodes <- visdata$nodes
edges <- visdata$edges
nodes$size <- nodes$betweenness

# Visualise the network.
visNetwork(nodes = nodes, edges = edges) %>% 
  # Label the nodes using the user names.
  visNodes(label = id) %>%
  # Use arrows to indicate the directions of the ties.
  visEdges(arrows = c('to,from'), smooth = TRUE) %>%
  # Allow highlighting of nearest nodes and ties and selection of nodes by user name.
  visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)

The size of each node represents its betweenness score. Users in the network of “kixes” who have high betweenness scores are “HaseenahKoya”, “tjc_singapore” and “glengyron”. In other words, the transmission of tweets with #Singapore was mainly driven by “kixes” and his connections with these three users.

2. Identifying underlying topics

First, I created a corpus of texts with #Singapore. To analyse the data, I first retained only the words in the corpus.

# Install and upload "tm" package to mine texts.
install.packages("tm", repos="https://cran.rstudio.com")
library(tm)

# Install and upload "qdap" and "qdapRegex" packages to clean texts.
install.packages("qdap", repos="https://cran.rstudio.com")
library(qdap)
install.packages("qdapRegex", repos="https://cran.rstudio.com")
library(qdapRegex)

# Install and upload "magrittr" package to provide a pipe-like operator.
library(magrittr)

# Extract texts from tweets.
text_twts_sg <- twts_sg$text

# Remove urls.
text_tws_url <- rm_twitter_url(text_twts_sg)

# Remove special characters, punctuation, and numbers.
text_tws_chrs  <- gsub("[^A-Za-z]"," " , text_tws_url)

# Convert texts into a corpus.
text_sg_corpus <- text_tws_chrs %>% 
  VectorSource() %>% 
  Corpus() 

# Convert the corpus to lower case.
text_sg_corpus_lwr <- tm_map(text_sg_corpus, tolower) 

# Remove English stopwords.
text_sg_corpus_stpwd <- tm_map(text_sg_corpus_lwr, removeWords, stopwords("english"))

# Remove spaces.
corpus_sg <- tm_map(text_sg_corpus_stpwd, stripWhitespace)

Next, I created a document term matrix (DTM) from the corpus. A DTM is a matrix representation of a corpus where the documents are the rows and the words are the columns. This allowed me to count the frequency of each word in each document.

# Create a DTM.
dtm_sg <- DocumentTermMatrix(corpus_sg)
dtm_sg

## <<DocumentTermMatrix (documents: 4525, terms: 10629)>>
## Non-/sparse entries: 71391/48024834
## Sparsity           : 100%
## Maximal term length: 34
## Weighting          : term frequency (tf)

# Find the sum of word counts in each document.
rowTotals <- apply(dtm_sg, 1, sum)
head(rowTotals)

##  1  2  3  4  5  6 
## 21 28 19 10 21 12

# Select rows with a row total greater than zero to exclude documents with no words.
dtm_sg_new <- dtm_sg[rowTotals > 0, ]
dtm_sg_new

## <<DocumentTermMatrix (documents: 4525, terms: 10629)>>
## Non-/sparse entries: 71391/48024834
## Sparsity           : 100%
## Maximal term length: 34
## Weighting          : term frequency (tf)

Then, I derived topics from the DTM using topic modelling, which is the task of discovering topics from a vast amount of text. For this analysis, I derived a model with five topics and surfaced the top eight terms for each topic.

# Install and load the "topicmodels" package to run topic models.
install.packages("topicmodels", repos="https://cran.rstudio.com")
library(topicmodels)

# Create a topic model with 5 topics.
topicmodl_5 <- LDA(dtm_sg_new, k = 5)

# Select and view the top 8 terms in the topic model.
top_8terms <- terms(topicmodl_5,8)
top_8terms

##      Topic 1     Topic 2     Topic 3     Topic 4     Topic 5    
## [1,] "singapore" "singapore" "singapore" "singapore" "singapore"
## [2,] "travel"    "free"      "amp"       "building"  "covid"    
## [3,] "bubble"    "zerowaste" "malaysia"  "shot"      "amp"      
## [4,] "hongkong"  "jobs"      "australia" "amos"      "hongkong" 
## [5,] "will"      "new"       "usa"       "child"     "wwii"     
## [6,] "covid"     "amp"       "thailand"  "urban"     "travel"   
## [7,] "hong"      "story"     "vietnam"   "porn"      "cases"    
## [8,] "kong"      "wine"      "canada"    "yee"       "murder"

We may derive slightly different results each time the model is run. But generally, the most popular topics were related to COVID-19 (e.g., Singapore’s travel bubble with Hong Kong), Singapore’s sustainability initiatives, and high-profile criminal cases (e.g., drug runner Gobi Avedian who escaped the death penalty, Amos Yee who was arrested for possessing child pornography).

3. Analysing sentiments

This involves deriving and quantifying emotions from the tweets with #Singapore. First, the tweets were broken down into individual sentences. Then, for every sentence, each word was matched to (a) a positive or negative sentiment and/or (b) at least one of eight specific emotions (e.g., anger, surprise) based on a pre-defined sentiment dictionary. Every word-sentiment or word-emotion match corresponded to a score of “1”. Words that were not found in the dictionary were not matched.

Take, for example, the sentence “I might get angry and decide to do something horrible”. “angry” would be matched to “negative”, “anger”, and “disgust”, while “horrible” would be matched to “negative”, “anger”, “disgust”, and “fear”. Therefore, this sentence would score “2” each for “negative”, “anger”, and “disgust”, as well as “1” for “fear”.

# Install and load the "syuzhet" package to perform sentiment analysis.
install.packages("syuzhet", repos="https://cran.rstudio.com")
library(syuzhet)

# Perform sentiment analysis for tweets with #Singapore. The "get_nrc_sentiment" function implements Saif Mohammad’s NRC emotion lexicon, which is a list of words and their associations with two sentiments (negative and positive) and eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust). The function returns a data frame in which each row represents a sentence from the text. The columns represent each of the sentiments/emotions. The figures represent the scores for the respective sentences and sentiments/emotions.
sent.value <- get_nrc_sentiment(text_twts_sg)

# View the scores.
sent.value[1:5, 1:7]

##   anger anticipation disgust fear joy sadness surprise
## 1     0            0       0    0   0       0        0
## 2     3            3       1    3   0       2        1
## 3     1            0       0    1   0       1        0
## 4     1            0       0    1   0       1        0
## 5     0            0       0    0   0       0        0

# Calculate the sum of scores.
score <- colSums(sent.value[,])

# Convert the scores to a data frame.
score_df <- data.frame(score)

# Convert the row names into a "sentiment" column and combine it with the scores.
sent.score <- cbind(sentiment = row.names(score_df), 
                    score_df, row.names = NULL)

# View the data frame.
print(sent.score)

##       sentiment score
## 1         anger  1384
## 2  anticipation  3016
## 3       disgust   849
## 4          fear  2306
## 5           joy  2255
## 6       sadness  1246
## 7      surprise  1274
## 8         trust  2927
## 9      negative  2947
## 10     positive  5971

Next, I plotted the aggregate scores for the two sentiments and eight emotions.

# Plot the aggregate scores.
ggplot(data = sent.score, aes(x = sentiment, y = score, 
      fill = sentiment)) + 
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on the plot, the tweets with #Singapore were mostly positive, expressing anticipation and trust. Tweets with negative sentiments mainly expressed fear.

Analysis of Twitter Data with R: An Illustration

CHEN Fuwei

15 Nov 2020

Objectives

1. Identifying influential users

2. Identifying underlying topics

3. Analysing sentiments