This is a follow up publication to “Sentiment Analysis in R”. Here, I will demonstrate the use of syuzhet package in R to do sentiment analysis of tweets relating to the Housing Development Board of Singapore. The dataset consists of 144 tweets extracted using Twitter Search API (see http://rpubs.com/kevinsis/sentiment1).
library(syuzhet)
library(plotly)
library(tm)
library(wordcloud)
load("tweets.rda")
As in the previous publication, I have written a function to help with data cleaning:
# Function for data cleaning
f_clean_tweets <- function (tweets) {
clean_tweets = sapply(tweets, function(x) x$getText())
# remove retweet entities
clean_tweets = gsub('(RT|via)((?:\\b\\W*@\\w+)+)', '', clean_tweets)
# remove at people
clean_tweets = gsub('@\\w+', '', clean_tweets)
# remove punctuation
clean_tweets = gsub('[[:punct:]]', '', clean_tweets)
# remove numbers
clean_tweets = gsub('[[:digit:]]', '', clean_tweets)
# remove html links
clean_tweets = gsub('http\\w+', '', clean_tweets)
# remove unnecessary spaces
clean_tweets = gsub('[ \t]{2,}', '', clean_tweets)
clean_tweets = gsub('^\\s+|\\s+$', '', clean_tweets)
# remove emojis or special characters
clean_tweets = gsub('<.*>', '', enc2native(clean_tweets))
clean_tweets = tolower(clean_tweets)
clean_tweets
}
This time round, I would like to show how the sentiment score varies across time as well. To do this, I extract the timestamp as follows:
# First call the function defined above to clean the data
clean_tweets <- f_clean_tweets(some_tweets)
# extract the timestamp
timestamp <- as.POSIXct(sapply(some_tweets, function(x)x$getCreated()), origin="1970-01-01", tz="GMT")
timestamp <- timestamp[!duplicated(clean_tweets)]
clean_tweets <- clean_tweets[!duplicated(clean_tweets)]
The syuzhet package incorporates four sentiment lexicons: afinn, bing, nrc, and of course the default syuzhet itself: https://github.com/mjockers/syuzhet
Let’s compare the four lexicons and put them into a data frame:
# Get sentiments using the four different lexicons
syuzhet <- get_sentiment(clean_tweets, method="syuzhet")
bing <- get_sentiment(clean_tweets, method="bing")
afinn <- get_sentiment(clean_tweets, method="afinn")
nrc <- get_sentiment(clean_tweets, method="nrc")
sentiments <- data.frame(syuzhet, bing, afinn, nrc, timestamp)
Emotion analysis can be done with the NRC Emotion lexicon:
# get the emotions using the NRC dictionary
emotions <- get_nrc_sentiment(clean_tweets)
emo_bar = colSums(emotions)
emo_sum = data.frame(count=emo_bar, emotion=names(emo_bar))
emo_sum$emotion = factor(emo_sum$emotion, levels=emo_sum$emotion[order(emo_sum$count, decreasing = TRUE)])
Now we are ready to visualize the results, let’s start by comparing the sentiment scores across the four methods:
# plot the different sentiments from different methods
plot_ly(sentiments, x=~timestamp, y=~syuzhet, type="scatter", mode="jitter", name="syuzhet") %>%
add_trace(y=~bing, mode="lines", name="bing") %>%
add_trace(y=~afinn, mode="lines", name="afinn") %>%
add_trace(y=~nrc, mode="lines", name="nrc") %>%
layout(title="Recent sentiments of HDB in Singapore",
yaxis=list(title="score"), xaxis=list(title="date"))
They look pretty consistent/correlated! Then we can see what sorts of emotions are dominant in the tweets:
# Visualize the emotions from NRC sentiments
plot_ly(emo_sum, x=~emotion, y=~count, type="bar", color=~emotion) %>%
layout(xaxis=list(title=""), showlegend=FALSE,
title="Distribution of emotion categories for HDB (1-10 June 2017)")
Mainly positive! Again this is good news for HDB and this results agree with our previous analysis with the old sentiment package.
Finally, let’s see which word contributes to which emotion:
# Comparison word cloud
all = c(
paste(clean_tweets[emotions$anger > 0], collapse=" "),
paste(clean_tweets[emotions$anticipation > 0], collapse=" "),
paste(clean_tweets[emotions$disgust > 0], collapse=" "),
paste(clean_tweets[emotions$fear > 0], collapse=" "),
paste(clean_tweets[emotions$joy > 0], collapse=" "),
paste(clean_tweets[emotions$sadness > 0], collapse=" "),
paste(clean_tweets[emotions$surprise > 0], collapse=" "),
paste(clean_tweets[emotions$trust > 0], collapse=" ")
)
all <- removeWords(all, stopwords("english"))
# create corpus
corpus = Corpus(VectorSource(all))
#
# create term-document matrix
tdm = TermDocumentMatrix(corpus)
#
# convert as matrix
tdm = as.matrix(tdm)
tdm1 <- tdm[nchar(rownames(tdm)) < 11,]
#
# add column names
colnames(tdm) = c('anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust')
colnames(tdm1) <- colnames(tdm)
comparison.cloud(tdm1, random.order=FALSE,
colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
title.size=1, max.words=250, scale=c(2.5, 0.4),rot.per=0.4)
As a note, I need to exclude words with more than 11 characters (< 7%) so that the words can fit nicely into the wordcloud. In practice we can also shorten these long words instead of removing them.