Sentiment Analysis in R

Overview

In this demo, I will detail the approach I used (along with codes) to get some data from Twitter and analyze the sentiments of tweets relating to the Housing Development Board (HDB) of Singapore.

Setup

library(twitteR)
library(devtools)
if(!require(Rstem)) install_url("http://cran.r-project.org/src/contrib/Archive/Rstem/Rstem_0.4-1.tar.gz")
if(!require(sentiment)) install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")
library(Rstem)
library(sentiment)
library(plotly)
library(dplyr)
library(wordcloud)

Twitter Authentication

In order to authorize R for the Twitter Search API, I have setup a Twitter App to use OAuth authentication so that R can interface to the Twitter API through the package twitteR.

The authorization credentials are not shown in this document, but it has been setup so that we can search Twitter using the following lines:

setup_twitter_oauth(api_key,api_secret)
#setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

some_tweets <- searchTwitter('hdb+singapore', lang='en', n=1000) #just a sample search query as Twitter search API only returns a sample of recent tweets

Data Cleaning

This is the most time-consuming part. In order to do the cleaning effectively, I have written a simple function for that (building on my this post from my favorite blog):

f_clean_tweets <- function (tweets) {
  
  clean_tweets = sapply(tweets, function(x) x$getText())
  # remove retweet entities
  clean_tweets = gsub('(RT|via)((?:\\b\\W*@\\w+)+)', '', clean_tweets)
  # remove at people
  clean_tweets = gsub('@\\w+', '', clean_tweets)
  # remove punctuation
  clean_tweets = gsub('[[:punct:]]', '', clean_tweets)
  # remove numbers
  clean_tweets = gsub('[[:digit:]]', '', clean_tweets)
  # remove html links
  clean_tweets = gsub('http\\w+', '', clean_tweets)
  # remove unnecessary spaces
  clean_tweets = gsub('[ \t]{2,}', '', clean_tweets)
  clean_tweets = gsub('^\\s+|\\s+$', '', clean_tweets)
  # remove emojis or special characters
  clean_tweets = gsub('<.*>', '', enc2native(clean_tweets))
  
  clean_tweets = tolower(clean_tweets)
  
  clean_tweets
}

We can just call that function to do the cleaning, like so:

clean_tweets <- f_clean_tweets(some_tweets)

# removing duplicates due to retweets
clean_tweets <- clean_tweets[!duplicated(clean_tweets)]

Sentiment Analysis

For demonstration purpose, I will use the sentiment package in R. Even though there are more recent (and better) packages, the old sentiment package is still useful for instructional purpose because:

It can classify emotions (anger, disgust, fear, joy, sadness, surprise); trained using Naive Bayes on a dataset of approximately 1500 words [Carlo Strapparava and Alessandro Valitutti, “WordNet-Affect: an affective extension of WordNet”. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, May 2004, pp. 1083-1086].
It can also classify polarity (positive/negative); trained using Naive Bayes classifier on Janyce Wiebe’s subjectivity lexicon [Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003].

The classification of emotions and polarities is as simple as:

# using sentiment package to classify emotions
emotions <- classify_emotion(clean_tweets, algorithm='bayes')

# using sentiment package to classify polarities
polarities = classify_polarity(clean_tweets, algorithm='bayes')

df = data.frame(text=clean_tweets, emotion=emotions[,'BEST_FIT'],
                     polarity=polarities[,'BEST_FIT'], stringsAsFactors=FALSE)
df[is.na(df)] <- "N.A."

The results can then be conveniently visualized:

# plot the emotions
plot_ly(df, x=~emotion,type="histogram",
        marker = list(color = c('grey', 'red',
                                'orange', 'navy',
                                'yellow'))) %>%
  layout(yaxis = list(title='Count'), title="Sentiment Analysis: Emotions")

There are a lot of unknown emotions: this is OK because the model was trained on a relatively small dataset. But we should still be able to classify the polarity reasonably well:

plot_ly(df, x=~polarity, type="histogram",
        marker = list(color = c('magenta', 'gold',
                                'lightblue'))) %>%
  layout(yaxis = list(title='Count'), title="Sentiment Analysis: Polarity")

Mostly positive emotions! This is good news for HDB. Finally, let’s see which word contributed to which polarity in my tweets:

# Visualize the words by polarity
df <- df %>%
  group_by(polarity) %>%
  summarise(pasted=paste(text, collapse=" "))

# remove stopwords
df$pasted = removeWords(df$pasted, stopwords('english'))

# create corpus
corpus = Corpus(VectorSource(df$pasted))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = df$polarity

# comparison word cloud
comparison.cloud(tdm, colors = brewer.pal(3, 'Dark2'),
                 scale = c(3,.5), random.order = FALSE, title.size = 1.5)

This is definitely not perfect (perhaps that’s why the package is no longer maintained on CRAN). In the next publication, I will show how we can do sentiment analysis using the new syuzhet package.

Sentiment Analysis in R

Understanding sentiments around HDB in Singapore: an introduction

Kevin Siswandi

10 June 2017

Overview

Setup

Twitter Authentication

Data Cleaning

Sentiment Analysis