1 Introduction

Kavita Ganesan & Hyun Duk Kim of University of Illinois @ Urbana Champaign defined the concept of Sentiment Analysis as falling under a broader topic called, Subjectivity Analysis. This latter also includes Review Mining, Opinion Mining, and Appraisal Extraction.

And according to Marie-Claire Jenkins, Science for SEO, software automatically extract opinions, emotions and sentiments in text. It allows us to track attitudes and feelings on the web. People write blog posts, comments, reviews and tweets about all sorts of different topics. We can track products, brands and people for example and determine whether they are viewed positively or negatively on the web.

This analysis utilize data obtained from Twitter. I have used some bayesian analysis algorithms from the library sentiment. In addition I have included graphs showing prevalent emotion and polarity distribution to show how is the current happenning in Venezuela.

The neat thing about this analysis is that it will always be up to date, as the extraction of the tweets -by running again the markdown document- will always be reflecting the current situation about presos politicos, and maduro (current socialist dictator in Venezuela).

I have made some changes in the 'spanish' nouns used in the analysis to avoid some biased predominantly in the category of ‘neutrality’ due to the use of the library sentiment which is english-based. I have changed them to use 'proper english' words.

CPQ Energy & Analytics

Emotion Analysis

The text utilized for analysis has been obtained from Twitter. I am classifying it in different types of emotion: anger, disgust, fear, joy, sadness, and surprise, as per categories managed by the library. The classification is being performed using a naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon.

Polarity Analysis

In contrast to the classes of emotions, the ‘classify_polarity’ function allows us to classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon.

Tutorial with latest techniques is also presented by Colin Priest on his website.


2 Loading Libraries

library(devtools)
library(stringr)
library(twitteR)
library(RColorBrewer)
library(ggplot2)

3 Including Library sentiment

#
# sentiment package is no longer avalable on CRAN, we have to
# download the archived source code and install it via this RScript
# Note: we only have to download and install the sentiment package once
#
if (!require("pacman")) install.packages("pacman")
pacman::p_load(devtools, installr)
install.Rtools()
install_url("http://cran.r-project.org/src/contrib/Archive/Rstem/Rstem_0.4-1.tar.gz")
install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")

library(sentiment)

4 Keys and Access Tokens required to access Twitter

# api_key
# api_secret
# access_token
# access_token_secret

5 Submitting request for Authorization

For a complete guide of instructions on how to get tweets from R, please follow up a very well written step-by-step tutorial by Colin Priest.

setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

6 Collecting tweets containing terms: “maduro”, “presos”, “politicos”

# Harvest some tweets 

tweets = searchTwitter("maduro + presos + politicos", 
                       n=100000, 
                       lang="es", 
                       since='2015-12-01')

# Get some text

tweets_txt = sapply(tweets, function(x) x$getText())

7 Filtering and extracting noisy characters from tweets

# remove retweet entities
tweets_cl = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","", tweets_txt)

# remove html links
tweets_cl = gsub("http\\w+", "", tweets_cl)
tweets_cl = gsub("http[^[:blank:]]+", "", tweets_cl)
# remove at people
tweets_cl = gsub("@\\w+", "", tweets_cl)

# remove punctuation
tweets_cl = gsub("[[:punct:]]", " ", tweets_cl)
# remove numbers
tweets_cl = gsub("[[:digit:]]", "", tweets_cl)
# remove unnecessary spaces
tweets_cl = gsub("[^[:alnum:]]", " ", tweets_cl)
tweets_cl = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", tweets_cl, perl=TRUE)

8 Lower case tweets and Removing NA

# define "tolower error handling" function 
try.error = function(x)
{
  # create missing value
  y = NA
  # tryCatch error
  try_error = tryCatch(tolower(x), error=function(e) e)
  # if not an error
  if (!inherits(try_error, "error"))
    y = tolower(x)
  # result
  return(y)
}
# lower case using try.error with sapply 
tweets_cl = sapply(tweets_cl, try.error)

# remove NAs in tweets_cl
tweets_cl = tweets_cl[!is.na(tweets_cl)]
names(tweets_cl) = NULL

# remove conjunctions, pronouns, and articles
conjunction <- c(
  "y", "ni", "o", "ya", "luego", "conque", "pues", "e", "de", 
  "pero", "como", "cuando", "tal", "para", "a", "al", "en", "del", "por"
)
pronoun <- c("que", "quien", "quienes", "cual")
article <- c("las", "los", "el", "la", 
              "un", "unos", "una", "unas")

t <- c(article, pronoun, conjunction)

# eliminate specific words from tweets
library(tm)
tweets_clean <- removeWords(tweets_cl, t)
tweets_clean = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", tweets_clean, perl=TRUE)

9 Word cloud of tweets based on frequency encountered

This is a brief word cloud of tweets highlighting most common phrases utilized in the set of retireved tweets. In this case it indicates which are the main concerns and ‘win over’ of venezuelan population: peresos (prisoners), libertad (freedom), lucha (fight), politicos (politician), asilo (refuge), and maduro (current socialist president).

# create a world cloud
library(wordcloud)
col <- brewer.pal(8, "Dark2")
wordcloud(tweets_clean, min.freq = 5, scale = c(6,3), rot.per = 0.25, 
          random.color = T, max.word = 35, random.order = F, colors = col)

10 Performing sentiment analysis: emotions and polarity

# classify emotion
class_emo = classify_emotion(tweets_clean, algorithm="bayes", prior=1.0)
# get emotion best fit
emotion = class_emo[,7]
# substitute NA's by "unknown"
emotion[is.na(emotion)] = "unknown"

# classify polarity
class_pol = classify_polarity(tweets_clean, algorithm="bayes")
# get polarity best fit
polarity = class_pol[,4]

11 Creation of data frame with results: emotion & polarity

#
# using DT library to create a nice table interface
#

library("DT")

# data frame with results
sent_df = data.frame(text = tweets_clean, emotion = emotion,
                     polarity = polarity, stringsAsFactors = FALSE)

# sort data frame
sent_df = within(sent_df,
                 emotion <- factor(emotion, 
                                   levels = names(sort(table(emotion), 
                                                       decreasing = TRUE))))

datatable(sent_df, class = "cell-border-stripe", 
          rownames = TRUE, 
          colnames = c("tweet", "Emotion", "Polarity"), 
          options = list(pageLength = 10))


12 ggplot to visualize results: emotion

#
# some plots to visualize preliminary results
#

# plot distribution of emotions
ggplot(sent_df, aes(x = emotion)) + 
  geom_bar(aes(y = ..count.., fill = emotion)) + 
  scale_fill_brewer(palette = "Dark2") + 
  labs(x = "emotion categories", y = "number of tweets") + 
  theme(title = element_text("Sentiment Analysis of Tweets about Political 
                 Situation in Venezuela
                 (classification by emotion)"), 
        plot.title = element_text(size = 12))


13 ggplot to visualize results: polarity

# plot distribution of polarity
ggplot(sent_df, aes(x = polarity)) +
  geom_bar(aes(y = ..count.., fill = polarity)) +
  scale_fill_brewer(palette = "RdGy") +
  labs(x="polarity categories", y="number of tweets") +
  theme(title = element_text("Sentiment Analysis of Tweets about Political\n
                              Situation in Venezuela\n
                             (classification by polarity)"),
       plot.title = element_text(size=12))


14 Visualize emotions using a cloud

#
# Separate the text by emotions and visualize 
# the words with a comparison cloud
#

emos = levels(factor(sent_df$emotion))
nemo = length(emos)
emo.docs = rep("", nemo)
for (i in 1:nemo)
{
  tmp = tweets_clean[emotion == emos[i]]
  emo.docs[i] = paste(tmp, collapse = " ")
}

# remove stopwords
emo.docs = removeWords(emo.docs, stopwords("english"))
# create corpus
corpus = Corpus(VectorSource(emo.docs))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = emos

This ‘cloud’ distributes emotions based on a clasification made by the bayesian algorithm included in the library sentiment.
First, in the anger area we observe several nouns like hambre (hungry), phrases related to amnisty or lack of it as well as libertad (freedom) or lack of it, fracaso (failure), that certainly displays the sentiment of anger.
In the disgust zone we observe phrases as horror as indicative of sickness and revolt of population against the establishment. As a curious observation the disgust zone is not too populated and this might indicate the venezuelan are not predominantly repellent, or loathsome against Maduro but trying to take him out of power by democratic means sooner than later.
For the sadness zone we encountered not so clear expressions. This may be due to the use of the ‘english-based’ library sentiment.
Within the fear zone we observe the term diosdado (current maduro’ accomplice), golpes (knocking), delito (crime, offense), disfraza (disguise) which are reasonable expressions of the fear in the population.
In joy we found people making a encouraging effort in being stronger as we see phrases like opositora (opposition), concentracion (demonstration), power (exercise of power), and so on which is an apparent image of what people is determined to conquer.

# comparison word cloud
comparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"),
                 scale = c(3,.5), random.order = FALSE, title.size = 1.5)

Copyright 2016 — CPQ Energy & Analytics