Basic Sentiment Analysis

Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker. With ever increasing data size, it is no longer feasible to read text manually and understand the emotion. Instead, an algorithm is used that extracts emotions from thousands of text documents in seconds. We will review one such algorithm here.

Basic Setup

#set working directory 
setwd("C:/Users/awani/Desktop")

#load required libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(twitteR, wordcloud, tm, tidyr, tidytext, syuzhet, ggplot2, NLP, RColorBrewer, RTextTools)

Data extraction from twitter

The first step is to get text comments or reviews from social media or any other website. In this example, we would use “twitteR” package in R to extract twitter data.

Visit https://apps.twitter.com/app/new and login with your twitter credentials. You need to create a twitter API and connect to twitter before you can extract tweets.

##establish twitter connection
#consumer_key = "Enter your Key"
#consumer_secret = "Enter your Key"
#access_token = "Enter your Key"
#access_secret = "Enter your Key"

## Set up connection
setup_twitter_oauth(consumer_key,consumer_secret,access_token, access_secret) # set up twitter connection

## [1] "Using direct authentication"

# Search tweets
fb = searchTwitter("zuckerberg", n= 1000, lang = "en")

#save as dataframe
fb = do.call(rbind, lapply(fb, as.data.frame))

Data Cleaning

Extracted tweets will have a bunch of stuff not conveying any kind of sentiments like names, punctuation and others. It is better to remove them before moving forward. some custom cleaning might be required but the code will take care most common data cleaning steps.

### clean data ####

text = fb$text   # save tweets to another data set "text"

text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",text)         #remove names
text = gsub("http[^[:blank:]]+","",text)                   #remove html links
text = gsub("@\\w+","",text)                               #remove people names
text = gsub("[[:punct:]]","",text)                         #remove punctuations
text = trimws(text, which = c("both", "left", "right"))    # remove whitespace

text = gsub('[[:digit:]]+', '', text)                      # remove digits
text = gsub("[\r\n]", "", text)                            # remove line breaks
text = iconv(text, to = "ASCII//TRANSLIT")                 # remove not readable standard text
text = iconv(text, "ASCII", "UTF-8", sub="")               # remove not readable standard text
text = tolower(text)                                       # lower case

Word Cloud

Word cloud is quick and effective method to visualize word frequency and identify most repeated words in the corpus. One must, however, be vigilant while making word clouds. If not used properly they often paint a misleading picture.

### Word Cloud ###
corpus = Corpus(VectorSource(text))                          # convert tweets to corpus

#some more cleaning
corpus = tm_map(corpus, removeWords, stopwords("english"))   #remove stopwords like "and","the" and"that"
corpus = tm_map(corpus, stripWhitespace)                     # remove whitespace

# word frequency
uniqwords = as.matrix(TermDocumentMatrix(corpus))            # covert corpus to Term Document matrix 
wordfreq = sort(rowSums(uniqwords),decreasing=TRUE)          # Count frequency of words in each tweet
WCinput = data.frame(word = names(wordfreq),freq=wordfreq)   # word frequency dataframe

#generate the wordcloud
wordcloud(words = WCinput$word, freq = WCinput$freq, min.freq = 2,
          max.words=200, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

Basic Sentiment Analysis

We will now use “syuzhet” package to summarize the sentiments attached with the tweets extracted. The get_nrc_sentiment categorizes words into 10 different emotions giving us an idea of wide range of emotions. We can then aggregate these emotions and visualize to get a better understanding.

### Sentiment Analysis ####

#get sentiment
Sentiment = get_nrc_sentiment(as.character(text), cl =NULL,language = "english")

#aggregate
agg_sent = data.frame(apply(Sentiment,2,sum))
names(agg_sent) = "WordCount"
agg_sent$Emotion = row.names(agg_sent)

#plot sentiments
ggplot(data = agg_sent, aes(x = Emotion, y = WordCount)) +
  geom_bar(aes(fill = Emotion), stat = "identity") +
  theme(legend.position = "none")+
  xlab("Emotion") + ylab("Total Word Count") + ggtitle("Twitter Sentiment Analysis")

Basic Sentiment Analysis

Awanindra Singh

July 17, 2018