Analysis performed on 2015-06-28 14:18:46

1 Introduction

Social media including Twitter, Facebook, LinkedIn are the most popular free public platforms for expressing opinion on a diverse range of subjects. With millions of tweets (feeds) daily, there is a wealth of information out there. Twitter, in particular is used extensively by individuals and companies for status updates and product/ services marketing. Twitter is also a great text mining source for data scientists. From analyzing behavior, incidents, sentiments to predicting stock markets, events and trends, it provides a ton of data for mining and contextual analytics.

In this post, I will show how to do a simple sentiment analysis. We will download twitter feeds on a subject and compare it to a database of positive, negative words. The ratio of the matched positive and negative words is the sentiment ratio. We will also define functions to find most frequently occurring words. These words can provide useful contextual information on public opinion and sentiments. The data set for the positive and negative opinion words (sentiment words) comes from Hu and Liu, KDD-2004.

The main packages used in this analysis are twitteR, dplyr, stringr, ggplot2, tm, SnowballC, qdap, and wordcloud.It is important to install and load these packages using install.packages() and library() commands.

2 Load Twitter API

The first step is to register in the twitter application developer’s portal and get the authorization. You need

api_key=  "your api key"
api_secret= "your api_secret password"
access_token= "your access token"
access_token_secret= "your access token password"

After obtaining credentials, we setup authorizations to access twitter API.

setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

3 Load word dictionaries

Next step is to load the set of positive and negative sentiments words (dictionary) into your R working directory. The words are then accessed and assigned to variables, positive and negative as shown below.

positive=scan('positive-words.txt',what='character',comment.char=';')
negative=scan('negative-words.txt',what='character',comment.char=';')
positive[20:30]
##  [1] "accurately"   "achievable"   "achievement"  "achievements"
##  [5] "achievible"   "acumen"       "adaptable"    "adaptive"    
##  [9] "adequate"     "adjustable"   "admirable"
negative[500:510]
##  [1] "byzantine"    "cackle"       "calamities"   "calamitous"  
##  [5] "calamitously" "calamity"     "callous"      "calumniate"  
##  [9] "calumniation" "calumnies"    "calumnious"

There are a total of 2006 positive and 4783 negative words. The above section also shows some examples of words from the two dictionaries

Additional words can be added or removed from the dictionaries. In the code below, we add the word cloud to the positive word dictionary and remove it from the negative word dictionary.

positive=c(positive,"cloud")
negative=negative[negative!="cloud"]

4 Search twitter feeds

The next step is defining a twitter search string and assigning to a variable, findfd. Number of tweets to be extracted is assigned to another variable, number. The time to perform the twitter search and extraction is affected by this number. A slow internet connection and/or complex search fields may result in additional delays.

findfd= "CyberSecurity"
number= 5000

In the above code, we use CyberSecurity string to retrieve 5000 tweets. The code for searching twitter is

tweet=searchTwitter(findfd,number)
## Time difference of 1.301408 mins

4.1 Getting text from feeds

Twitter feeds have tons of additional fields and embedded superfluous information. We use the gettext() function to extract the text fields and assign the list to a variable tweetT. The function is applied to all 5000 tweets. The code below also shows results of extraction for the first 5 feeds.

tweetT=lapply(tweet,function(t)t$getText())
head(tweetT,5)
## [[1]]
## [1] "RT @PCIAA: \"You must have realtime technology\" how do you defend against #Cyberattacks? @FireEye #cybersecurity http://t.co/Eg5H9UmVlY"
## 
## [[2]]
## [1] "@MPBorman: ‘#Cybersecurity on agenda for 80% corporate boards http://t.co/eLfxkgi2FT  @CS… http://t.co/h9tjop0ete http://t.co/qiyfP94FlQ"
## 
## [[3]]
## [1] "The FDA takes steps to strengthen cybersecurity of medical devices | @scoopit via @60601Testing http://t.co/9eC5LhGgBa"
## 
## [[4]]
## [1] "Senior Solutions Architect, Cybersecurity, NYC-Long Island region, Virtual offic... http://t.co/68aOUMNgqy #job#cybersecurity"
## 
## [[5]]
## [1] "RT @Cyveillance: http://t.co/Ym8WZXX55t #cybersecurity #infosec - The #DarkWeb As You Know It Is A Myth via @Wired http://t.co/R67Nh6Ck70"

4.2 Defining text cleaning functions

In this step, we write a function which executes a series of commands to clean text, removes punctuation, special characters, embedded HTTP links, extra spaces, and digits. This function also changes upper case characters to lower case string using tolower() function. Many times, the tolower() function stops unexpectedly as it encounters special characters stopping execution of the r code. To avoid this, we write an error catching function, tryTolower, and embed it in the code of the text cleaning function.

tryTolower = function(x)
{
  y = NA
  # tryCatch error
  try_error = tryCatch(tolower(x), error = function(e) e)
  # if not an error
  if (!inherits(try_error, "error"))
    y = tolower(x)
  return(y)
}

The clean() function cleans the twitter feeds and splits the strings into a vector of words

clean=function(t){
 t=gsub('[[:punct:]]','',t)
 t=gsub('[[:cntrl:]]','',t) 
 t=gsub('\\d+','',t)
 t=gsub('[[:digit:]]','',t)
 t=gsub('@\\w+','',t)
 t=gsub('http\\w+','',t)
 t=gsub("^\\s+|\\s+$", "", t)
 t=sapply(t,function(x) tryTolower(x))
 t=str_split(t," ")
 t=unlist(t)
 return(t)
}

4.3 Cleaning and splitting twitter feeds

In this step, we apply the clean() function defined above to clean the list of 5000 feeds. The resultant feeds are stored in a list object, tweetclean. The following code also shows the first 5 tweets cleaned and split using this function

tweetclean=lapply(tweetT,function(x) clean(x))
head(tweetclean,5)
## [[1]]
##  [1] "rt"            "pciaa"         "you"           "must"         
##  [5] "have"          "realtime"      "technology"    "how"          
##  [9] "do"            "you"           "defend"        "against"      
## [13] "cyberattacks"  "fireeye"       "cybersecurity"
## 
## [[2]]
##  [1] "mpborman"       "‘cybersecurity" "on"             "agenda"        
##  [5] "for"            ""               "corporate"      "boards"        
##  [9] " "              "cs…"           
## 
## [[3]]
##  [1] "the"           "fda"           "takes"         "steps"        
##  [5] "to"            "strengthen"    "cybersecurity" "of"           
##  [9] "medical"       "devices"       ""              "scoopit"      
## [13] "via"           "testing"      
## 
## [[4]]
##  [1] "senior"           "solutions"        "architect"       
##  [4] "cybersecurity"    "nyclong"          "island"          
##  [7] "region"           "virtual"          "offic"           
## [10] ""                 "jobcybersecurity"
## 
## [[5]]
##  [1] "rt"            "cyveillance"   ""              "cybersecurity"
##  [5] "infosec"       ""              "the"           "darkweb"      
##  [9] "as"            "you"           "know"          "it"           
## [13] "is"            "a"             "myth"          "via"          
## [17] "wired"

5 Analyzing twitter feeds

Here we get into the actual task of analyzing feeds. We compare the twitter text feeds with the word dictionaries and retrieve out the matching words. To do this, we first define a function to count the number of positive and negative words that are matching with our database. Here is the code for the function, returnpscore for counting the positive matching words.

returnpscore=function(tweet) {
    pos.match=match(tweet,positive)
    pos.match=!is.na(pos.match)
    pos.score=sum(pos.match)
    return(pos.score)
}

Next we apply this function to the tweetclean list

positive.score=lapply(tweetclean,function(x) returnpscore(x))

Next we define a loop to count the total number of positive words present in the tweets

pcount=0
for (i in 1:length(positive.score)) {
  pcount=pcount+positive.score[[i]]
}
pcount
## [1] 1569

As seen above, there are 1569 positive words in the extracted tweets. Similar method is used to find negative score in the feeds.

The following code retrieves the positive and negative matching words.

poswords=function(tweets){
    pmatch=match(t,positive)
    posw=positive[pmatch]
    posw=posw[!is.na(posw)]
    return(posw)
  }

This function is applied to our tweetclean list and a loop is called to add words to a data frame, pdatamart. The code below also shows first 10 matches of positive words

words=NULL
pdatamart=data.frame(words)

for (t in tweetclean) {
  pdatamart=c(poswords(t),pdatamart)
}
head(pdatamart,10)
## [[1]]
## [1] "best"
## 
## [[2]]
## [1] "safe"
## 
## [[3]]
## [1] "capable"
## 
## [[4]]
## [1] "tough"
## 
## [[5]]
## [1] "fortune"
## 
## [[6]]
## [1] "excited"
## 
## [[7]]
## [1] "kudos"
## 
## [[8]]
## [1] "appropriate"
## 
## [[9]]
## [1] "humour"
## 
## [[10]]
## [1] "worth"

Similarly, we write a series of functions and loops for finding negative words and sentiments. This is assigned to another data frame, ndatamart. Here are the first 10 negative words from the tweets.

head(ndatamart,10)
## [[1]]
## [1] "attacks"
## 
## [[2]]
## [1] "breach"
## 
## [[3]]
## [1] "issues"
## 
## [[4]]
## [1] "attacks"
## 
## [[5]]
## [1] "poverty"
## 
## [[6]]
## [1] "attacks"
## 
## [[7]]
## [1] "dead"
## 
## [[8]]
## [1] "dead"
## 
## [[9]]
## [1] "dead"
## 
## [[10]]
## [1] "dead"

5.1 Plotting high frequency negative and positive words

In this step, we create some charts to show the distribution of high frequency positive and negative words.We use the unlist() function to convert the list to vectors. The vector variables pwords and nwords are converted to dataframe objects.

dpwords=data.frame(table(pwords))
dnwords=data.frame(table(nwords))

Using dplyr package, we first mutate the words as character variables and then filter for frequency >15 repetitions for positive and negative words.

dpwords=dpwords%>%
  mutate(pwords=as.character(pwords))%>%
  filter(Freq>15)

We plot the major positive words and their frequency using ggplot2 package. As seen, there are a total of 1569 positive words. The frequency distribution gives an indication of the degree of positive sentiment.

ggplot(dpwords,aes(pwords,Freq))+geom_bar(stat="identity",fill="lightblue")+theme_bw()+
  geom_text(aes(pwords,Freq,label=Freq),size=4)+
  labs(x="Major Positive Words", y="Frequency of Occurence",title=paste("Major Positive Words and Occurence in \n '",findfd,"' twitter feeds, n =",number))+
  geom_text(aes(1,5,label=paste("Total Positive Words :",pcount)),size=4,hjust=0)+theme(axis.text.x=element_text(angle=45))

Likewise, we plot negative words and their frequency. There are a total of 2063 negative words in 5000 twitter feeds on CyberSecurity search string

6 Removing common words and creating wordcloud

Here we convert the tweetclean into a word corpus using the function VectorSource. A word corpus enables us to eliminate superfluous common words using the text mining package tm. Removing common words, also called stop words, lets us focus on the important words and help establishing context. The following code below provides few examples of ‘Stop Words’

tweetscorpus=Corpus(VectorSource(tweetclean))
tweetscorpus=tm_map(tweetscorpus,removeWords,stopwords("english"))
stopwords("english")[30:50]
##  [1] "what"   "which"  "who"    "whom"   "this"   "that"   "these" 
##  [8] "those"  "am"     "is"     "are"    "was"    "were"   "be"    
## [15] "been"   "being"  "have"   "has"    "had"    "having" "do"

Next, we create a Word Cloud of tweets using the wordcloud package. Notice that We are limiting the maximum words to 300

wordcloud(tweetscorpus,scale=c(5,0.5),random.order = TRUE,rot.per = 0.20,use.r.layout = FALSE,colors = brewer.pal(6,"Dark2"),max.words = 300)

7 Analyzing and plotting high frequency words

In this final step, we convert the word corpus into a document matrix using the function DocumentTermMatrix. The Document matrix can be analyzed to examine most frequently occurring uncommon words. Next, we remove the sparse terms (extremely low frequency words) from the corpus. The code displays the most frequent terms (with frequency of 50 or above)

dtm=DocumentTermMatrix(tweetscorpus)
# #removing sparse terms
dtms=removeSparseTerms(dtm,.99)
freq=sort(colSums(as.matrix(dtm)),decreasing=TRUE)
#get some more frequent terms
findFreqTerms(dtm,lowfreq=100)
##  [1] "amp"           "atf"           "better"        "breach"       
##  [5] "china"         "cyber"         "cybercrime"    "cybersecurity"
##  [9] "data"          "experts"       "federal"       "firm"         
## [13] "government"    "hackers"       "hack"          "healthcare"   
## [17] "help"          "heres"         "http…"         "icit"         
## [21] "infosec"       "investigation" "iot"           "learn"        
## [25] "look"          "love"          "lunch"         "new"          
## [29] "news"          "next"          "official"      "opm"          
## [33] "passwords"     "possible"      "post"          "privacy"      
## [37] "reportedly"    "securing"      "security"      "senior"       
## [41] "share"         "site"          "startups"      "talk"         
## [45] "thehill"       "tips"          "took"          "top"          
## [49] "via"           "wanted"        "wed"           "whats"

finally, we convert the matrix to a data frame, filter for Minimum frequency > 75and plot using ggplot2

wf=data.frame(word=names(freq),freq=freq)
wfh=wf%>%
  filter(freq>=75,!word==tolower(findfd))
ggplot(wfh,aes(word,freq))+geom_bar(stat="identity",fill='lightblue')+theme_bw()+
  theme(axis.text.x=element_text(angle=45,hjust=1))+
  geom_text(aes(word,freq,label=freq),size=4)+labs(x="High Frequency Words ",y="Number of Occurences", title=paste("High Frequency Words and Occurence in \n '",findfd,"' twitter feeds, n =",number))+
  geom_text(aes(1,max(freq)-100,label=paste("# Positive Words:",pcount,"\n","# Negative Words:",ncount,"\n",result(ncount,pcount))),size=5, hjust=0)