Introduction to Text Mining

In this discussion, we will show you how to get the semantics analysis on any of the trending topic in twitteR.

In this session, we will show you how to create application in twitter development site and generate session key to get the access to the trending topics with the help of twitteR package by geoffjentry.

Pre-requisite

Please install packages like

Application Development in twitteR

The first step is to log in to your twitter account and then click on the App link. on top right corner of the page there is tab for “Create New App”. Click on it to see the following page,

Twitter Application Page

Twitter Application Page

  • Put the name of your application in the Name field.
  • Enter the description in the Description field.
  • The Website field needs to be filled with a valid URL, but again they can be any random url
  • You can leave the Callback URL filed blank.

See the following screenshot to get the understanding on generating the session key for OAUTH.

TwitterApp_Api Key

TwitterApp_Api Key

  • Click on the highlighted link on the page to get the Session Key as shown below,
Twitter_Access_Token

Twitter_Access_Token

Get Your Hands Dirty with the Data

First, load the “devtools” package as this is required to install the twitteR package from the github account. Following is the code in order to install the package and get started with the session as shown above.

Installation of twitteR Package in R

## Install the github codes from an url
install_github("geoffjentry/twitteR")
library(twitteR)
library(dplyr)
## Build the API model to get the tweets
api_key<-"V7K1ZkputrE22Kb2Hlmleb5ze"
api_secret<-"KuEtwZjcIZ8VuKENgoQ3w6l5ngFMhnniiQlqwubCldOG6Hf6d2"
access_token<-"774465035552886785-7uBmIK2xCfnPk4zF00UM5JxrtNMCBAv"
access_token_secret<-"s51j97gnlBDeWypMLiyyoAUfHjwuwuCPieSnG33DZtC1j"
## Setup twitteR oauth
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

The syntax: setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret) is the command to setup the session with the twitter.

Trend Topic: De-Monetisation

## Search the tweets for topic "demonetisation"
Demon<-searchTwitter("Demonetisation",n=200,since="2016-11-07")
Demon_Tweets<-sapply(Demon,function(x)x$getText())%>%as.data.frame()
colnames(Demon_Tweets)=c("Text")
write.csv(Demon_Tweets,"Tweets.csv")
Demon_DF<-read.csv("Tweets.csv")

Clean The Data

Primary Cleaning using gsub()

##Clean the tweet for sentiment analysis
#  remove html links, which are not required for sentiment analysis
tweet1=gsub("https://","",Demon_DF$Text)
tweet2=gsub("#","",tweet1)
tweet3=gsub("t.co/","",tweet2)
tweet4=gsub("@","",tweet3)
tweet5=gsub("RT|via","",tweet4)
tweet6=gsub("????????????????????????????????????","",tweet5)
tweet7=gsub("\n"," ",tweet6)
tweet8=gsub("<ed><U+A><U+BD><ed><U+B><U+C>","",tweet7)
tweet9=gsub("[[:digit:]]","",tweet8)
tweet10=gsub(" ???????????????????????????????????? ","",tweet9)
write.csv(tweet10,"tweets_clean.csv")

Secondary Cleaning using “tm” package

## Load the tm package to clean the corpus
library(tm)
library(plyr)
library(stringr)
library(SnowballC)
tweet_text=read.csv("tweets_clean.csv")

## Corpus
DemonCorpus=Corpus(VectorSource(tweet_text$x))

## Convert to plain text document
DemonCorpus=tm_map(DemonCorpus,PlainTextDocument)
## lower case
DemonCorpus=tm_map(DemonCorpus,content_transformer(tolower))

## Strip Whitespace

DemonCorpus<-tm_map(DemonCorpus, stripWhitespace)

# Remove Punctuation
DemonCorpus=tm_map(DemonCorpus,removePunctuation)

## Scan the stopwords (New List )
sw=scan('stopwords.txt',what='character',sep="")
words=unlist(str_split(sw,'\\s+'))
match=match(words,sw)
# Remove stopwords
DemonCorpus=tm_map(DemonCorpus,removeWords,c(words,"eduaubdedubuc","jamewils","yuvjrmfs"))
print(as.character(DemonCorpus[[100]]))

## Document term matrix
docterm_corpus <- DocumentTermMatrix(DemonCorpus)

#find frequent terms
colS <- colSums(as.matrix(docterm_corpus))

Word Cloud

## Create the word Cloud
library(wordcloud) 
wordcloud(names(colS),colS,max.freq=160, scale=c(4,2),min.freq=1,random.order=TRUE,
          colors = brewer.pal(6, 'Dark2'),max.words = 200)
de-monetiation

de-monetiation

Sentiment Analysis

Here, we will not use the popular Naive Bayes for the sentiment analysis, where as we will do the polarity analysis from very negative to Very Positive using match algorithm.

We have used the lexicon-opinion (Source) dictionary to match the words with that of twitteR comments. Using simple match algorithm we have classified the polarity in the following five categories,

We have classified on the basis of sentiment score. Here you can customised the range for each category depending on the total range of sentiment score for the twitter text. In this case, it is in the range of -2 to +3. Below is the code,


## Sentiment Analysis

demondf=read.csv("tweets_clean.csv")
demondf=demondf[!is.na(demondf$x),]
## Scan the Lexicon words (English) database which is in txt format
opinion.lexicon.pos<-scan('positive-words.txt',what='character',comment.char = ';')
opinion.lexicon.neg<-scan('negative-words.txt', what='character', comment.char=';')
# upgrade the positive and negative word list
pos.words=c(opinion.lexicon.pos,'cpimspeak','aayog')
neg.words=c(opinion.lexicon.neg,'haunt','modi','fcuk','cancel')
## Create the function for the sentiment score
getsentimentscore=function(sentences,words.positive,words.negative,.progress='none'){
  require(plyr)
  require(stringr)
  scores=laply(sentences,function(sentence,words.positive,words.negative,.progress='none'){
    # let us split each sentence by space delimiter
    words=unlist(str_split(sentence,'\\s+'))
    # Let us match with our database
    pos.matches=!is.na(match(words,pos.words))
    neg.matches=!is.na(match(words,neg.words))
    # get the score
    score= sum(pos.matches)-sum(neg.matches)
    return(score)},words.positive,words.negative,.progress=.progress)
  ## Return the dataframe with the respective sentence and scores
  return(data.frame(text=sentences,score=scores))}
Result<-getsentimentscore(demondf$x)
Result<-as.data.frame(Result)

## Categorise the words to very negative to very positive from tweets of Sully
vNeg<-nrow(as.data.frame(Result$text[Result$score==-2]))
Neg<-nrow(as.data.frame(Result$score[Result$score==-1]))
Neutral<-nrow((as.data.frame(Result$score[Result$score==0])))
Pos<-nrow(as.data.frame(Result$score[Result$score==2|Result==1]))
vPos<-nrow(as.data.frame(Result$score[Result$score==3]))
## Build the data frame
Tweets=as.data.frame(c(vNeg,Neg,Neutral,Pos,vPos))
colnames(Tweets)=c("Number of Tweets")
Emotion=as.data.frame(c("vNeg","Neg","Neutral","Pos","vPos"))
colnames(Emotion)=c("Degree of Emotion")
Plot.Result=as.data.frame(cbind(Emotion,Tweets))

Plot the Polarity using ggplot2

library(ggplot2)

ggplot(data=Plot.Result,aes(x=Plot.Result$`Degree of Emotion`,y=Plot.Result$`Number of Tweets`))+
  geom_bar(aes(fill=Plot.Result$`Degree of Emotion`),stat="identity",width=0.4)+
  scale_fill_brewer(palette="Dark2")+xlab("Degree of Polarity")+
  ylab("Number of Tweets")+
  geom_text(aes(label=Plot.Result$`Number of Tweets`),
            vjust=-0.5,colour="brown",stat="identity")+theme_bw() + 
  theme(panel.border = element_blank(), panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),axis.line = element_line(colour = "black"),legend.position = 'none')
Polarity

Polarity

Conclusion

You can use the popular naive bayes algorithm using sentiment package. With the help of this algorith you can get the different kinds of emotion on the trending topic. In next topic discussion, we will introduce the sentiment package for the sentiment analysis with comparative word cloud.

Stay Tuned …