Introduction to Text Mining
In this discussion, we will show you how to get the semantics analysis on any of the trending topic in twitteR.
In this session, we will show you how to create application in twitter development site and generate session key to get the access to the trending topics with the help of twitteR package by geoffjentry.
Pre-requisite
Please install packages like
- ggplot2
- devtools
- dplyr
- tm
- SnowballC
Get Your Hands Dirty with the Data
First, load the “devtools” package as this is required to install the twitteR package from the github account. Following is the code in order to install the package and get started with the session as shown above.
Trend Topic: De-Monetisation
## Search the tweets for topic "demonetisation"
Demon<-searchTwitter("Demonetisation",n=200,since="2016-11-07")
Demon_Tweets<-sapply(Demon,function(x)x$getText())%>%as.data.frame()
colnames(Demon_Tweets)=c("Text")
write.csv(Demon_Tweets,"Tweets.csv")
Demon_DF<-read.csv("Tweets.csv")
Clean The Data
Primary Cleaning using gsub()
##Clean the tweet for sentiment analysis
# remove html links, which are not required for sentiment analysis
tweet1=gsub("https://","",Demon_DF$Text)
tweet2=gsub("#","",tweet1)
tweet3=gsub("t.co/","",tweet2)
tweet4=gsub("@","",tweet3)
tweet5=gsub("RT|via","",tweet4)
tweet6=gsub("????????????????????????????????????","",tweet5)
tweet7=gsub("\n"," ",tweet6)
tweet8=gsub("<ed><U+A><U+BD><ed><U+B><U+C>","",tweet7)
tweet9=gsub("[[:digit:]]","",tweet8)
tweet10=gsub(" ???????????????????????????????????? ","",tweet9)
write.csv(tweet10,"tweets_clean.csv")
Secondary Cleaning using “tm” package
## Load the tm package to clean the corpus
library(tm)
library(plyr)
library(stringr)
library(SnowballC)
tweet_text=read.csv("tweets_clean.csv")
## Corpus
DemonCorpus=Corpus(VectorSource(tweet_text$x))
## Convert to plain text document
DemonCorpus=tm_map(DemonCorpus,PlainTextDocument)
## lower case
DemonCorpus=tm_map(DemonCorpus,content_transformer(tolower))
## Strip Whitespace
DemonCorpus<-tm_map(DemonCorpus, stripWhitespace)
# Remove Punctuation
DemonCorpus=tm_map(DemonCorpus,removePunctuation)
## Scan the stopwords (New List )
sw=scan('stopwords.txt',what='character',sep="")
words=unlist(str_split(sw,'\\s+'))
match=match(words,sw)
# Remove stopwords
DemonCorpus=tm_map(DemonCorpus,removeWords,c(words,"eduaubdedubuc","jamewils","yuvjrmfs"))
print(as.character(DemonCorpus[[100]]))
## Document term matrix
docterm_corpus <- DocumentTermMatrix(DemonCorpus)
#find frequent terms
colS <- colSums(as.matrix(docterm_corpus))
Word Cloud
## Create the word Cloud
library(wordcloud)
wordcloud(names(colS),colS,max.freq=160, scale=c(4,2),min.freq=1,random.order=TRUE,
colors = brewer.pal(6, 'Dark2'),max.words = 200)
Sentiment Analysis
Here, we will not use the popular Naive Bayes for the sentiment analysis, where as we will do the polarity analysis from very negative to Very Positive using match algorithm.
We have used the lexicon-opinion (Source) dictionary to match the words with that of twitteR comments. Using simple match algorithm we have classified the polarity in the following five categories,
- Very Negative
- Negative
- Neutral
- Positive
- Very Positive
We have classified on the basis of sentiment score. Here you can customised the range for each category depending on the total range of sentiment score for the twitter text. In this case, it is in the range of -2 to +3. Below is the code,
## Sentiment Analysis
demondf=read.csv("tweets_clean.csv")
demondf=demondf[!is.na(demondf$x),]
## Scan the Lexicon words (English) database which is in txt format
opinion.lexicon.pos<-scan('positive-words.txt',what='character',comment.char = ';')
opinion.lexicon.neg<-scan('negative-words.txt', what='character', comment.char=';')
# upgrade the positive and negative word list
pos.words=c(opinion.lexicon.pos,'cpimspeak','aayog')
neg.words=c(opinion.lexicon.neg,'haunt','modi','fcuk','cancel')
## Create the function for the sentiment score
getsentimentscore=function(sentences,words.positive,words.negative,.progress='none'){
require(plyr)
require(stringr)
scores=laply(sentences,function(sentence,words.positive,words.negative,.progress='none'){
# let us split each sentence by space delimiter
words=unlist(str_split(sentence,'\\s+'))
# Let us match with our database
pos.matches=!is.na(match(words,pos.words))
neg.matches=!is.na(match(words,neg.words))
# get the score
score= sum(pos.matches)-sum(neg.matches)
return(score)},words.positive,words.negative,.progress=.progress)
## Return the dataframe with the respective sentence and scores
return(data.frame(text=sentences,score=scores))}
Result<-getsentimentscore(demondf$x)
Result<-as.data.frame(Result)
## Categorise the words to very negative to very positive from tweets of Sully
vNeg<-nrow(as.data.frame(Result$text[Result$score==-2]))
Neg<-nrow(as.data.frame(Result$score[Result$score==-1]))
Neutral<-nrow((as.data.frame(Result$score[Result$score==0])))
Pos<-nrow(as.data.frame(Result$score[Result$score==2|Result==1]))
vPos<-nrow(as.data.frame(Result$score[Result$score==3]))
## Build the data frame
Tweets=as.data.frame(c(vNeg,Neg,Neutral,Pos,vPos))
colnames(Tweets)=c("Number of Tweets")
Emotion=as.data.frame(c("vNeg","Neg","Neutral","Pos","vPos"))
colnames(Emotion)=c("Degree of Emotion")
Plot.Result=as.data.frame(cbind(Emotion,Tweets))
Plot the Polarity using ggplot2
library(ggplot2)
ggplot(data=Plot.Result,aes(x=Plot.Result$`Degree of Emotion`,y=Plot.Result$`Number of Tweets`))+
geom_bar(aes(fill=Plot.Result$`Degree of Emotion`),stat="identity",width=0.4)+
scale_fill_brewer(palette="Dark2")+xlab("Degree of Polarity")+
ylab("Number of Tweets")+
geom_text(aes(label=Plot.Result$`Number of Tweets`),
vjust=-0.5,colour="brown",stat="identity")+theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),axis.line = element_line(colour = "black"),legend.position = 'none')
Conclusion
You can use the popular naive bayes algorithm using sentiment package. With the help of this algorith you can get the different kinds of emotion on the trending topic. In next topic discussion, we will introduce the sentiment package for the sentiment analysis with comparative word cloud.
Stay Tuned …