After weeks and days of intense speculation about the situation in Jammu and Kashmir, the Narendra Modi government finally revealed its cards. Day before Yesterday Home minister Amit Shah announced in Parliament the suspension of Article 370. There has been an exponential surge in the online activity especially in twitter where people share their views & displeasures etc. So in this article, I am going to demonstrate how we can analyze what people are posting on Twitter on this particular topic,how people felt about the decision using data science skills. Using this analysis we can better understand whether the decision is good or bad as per people. I will be using programming language as R where I will perform text mining.
Let’s get started. We will try to analyze the sentiments of tweets which contain hashtag #article370. My code is divided into following parts:
We will first install the relevant packages that we need. To extract tweets from Twitter, we will need package ‘twitteR’.Also we will be using ‘tm’ package for text mining of those tweets & ‘ROAuth’ package for us to aunthenticate with Twitter’s server.
## Libraries to be installed-
library(tm)
## Loading required package: NLP
library(ROAuth)
library(twitteR)
We will be using Twitter API to fetch data by writing below mentioned codes. Before that I have already generated my Api key,Api secret, token & token secret by visiting https://dev.twitter.com/apps & followed necessary steps.I am not explaining it here.Next, we will invoke Twitter API using the app we have created and using the keys and access tokens we got through the app.
api_key="pCpB5nLbiZbjbMcxzSa7hTYim"
api_secret="fYLvgR0eKdHezedf2ZEMDag4nhHwJt5najhxC5Z6sCqvXuKx9V"
access_token="127058048-r0sKWyBwpLpI9sL2j2gVte5lw1I1tf0j9WNmLT1U"
access_token_secret="H8soRuDbnoYCru6zwlSo2dEFbdC8W5VtNDYOINsQzm8qO"
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)
## [1] "Using direct authentication"
length(tweets)
## [1] 10000
Next we will transform those tweets into a data frame format using function twListToDF which is more understandable & workable.
df=twListToDF(tweets)
We will build a corpus of those tweets using function Corpus which is available in tm package.
library(tm)
library(stringr)
corpus_370=Corpus(VectorSource(df$text))
corpus_370 = tm_map(corpus_370, function(x) iconv(enc2utf8(x), sub = "byte"))
## Warning in tm_map.SimpleCorpus(corpus_370, function(x) iconv(enc2utf8(x), :
## transformation drops documents
The corpus need to be cleaned for better analysis. We need to remove stop words, punctuations, stripping white spaces,removing numbers, converting it to lower case & all. Basically These are things which don’t express any emotions.
corpus_370 <- tm_map(corpus_370, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus_370, content_transformer(tolower)):
## transformation drops documents
corpus_370 <- tm_map(corpus_370, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus_370, removePunctuation):
## transformation drops documents
corpus_370 <- tm_map(corpus_370, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus_370, removeNumbers): transformation
## drops documents
The corpus contains the tweet part, hashtags, and URLs. We need to remove hashtags and URLs from it so that we are left only with the main tweet part to run our sentiment analysis.Ideally we will write a function for it & apply it on corpus.
Textprocessing <- function(x)
{gsub("http[[:alnum:]]*",'', x)
gsub('http\\S+\\s*', '', x) ## Remove URLs
gsub('\\b+RT', '', x) ## Remove RT
gsub('#\\S+', '', x) ## Remove Hashtags
gsub('@\\S+', '', x) ## Remove Mentions
gsub('[[:cntrl:]]', '', x) ## Remove Controls and special characters
gsub("\\d", '', x) ## Remove Controls and special characters
gsub('[[:punct:]]', '', x) ## Remove Punctuations
gsub("^[[:space:]]*","",x) ## Remove leading whitespaces
gsub("[[:space:]]*$","",x) ## Remove trailing whitespaces
gsub(' +',' ',x) ## Remove extra whitespaces
}
corpus_370 <- tm_map(corpus_370,Textprocessing)
## Warning in tm_map.SimpleCorpus(corpus_370, Textprocessing): transformation
## drops documents
corpus_370= tm_map(corpus_370, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus_370, stripWhitespace): transformation
## drops documents
corpus_370 = tm_map(corpus_370,removeWords,stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus_370, removeWords,
## stopwords("english")): transformation drops documents
Now, a cleaned corpus is transformed into document term matrix. A document terms matrix represent frequency of every word present in the corpus. We will use function TermDocumentMatrix to transform that corpus into Term Document Matrix as shown below.
tdm_370=TermDocumentMatrix(corpus_370)
We will Organize terms by their frequency by converting it to matrix. Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function TermDocumentMatrix() from text mining package can be used as follow.
m=as.matrix(tdm_370)
v=sort(rowSums(m),decreasing = T)
d=data.frame(word=names(v),freq=v)
head(d,20)
Plot word frequencies
The frequency of the first 10 frequent words are plotted below
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
col ="lightblue", main ="Most frequent words",
ylab = "Word frequencies")
Above bar graph suggests that the most common words used in those tweets are article followed by govt,decision & so on.
The importance of words can be illustrated as a word cloud as follows.WORDCLOUD is the visual representation of words in the tweets. We will write codes in order to generate word cloud.
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 10,max.words=500, random.order=FALSE,
scale = c(3, 0.5), colors = rainbow(50))
The above word cloud shows that most frequently used words in the tweets are article, Kashmir,India,decision & so on. The different colors and size of the words indicate their frequency.For example ‘Article’ has higher frequency than other words.Followed by decision,govt,Kashmir & so on.
Our main aim is to analyze the sentiments of people around article 370. The analysis will consist of eight different emotions and two sentiments positive and negative.
library(syuzhet)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:syuzhet':
##
## rescale
library(reshape2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:twitteR':
##
## id, location
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Getting 8 different Emotions (Anger,anticipation disgust,fear,joy,sadness,surprise,trust) And their corresponding Valence From NRC Dictionary
The get_nrc sentiment function from the package syuzhet is used which will compare all the tokenized words with the wordsentinet EmoLex which contain a large number of words with different emotions.
s=get_nrc_sentiment(df$text)
head(s)
barplot(colSums(s),las=2,col=rainbow(10),ylab="count",main = "Sentiment scores for Article 370")
Above Bar graph representation is used to visualize the various sentiments behind tweets.As expected positive is highest followed by Trust & Negative. This means there are large number of people who thinks that this decision to revoke article 370 will bring positive changes. However there are almost comparable number of people who think that it will open pandora’s box unnecessary.
Conclusion:-
My conclusion is purely based on the 10,000 number of tweets I pulled out from Twitter API on 7th August 2019.
There was mostly positive reactions from people since it has highest sentiment scores followed by Negative & Trust.