After weeks and days of intense speculation about the situation in Jammu and Kashmir, the Narendra Modi government finally revealed its cards. Day before Yesterday Home minister Amit Shah announced in Parliament the suspension of Article 370. There has been an exponential surge in the online activity especially in twitter where people share their views & displeasures etc. So in this article, I am going to demonstrate how we can analyze what people are posting on Twitter on this particular topic,how people felt about the decision using data science skills. Using this analysis we can better understand whether the decision is good or bad as per people. I will be using programming language as R where I will perform text mining.

Let’s get started. We will try to analyze the sentiments of tweets which contain hashtag #article370. My code is divided into following parts:

Extracting tweets using Twitter application
Cleaning the tweets for further analysis
Getting sentiment score & Plotting word frequencies
Wordcloud & Sentiment Analysis

We will first install the relevant packages that we need. To extract tweets from Twitter, we will need package ‘twitteR’.Also we will be using ‘tm’ package for text mining of those tweets & ‘ROAuth’ package for us to aunthenticate with Twitter’s server.

## Libraries to be installed-
library(tm)

## Loading required package: NLP

library(ROAuth)
library(twitteR)

Setting up Twitter Connection

We will be using Twitter API to fetch data by writing below mentioned codes. Before that I have already generated my Api key,Api secret, token & token secret by visiting https://dev.twitter.com/apps & followed necessary steps.I am not explaining it here.Next, we will invoke Twitter API using the app we have created and using the keys and access tokens we got through the app.

api_key="pCpB5nLbiZbjbMcxzSa7hTYim"
api_secret="fYLvgR0eKdHezedf2ZEMDag4nhHwJt5najhxC5Z6sCqvXuKx9V"
access_token="127058048-r0sKWyBwpLpI9sL2j2gVte5lw1I1tf0j9WNmLT1U"
access_token_secret="H8soRuDbnoYCru6zwlSo2dEFbdC8W5VtNDYOINsQzm8qO"

setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

## [1] "Using direct authentication"

getting tweets containing hashtags #article370

We will use below mentioned codes where I tried to fetch 10,000 tweets pertaining to hashtag article 370 & I am pulling those tweets from date 5th Aug 2019.

tweets=twitteR::searchTwitter("#article370",n=10000,lang = "en",since = "2019-08-05")

We collected 10,000 tweets

length(tweets)

## [1] 10000

Convert to data frame & We get a total of 16 variables using ‘twListToDF’ function, snapshot of the sample data is shown below.

Next we will transform those tweets into a data frame format using function twListToDF which is more understandable & workable.

df=twListToDF(tweets)

Build a Corpus

We will build a corpus of those tweets using function Corpus which is available in tm package.

library(tm)
library(stringr)

corpus_370=Corpus(VectorSource(df$text))
corpus_370 = tm_map(corpus_370, function(x) iconv(enc2utf8(x), sub = "byte"))

## Warning in tm_map.SimpleCorpus(corpus_370, function(x) iconv(enc2utf8(x), :
## transformation drops documents

The corpus need to be cleaned for better analysis. We need to remove stop words, punctuations, stripping white spaces,removing numbers, converting it to lower case & all. Basically These are things which don’t express any emotions.

convert myCorpus into lowercase

corpus_370 <- tm_map(corpus_370, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus_370, content_transformer(tolower)):
## transformation drops documents

remove punctuation

corpus_370 <- tm_map(corpus_370, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus_370, removePunctuation):
## transformation drops documents

remove numbers

corpus_370 <- tm_map(corpus_370, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus_370, removeNumbers): transformation
## drops documents

The corpus contains the tweet part, hashtags, and URLs. We need to remove hashtags and URLs from it so that we are left only with the main tweet part to run our sentiment analysis.Ideally we will write a function for it & apply it on corpus.

Textprocessing <- function(x)
{gsub("http[[:alnum:]]*",'', x)
  gsub('http\\S+\\s*', '', x) ## Remove URLs
  gsub('\\b+RT', '', x) ## Remove RT
  gsub('#\\S+', '', x) ## Remove Hashtags
  gsub('@\\S+', '', x) ## Remove Mentions
  gsub('[[:cntrl:]]', '', x) ## Remove Controls and special characters
  gsub("\\d", '', x) ## Remove Controls and special characters
  gsub('[[:punct:]]', '', x) ## Remove Punctuations
  gsub("^[[:space:]]*","",x) ## Remove leading whitespaces
  gsub("[[:space:]]*$","",x) ## Remove trailing whitespaces
  gsub(' +',' ',x) ## Remove extra whitespaces
}
corpus_370 <- tm_map(corpus_370,Textprocessing)

## Warning in tm_map.SimpleCorpus(corpus_370, Textprocessing): transformation
## drops documents

remove extra whitespace

corpus_370= tm_map(corpus_370, stripWhitespace)

## Warning in tm_map.SimpleCorpus(corpus_370, stripWhitespace): transformation
## drops documents

remove stopwords

corpus_370 = tm_map(corpus_370,removeWords,stopwords("english"))

## Warning in tm_map.SimpleCorpus(corpus_370, removeWords,
## stopwords("english")): transformation drops documents

Build a term document matrix

Now, a cleaned corpus is transformed into document term matrix. A document terms matrix represent frequency of every word present in the corpus. We will use function TermDocumentMatrix to transform that corpus into Term Document Matrix as shown below.

tdm_370=TermDocumentMatrix(corpus_370)

We will Organize terms by their frequency by converting it to matrix. Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function TermDocumentMatrix() from text mining package can be used as follow.

m=as.matrix(tdm_370)

v=sort(rowSums(m),decreasing = T)
d=data.frame(word=names(v),freq=v)

head(d,20)

Plot word frequencies

The frequency of the first 10 frequent words are plotted below

 barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
          col ="lightblue", main ="Most frequent words",
          ylab = "Word frequencies")

Above bar graph suggests that the most common words used in those tweets are article followed by govt,decision & so on.

Word Cloud

The importance of words can be illustrated as a word cloud as follows.WORDCLOUD is the visual representation of words in the tweets. We will write codes in order to generate word cloud.

Libraries to be installed

library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 10,max.words=500, random.order=FALSE,
          scale = c(3, 0.5), colors = rainbow(50))

The above word cloud shows that most frequently used words in the tweets are article, Kashmir,India,decision & so on. The different colors and size of the words indicate their frequency.For example ‘Article’ has higher frequency than other words.Followed by decision,govt,Kashmir & so on.

Sentiment Analysis

Our main aim is to analyze the sentiments of people around article 370. The analysis will consist of eight different emotions and two sentiments positive and negative.

Libraries to be installed for sentiment analysis.

library(syuzhet)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:syuzhet':
## 
##     rescale

library(reshape2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:twitteR':
## 
##     id, location

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Getting 8 different Emotions (Anger,anticipation disgust,fear,joy,sadness,surprise,trust) And their corresponding Valence From NRC Dictionary

The get_nrc sentiment function from the package syuzhet is used which will compare all the tokenized words with the wordsentinet EmoLex which contain a large number of words with different emotions.

s=get_nrc_sentiment(df$text)
head(s)

Bar Plot

barplot(colSums(s),las=2,col=rainbow(10),ylab="count",main = "Sentiment scores for Article 370")

Above Bar graph representation is used to visualize the various sentiments behind tweets.As expected positive is highest followed by Trust & Negative. This means there are large number of people who thinks that this decision to revoke article 370 will bring positive changes. However there are almost comparable number of people who think that it will open pandora’s box unnecessary.

Conclusion:-

My conclusion is purely based on the 10,000 number of tweets I pulled out from Twitter API on 7th August 2019.
There was mostly positive reactions from people since it has highest sentiment scores followed by Negative & Trust.

R Notebook