The Make in India is initiative launched by Prime Minister in September 2014 in order to encourage the manufacturing sector in India and to transform India into a global design and manufacturing hub. The hashtag #MakeinIndia has been trending over the past 2 years. The government, organisation and people have been tweeting on various announcement and opinions related to the #MakeInIndia campaign.
To know more about Make In India campaign, visit: http://www.makeinindia.com/home
The objectives of this report:
A tweet is a message posted by a user on Twitter and is restricted to 140 characters. The Tweet contains various components called hashtag (#), which essentially captures the content of the tweet. Similarly, ‘@’ symbol is used by the user when he/she tweets a direct message or tags a user in the tweet. A Tweet that is shared publicly by a user to his followers is known as a retweet. Retweets help us measure the popularity of tweet.
The twitteR package in R which provides access to the Twitter API. This package was used to establish a connection with Twitter to extract data using the function setup_twitter_oauth(). A total of 15,344 tweets were extracted that contains the tweets from 23rd August 2017 till 11th September 2017. The data was extracted and stored into a file called “MakeinIndia.csv”.
TwData <- read.csv("MakeinIndia.csv")
In this project, the following packages were used for various analysis:
Once the tweets are imported, the data has to clean and parsed into a corpus for text analytics. Non-English words were removed from the data.
#Remove words between the character " <", " >"
TwData$text <- genX(TwData$text, " <", ">")
#Convert the data into a "Volatile" corpus
myCorpus<- VCorpus(VectorSource(TwData$text))
#Converting the text into lower characters
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#Removing URLs from the text
removeURL <- function(x) gsub("https[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
#Removing Numbers & Punctuation
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
#Removing Stopwords & other common words
myStopWords<- c((stopwords('english')),c("rt", "use", "used", "via", "amp", "makeinindia"))
myCorpus<- tm_map(myCorpus,removeWords , myStopWords)
#Removing Whitespace
myCorpus<- tm_map(myCorpus, stripWhitespace)
#Stemming the word "Indias" to "India"
myCorpus <- tm_map(myCorpus, content_transformer(gsub), pattern = "indias | indian", replacement = "india")
Once the data is cleaned, a TermDocumentMatrix is created in order to calculate the frequency of terms that occur in a collection of tweets. There are many ways and variables to measure the influence and effectiveness of a hashtag on Twitter. The number of followers is not the only measure. The below graph visualises the most active users who talked about the Make in India campaign.
e1 <- data.frame(username = names(summary(TwData$screenName)), No.of.Tweets = summary(TwData$screenName))
e1 <- subset(e1, e1$No.of.Tweets>60 & e1$No.of.Tweets<1000 )
e1 <- ggplot(e1, aes(reorder(username, No.of.Tweets),No.of.Tweets)) + theme_minimal() + geom_bar(stat = "identity") + coord_flip() +labs(list(title="Most active users", x="Username", y="Tweet Counts"))
e2 <- aggregate(retweetCount ~ screenName, data = TwData, FUN = sum)
e2 <- subset(e2, e2$retweetCount >3000)
e2 <- ggplot(e2, aes(reorder(screenName, retweetCount),retweetCount)) + theme_minimal() + geom_bar(stat = "identity") + coord_flip() +labs(list(title="Retweets", x="", y="Retweet Count"))
grid.arrange(e1, e2, ncol=2)
We can see that most of the tweets originated from the official Twitter handle of the Make in India campaign (@makeinindia). The second most number of retweets also originated from the same account. The account is followed by 3.24 million people (as of 15th September 2017) and has tweeted at least 14,300 times.
The user Sheetal Angural (@sheetalang1983) who is the General Sec. SC Morcha BJP Punjab has played a huge role in popularising the Make in India campaign in Twitter. He accounts for a significant number of tweets and retweets about MakeinIndia. @MODIfyingBHARAT is another popular Twitter handle followed by 4850 people. However, they tweet very frequently about Make in India and hence tops the chart. Similarly, a user Harikishan Joshi (@iHarikishanabvp) who is a State Executive Council Member at Bharat Tibbat Sahyog Manch(Rajasthan) member at RSS is also one of the top users who talked about the Make in India campaign.
Hence, we can observe that the popularity of the #MakeinIndia is mainly due to the tweets by Government, BJP or Hindu nationalism supporters.
The tweets are analysed to understand the most frequently occurring words.
tdm <- TermDocumentMatrix(myCorpus)
dtm <- DocumentTermMatrix(myCorpus)
freq.terms <- findFreqTerms(tdm, lowfreq = 500)
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >500)
df <- data.frame(term = names(term.freq), freq= term.freq)
ggplot(df, aes(reorder(term, freq),freq)) + theme_bw() + geom_bar(stat = "identity") + coord_flip() +labs(list(title="Term Frequency Chart", x="Terms", y="Term Counts"))
library(wordcloud)
word.freq <-sort(rowSums(as.matrix(tdm)), decreasing= F)
pal<- brewer.pal(8, "Dark2")
wordcloud(words = names(word.freq), freq = word.freq, min.freq = 2, random.order = F, colors = pal, max.words = 100)
After the necessary cleaning, a wordcloud was made to understand the most frequently used terms. The following observations are made:
The most common word that was used in the tweet was ‘India’, followed by ‘cimgoi’ - the official handle of the Commerce Minister (@CimGOI). The Commerce Minister, Nirmala Sitharaman was tagged more often in the tweets more frequent than the Prime Minister Narendra Modi & the PMO Office. The new Commerce Minister Suresh Prabhu is also gaining popularity in the tweets associated with Make in India.
Some words gain popularity based on an event or a trend and occur in the tweet over a period of time. The word usage depends upon the popularity, significant and the reach of the hashtag. E.g. Twitter users tagged #SaveJPWishtown hashtag with the Make in India campaign in order to gain attention from the government. Many homebuyers wanted the government to have a resolution to the issue of not getting their flats delivered as promised.
Another popular word that was used is #IndiainStockholm. The aim of this campaign by MakeinIndia was to attract and create awareness about Swedish investments in India. Most tweets during the period were focused towards Swedish investments prior to the two-day workshop that will be held in Stockholm, Sweden in October.
WordCorr1 <- apply_as_df(myCorpus[1:6000], word_cor, word = "cimgoi", r=.25)
p1 <- plot(WordCorr1)
WordCorr2 <- apply_as_df(myCorpus[1:6000], word_cor, word = "pmoindia", r=.25)
p2 <- plot(WordCorr2)
grid.arrange(p1, p2, ncol=2)
A few words are associated strongly with the most frequent words than others. We must understand the difference between association and correlation. Association refers to the general relationship between two random variables while the correlation refers to a more or less a linear relationship between the random variables.
The Commerce Minister’s handle is focused on attracting new investors to make in India. The word “Ready” is more frequently used in the tweets. We can observe that the “Tata-Boeing” collaboration for Apache Helicopters was tweeted along with (@CimGOI).
The PMO office mainly tweets about announcements and news related to Make in India. We can observe that the (@PMOIndia) is high correlated with the terms related to the upcoming bullet train project connection Mumbai and Ahmedabad. The Prime Minister is set to lay foundation of the project on 14th September. It is regarded as one of the initiatives under the Make in India project. Hence, we can see that Twitter users were excited about the upcoming Bullet Train project.
findAssocs(tdm, c("cimgoi", "pmoindia"), 0.2)
## $cimgoi
## supply ready facility incoming
## 0.27 0.26 0.23 0.23
## indiainstockholm exports register
## 0.23 0.20 0.20
##
## $pmoindia
## abe ceremony groundbreaking mumahmedabad
## 0.43 0.43 0.43 0.43
## perform train bullet project
## 0.43 0.42 0.41 0.39
## narendramodi ever bureaucrats discuss
## 0.35 0.32 0.27 0.27
## indiavision indiabiggest reform unified
## 0.27 0.24 0.24 0.24
## inaugurates swedishpm tbt alongside
## 0.23 0.23 0.23 0.22
## cheated savejpwishtown surbhimadan tax
## 0.22 0.22 0.22 0.22
## visited asks chief gemindia
## 0.22 0.21 0.21 0.21
## meets myogiadityanath secretaries pick
## 0.21 0.21 0.21 0.20
## procurement
## 0.20
The association table above also confirms the same. Most frequently associated words with @CimGOI include “Ready”, “Supply”, “Facility”, “Incoming” etc. In the case of @PMOIndia, most words are related to meetings, announcements and visits associated with Make in India.
Since there are a large number of similar tweets getting generated with #MakeinIndia, it becomes challenging to make meaningful interpretations from the huge volumes of data that need to be processed.
We try to cluster similar tweets together. Hierarchical clustering attempts to build different levels of clusters. The R function, hclust() was used to perform hierarchical clustering. It uses the agglomerative method.
To perform this operation, the corpus was converted into a matrix with each tweet. Extremely sparse rows, i.e. rows with elements that are part of less than 3% of the entire corpus were removed. Ward’s method for hierarchical clustering was used. The results of hierarchical clustering is presented below in a dendrogram.
library(cluster)
d1 <- dist(t(removeSparseTerms(dtm, 0.967)), method="euclidian")
fit <- hclust(d1, method = "ward.D") # for a different look try substituting: method="ward.D"
plot(fit, hang=-1)
groups <- cutree(fit, k=9) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=9, border="red")
The height of each node in the plot is proportional to the value of the intergroup dissimilarity between its two. We can observed that Rahul Gandhi’s comments on the failure of Make in India and Chinese goods are clustered together.
Another cluster revolved around the #savejpwishtown, urging the Prime Minister to take action. We can also observe that #IndiainStockholm campaign can be clustered together. Finally a cluster around the PMO office and Narendra Modi.
Another technique that is employed to deduce the themes of text is topic modeling. Topic modeling has implementations in various algorithms, but the most common algorithm in use is Latent Dirichlet Allocation (LDA). It allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
In this case, 6 topics were derived from the tweets. The following are the topics and the words associated:
rowTotals <- apply(dtm , 1, sum)
NullDocs <- dtm[rowTotals==0, ]
dtm <- dtm[rowTotals> 0, ]
if (length(NullDocs$dimnames$Docs) > 0) {
TwData <- TwData[-as.numeric(NullDocs$dimnames$Docs),]
}
lda <- LDA(dtm, k = 6) # find 6 topic
term <- terms(lda, 7) # first 7 terms of every topic
term <- apply(term, MARGIN = 2, paste, collapse = ", ")
topics<- topics(lda)
topics<- data.frame(date=as.Date(TwData$created, format="%d-%m-%Y"), topic = topics)
qplot (date, ..count.., data=topics, geom ="density", fill= term[topic], alpha=I(0.9)) + theme(legend.position="bottom") + scale_fill_manual(values=c("#D50000", "#3F51B5", "#8BC34A", "#FFC107", "#607D8B", "#00BCD4"), name="Topics", labels=names(term))
From the above graph, we can observe that tweets related to PM Narendra Modi are constantly in the trend. We can also observe that Rahul Gandhi’s comments about Make in India created a hype. NSitharaman popularity in the campaign decreased as soon as she left the office and Suresh Prabhu managed to create a hype. #IndiainSweden is slowly gaining popularity in the tweets of Make in India.
A sentiment analysis is performed to determine whether tweets were positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of the user about the Make in India Campaign.
library(sentiment)
sentiments <- sentiment(TwData$text)
table(sentiments$polarity)
##
## negative neutral positive
## 199 10057 4967
sentiments$score <- 0
sentiments$score[sentiments$polarity == "positive"] <- 1
sentiments$score[sentiments$polarity == "negative"] <- -1
sentiments$date <- as.IDate(TwData$created, format="%d-%m-%Y")
result <- aggregate(score ~ date, data = sentiments, sum)
plot(result, type = "l")
From, the above graph, we can observed that most of the tweets were positive in nature. There was a dip in tweets on September 3, 2017. This could be because of the transitory period between NSitharaman and Suresh Prabhu who took charge of the Minister of Commerce & Industry on 4th September 2017. Some users were also unhappy with Suresh Prabhu taking charge of the commerce minister. The streamgraph below visualises the same:
Data<-data.frame(sentiments$polarity)
colnames(Data)[1] <- "polarity"
Data$Date <- as.IDate(TwData$created, format="%d-%m-%Y")
Data$text <- NULL
Data$Count <- 1
graphdata <- aggregate(Count ~ polarity + Date,data=Data,FUN=length)
colnames(graphdata)[2] <- "Date"
library(streamgraph)
graphdata %>%
streamgraph(polarity, Count, Date) %>%
sg_axis_x(20) %>%
sg_axis_x(1, "Date", "%d %b") %>%
sg_legend(show=TRUE, label="Polarity: ")
The syuzhet package offers a few different algorithms, each taking a different approach to sentiment scoring. In this case, emotion scoring is done based on the nrc algorithm. The below bar graph shows the scores generated for words with respect to each sentiment.
#Sentiment-2
library('syuzhet')
d<-get_nrc_sentiment(TwData$text)
td<-data.frame(t(d))
td_new <- data.frame(rowSums(td[2:7945]))
names(td_new)[1] <- "count"
td_new <- cbind("sentiment" = rownames(td_new), td_new)
rownames(td_new) <- NULL
td_new2<-td_new[1:8,]
qplot(sentiment, data=td_new2, weight=count, geom="bar",fill=sentiment)+ggtitle("Twitter sentiments")
We can observe the anticipation and trust level score the highest. People believe that the Make in India campaign will help transform the nation. They believe in the government to deliver the promises laid out by them. There is also a sense of joy and surprise when a new announcement is made. There is also fear among users especially when comments about JPWishTown and China was made by others.
Make in India campaign overall has a good brand perception. Most of the feedback that is associated with #MakeinIndia is positive. However, the reach of the campaign is yet to reach the non-followers of the campaign. Currently, most promoters are somehow related with the government or BJP. To be effective, the government should try to get corporates also promote the #MakeinIndia hashtag on their pages. There is a huge level of trust associated with the government. People anticipate that the initiative will help transform the nation.
The study considers only 15,344 tweets of the whole set of tweets that would have been sent on the subject. The observations were made over a period of time. The study also did not consider captions of pictures, news reports, and other social media reports which could have generated additional insights. There exist other topic models and black box techniques for similar analysis that have a record of better performance. These have not been performed as they are beyond the scope of this exercise.
This project was prepared by Bharat S Raj (DM18118), PGDM Student (Class of 2018) of Great Lakes Institute of Management, Chennai as part of Web & Social Media Analytics Course. Thanks to Prof. Tushar Sharma, Founder, Voksedigital Consultancy Services for giving an opportunity to work on the project.