Project Description:

Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis: In this project I’ll aim to analyse the sentiment towards Narendra Modi,the current Prime Minister of India over the last 12 months

1.Loading the required packages

packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph", 
              "ggraph", "wordcloud2", "textdata", "sf", "tmap", "here", "ggdark", "syuzhet", 
              "sentimentr", "lubridate","stringi")

# Load packages
invisible(lapply(packages, library, character.only = TRUE))
## Warning: package 'RedditExtractoR' was built under R version 4.4.2
## Warning: package 'anytime' was built under R version 4.4.2
## Warning: package 'tidytext' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'igraph'
## 
## 
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## 
## The following object is masked from 'package:base':
## 
##     union
## Warning: package 'ggraph' was built under R version 4.4.2
## Warning: package 'wordcloud2' was built under R version 4.4.2
## Warning: package 'textdata' was built under R version 4.4.2
## 
## Attaching package: 'textdata'
## 
## The following object is masked from 'package:httr':
## 
##     cache_info
## 
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
## here() starts at C:/Users/srini/OneDrive/Documents/Urban Analytics
## Warning: package 'ggdark' was built under R version 4.4.2
## Warning: package 'syuzhet' was built under R version 4.4.2
## Warning: package 'sentimentr' was built under R version 4.4.2
## 
## Attaching package: 'sentimentr'
## 
## The following object is masked from 'package:syuzhet':
## 
##     get_sentences

2.Searching Reddit threads using a keyword of my choice

# using both subreddit and keyword
reddit_threads <- find_thread_urls(keywords= "Narendra Modi", 
                              subreddit = "India", 
                              sort_by = 'relevance', 
                              period = 'year') %>% 
  drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
rownames(reddit_threads) <- NULL

head(reddit_threads, 5) %>% knitr::kable()
date_utc timestamp title text subreddit comments url
2024-05-09 1715255486 Justices MB Lokur, AP Shah & N Ram Invite PM Narendra Modi & Rahul Gandhi To Public Debate On Lok Sabha Elections india 4 https://www.reddit.com/r/india/comments/1cnvd4c/justices_mb_lokur_ap_shah_n_ram_invite_pm/
2024-01-12 1705078116 Line of Actual Control (LAC) | No-intrusion mystery deepens: Army chief General Manoj Pande’s ‘first aim’ of restoring China status quo contradicts Prime Minister Narendra Modi’s border claim india 17 https://www.reddit.com/r/india/comments/194zljd/line_of_actual_control_lac_nointrusion_mystery/
2024-08-02 1722593428 PM Narendra Modi has provided swimming facilities for all MP’s in the new Parliament House that he built at a cost of 22000 crore of tax payers hard earned money. india 1 https://www.reddit.com/r/india/comments/1ei6d4m/pm_narendra_modi_has_provided_swimming_facilities/
2024-07-26 1722026914 Editorial With Sujit Nair | Henley Report ranks Indian Passport 82nd | Narendra Modi | Jaishankar india 1 https://www.reddit.com/r/india/comments/1ecyjfo/editorial_with_sujit_nair_henley_report_ranks/
2024-05-31 1717146764 Narendra Modis quest to reshape India for a thousand years india 7 https://www.reddit.com/r/india/comments/1d4rbns/narendra_modis_quest_to_reshape_india_for_a/

3.Clean your text data and then tokenize it

reddit_words <- reddit_threads %>% 
  unnest_tokens(output = word, input = text, token = "words")

data("stop_words")

replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

reddit_words <- reddit_threads %>% 
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "words") %>% 
  anti_join(stop_words, by = "word") %>% 
  filter(str_detect(word, "[a-z]"))

reddit_words %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

4.Generate a word cloud

reddit_words %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

The word cloud generated on visual inspections seems to containing all the relevant terms related to the keyword and the subreddit whether it be political terms such as elections, policy and opposition members. Another interesting insight is the frequent reoccurence of terms related to the policies which were criticised such as the Farmer’s protest, the CAA act and also the recent violence in Manipur.

5.Conduct a tri-gram analysis

Tri-gram analysis

words_ngram <- reddit_threads %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

#separate the words into 3 columns
words_ngram_trio <- words_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1, word 2, and word 3 columns
words_ngram_trio_filtered <- words_ngram_trio %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & 
           !word3 %in% stop_words$word) %>%
# drop non-alphabet-only strings across the 3 columns
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") &
           str_detect(word3, "[a-z]"))
# Filter out words that are not encoded in ASCII in all 3 columns
words_ngram_trio_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) &
           stri_enc_isascii(word3))
# Sort the new tri-gram (n=3) counts:
words_counts <- words_ngram_trio_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% knitr::kable()
word1 word2 word3 n
prime minister narendra 19
minister narendra modi 18
uniform civil code 11
pm narendra modi 10
anti caa movement 7
citizenship amendment act 7
ajai shukla disengagement 4
aligarh muslim university 4
citizenship amendment bill 4
cm biren singh 4
defence minister rajnath 4
devangana kalita natasha 4
digital content creators 4
disengagement china india 4
home minister amit 4
kalita natasha narwal 4
minister amit shah 4
minister rajnath singh 4
narendra modi bjp 4
narwal safoora zargar 4

The Tri-gram analysis does not provide a detailed insight into the sentiments of the supporters posting on reddit, hence I’ll proceed with a Bi-gram analysis

Bi-gram analysis

# Perform bi-gram analysis as well:
words_ngram2 <- reddit_threads %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 2)

#separate the words into 2 columns
words_ngram_pair <- words_ngram2 %>%
  separate(paired_words, c("word1", "word2"), sep = " ")

# filter rows where there are stop words under word 1, and word 2 columns
words_ngram_pair_filtered <- words_ngram_pair %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings across the 2 columns
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))
# Filter out words that are not encoded in ASCII in all 2 columns
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))
# Sort the bi-gram:
words_counts2 <- words_ngram_pair_filtered %>%
  count(word1, word2) %>%
  arrange(desc(n))

head(words_counts2,20) %>% knitr::kable()
word1 word2 n
narendra modi 80
prime minister 34
pm modi 25
electoral bonds 24
modi government 23
minister narendra 19
amit shah 17
rahul gandhi 16
legal guarantee 15
shaheen bagh 15
citizenship amendment 11
civil code 11
content creators 11
pm narendra 11
uniform civil 11
lok sabha 10
supreme court 10
caa nrc 9
yogendra yadav 9
biren singh 8

The Bi-gram analysis also does not offer any concrete sentiment towards the Prime Minister as it the most frequent words are the names of those either belonging to the ruling or opposition party. However a frequent word being repeated is the CAA and NRC policies which were one of the most criticised policies implemented by the ruling party. Despite being quite some time, there still appears to be strong sentiment towards those policies.

6.Conducting a sentiment analysis

sentiment_reddit <- sentiment(reddit_threads$text) %>%
  arrange(desc(sentiment)) 
                    
head(sentiment_reddit, 10) %>% 
  knitr::kable()
element_id sentence_id word_count sentiment
31 2 5 1.7609035
46 15 4 0.8750000
79 23 26 0.8482023
60 13 15 0.8262364
77 23 1 0.8000000
53 7 32 0.7954951
74 51 3 0.7794229
56 7 13 0.7349778
56 25 16 0.7250000
66 11 19 0.7226596

7.7. Display 10 sample texts alongside their sentiment scores and evaluate the credibility of the sentiment analysis outcomes.

# setting seed and setting up code to randomly select 10 posts from the entire population
set.seed(1234)
reddit_sentiment_filtered <- reddit_threads %>%
  filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
  unnest(text)
random_indices <- sample(nrow(reddit_sentiment_filtered), size = 10)
ten_reddit_text_samples <- reddit_sentiment_filtered[random_indices, ]
ten_reddit_text_samples$sentiment_score <- sapply(ten_reddit_text_samples$text, function(text) {
  sentiment(text)$sentiment[1]
})
# selecting columns and sorting by sentiment for display
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
ten_reddit_text_samples_final <- ten_reddit_text_samples[, columns] %>%
  arrange(desc(sentiment_score))
ten_reddit_text_samples_final$title <- strtrim(ten_reddit_text_samples_final$title, 50)
ten_reddit_text_samples_final$text <- strtrim(ten_reddit_text_samples_final$text, 50)
head(ten_reddit_text_samples_final, 10) %>% 
  knitr::kable()
date_utc title text subreddit sentiment_score
2024-03-05 First Lalu, now A Raja: INDIA bloc keeps lobbing f Seizing on Rashtriya Janata Dal (RJD) leader Lalu india 0.0575396
2024-05-08 Rahul Gandhi And Fighting A Political Battle On On On 22 January 2024, as Prime Minister set out for india 0.0456435
2024-04-22 Anshuman Sail Nehru (@AnshumanSail) on X Today, Narendra Modi has shamelessly lied about Dr india 0.0335410
2024-05-19 How Narendra Modi made himself unbeatable I DW New Not BBC, DW. I guess they know when they see one. india 0.0000000
2024-11-25 Did Virat Kohli start the beard trend among men in Ive been observing that these days, almost 8 out o india 0.0000000
2024-02-19 Point out the shortcomings of the ruling party in I’ll go first. I’m from West Bengal and honestly, india 0.0000000
2024-03-24 What is the right kind of Protests? A while ago, PM Narendra Modi in Parliament mocked india -0.0250000
2024-04-19 BBC interview of Narendra Modi 2002 post Godhra ri Its clear he doesnt make the same mistake twice. india -0.0753778
2024-02-19 If youre Indian National Congress President, what When Rahul Gandhi was President of INC during the india -0.1039230
2024-03-19 If Somebody received a message like this,Please bl Send a screenshot with a message showing that you india -0.1109400

Sentiment Analysis of 10 Samples: As is often the case in politics, Prime Minister Modi’s sentiment score showcases a mix of positive and negative posts, reflecting the diverse opinions expressed by the public. This is evident from the varying sentiment scores, where posts supportive of Modi typically receive positive scores, while those critical of him are assigned negative scores. However, there are exceptions, as observed in the classification of certain posts.

Upon closer review, Sentimentr has generally done a commendable job of accurately classifying posts based on their sentiment. Most posts align well with their sentiment scores, with supportive content receiving positive scores and critical posts being marked as negative. However, one exception is noted in the third post. Despite the headline and text suggesting a critical tone towards the political leader, the post has been misclassified with a positive sentiment score. Additionally, there is another instance of a post unrelated to the keyword or subreddit context, which has been assigned a neutral score of 0.0.

While Sentimentr’s performance appears accurate at face value, this evaluation is based on surface-level observations. To provide a more detailed and robust assessment of its effectiveness, further analysis of the text is required

8. Discuss intriguing insights derived from the sentiment analysis, supporting your observations with at least two plots.

# re-run sentiment analysis code used for 10 samples on all text for plotting
reddit_sentiment_plots <- reddit_threads %>%
  filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
  unnest(text) %>%
  drop_na()
reddit_sentiment_plots$sentiment_score <-
  sapply(reddit_sentiment_plots$text, function(text) {
  sentiment(text)$sentiment[1]
})
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
reddit_sentiment_plots <- reddit_sentiment_plots[, columns] %>%
  arrange(desc(sentiment_score))
reddit_sentiment_plots$title <- strtrim(reddit_sentiment_plots$title, 50)
reddit_sentiment_plots$text <- strtrim(reddit_sentiment_plots$text, 50)
# turn date_utc into Day of the Week for plotting
reddit_sentiment_plots$DoW <- wday(reddit_sentiment_plots$date_utc, label = TRUE, abbr = FALSE)
reddit_sentiment_plots <- reddit_sentiment_plots %>% select(date_utc, DoW, title, text,
                                                            subreddit, sentiment_score)
# density plot to show distribution of scores
ggplot(reddit_sentiment_plots, aes(x = sentiment_score)) +
  geom_density(fill = "cyan", alpha = 0.9) +
  labs(title = "Distribution of Sentiment", x = "Sentiment_score", y = "Density") +
  theme_dark()

The graph displays the density distribution of sentiment scores, showcasing a concentration of values around 0. This suggests that the majority of posts have a neutral sentiment or are balanced between positive and negative emotions. The curve is slightly skewed to the right, indicating a minor prevalence of positive sentiment compared to negative sentiment. The density tapers off towards the extremes, with fewer posts exhibiting strongly negative or strongly positive sentiment. This aligns with the view that despite being a right wing wing leader, majority of the moderates also do tend to agree with the Prime Ministers evident by acheiving a majority victory in the recently concluded elections in a nation whose citizens have historically voted for Moderate political parties

#day-of-week bar graph
reddit_sentiment_plots %>% 
  ggplot(aes(x = DoW)) +
  geom_bar(fill = 'lightblue') +
  theme_classic()

The bar chart shows the distribution of posts across days of the week, with no significant pattern or correlation between the day and the number of posts. While Tuesday and Wednesday have slightly higher counts, the variation is not substantial enough to suggest that sentiment scores are influenced by the day of the week.

#violin graphn by day-of-week
ggplot(reddit_sentiment_plots, aes(x = DoW, y = sentiment_score)) +
  geom_violin(fill = "black", color = "white") +
  #labs(title = "Violin Plot by Group", x = "Group", y = "Value") +
  theme_dark()

The violin plot illustrates the distribution of sentiment scores towards Prime Minister Modi over the last 12 months, segmented by the days of the week. The distribution is relatively consistent across all days, indicating that sentiment—whether positive, neutral, or negative—does not vary significantly based on the day of the week. Most sentiment scores are centered around zero, reflecting a mix of neutral or balanced opinions. However, the noticeable tails in both positive and negative directions indicate that, while most posts remain neutral, there are significant spikes of both strong support and criticism. This can be attributed to the Prime Minister’s affiliation with a right-wing party, as politicians from right-wing parties often tend to evoke more polarized opinions compared to those with moderate ideologies. This reinforces the idea that sentiment towards PM Modi is shaped more by events and public discourse rather than day-specific patterns.

9.Conclusion

The sentiment analysis of Prime Minister Narendra Modi over the past 12 months reveals a predominantly neutral sentiment, with occasional spikes of strong support and criticism. These spikes reflect the polarization often associated with right-wing leaders, as Modi’s policies and leadership style evoke diverse reactions. Despite this, a slight skew towards positive sentiment aligns with his ability to appeal to a broad voter base, as evidenced by his recent electoral success in a nation historically favoring moderate parties. The analysis underscores that public sentiment is shaped more by specific events and discourse rather than consistent patterns, offering valuable insights into the complexities of political perception.

The results from Sentimentr, although not entirely precise, provided valuable insights for analysis. The algorithm’s difficulty in accurately assessing posts highlights the complexity of human emotions and expression