Major Assignment 3

Project Description:

Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis: In this project I’ll aim to analyse the sentiment towards Narendra Modi,the current Prime Minister of India over the last 12 months

1.Loading the required packages

packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph", 
              "ggraph", "wordcloud2", "textdata", "sf", "tmap", "here", "ggdark", "syuzhet", 
              "sentimentr", "lubridate","stringi")

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

## Warning: package 'RedditExtractoR' was built under R version 4.4.2

## Warning: package 'anytime' was built under R version 4.4.2

## Warning: package 'tidytext' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'igraph'
## 
## 
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## 
## The following object is masked from 'package:base':
## 
##     union

## Warning: package 'ggraph' was built under R version 4.4.2

## Warning: package 'wordcloud2' was built under R version 4.4.2

## Warning: package 'textdata' was built under R version 4.4.2

## 
## Attaching package: 'textdata'
## 
## The following object is masked from 'package:httr':
## 
##     cache_info
## 
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
## here() starts at C:/Users/srini/OneDrive/Documents/Urban Analytics

## Warning: package 'ggdark' was built under R version 4.4.2

## Warning: package 'syuzhet' was built under R version 4.4.2

## Warning: package 'sentimentr' was built under R version 4.4.2

## 
## Attaching package: 'sentimentr'
## 
## The following object is masked from 'package:syuzhet':
## 
##     get_sentences

2.Searching Reddit threads using a keyword of my choice

# using both subreddit and keyword
reddit_threads <- find_thread_urls(keywords= "Narendra Modi", 
                              subreddit = "India", 
                              sort_by = 'relevance', 
                              period = 'year') %>% 
  drop_na()

## parsing URLs on page 1...
## parsing URLs on page 2...

rownames(reddit_threads) <- NULL

head(reddit_threads, 5) %>% knitr::kable()

date_utc	timestamp	title	subreddit	comments	url
2024-05-09	1715255486	Justices MB Lokur, AP Shah & N Ram Invite PM Narendra Modi & Rahul Gandhi To Public Debate On Lok Sabha Elections	india	4	https://www.reddit.com/r/india/comments/1cnvd4c/justices_mb_lokur_ap_shah_n_ram_invite_pm/
2024-01-12	1705078116	Line of Actual Control (LAC) \| No-intrusion mystery deepens: Army chief General Manoj Pande’s ‘first aim’ of restoring China status quo contradicts Prime Minister Narendra Modi’s border claim	india	17	https://www.reddit.com/r/india/comments/194zljd/line_of_actual_control_lac_nointrusion_mystery/
2024-08-02	1722593428	PM Narendra Modi has provided swimming facilities for all MP’s in the new Parliament House that he built at a cost of 22000 crore of tax payers hard earned money.	india	1	https://www.reddit.com/r/india/comments/1ei6d4m/pm_narendra_modi_has_provided_swimming_facilities/
2024-07-26	1722026914	Editorial With Sujit Nair \| Henley Report ranks Indian Passport 82nd \| Narendra Modi \| Jaishankar	india	1	https://www.reddit.com/r/india/comments/1ecyjfo/editorial_with_sujit_nair_henley_report_ranks/
2024-05-31	1717146764	Narendra Modis quest to reshape India for a thousand years	india	7	https://www.reddit.com/r/india/comments/1d4rbns/narendra_modis_quest_to_reshape_india_for_a/

3.Clean your text data and then tokenize it

reddit_words <- reddit_threads %>% 
  unnest_tokens(output = word, input = text, token = "words")

data("stop_words")

replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

reddit_words <- reddit_threads %>% 
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "words") %>% 
  anti_join(stop_words, by = "word") %>% 
  filter(str_detect(word, "[a-z]"))

reddit_words %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

4.Generate a word cloud

reddit_words %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

The word cloud generated on visual inspections seems to containing all the relevant terms related to the keyword and the subreddit whether it be political terms such as elections, policy and opposition members. Another interesting insight is the frequent reoccurence of terms related to the policies which were criticised such as the Farmer’s protest, the CAA act and also the recent violence in Manipur.

5.Conduct a tri-gram analysis

Tri-gram analysis

words_ngram <- reddit_threads %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

#separate the words into 3 columns
words_ngram_trio <- words_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1, word 2, and word 3 columns
words_ngram_trio_filtered <- words_ngram_trio %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & 
           !word3 %in% stop_words$word) %>%
# drop non-alphabet-only strings across the 3 columns
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") &
           str_detect(word3, "[a-z]"))
# Filter out words that are not encoded in ASCII in all 3 columns
words_ngram_trio_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) &
           stri_enc_isascii(word3))
# Sort the new tri-gram (n=3) counts:
words_counts <- words_ngram_trio_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% knitr::kable()

word1	word2	word3	n
prime	minister	narendra	19
minister	narendra	modi	18
uniform	civil	code	11
pm	narendra	modi	10
anti	caa	movement	7
citizenship	amendment	act	7
ajai	shukla	disengagement	4
aligarh	muslim	university	4
citizenship	amendment	bill	4
cm	biren	singh	4
defence	minister	rajnath	4
devangana	kalita	natasha	4
digital	content	creators	4
disengagement	china	india	4
home	minister	amit	4
kalita	natasha	narwal	4
minister	amit	shah	4
minister	rajnath	singh	4
narendra	modi	bjp	4
narwal	safoora	zargar	4

The Tri-gram analysis does not provide a detailed insight into the sentiments of the supporters posting on reddit, hence I’ll proceed with a Bi-gram analysis

Bi-gram analysis

# Perform bi-gram analysis as well:
words_ngram2 <- reddit_threads %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 2)

#separate the words into 2 columns
words_ngram_pair <- words_ngram2 %>%
  separate(paired_words, c("word1", "word2"), sep = " ")

# filter rows where there are stop words under word 1, and word 2 columns
words_ngram_pair_filtered <- words_ngram_pair %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings across the 2 columns
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))
# Filter out words that are not encoded in ASCII in all 2 columns
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))
# Sort the bi-gram:
words_counts2 <- words_ngram_pair_filtered %>%
  count(word1, word2) %>%
  arrange(desc(n))

head(words_counts2,20) %>% knitr::kable()

word1	word2	n
narendra	modi	80
prime	minister	34
pm	modi	25
electoral	bonds	24
modi	government	23
minister	narendra	19
amit	shah	17
rahul	gandhi	16
legal	guarantee	15
shaheen	bagh	15
citizenship	amendment	11
civil	code	11
content	creators	11
pm	narendra	11
uniform	civil	11
lok	sabha	10
supreme	court	10
caa	nrc	9
yogendra	yadav	9
biren	singh	8

The Bi-gram analysis also does not offer any concrete sentiment towards the Prime Minister as it the most frequent words are the names of those either belonging to the ruling or opposition party. However a frequent word being repeated is the CAA and NRC policies which were one of the most criticised policies implemented by the ruling party. Despite being quite some time, there still appears to be strong sentiment towards those policies.

6.Conducting a sentiment analysis

sentiment_reddit <- sentiment(reddit_threads$text) %>%
  arrange(desc(sentiment)) 
                    
head(sentiment_reddit, 10) %>% 
  knitr::kable()

element_id	sentence_id	word_count	sentiment
31	2	5	1.7609035
46	15	4	0.8750000
79	23	26	0.8482023
60	13	15	0.8262364
77	23	1	0.8000000
53	7	32	0.7954951
74	51	3	0.7794229
56	7	13	0.7349778
56	25	16	0.7250000
66	11	19	0.7226596

7.7. Display 10 sample texts alongside their sentiment scores and evaluate the credibility of the sentiment analysis outcomes.

# setting seed and setting up code to randomly select 10 posts from the entire population
set.seed(1234)
reddit_sentiment_filtered <- reddit_threads %>%
  filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
  unnest(text)
random_indices <- sample(nrow(reddit_sentiment_filtered), size = 10)
ten_reddit_text_samples <- reddit_sentiment_filtered[random_indices, ]
ten_reddit_text_samples$sentiment_score <- sapply(ten_reddit_text_samples$text, function(text) {
  sentiment(text)$sentiment[1]
})
# selecting columns and sorting by sentiment for display
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
ten_reddit_text_samples_final <- ten_reddit_text_samples[, columns] %>%
  arrange(desc(sentiment_score))
ten_reddit_text_samples_final$title <- strtrim(ten_reddit_text_samples_final$title, 50)
ten_reddit_text_samples_final$text <- strtrim(ten_reddit_text_samples_final$text, 50)
head(ten_reddit_text_samples_final, 10) %>% 
  knitr::kable()

date_utc	title	text	subreddit	sentiment_score
2024-03-05	First Lalu, now A Raja: INDIA bloc keeps lobbing f	Seizing on Rashtriya Janata Dal (RJD) leader Lalu	india	0.0575396
2024-05-08	Rahul Gandhi And Fighting A Political Battle On On	On 22 January 2024, as Prime Minister set out for	india	0.0456435
2024-04-22	Anshuman Sail Nehru (@AnshumanSail) on X	Today, Narendra Modi has shamelessly lied about Dr	india	0.0335410
2024-05-19	How Narendra Modi made himself unbeatable I DW New	Not BBC, DW. I guess they know when they see one.	india	0.0000000
2024-11-25	Did Virat Kohli start the beard trend among men in	Ive been observing that these days, almost 8 out o	india	0.0000000
2024-02-19	Point out the shortcomings of the ruling party in	I’ll go first. I’m from West Bengal and honestly,	india	0.0000000
2024-03-24	What is the right kind of Protests?	A while ago, PM Narendra Modi in Parliament mocked	india	-0.0250000
2024-04-19	BBC interview of Narendra Modi 2002 post Godhra ri	Its clear he doesnt make the same mistake twice.	india	-0.0753778
2024-02-19	If youre Indian National Congress President, what	When Rahul Gandhi was President of INC during the	india	-0.1039230
2024-03-19	If Somebody received a message like this,Please bl	Send a screenshot with a message showing that you	india	-0.1109400

Sentiment Analysis of 10 Samples: As is often the case in politics, Prime Minister Modi’s sentiment score showcases a mix of positive and negative posts, reflecting the diverse opinions expressed by the public. This is evident from the varying sentiment scores, where posts supportive of Modi typically receive positive scores, while those critical of him are assigned negative scores. However, there are exceptions, as observed in the classification of certain posts.

Upon closer review, Sentimentr has generally done a commendable job of accurately classifying posts based on their sentiment. Most posts align well with their sentiment scores, with supportive content receiving positive scores and critical posts being marked as negative. However, one exception is noted in the third post. Despite the headline and text suggesting a critical tone towards the political leader, the post has been misclassified with a positive sentiment score. Additionally, there is another instance of a post unrelated to the keyword or subreddit context, which has been assigned a neutral score of 0.0.

While Sentimentr’s performance appears accurate at face value, this evaluation is based on surface-level observations. To provide a more detailed and robust assessment of its effectiveness, further analysis of the text is required

8. Discuss intriguing insights derived from the sentiment analysis, supporting your observations with at least two plots.

# re-run sentiment analysis code used for 10 samples on all text for plotting
reddit_sentiment_plots <- reddit_threads %>%
  filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
  unnest(text) %>%
  drop_na()
reddit_sentiment_plots$sentiment_score <-
  sapply(reddit_sentiment_plots$text, function(text) {
  sentiment(text)$sentiment[1]
})
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
reddit_sentiment_plots <- reddit_sentiment_plots[, columns] %>%
  arrange(desc(sentiment_score))
reddit_sentiment_plots$title <- strtrim(reddit_sentiment_plots$title, 50)
reddit_sentiment_plots$text <- strtrim(reddit_sentiment_plots$text, 50)
# turn date_utc into Day of the Week for plotting
reddit_sentiment_plots$DoW <- wday(reddit_sentiment_plots$date_utc, label = TRUE, abbr = FALSE)
reddit_sentiment_plots <- reddit_sentiment_plots %>% select(date_utc, DoW, title, text,
                                                            subreddit, sentiment_score)
# density plot to show distribution of scores
ggplot(reddit_sentiment_plots, aes(x = sentiment_score)) +
  geom_density(fill = "cyan", alpha = 0.9) +
  labs(title = "Distribution of Sentiment", x = "Sentiment_score", y = "Density") +
  theme_dark()

The graph displays the density distribution of sentiment scores, showcasing a concentration of values around 0. This suggests that the majority of posts have a neutral sentiment or are balanced between positive and negative emotions. The curve is slightly skewed to the right, indicating a minor prevalence of positive sentiment compared to negative sentiment. The density tapers off towards the extremes, with fewer posts exhibiting strongly negative or strongly positive sentiment. This aligns with the view that despite being a right wing wing leader, majority of the moderates also do tend to agree with the Prime Ministers evident by acheiving a majority victory in the recently concluded elections in a nation whose citizens have historically voted for Moderate political parties

#day-of-week bar graph
reddit_sentiment_plots %>% 
  ggplot(aes(x = DoW)) +
  geom_bar(fill = 'lightblue') +
  theme_classic()

The bar chart shows the distribution of posts across days of the week, with no significant pattern or correlation between the day and the number of posts. While Tuesday and Wednesday have slightly higher counts, the variation is not substantial enough to suggest that sentiment scores are influenced by the day of the week.

#violin graphn by day-of-week
ggplot(reddit_sentiment_plots, aes(x = DoW, y = sentiment_score)) +
  geom_violin(fill = "black", color = "white") +
  #labs(title = "Violin Plot by Group", x = "Group", y = "Value") +
  theme_dark()

The violin plot illustrates the distribution of sentiment scores towards Prime Minister Modi over the last 12 months, segmented by the days of the week. The distribution is relatively consistent across all days, indicating that sentiment—whether positive, neutral, or negative—does not vary significantly based on the day of the week. Most sentiment scores are centered around zero, reflecting a mix of neutral or balanced opinions. However, the noticeable tails in both positive and negative directions indicate that, while most posts remain neutral, there are significant spikes of both strong support and criticism. This can be attributed to the Prime Minister’s affiliation with a right-wing party, as politicians from right-wing parties often tend to evoke more polarized opinions compared to those with moderate ideologies. This reinforces the idea that sentiment towards PM Modi is shaped more by events and public discourse rather than day-specific patterns.

9.Conclusion

The sentiment analysis of Prime Minister Narendra Modi over the past 12 months reveals a predominantly neutral sentiment, with occasional spikes of strong support and criticism. These spikes reflect the polarization often associated with right-wing leaders, as Modi’s policies and leadership style evoke diverse reactions. Despite this, a slight skew towards positive sentiment aligns with his ability to appeal to a broad voter base, as evidenced by his recent electoral success in a nation historically favoring moderate parties. The analysis underscores that public sentiment is shaped more by specific events and discourse rather than consistent patterns, offering valuable insights into the complexities of political perception.

The results from Sentimentr, although not entirely precise, provided valuable insights for analysis. The algorithm’s difficulty in accurately assessing posts highlights the complexity of human emotions and expression