Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis: In this project I’ll aim to analyse the sentiment towards Narendra Modi,the current Prime Minister of India over the last 12 months
packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph",
"ggraph", "wordcloud2", "textdata", "sf", "tmap", "here", "ggdark", "syuzhet",
"sentimentr", "lubridate","stringi")
# Load packages
invisible(lapply(packages, library, character.only = TRUE))
## Warning: package 'RedditExtractoR' was built under R version 4.4.2
## Warning: package 'anytime' was built under R version 4.4.2
## Warning: package 'tidytext' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks magrittr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'igraph'
##
##
## The following objects are masked from 'package:lubridate':
##
## %--%, union
##
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
##
## The following object is masked from 'package:base':
##
## union
## Warning: package 'ggraph' was built under R version 4.4.2
## Warning: package 'wordcloud2' was built under R version 4.4.2
## Warning: package 'textdata' was built under R version 4.4.2
##
## Attaching package: 'textdata'
##
## The following object is masked from 'package:httr':
##
## cache_info
##
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
## here() starts at C:/Users/srini/OneDrive/Documents/Urban Analytics
## Warning: package 'ggdark' was built under R version 4.4.2
## Warning: package 'syuzhet' was built under R version 4.4.2
## Warning: package 'sentimentr' was built under R version 4.4.2
##
## Attaching package: 'sentimentr'
##
## The following object is masked from 'package:syuzhet':
##
## get_sentences
# using both subreddit and keyword
reddit_threads <- find_thread_urls(keywords= "Narendra Modi",
subreddit = "India",
sort_by = 'relevance',
period = 'year') %>%
drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
rownames(reddit_threads) <- NULL
head(reddit_threads, 5) %>% knitr::kable()
| date_utc | timestamp | title | text | subreddit | comments | url |
|---|---|---|---|---|---|---|
| 2024-05-09 | 1715255486 | Justices MB Lokur, AP Shah & N Ram Invite PM Narendra Modi & Rahul Gandhi To Public Debate On Lok Sabha Elections | india | 4 | https://www.reddit.com/r/india/comments/1cnvd4c/justices_mb_lokur_ap_shah_n_ram_invite_pm/ | |
| 2024-01-12 | 1705078116 | Line of Actual Control (LAC) | No-intrusion mystery deepens: Army chief General Manoj Pande’s ‘first aim’ of restoring China status quo contradicts Prime Minister Narendra Modi’s border claim | india | 17 | https://www.reddit.com/r/india/comments/194zljd/line_of_actual_control_lac_nointrusion_mystery/ | |
| 2024-08-02 | 1722593428 | PM Narendra Modi has provided swimming facilities for all MP’s in the new Parliament House that he built at a cost of 22000 crore of tax payers hard earned money. | india | 1 | https://www.reddit.com/r/india/comments/1ei6d4m/pm_narendra_modi_has_provided_swimming_facilities/ | |
| 2024-07-26 | 1722026914 | Editorial With Sujit Nair | Henley Report ranks Indian Passport 82nd | Narendra Modi | Jaishankar | india | 1 | https://www.reddit.com/r/india/comments/1ecyjfo/editorial_with_sujit_nair_henley_report_ranks/ | |
| 2024-05-31 | 1717146764 | Narendra Modis quest to reshape India for a thousand years | india | 7 | https://www.reddit.com/r/india/comments/1d4rbns/narendra_modis_quest_to_reshape_india_for_a/ |
reddit_words <- reddit_threads %>%
unnest_tokens(output = word, input = text, token = "words")
data("stop_words")
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&|<|>"
reddit_words <- reddit_threads %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
unnest_tokens(word, text, token = "words") %>%
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "[a-z]"))
reddit_words %>%
count(word, sort = TRUE) %>%
top_n(20, n) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "words",
y = "counts",
title = "Unique wordcounts")
reddit_words %>%
count(word, sort = TRUE) %>%
wordcloud2()
The word cloud generated on visual inspections seems to containing all the relevant terms related to the keyword and the subreddit whether it be political terms such as elections, policy and opposition members. Another interesting insight is the frequent reoccurence of terms related to the policies which were criticised such as the Farmer’s protest, the CAA act and also the recent violence in Manipur.
Tri-gram analysis
words_ngram <- reddit_threads %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
select(text) %>%
unnest_tokens(output = paired_words,
input = text,
token = "ngrams",
n = 3)
#separate the words into 3 columns
words_ngram_trio <- words_ngram %>%
separate(paired_words, c("word1", "word2", "word3"), sep = " ")
# filter rows where there are stop words under word 1, word 2, and word 3 columns
words_ngram_trio_filtered <- words_ngram_trio %>%
filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word &
!word3 %in% stop_words$word) %>%
# drop non-alphabet-only strings across the 3 columns
filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") &
str_detect(word3, "[a-z]"))
# Filter out words that are not encoded in ASCII in all 3 columns
words_ngram_trio_filtered %<>%
filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) &
stri_enc_isascii(word3))
# Sort the new tri-gram (n=3) counts:
words_counts <- words_ngram_trio_filtered %>%
count(word1, word2, word3) %>%
arrange(desc(n))
head(words_counts, 20) %>% knitr::kable()
| word1 | word2 | word3 | n |
|---|---|---|---|
| prime | minister | narendra | 19 |
| minister | narendra | modi | 18 |
| uniform | civil | code | 11 |
| pm | narendra | modi | 10 |
| anti | caa | movement | 7 |
| citizenship | amendment | act | 7 |
| ajai | shukla | disengagement | 4 |
| aligarh | muslim | university | 4 |
| citizenship | amendment | bill | 4 |
| cm | biren | singh | 4 |
| defence | minister | rajnath | 4 |
| devangana | kalita | natasha | 4 |
| digital | content | creators | 4 |
| disengagement | china | india | 4 |
| home | minister | amit | 4 |
| kalita | natasha | narwal | 4 |
| minister | amit | shah | 4 |
| minister | rajnath | singh | 4 |
| narendra | modi | bjp | 4 |
| narwal | safoora | zargar | 4 |
The Tri-gram analysis does not provide a detailed insight into the sentiments of the supporters posting on reddit, hence I’ll proceed with a Bi-gram analysis
Bi-gram analysis
# Perform bi-gram analysis as well:
words_ngram2 <- reddit_threads %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
select(text) %>%
unnest_tokens(output = paired_words,
input = text,
token = "ngrams",
n = 2)
#separate the words into 2 columns
words_ngram_pair <- words_ngram2 %>%
separate(paired_words, c("word1", "word2"), sep = " ")
# filter rows where there are stop words under word 1, and word 2 columns
words_ngram_pair_filtered <- words_ngram_pair %>%
filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word) %>%
# drop non-alphabet-only strings across the 2 columns
filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))
# Filter out words that are not encoded in ASCII in all 2 columns
words_ngram_pair_filtered %<>%
filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))
# Sort the bi-gram:
words_counts2 <- words_ngram_pair_filtered %>%
count(word1, word2) %>%
arrange(desc(n))
head(words_counts2,20) %>% knitr::kable()
| word1 | word2 | n |
|---|---|---|
| narendra | modi | 80 |
| prime | minister | 34 |
| pm | modi | 25 |
| electoral | bonds | 24 |
| modi | government | 23 |
| minister | narendra | 19 |
| amit | shah | 17 |
| rahul | gandhi | 16 |
| legal | guarantee | 15 |
| shaheen | bagh | 15 |
| citizenship | amendment | 11 |
| civil | code | 11 |
| content | creators | 11 |
| pm | narendra | 11 |
| uniform | civil | 11 |
| lok | sabha | 10 |
| supreme | court | 10 |
| caa | nrc | 9 |
| yogendra | yadav | 9 |
| biren | singh | 8 |
The Bi-gram analysis also does not offer any concrete sentiment towards the Prime Minister as it the most frequent words are the names of those either belonging to the ruling or opposition party. However a frequent word being repeated is the CAA and NRC policies which were one of the most criticised policies implemented by the ruling party. Despite being quite some time, there still appears to be strong sentiment towards those policies.
sentiment_reddit <- sentiment(reddit_threads$text) %>%
arrange(desc(sentiment))
head(sentiment_reddit, 10) %>%
knitr::kable()
| element_id | sentence_id | word_count | sentiment |
|---|---|---|---|
| 31 | 2 | 5 | 1.7609035 |
| 46 | 15 | 4 | 0.8750000 |
| 79 | 23 | 26 | 0.8482023 |
| 60 | 13 | 15 | 0.8262364 |
| 77 | 23 | 1 | 0.8000000 |
| 53 | 7 | 32 | 0.7954951 |
| 74 | 51 | 3 | 0.7794229 |
| 56 | 7 | 13 | 0.7349778 |
| 56 | 25 | 16 | 0.7250000 |
| 66 | 11 | 19 | 0.7226596 |
# setting seed and setting up code to randomly select 10 posts from the entire population
set.seed(1234)
reddit_sentiment_filtered <- reddit_threads %>%
filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
unnest(text)
random_indices <- sample(nrow(reddit_sentiment_filtered), size = 10)
ten_reddit_text_samples <- reddit_sentiment_filtered[random_indices, ]
ten_reddit_text_samples$sentiment_score <- sapply(ten_reddit_text_samples$text, function(text) {
sentiment(text)$sentiment[1]
})
# selecting columns and sorting by sentiment for display
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
ten_reddit_text_samples_final <- ten_reddit_text_samples[, columns] %>%
arrange(desc(sentiment_score))
ten_reddit_text_samples_final$title <- strtrim(ten_reddit_text_samples_final$title, 50)
ten_reddit_text_samples_final$text <- strtrim(ten_reddit_text_samples_final$text, 50)
head(ten_reddit_text_samples_final, 10) %>%
knitr::kable()
| date_utc | title | text | subreddit | sentiment_score |
|---|---|---|---|---|
| 2024-03-05 | First Lalu, now A Raja: INDIA bloc keeps lobbing f | Seizing on Rashtriya Janata Dal (RJD) leader Lalu | india | 0.0575396 |
| 2024-05-08 | Rahul Gandhi And Fighting A Political Battle On On | On 22 January 2024, as Prime Minister set out for | india | 0.0456435 |
| 2024-04-22 | Anshuman Sail Nehru (@AnshumanSail) on X | Today, Narendra Modi has shamelessly lied about Dr | india | 0.0335410 |
| 2024-05-19 | How Narendra Modi made himself unbeatable I DW New | Not BBC, DW. I guess they know when they see one. | india | 0.0000000 |
| 2024-11-25 | Did Virat Kohli start the beard trend among men in | Ive been observing that these days, almost 8 out o | india | 0.0000000 |
| 2024-02-19 | Point out the shortcomings of the ruling party in | I’ll go first. I’m from West Bengal and honestly, | india | 0.0000000 |
| 2024-03-24 | What is the right kind of Protests? | A while ago, PM Narendra Modi in Parliament mocked | india | -0.0250000 |
| 2024-04-19 | BBC interview of Narendra Modi 2002 post Godhra ri | Its clear he doesnt make the same mistake twice. | india | -0.0753778 |
| 2024-02-19 | If youre Indian National Congress President, what | When Rahul Gandhi was President of INC during the | india | -0.1039230 |
| 2024-03-19 | If Somebody received a message like this,Please bl | Send a screenshot with a message showing that you | india | -0.1109400 |
Sentiment Analysis of 10 Samples: As is often the case in politics, Prime Minister Modi’s sentiment score showcases a mix of positive and negative posts, reflecting the diverse opinions expressed by the public. This is evident from the varying sentiment scores, where posts supportive of Modi typically receive positive scores, while those critical of him are assigned negative scores. However, there are exceptions, as observed in the classification of certain posts.
Upon closer review, Sentimentr has generally done a commendable job of accurately classifying posts based on their sentiment. Most posts align well with their sentiment scores, with supportive content receiving positive scores and critical posts being marked as negative. However, one exception is noted in the third post. Despite the headline and text suggesting a critical tone towards the political leader, the post has been misclassified with a positive sentiment score. Additionally, there is another instance of a post unrelated to the keyword or subreddit context, which has been assigned a neutral score of 0.0.
While Sentimentr’s performance appears accurate at face value, this evaluation is based on surface-level observations. To provide a more detailed and robust assessment of its effectiveness, further analysis of the text is required
# re-run sentiment analysis code used for 10 samples on all text for plotting
reddit_sentiment_plots <- reddit_threads %>%
filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
unnest(text) %>%
drop_na()
reddit_sentiment_plots$sentiment_score <-
sapply(reddit_sentiment_plots$text, function(text) {
sentiment(text)$sentiment[1]
})
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
reddit_sentiment_plots <- reddit_sentiment_plots[, columns] %>%
arrange(desc(sentiment_score))
reddit_sentiment_plots$title <- strtrim(reddit_sentiment_plots$title, 50)
reddit_sentiment_plots$text <- strtrim(reddit_sentiment_plots$text, 50)
# turn date_utc into Day of the Week for plotting
reddit_sentiment_plots$DoW <- wday(reddit_sentiment_plots$date_utc, label = TRUE, abbr = FALSE)
reddit_sentiment_plots <- reddit_sentiment_plots %>% select(date_utc, DoW, title, text,
subreddit, sentiment_score)
# density plot to show distribution of scores
ggplot(reddit_sentiment_plots, aes(x = sentiment_score)) +
geom_density(fill = "cyan", alpha = 0.9) +
labs(title = "Distribution of Sentiment", x = "Sentiment_score", y = "Density") +
theme_dark()
The graph displays the density distribution of sentiment scores, showcasing a concentration of values around 0. This suggests that the majority of posts have a neutral sentiment or are balanced between positive and negative emotions. The curve is slightly skewed to the right, indicating a minor prevalence of positive sentiment compared to negative sentiment. The density tapers off towards the extremes, with fewer posts exhibiting strongly negative or strongly positive sentiment. This aligns with the view that despite being a right wing wing leader, majority of the moderates also do tend to agree with the Prime Ministers evident by acheiving a majority victory in the recently concluded elections in a nation whose citizens have historically voted for Moderate political parties
#day-of-week bar graph
reddit_sentiment_plots %>%
ggplot(aes(x = DoW)) +
geom_bar(fill = 'lightblue') +
theme_classic()
The bar chart shows the distribution of posts across days of the week, with no significant pattern or correlation between the day and the number of posts. While Tuesday and Wednesday have slightly higher counts, the variation is not substantial enough to suggest that sentiment scores are influenced by the day of the week.
#violin graphn by day-of-week
ggplot(reddit_sentiment_plots, aes(x = DoW, y = sentiment_score)) +
geom_violin(fill = "black", color = "white") +
#labs(title = "Violin Plot by Group", x = "Group", y = "Value") +
theme_dark()
The violin plot illustrates the distribution of sentiment scores towards Prime Minister Modi over the last 12 months, segmented by the days of the week. The distribution is relatively consistent across all days, indicating that sentiment—whether positive, neutral, or negative—does not vary significantly based on the day of the week. Most sentiment scores are centered around zero, reflecting a mix of neutral or balanced opinions. However, the noticeable tails in both positive and negative directions indicate that, while most posts remain neutral, there are significant spikes of both strong support and criticism. This can be attributed to the Prime Minister’s affiliation with a right-wing party, as politicians from right-wing parties often tend to evoke more polarized opinions compared to those with moderate ideologies. This reinforces the idea that sentiment towards PM Modi is shaped more by events and public discourse rather than day-specific patterns.
The sentiment analysis of Prime Minister Narendra Modi over the past 12 months reveals a predominantly neutral sentiment, with occasional spikes of strong support and criticism. These spikes reflect the polarization often associated with right-wing leaders, as Modi’s policies and leadership style evoke diverse reactions. Despite this, a slight skew towards positive sentiment aligns with his ability to appeal to a broad voter base, as evidenced by his recent electoral success in a nation historically favoring moderate parties. The analysis underscores that public sentiment is shaped more by specific events and discourse rather than consistent patterns, offering valuable insights into the complexities of political perception.
The results from Sentimentr, although not entirely precise, provided valuable insights for analysis. The algorithm’s difficulty in accurately assessing posts highlights the complexity of human emotions and expression