Research Aim: To examine how Reddit users express positive and negative sentiment toward the Harry Potter series, focusing on both their engagement as a fandom and their reactions to specific characters, plot developments, and related content.

# Define package list
packages <- c(
  "tidyverse", "tidytext", "textclean", "wordcloud2",
  "quanteda", "quanteda.textstats", "stopwords",
  "ggplot2", "stringr", "SnowballC", "RedditExtractoR",
  "anytime", "magrittr", "httr", "igraph", "ggraph",
  "wordcloud2", "textdata", "here"
)

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(!installed_packages)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'tidytext' was built under R version 4.5.2
## Warning: package 'textclean' was built under R version 4.5.2
## Warning: package 'wordcloud2' was built under R version 4.5.2
## Warning: package 'quanteda' was built under R version 4.5.2
## Package version: 4.3.1
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 24 of 24 threads used.
## See https://quanteda.io for tutorials and examples.
## Warning: package 'quanteda.textstats' was built under R version 4.5.2
## Warning: package 'stopwords' was built under R version 4.5.2
## Warning: package 'SnowballC' was built under R version 4.5.2
## Warning: package 'RedditExtractoR' was built under R version 4.5.2
## Warning: package 'anytime' was built under R version 4.5.2
## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## Warning: package 'igraph' was built under R version 4.5.2
## 
## Attaching package: 'igraph'
## 
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## The following object is masked from 'package:base':
## 
##     union
## Warning: package 'ggraph' was built under R version 4.5.2
## Warning: package 'textdata' was built under R version 4.5.2
## 
## Attaching package: 'textdata'
## 
## The following object is masked from 'package:httr':
## 
##     cache_info
## 
## here() starts at C:/Users/SHAMBHAVI SINHA/OneDrive/semester1_mcrp/urban_analytics_R/urban_analytics
# using keyword
threads_1 <- find_thread_urls(
  keywords = "Harry Potter",
  sort_by = "relevance",
  period = "all"
)

rownames(threads_1) <- NULL



# Sanitize text
threads_1 %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>%   # replace vertical bars
        str_replace_all("\\n", " ") %>%   # replace newlines
        str_squish()                      # clean up extra spaces
  ))

colnames(threads_1)
head(threads_1, 3) %>% knitr::kable()

Finding the subreddits based on the subscribers

# search for subreddits
subreddit_list <- RedditExtractoR::find_subreddits("harry potter fandom")
subreddit_list %>% 
  arrange(desc(subscribers)) %>% 
  .[1:25,c('subreddit','title','subscribers')] %>% 
  knitr::kable()
threads_1$subreddit %>% table() %>% sort(decreasing = T) %>% head(20)

Looking at the top subreddits

# using subreddit
threads_2 <- find_thread_urls(subreddit = "harrypotter", 
                              sort_by = 'top', 
                              period = 'year') 
 

rownames(threads_2) <- NULL

# Sanitize text
threads_2 %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>% 
        str_replace_all("\\n", " ") %>%
        str_squish()
  ))

head(threads_2, 3) %>% knitr::kable()
# using both subreddit and keyword
threads_3 <- find_thread_urls(keywords= "Harry Potter", 
                              subreddit = "Movies", 
                              sort_by = 'relevance', 
                              period = 'all') 
 
rownames(threads_3) <- NULL

# Sanitize text
threads_3 %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>% 
        str_replace_all("\\n", " ") %>% 
        str_squish() 
  ))

head(threads_3, 3) %>% knitr::kable()
# get individual comments
threads_2_content <- get_thread_content(threads_2$url[1:4])

Looking at the up and downvote ratio

names(threads_2_content)

# check upvotes and downvotes
print(threads_2_content$threads[,c('upvotes','downvotes','up_ratio')])

We see that there are no downvotes in the top 4 rows with up ratio always above 0.99 saying that most other redditors agree with the narratives.

# Sanitize text
threads_2_content$comments %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>% 
        str_replace_all("\\n", " ") %>% 
        str_squish() 
  ))

head(threads_2_content$comments, 3) %>% knitr::kable()
# Read the saved CSV into a new object called thread_2
thread_2 <- read.csv("C:/Users/SHAMBHAVI SINHA/Downloads/threads_2.csv", stringsAsFactors = FALSE)

# View the first few rows to confirm
head(thread_2)
##     date_utc  timestamp
## 1 2025-07-17 1752765888
## 2 2025-08-31 1756604624
## 3 2025-08-13 1755112666
## 4 2025-07-15 1752572839
## 5 2025-09-15 1757906935
## 6 2025-08-27 1756289330
##                                                           title
## 1            Chris Columbus was just what 11 year old me needed
## 2                 I really like Prisoner of Azkaban's aesthetic
## 3                             Why did Lego release an empty set
## 4                             My Harry Potter inspired wedding!
## 5                      How did Dumbledore know Harry was there?
## 6 What rewritten scene (NOT omitted scene) annoys you the most?
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                text
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                These are only preview pictures we are still waiting for the final ones. But I\031m so happy I just want to share a little bit of our magical day!
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 How did Dumbledore know Harry and Ron were there at Hagrids hut in chamber of secrets under the cloak? Was it intuition, did he feel their presence? Could he see them or maybe he heard them? What do you think?
## 6 So I mean a scene where they used a similar amount of time, but just told it a different way to the books. So leaving out Gaunt memories etc. doesn't count. Mine is how they butchered Neville's most epic moment in the film. It would have taken the same amount of time, in fact I believe it could have been much less, to show exactly how it was in the book, which is infinitely better. Book: Harry tells Neville before going to the forest that killing the Snake is essential. When Harry is seen dead, Neville just fucking lunges for Voldemort like an absolute badass. Just goes for him. Voldemort body binds him, tells him as a pure blood they would love to have him on their side, otherwise he will die. Neville screams out that he'll join them when Hell freezes over. Voldemort says very well, puts the sorting hat on his head (to mock the old sorting system) and sets him on fire, to burn to dead while paralysed. The body binds him charm breaks, Neville whips out the sword and slashes Nagini's head off right next to Voldemort, who stands there looking like a shocked dumbass in front of all the death eaters. One of the best scenes in all the books. Movie: they changed it to Voldemort asks for people to change sides, Neville steps out and gives a slow, emotional speech to everyone about how Harry and others didn't die in vain, and they shouldn't give up the fight. Then he pulls the sword out of the hat to use instead of his wand, and stands there long enough for V to blast him backwards. Then later, he awakes in chaos and it is played for laughs that he is confused and bumbling around, happens upon Rob and Hermione being attacked by Nagini and kills her with the sword to defend them, not because he was attacking on Harry's word.
##     subreddit comments
## 1 harrypotter       87
## 2 harrypotter      183
## 3 harrypotter       90
## 4 harrypotter      205
## 5 harrypotter      780
## 6 harrypotter      728
##                                                                                                         url
## 1 https://www.reddit.com/r/harrypotter/comments/1m2ah4w/chris_columbus_was_just_what_11_year_old_me_needed/
## 2       https://www.reddit.com/r/harrypotter/comments/1n4igsw/i_really_like_prisoner_of_azkabans_aesthetic/
## 3                  https://www.reddit.com/r/harrypotter/comments/1mpds3o/why_did_lego_release_an_empty_set/
## 4                   https://www.reddit.com/r/harrypotter/comments/1m0dixm/my_harry_potter_inspired_wedding/
## 5            https://www.reddit.com/r/harrypotter/comments/1nhb9fj/how_did_dumbledore_know_harry_was_there/
## 6  https://www.reddit.com/r/harrypotter/comments/1n1czsd/what_rewritten_scene_not_omitted_scene_annoys_you/
# create new column: date
thread_2 %<>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date))

# number of threads by week in seconds
thread_2 %>% 
  ggplot(aes(x = date)) +
  geom_histogram(color="black", position = 'stack', binwidth = 604800) +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(thread_2$date, na.rm = T), 
                                max(thread_2$date, na.rm = T), 
                                by = "1 month")) +
  theme_minimal()

The monthly distribution in the last one years suggests that August a sharp rise in the discussion about the Harry potter fandom.I looked on the internet to find the reason. Turns out HBO started filming the Harry potter series busting a large discussion among the fans.

# create new columns: day_of_week, is_weekend
thread_2 %<>%  
  mutate(day_of_week = wday(date, label = TRUE)) %>% 
  mutate(is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"))

# number of threads by time of day
thread_2 %>% 
  ggplot(aes(x = day_of_week, fill = is_weekend)) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = c("Weekday" = "gray", "Weekend" = "pink")) + 
  theme_minimal()

The graph suggests that people were posting on Harry potter over the weekends. This could be easily jusitified by having more leisure time.

print(thread_2$timestamp[1])
## [1] 1752765888
print(thread_2$timestamp[1] %>% anytime(tz = anytime:::getTZ()))
## [1] "2025-07-17 11:24:48 EDT"
thread_2 %<>%  
  mutate(time = timestamp %>% 
           anytime(tz = anytime:::getTZ()) %>% 
           str_split('-| |:') %>% 
           sapply(function(x) as.numeric(x[4])))
# number of threads by time of day
thread_2 %>% 
  ggplot(aes(x = time)) +
  geom_histogram(bins = 24, color = 'black') +
  scale_x_continuous(breaks = seq(0, 24, by=2)) + 
  theme_minimal()

The morning 9 am to 1 pM is the peak time where redditors are actively talking abut Harry Potter.

library(RedditExtractoR)
library(dplyr)
library(stringr)
library(magrittr)

# Try calling the subreddit safely
threads_2 <- tryCatch(
  {
    find_thread_urls(
      subreddit = "harrypotter",  # MUST NOT CONTAIN SPACES
      sort_by   = "top",
      period    = "year"
    )
  },
  error = function(e) {
    stop("Reddit API call failed: ", e$message)
  }
)

# Check the result
if (is.null(threads_2) || nrow(threads_2) == 0) {
  stop("threads_2 is empty. Please check your internet connection or subreddit name.")
}

rownames(threads_2) <- NULL

# Clean text fields
threads_2 <- threads_2 %>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
      str_replace_all("\\|", "/") %>%
      str_replace_all("\\n", " ") %>%
      str_squish()
  ))

print("threads_2 successfully created.")
head(threads_2)
# Word tokenization\
# Install tidytext if not already installed
install.packages("tidytext")
## Warning: package 'tidytext' is in use and will not be installed
# Load the package
library(tidytext)

# Then run your tokenization
words <- thread_2 %>% 
  unnest_tokens(output = word, input = text, token = "words")

words <- thread_2 %>% 
  unnest_tokens(output = word, input = text, token = "words")

words %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Top 20 Word Counts")
## Selecting by n

library(tidytext)

# load list of stop words - from the tidytext package
data("stop_words")
# view random 50 words
print(stop_words$word[sample(1:nrow(stop_words), 100)])
##   [1] "oh"           "didn't"       "himself"      "these"        "tell"        
##   [6] "inasmuch"     "three"        "take"         "also"         "c"           
##  [11] "us"           "saw"          "own"          "against"      "which"       
##  [16] "downwards"    "what's"       "sees"         "now"          "about"       
##  [21] "later"        "give"         "like"         "should"       "come"        
##  [26] "she"          "i'd"          "working"      "going"        "particularly"
##  [31] "has"          "under"        "welcome"      "man"          "wants"       
##  [36] "say"          "alone"        "whence"       "were"         "keeps"       
##  [41] "near"         "beside"       "down"         "i've"         "four"        
##  [46] "anywhere"     "otherwise"    "thought"      "generally"    "good"        
##  [51] "available"    "go"           "so"           "placed"       "hasn't"      
##  [56] "towards"      "pointing"     "made"         "indicate"     "changes"     
##  [61] "its"          "parted"       "longest"      "together"     "on"          
##  [66] "formerly"     "nevertheless" "with"         "seems"        "secondly"    
##  [71] "ie"           "downed"       "part"         "before"       "cant"        
##  [76] "ever"         "smallest"     "currently"    "clearly"      "concerning"  
##  [81] "only"         "merely"       "this"         "an"           "thus"        
##  [86] "better"       "into"         "himself"      "nowhere"      "whereafter"  
##  [91] "against"      "members"      "use"          "whatever"     "wanted"      
##  [96] "twice"        "i've"         "order"        "whole"        "place"
# load required packages (install first if needed)
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")
if (!requireNamespace("tidytext", quietly = TRUE)) install.packages("tidytext")
if (!requireNamespace("glue", quietly = TRUE)) install.packages("glue")

library(dplyr)      # provides %>% and data-manip verbs
library(stringr)    # str_replace_all, str_detect
library(tidytext)   # unnest_tokens, stop_words
library(glue)       # glue()

# sanity check: ensure threads_2 exists
if (!exists("thread_2")) stop("Object 'thread_2' not found. Create thread_2 before tokenizing.")

# Optional: ensure `text` column exists
if (!"text" %in% names(thread_2)) stop("threads_2 does not contain a 'text' column. Create or rename the text column first.")

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

# Tokenize into words (keep an initial 'words' object for comparison)
words <- thread_2 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "words")

# Prepare stop_words (from tidytext); it's available once tidytext is loaded
data("stop_words")  # safe to call; tidytext must be loaded

# Cleaned words: remove URLs, tokenize, remove stop words and non-alphabet tokens
words_clean <- thread_2 %>% 
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "words") %>% 
  anti_join(stop_words, by = "word") %>% 
  filter(str_detect(word, "[a-z]")) 

# Print counts before / after
cat(glue("Before: {nrow(words)}, After: {nrow(words_clean)}\n"))
## Before: 5146, After: 1784
words_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

words %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()
words_clean %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

Harry” is the largest word, showing he is the central topic of discussion.Other prominent words like “Potter,” “Voldemort,” “Hermione,” “Ron,” “books,” “movies,” “Hogwarts,” “Snitch,” “Weasley,” “Dumbledore” reflect key characters, objects, and themes from the series.

The Words like “time,” “team,” “found,” “game,” “family” suggest discussions also touch on plot events, relationships, and major story elements.The mixture of character names, magical terms, and story-related nouns shows that users are engaging both with the narrative content (books, movies, events) and the characters themselves.

n <- 20 # number of words with color
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

Clean Word Cloud with lightened less relevent words

words_clean %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

This cloud specifically shows a lots of discussion around the major golden trio as well as the side characters and the antagonist Voldemort .

# Get ngrams. You may try playing around with the value of n, n=3, n=4
# Load required packages
library(dplyr)
library(stringr)
library(tidytext)

# Define regex to clean text
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

# Create ngrams (e.g., n = 3)
words_ngram <- thread_2 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)
# Show ngrams with sorted values
words_ngram %>%
  count(paired_words, sort = TRUE) %>% 
  head(20) %>% 
  knitr::kable()
paired_words n
NA 154
harry potter and 8
potter and the 7
a lot of 6
in the book 6
for the first 4
harry and hermione 4
one of the 4
the first time 4
and i thought 3
catching the snitch 3
do you think 3
in the movies 3
it turned out 3
your team is 3
295,000 matches and 2
a few hours 2
able to turn 2
actually looks like 2
amount of time 2
#separate the paired words into three columns
words_ngram_pair <- words_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_pair_filtered <- words_ngram_pair %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") & str_detect(word3, "[a-z]"))

# Filter out words that are not encoded in ASCII
# To see what's ASCII, google 'ASCII table'
library(stringi)
library(dplyr)
library(magrittr)  # for %<>%
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & 
         stri_enc_isascii(word2) & 
         stri_enc_isascii(word3))

# Sort the new bi-gram (n=2) counts:
words_counts <- words_ngram_pair_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()
word1 word2 word3 n
alan rickman writes 1
arrest harry isn 1
art credit lulusketches 1
art credit wizardingworld 1
bad quality pic 1
beaters bludgers chasers 1
birthday cake sorting 1
blood prince harry 1
bludgers chasers keepers 1
book harry tells 1
books creative liberty 1
breaks neville whips 1
broke af egypt 1
butterbeer birthday cake 1
butterbeer pumpkin juice 1
cake sorting hat 1
charm breaks neville 1
christmas holidays painting 1
cold butterbeer pumpkin 1
cold wet draughty 1
# plot word network
words_counts %>%
  filter(n >= 1) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The word tri-graams does not see repetation so I am not sure how successfully it represents the coorelations among the themes discussed in the series.The strength of connection between themes discussion does not show a clear pattern to me. Although it does seem to merge as a cluster at Potter. The results were significantly different when tried for 2 word pairs.

2-word pair

# Get ngrams. You may try playing around with the value of n, n=3, n=4
# Load required packages
library(dplyr)
library(stringr)
library(tidytext)

# Define regex to clean text
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

# Create ngrams (e.g., n = 3)
words_ngram <- thread_2 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 2)
# Show ngrams with sorted values
words_ngram %>%
  count(paired_words, sort = TRUE) %>% 
  head(20) %>% 
  knitr::kable()
paired_words n
NA 153
in the 27
of the 26
harry potter 16
the books 12
to the 12
and i 11
and the 11
for the 11
it was 10
on the 10
the snitch 10
but i 9
the movies 9
potter and 8
the book 8
the first 8
he was 7
i m 7
it s 7
#separate the paired words into three columns
words_ngram_pair <- words_ngram %>%
  separate(paired_words, c("word1", "word2"), sep = " ")

# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_pair_filtered <- words_ngram_pair %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word ) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))

# Filter out words that are not encoded in ASCII
# To see what's ASCII, google 'ASCII table'
library(stringi)
library(dplyr)
library(magrittr)  # for %<>%
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & 
         stri_enc_isascii(word2))
        

# Sort the new bi-gram (n=2) counts:
words_counts <- words_ngram_pair_filtered %>%
  count(word1, word2) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()
word1 word2 n
harry potter 16
truth serum 4
death eaters 3
art credit 2
avada kedavra 2
body binds 2
daniel radcliffe 2
death eater 2
diagon alley 2
forbidden forest 2
lightning bolt 2
predicted harry 2
red shirt 2
snake form 2
sorting hat 2
triwizard tournament 2
2,5kgs 5,5lbs 1
25k cash 1
2nd income 1
3rd instalment 1
# plot word network
words_counts %>%
  filter(n >= 3) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")

The redditors were interested in talking about Harry Potter as generic term while the idea of Death eaters and Truth serum seems to be getting attention. I would have expected some more main themes like Golden trio or Maurauders map.

#############sentiment analysis######################################################

Dictionary Method

# Package names
packages <- c(
  "RedditExtractoR", "anytime", "magrittr", "httr",
  "tidytext", "tidyverse", "igraph", "ggraph",
  "wordcloud2", "textdata", "here", "sentimentr", "glue"
)

installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

invisible(lapply(packages, library, character.only = TRUE))
## Warning: package 'sentimentr' was built under R version 4.5.2
#Combine title and text for sentiment analysis
thread_2 <- thread_2 %>%
  mutate(
    title  = replace_na(title, ""),
    text   = replace_na(text, ""),
    text_all = stringr::str_c(title, text, sep = ". ")
  )

# Run sentiment analysis (dictionary + negation-aware)
sent_res <- sentiment_by(thread_2$text_all)

# Attach back to main data frame
thread_2$sentiment  <- sent_res$ave_sentiment
thread_2$word_count <- sent_res$word_count

summary(thread_2$sentiment)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.75000 -0.04276  0.00000  0.05532  0.19823  0.77942
library(sentimentr)

# 1. Keep only valid non-empty rows
threads_2_valid <- thread_2[!is.na(thread_2$text) & thread_2$text != "", ]

# 2. Break into sentences
sentences_2 <- get_sentences(threads_2_valid$text)

# 3. Compute sentiment PER SENTENCE (element_id keeps mapping)
sent_df <- sentiment(sentences_2)

# 4. Aggregate back to ONE score per original text
sentiment_per_text <- aggregate(sent_df$sentiment, 
                                by = list(sent_df$element_id),
                                FUN = mean)

# Rename columns
colnames(sentiment_per_text) <- c("row_id", "sentiment")

# 5. Attach sentiment back to threads_2_valid (perfect alignment)
threads_2_valid$sentiment <- sentiment_per_text$sentiment

# 6. Sample 10 rows
set.seed(123)
sample_rows <- threads_2_valid[sample(1:nrow(threads_2_valid), 10), ]

# 7. Extract the first sentence
sample_rows$first_sentence <- sub("(\\..*).*", "\\1", sample_rows$text)
sample_rows$first_sentence <- ifelse(sample_rows$first_sentence == "",
                                     sample_rows$text,
                                     sample_rows$first_sentence)

# 8. Output
sample_output_2 <- sample_rows[, c("first_sentence", "sentiment")]
sample_output_2
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              first_sentence
## 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Can you reverse the obliviate spell?
## 199                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             No, really. Is he an idiot?
## 114                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I know the lady who could turn into her snake form (Nagini) do it on command, but her curse would eventually make her into a Snake permanently and not being able to turn back to human again, So this begs the question, how was Bathilda Bagshot able to turn into Nagini the snake? - unless Bathilda is the old lady Harry and Hermione met in Grodrics Hollow from The Grimes Of Grindlewald, but as I mentioned the curse the lady had from Fantastic Beasts would eventually take control over her and force her to remain in Snake form forever.
## 25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       I don't hear people talking abt it much. That scene was hilarious!!! Slughorn: Harry!! Potter: SiRrRrrRr! Slughorn: I can't let you roam around by yourself Harry: Well then by all means cOme AloNG sir! Aragog death Not to mention the pincers.
## 172                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Thanks to u/distinct-leather-382 for letting me know about these screenings. Total dream moment to be seeing this in theaters again after all these years!
## 92                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     It's still a work in progress and needs more updates and wall decor but I'm so proud of how it's turned our so far!!
## 113 As it is now, 99.9% of Quidditch games are determined by the seeker. Beaters, bludgers, chasers, keepers& all of it is just extra fluff when catching the snitch gets you 150 points and ends the game. Honestly, it was such a lazy way of making Harry so central and important to the team. BUT& one tiny change makes the entire game more compelling and challenging while making the entire team useful: NO POINTS FOR THE SNITCH. Catching the snitch *only* ends the game. Hear me out: The way it\031s written, catching the snitch is something to always strive for, because you\031re gonna win the game. Period. In 7 books, only ONE exception to that was ever mentioned. But think of how it plays out if you can ONLY catch the snitch when your team is up because if you catch it when your team is down, you lose the game for your team. So the seeker for the team that currently has the most points looks for the snitch as normal. But the other seeker has to try to keep the snitch in play until their team can score more goals. So, if the snitch is flying in Harry\031s face but Gryffindor is down a goal, he can\031t just catch it. But he has to make sure that neither do the opponents. And If, during the struggle to keep the other seeker from the snitch, Gryffindor scores a goal, then the objectives of the two seekers have to change (I guess this would also mean that, in the event of a tie, the team that caught the snitch gets the tie-break). This makes the whole thing more exciting and allows the rest of the players to be just as important to the game as the seeker. EDIT TO ADD: A lot of comments in here about how 150 points isn\031t all that big a deal, like being 15 goals ahead is nothing special. Well, this view overlooks a couple of things: 1) If your team is down by anything near 15 goals, they absolutely don\031t deserve to win because one guy grabs a tiny ball. That\031s just& unsportsmanlike (pardon the gendered term). And 2) Quidditch is *very* clearly modeled on football (or \034soccer\035 to Americans), in which goals are pretty rare and scores tend on the low end (the most common score in football is actually 1-1, happening 11% of the time). I went to a site called FootyStats, which analyzed nearly 295,000 matches and posted the instances of the various score outcomes. A 15 goal spread happened exactly TWICE out of those 295,000 matches. And both instances were 15-0, so clearly cases where one of the teams was seriously outclassed in probably every metric. Doesn\031t quite seem fair, then, that those outclassed teams should pull out a win because someone finds a golf ball on the pitch, does it?
## 95                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             I'm listening to the audiobook and had a revelation this morning Harry asks her how she recognized him and she said by his expression. and I FINALLY REALIZED. Luna has always been far more perceptive than anybody gives her credit for, but this one . . . After 16 years of seeing the various scandalized, curious, or exasperated expressions of people who meet her for the first time or meet her again and again. And we find out later about the ceiling mural . . . She knows instantly when she sees how Harry is looking at her that he is her friend and makes the obvious conclusion. They were never romantically involved, the look on his face wouldn't have been one of love or adoration. It must have simply been a look of complete acceptance and welcome. And that must have meant the world to her.
## 223                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               This healed a part of me.
## 49                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            GOF was the first HP I read and I was instantly hooked. I picked up the first 3 books and devoured them instantly. From Moody's introduction, to Dumbledore's memories to recalling all those little snippets such as "If there's one thing I hate more than any other, it's a Death Eater who walked free", everything leading up to it was bloody brilliant
##       sentiment
## 64   0.00000000
## 199 -0.25000000
## 114 -0.02099398
## 25   0.17182679
## 172  0.12001020
## 92   0.54858364
## 113  0.05961321
## 95   0.00571878
## 223  0.00000000
## 49  -0.11926030

The dictionary-based sentiment analysis of the selected Reddit posts shows that the method performs reliably when the text contains clear emotional cues but struggles with more nuanced or context ependent expressions. Sentences that include explicit positive language uch as “I’m so proud of how it’s turned out” or references to humorous or exciting moments eceive distinctly positive scores, reflecting their enthusiastic tone. Similarly, strongly negative wording, such as calling someone an “idiot,” is captured effectively through moderately negative values. However, many posts in the dataset are narrative, descriptive, or analytical in nature, which leads to scores clustering around zero because the dictionary method relies solely on surface-level indicators. As a result, emotionally meaningful but implicitly positive statements, such as “This healed a part of me,” are misclassified as neutral, demonstrating the method’s limitations in detecting subtle sentiment or contextual meaning. Overall, the results indicate that while dictionary methods can identify overt sentiment, they often underrepresent the emotional depth of fan discussions, especially when feelings are implied rather than directly stated.

#8. Discuss intriguing insights derived from the sentiment analysis, supporting your observations with at least THREE plots.
library(ggplot2)

ggplot(threads_2_valid, aes(x = sentiment)) +
  geom_density(fill = "blue", alpha = 0.35) +
  labs(
    title = "Density of Sentiment Scores (Reddit Threads)- Dictionary",
    x = "Sentiment Score",
    y = "Density"
  ) +
  theme_minimal(base_size = 14)

# Add sentiment label
sentiment_cat_2 <- threads_2_valid %>%
  mutate(
    sentiment_label = case_when(
      sentiment > 0 ~ "Positive",
      sentiment < 0 ~ "Negative",
      TRUE ~ "Neutral"
    )
  ) %>%
  filter(sentiment_label != "Neutral")

ggplot(sentiment_cat_2, aes(x = sentiment, fill = sentiment_label)) +
  geom_histogram(bins = 30, alpha = 0.6, position = "identity", color = "white") +
  scale_fill_manual(values = c("Positive" = "blue", "Negative" = "firebrick")) +
  labs(
    title = "Distribution of Positive vs Negative Sentiment Scores",
    x = "Sentiment Score",
    y = "Count",
    fill = "Sentiment Type"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
    legend.title = element_text(face = "bold")
  )

Deep Learning Model Results

options(repos = c(CRAN = "https://cran.rstudio.com/"))

# Then install packages
install.packages("ggplot2")  # or any other package
## Warning: package 'ggplot2' is in use and will not be installed
# Install if not already installed
if (!require(sentimentr)) install.packages("sentimentr")

# Load the package
library(sentimentr)

# Example usage
# Assuming threads_2$title contains your Reddit post titles
sentiment_scores <- sentiment_by(thread_2$title)
head(sentiment_scores)
## Key: <element_id>
##    element_id word_count    sd ave_sentiment
##         <int>      <int> <num>         <num>
## 1:          1          9    NA  0.000000e+00
## 2:          2          7    NA -2.098124e-17
## 3:          3          7    NA -9.449112e-02
## 4:          4          5    NA  2.236068e-01
## 5:          5          7    NA  0.000000e+00
## 6:          6         10    NA  1.581139e-01
# Package names
packages <- c( "tidytext", "tidyverse", "textdata", "anytime", "magrittr", "wordcloud2",
               "syuzhet", "sentimentr", "lubridate", "here")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}

I used the google collab link to run my threads_2 file for the sentiment analysis using the Bert label

# import the data
reddit_sentiment <- read_csv("C:/Users/SHAMBHAVI SINHA/Downloads/sample_reddit_bert.csv")
## New names:
## Rows: 249 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): title, text, subreddit, url, bert_label dbl (4): ...1, timestamp,
## comments, bert_score date (1): date_utc
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# drop NAs
reddit_sentiment %<>% drop_na('bert_label')

2-2. Comparison with the dictionary method

Get sentiment scores using the dictionary method for comparison.

# Join thread title and text.
reddit_sentiment %<>%
  mutate(title = replace_na(title, ""),
         text = replace_na(text, ""),
         title_text = str_c(title, text, sep = ". "))

# dictionary method
reddit_sentiment_dictionary <- sentiment_by(reddit_sentiment$title_text)

reddit_sentiment$sentiment_dict <- reddit_sentiment_dictionary %>% pull(ave_sentiment)
reddit_sentiment$word_count <- reddit_sentiment_dictionary %>% pull(word_count)
reddit_sentiment %<>% mutate(bert_label_numeric = str_sub(bert_label, 1, 1) %>% as.numeric())

cor(reddit_sentiment$bert_label_numeric, reddit_sentiment$sentiment_dict)
## [1] 0.5333322

This is a moderate positive correlation.It suggests that, in general, higher dictionary-based sentiment scores tend to align with more positive BERT labels, but the methods are not perfectly consistent.The difference is expected because BERT can capture context, sarcasm, and multi-word expressions better than a simple dictionary method, which only looks at individual words.

library(ggplot2)

# Dark theme similar to ggdark::dark_theme_grey()
theme_dark_grey_like <- function(base_size = 12, base_family = "") {
  theme_grey(base_size = base_size, base_family = base_family) %+replace%
    theme(
      panel.background = element_rect(fill = "#222222", color = NA),
      plot.background = element_rect(fill = "#222222", color = NA),
      panel.grid.major = element_line(color = "#444444", size = 0.5),
      panel.grid.minor = element_line(color = "#333333", size = 0.25),
      axis.text = element_text(color = "white"),
      axis.title = element_text(color = "white", face = "bold"),
      plot.title = element_text(color = "white", size = 16, face = "bold", hjust = 0.5),
      plot.subtitle = element_text(color = "white", size = 12, hjust = 0.5),
      legend.background = element_rect(fill = "#222222"),
      legend.key = element_rect(fill = "#222222"),
      legend.text = element_text(color = "white"),
      legend.title = element_text(color = "white", face = "bold")
    )
}

# Plot using this dark theme
ggplot(data = reddit_sentiment, aes(x = bert_label_numeric, y = sentiment_dict)) +
  geom_jitter(width = 0.1, height = 0, color = "skyblue") +
  geom_hline(yintercept = 0, color = '#FFD700', lwd = 1, linetype = 'dashed') +
  theme_dark_grey_like()
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

As bert_label_numeric increases, the dictionary-based sentiment tends to shift upwards, consistent with the positive correlation of ~0.53 . Higher BERT labels generally correspond to higher dictionary scores. Each BERT category has a wide spread of dictionary scores. For example, bert_label_numeric = 3 shows dictionary scores ranging from negative to slightly positive. This reflects differences in context handling: the dictionary method can miss nuances captured by BERT.Many points cluster around 0, showing that the dictionary often returns near-neutral sentiment, even when BERT indicates stronger sentiment. Some dictionary scores are strongly negative or positive while BERT is moderate, highlighting misalignment for certain texts—likely cases of sarcasm, negation, or complex phrasing.

bert_example <- reddit_sentiment %>%
  filter(bert_label_numeric %in% c(1,5)) %>%
  group_by(bert_label) %>%
  arrange(desc(bert_score)) %>%
  slice_head(n = 3) %>%
  ungroup()

# 1 star
bert_example %>% filter(bert_label_numeric == 1) %>% pull(title_text) %>% print()
## [1] "Anyone ever seen HP World this empty?. "                                                                                                                                                                                                                    
## [2] "13.5 years later and I still can\031t get over how absolutely ridiculous this entire sequence is. Remember when parts of it were and in the trailer and I thought - oh wow must be some sort of dream sequence they added or something - NOPE. Just insane."
## [3] "Ugly covers? Gotta mention this one cause what the hell is this?. "
# 5 star
bert_example %>% filter(bert_label_numeric == 5) %>% pull(title_text) %>% print()
## [1] "One of the best McGonagall quotes of all time. "                                                                                                             
## [2] "She absolutely nailed it, my favorite costume from last night. "                                                                                             
## [3] "My GF is awesome. Since I was young I had always low-key wanted Harry\031s cake from hagrid because it just looked good IMO&this year, my gf made it for me!"
sentimentr_example <- reddit_sentiment %>%
  mutate(sentimentr_abs = abs(sentiment_dict),
         sentimentr_binary = case_when(sentiment_dict > 0 ~ 'positive',
                                       TRUE ~ 'negative')) %>%
  group_by(sentimentr_binary) %>%
  arrange(desc(sentimentr_abs)) %>%
  slice_head(n = 3) %>%
  ungroup() %>%
  arrange(sentiment_dict)

# negative
sentimentr_example %>% filter(sentimentr_binary == 'negative') %>% pull(title_text) %>% print()
## [1] "Damn. "                            "The bias was always crazy. "      
## [3] "That would be so confusing lmao. "
# positive
sentimentr_example %>% filter(sentimentr_binary == 'positive') %>% pull(title_text) %>% print()
## [1] "Can we all agree that the acting of this gentleman was absolutely excellent.. I loved this man lol"
## [2] "Which Weasley is the most powerful/skilled, and why?. "                                            
## [3] "Pretty awesome decorations. "
# Load necessary libraries
library(dplyr)      # for %>% and data manipulation
library(ggplot2)    # for ggplot2 plotting


reddit_sentiment %>%
ggplot(aes(x = bert_label)) +
geom_bar(fill = "skyblue") +
labs(title = "Number of Threads by Sentiment Category",
x = "Sentiment",
y = "Number of Threads") +
theme_dark_grey_like()

The sentiment around Harry Potter content seems polarized, with a significant number of very positive and very negative sentiments expressed. There is likely a smaller group of users with more neutral or moderate feelings about the content, as reflected by the lower number of threads in the middle categories (2, 3, and 4 stars). 2 stars and 4 stars each have relatively fewer threads compared to the 1-star and 5-star categories. This might suggest that while people are willing to express extreme opinions (either very negative or very positive), there is less of a middle-ground sentiment. he strong presence of 5-star ratings suggests a passionate fanbase, while the 1-star threads could represent strong criticisms or controversies surrounding the franchise.

Distribution of Sentiment scores

# Word counts by sentiment category

reddit_sentiment %>%
ggplot(aes(x = bert_label, y = word_count)) +
geom_jitter(height = 0, width = 0.05, color = "skyblue") +
stat_summary(fun = mean, geom = "crossbar", width = 0.4, color = "red") +
labs(title = "Word Counts by Sentiment Category",
x = "Sentiment",
y = "Word Count") +
theme_dark_grey_like()

The distribution of word counts across sentiment categories shows that users discussing Harry Potter express their opinions with similar levels of detail regardless of whether their sentiment is positive or negative. Although each category contains a few very long posts suggesting highly passionate or strongly opinionated users the average word counts remain relatively consistent from 1 star through 5 stars, as indicated by the red crossbars. This consistency shows that sentiment strength does not drive message length; highly negative and highly positive threads are no more verbose than neutral or moderately rated ones. Instead, the data reveals substantial variation within each sentiment category, with clusters of short comments and scattered long-form responses across all ratings. This suggests that users’ writing styles differ widely and that both brief reactions and detailed commentary occur alongside every sentiment level. While earlier sentiment distribution results showed a polarized community dominated by 1-star and 5-star threads—the word count analysis adds nuance by demonstrating that this polarization stems from users’ opinions rather than differences in how extensively they articulate them. Together, these trends highlight a fandom where people hold strong and often contrasting views about the Harry Potter content, yet express these views in similarly varied lengths.

# Remove outliers for comments

reddit_sentiment_rm_outlier <- reddit_sentiment %>%
group_by(bert_label) %>%
filter(
between(
comments,
quantile(comments, 0.25) - 1.5 * IQR(comments),
quantile(comments, 0.75) + 1.5 * IQR(comments)
)
)

# Correlation analysis

cor.test(reddit_sentiment_rm_outlier$comments, reddit_sentiment_rm_outlier$bert_label_numeric)
## 
##  Pearson's product-moment correlation
## 
## data:  reddit_sentiment_rm_outlier$comments and reddit_sentiment_rm_outlier$bert_label_numeric
## t = -4.417, df = 229, p-value = 1.544e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3949829 -0.1567828
## sample estimates:
##        cor 
## -0.2801903
# Scatterplot

reddit_sentiment_rm_outlier %>%
ggplot(aes(x = bert_label_numeric, y = comments)) +
geom_jitter(height = 0, width = 0.05, color = "skyblue") +
geom_smooth(method = 'loess', span = 0.75, color = "gold") +
labs(title = "Comments vs Sentiment",
x = "Sentiment (Numeric)",
y = "Number of Comments") +
theme_dark_grey_like()
## `geom_smooth()` using formula = 'y ~ x'

The relationship between sentiment ratings and the number of comments reveals a subtle but meaningful trend in how Harry Potter discussions unfold. The scatterplot shows substantial variation in comment volume across all sentiment levels, with some threads in every category receiving hundreds of comments. However, the LOESS smoothing line indicates a gradual downward trend: as sentiment becomes more positive, the average number of comments decreases slightly. This visual pattern is supported statistically by the Pearson correlation coefficient of –0.28, which is weak but significant (p < 0.001). This suggests that threads with more negative sentiment tend to attract more engagement, likely because negative or controversial opinions spark debate, disagreement, or extended discussion among community members. In contrast, highly positive threads while still present tend to receive somewhat fewer comments on average. The data indicates that negativity generates more interaction within the Harry Potter community, reflecting a broader online trend in which critical or contentious viewpoints draw higher levels of participation. I agree to this as a person who has been part of such community for a long time , the negative critisim or someone saying very controversial thoughts often erupts longer discussion and debates.

library(dplyr)
library(stringr)
library(tidytext)

# Regex to remove URLs and HTML entities
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

reddit_sentiment_clean <- reddit_sentiment %>%
  mutate(title_text = str_replace_all(title_text, replace_reg, "")) %>%
  # Tokenize titles into words
  unnest_tokens(word, title_text, token = "words") %>%
  # Remove stop words
  anti_join(stop_words, by = "word") %>%
  # Keep only alphabetic words
  filter(str_detect(word, "[a-z]")) %>%
  # Remove specific keywords
  filter(!word %in% c('Wizard','Hogwarts','harry'))
# negative text
reddit_sentiment_clean_negative <- reddit_sentiment_clean %>%
  filter(bert_label_numeric %in% c(1,2))
# positive text
reddit_sentiment_clean_positive <- reddit_sentiment_clean %>%
  filter(bert_label_numeric %in% c(4,5))

# Remove words that are commonly seen in both negative and positive threads
reddit_sentiment_clean_negative_unique <- reddit_sentiment_clean_negative %>%
  anti_join(reddit_sentiment_clean_positive, by = 'word')
reddit_sentiment_clean_positive_unique <- reddit_sentiment_clean_positive %>%
  anti_join(reddit_sentiment_clean_negative, by = 'word')

negative sentiment thread

# Wordcloud with a custom color palette
n <- 20
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark nor too bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

# Install wordcloud2 if not already installed
install.packages("wordcloud2")
## Warning: package 'wordcloud2' is in use and will not be installed
# Load the package
library(wordcloud2)

# Then run your word cloud code
reddit_sentiment_clean_negative_unique %>%
  count(word, sort = TRUE) %>%
  wordcloud2(color = pal,
             minRotation = -pi/6,
             maxRotation = -pi/6,
             rotateRatio = 1)
knitr::include_graphics("C:/Users/SHAMBHAVI SINHA/Downloads/w1.png")

The image you provided seems to be a word cloud that highlights various terms, potentially drawn from a Reddit analysis, possibly related to Harry Potter. Several words appear with significant prominence. Among these, words such as “shit,” “literally,” and “crazy” suggest some negative sentiment, given the intensity of their size and prominence in the word cloud. The term “shit” stands out as a clear indicator of negative sentiment, as it’s commonly used in a derogatory or frustrated context.

  • Words appearing in positive threads.
# Wordcloud with a custom color palette
n <- 20
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark nor too bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

reddit_sentiment_clean_positive_unique %>%
  count(word, sort = TRUE) %>%
  wordcloud2(color = pal,
       minRotation = pi/6,
       maxRotation = pi/6,
       rotateRatio = 1)
knitr::include_graphics("C:/Users/SHAMBHAVI SINHA/Downloads/w2.png")

This image appears to be another word cloud, but this one has a more positive sentiment based on the words displayed. Prominent words like “love,” “cake,” “neville,” and “butterbeer” suggest that the overall tone is more lighthearted and positive, possibly reflecting fandom-related discussions, such as a Harry Potter fan community.

The central focus on “neville” and “love” may point to a discussion centered around character appreciation, or positive and fun fan interactions. “Butterbeer” and “cake” are also terms that are typically associated with comfort and enjoyment, further indicating a positive context.

#Temporal Analysis

library(dplyr)
library(lubridate)
library(anytime)
library(stringr)
library(magrittr)

reddit_sentiment %<>%
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date)) %>%
  mutate(year = year(date),
         day_of_week = wday(date, label = TRUE),
         is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"),
         time = timestamp %>%
           anytime(tz = anytime:::getTZ()) %>%
           str_split('-| |:') %>%
           sapply(function(x) as.numeric(x[4])))

Day of the Week Rating Variations

reddit_sentiment %>%
  ggplot(aes(x = day_of_week, fill = bert_label)) +
  geom_bar(position = 'fill') +
  scale_fill_brewer(palette = 'PuRd', direction = -1)

Mondays seem to have a higher proportion of darker colors (lower stars), indicating that the posts on Mondays might lean more negative.On Sundays, the distribution appears to be a mix of sentiments, but there might be a slight increase in higher star ratings (more positive).Mid-week (Tue to Thu) seems to have a more balanced distribution of sentiment, with a more consistent mix of different star ratings. Fridays and Saturdays show a more varied distribution, possibly reflecting a higher volume of posts with moderate to positive sentiment.

reddit_sentiment %>%
  ggplot(aes(x = time, fill = bert_label)) +
  geom_histogram(bins = 24, position = 'fill', color = 'black', lwd = 0.2) +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  scale_fill_manual(values = c('#bc5090', '#bc5090', '#ff6361', '#ffa600', '#ffa600'))

There is no direct relation on the time of theday the rating of the posts made through the analysis.

reddit_sentiment %>%
  filter(!is.na(sentiment_dict), !is.na(word_count)) %>%
  ggplot(aes(x = sentiment_dict, y = word_count)) +
  geom_jitter(width = 0.02, height = 0, alpha = 0.4) +
  geom_smooth(method = "loess", se = FALSE, color = "darkgreen") +
  labs(
    title = "Relationship between sentiment and text length",
    x = "Dictionary Sentiment Score",
    y = "Word Count"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Te plot seems to indicate that sentiment and word count are not strongly related. Posts with both extreme and neutral sentiments have varying word counts. There may not be a significant trend indicating that longer posts are more positive or negative on average. This could suggest that both short and long posts can express a wide range of sentiments equally well.

reddit_sentiment %>%
  filter(!is.na(sentiment_dict)) %>%   # your actual sentiment column
  ggplot(aes(x = sentiment_dict)) +
  geom_histogram(bins = 30, color = "black", fill = "blue") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Distribution of Sentiment about Harry Potter",
    x = "Sentiment score (sentimentr dictionary)",
    y = "Number of threads"
  ) +
  theme_minimal()

There is a noticeable skew toward more neutral to slightly negative sentiment scores, but fewer threads seem to have extremely positive or negative sentiments. This suggests that while most Reddit threads might express neutral opinions, a smaller portion might express either enthusiasm or frustration about Harry Potter-related topics.

Comparison of Dictionary-Based Sentiment vs Deep-Learning Sentiment

When comparing the sentiment distributions produced by the dictionary method and the deeplearning model for my Harry Potter Reddit dataset, the methodological contrasts become especially clear once the nature of fandom discussions is taken into account. The dictionary method generates sentiment values tightly clustered around zero, with very few strongly positive or negative scores, because lexicon-based approaches depend entirely on predefined sentiment words and thus classify most context rich descriptive, or lore-oriented posts as neutral. This is particularly limiting in the Harry Potter context, where a large share of user comments involve plot analysis, worldbuilding debates, continuity clarifications (such as the Bathilda Bagshot–Nagini connection), or mechanical discussions about quidditch rules. These posts often embed emotional subtext without using explicit sentiment vocabulary. As a result, statements conveying nostalgia, reverence for characters, or disappointment with adaptations—such as “This healed a part of me” or sarcastic remarks about film choices—are flattened into neutrality. In contrast, the deep-learning model produces a wider and more realistic sentiment distribution, with broader tails and a higher proportion of strongly positive and negative scores. This reflects the model’s ability to account for context, idiomatic phrasing, sarcasm, rhetorical questioning, and the layered emotional cues typical of fandom discourse. It identifies indirect negativity (e.g subtle criticism of character arcs or canon decisions) and captures forms of positive sentiment that rely on shared cultural memory, excitement, or personal attachment rather than overt positive adjectives. In effect, while the dictionary method provides a conservative baseline, it systematically underestimates the emotional depth of Harry Potter fandom conversations. The deep-learning method, by incorporating contextual semantics and narrative cues, produces results that align more closely with how fans actually express sentiment though nuanced storytelling, irony, nostalgia, and collective interpretation.