Sentiment Analysis with Text Mining in R

Author

Mark Hamer

Introduction

This project reproduces and extends the sentiment analysis example from Chapter 2 of Text Mining with R (Silge & Robinson, 2017). The base example uses Jane Austen’s novels to demonstrate how sentiment lexicons can trace emotional arcs across a narrative. The extension applies similar methods to a more charged question: how does sentiment and language differ across male online communities associated with loneliness and ideological grievance? Three subreddits were selected along a spectrum of radicalization: r/lonely, r/ForeverAlone, and r/mensrights. Posts were collected using the RedditExtractoR package and analyzed using three lexicons: the Bing lexicon for comparison with the base example, a custom manosphere lexicon built for this project, and sentimentr for context-aware sentence-level scoring. Notable communities from the original research design, including r/MGTOW and r/redpill, were inaccessible due to Reddit bans, a finding that is itself worth acknowledging.

Setup

library(tidytext)
library(janeaustenr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(stringr)
library(ggplot2)
library(tidyr)
library(textdata)
library(sentimentr)

reddit_data <- readRDS("reddit_data.rds")

tidy_books <- austen_books() |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                            ignore_case = TRUE)))
  ) |>
  ungroup()

tidy_books <- tidy_books |>
  unnest_tokens(word, text)

Base Example: Sentiment in Jane Austen’s Novels

nrc_joy <- get_sentiments("nrc") |>
  filter(sentiment == "joy")

tidy_books |>
  filter(book == "Emma") |>
  inner_join(nrc_joy) |>
  count(word, sort = TRUE)

Joining with `by = join_by(word)`

# A tibble: 301 × 2
   word          n
   <chr>     <int>
 1 good        359
 2 friend      166
 3 hope        143
 4 happy       125
 5 love        117
 6 deal         92
 7 found        92
 8 present      89
 9 kind         82
10 happiness    76
# ℹ 291 more rows

jane_austen_sentiment <- tidy_books |>
  inner_join(get_sentiments("bing"), relationship = "many-to-many") |>
  count(book, index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(sentiment = positive - negative)

Joining with `by = join_by(word)`

Jane Austen Sentiment Analysis Plot

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Jane Austen Sentiment Analysis Analysis

Each bar represents net sentiment within an 80-line chunk of text. Austen’s novels trend positive overall, fitting her romantic style, but every novel dips negative at its dramatic turning points. Pride and Prejudice, for instance, turns sharply negative near its midpoint during Lydia’s elopement crisis. All six novels recover and end on a positive note, reflecting Austen’s signature happy endings.

Citation

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com

Extension: Male Loneliness on Reddit Sentiment Analysis

tidy_reddit <- reddit_data |>
  unnest_tokens(word, text) |>
  anti_join(stop_words)

Joining with `by = join_by(word)`

reddit_sentiment <- tidy_reddit |>
  inner_join(get_sentiments("bing"), relationship = "many-to-many") |>
  count(subreddit, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(
    net_sentiment = positive - negative,
    ratio = positive / (positive + negative)
  )

Joining with `by = join_by(word)`

tidy_reddit |>
  inner_join(get_sentiments("bing"), relationship = "many-to-many") |>
  filter(sentiment == "negative") |>
  count(subreddit, word, sort = TRUE) |>
  group_by(subreddit) |>
  slice_max(n, n = 10) |>
  ggplot(aes(x = reorder(word, n), y = n, fill = subreddit)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~subreddit, scales = "free_y") +
  coord_flip() +
  labs(
    title = "Top Negative Words by Subreddit",
    x = "Word",
    y = "Count"
  ) +
  theme_minimal()

Joining with `by = join_by(word)`

ggplot(reddit_sentiment, aes(x = subreddit, y = net_sentiment, fill = subreddit)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Net Sentiment Across Subreddits",
    x = "Subreddit",
    y = "Net Sentiment (Positive - Negative)"
  ) +
  theme_minimal()

manosphere_lexicon <- tibble(
  word = c(
    # Dehumanizing slurs
    "femoid", "roastie", "whore", "slut", "thot", "cunt", "bitch",
    "gold digger", "used goods",
    # Incel/manosphere coded language
    "hypergamy", "awalt", "blackpill", "chad", "stacy", "betabux",
    "smv", "gynocentric", "misandry", "redpill", "alpha", "beta",
    "sexual marketplace", "cock carousel", "pair bonding", "wall",
    "female nature", "beta male",
    # Grievance framing
    "divorce rape", "false accusations", "female privilege",
    "family court", "child support trap", "men are oppressed",
    "gynocentric society", "misandry",
    # Dehumanizing framing
    "females", "validation", "attention seekers", "easy mode",
    "women always", "women only", "modern women", "female privilege",
    "used goods", "logic vs emotion", "men build"
  ),
  score = c(
    # Slurs
    -4, -4, -3, -3, -3, -4, -3,
    -3, -3,
    # Coded language
    -2, -3, -3, -1, -1, -2,
    -2, -2, -2, -2, -1, -1,
    -2, -4, -2, -2,
    -2, -2,
    # Grievance
    -3, -2, -2,
    -2, -3, -2,
    -2, -2,
    # Framing
    -1, -1, -2, -2,
    -2, -2, -2, -2,
    -3, -2, -2
  ),
  lexicon = "manosphere"
)

manosphere_scores <- tidy_reddit |>
  inner_join(manosphere_lexicon, by = "word") |>
  group_by(subreddit) |>
  summarise(
    manosphere_score = sum(score),
    term_count = n()
  )

Warning in inner_join(tidy_reddit, manosphere_lexicon, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 34332 of `x` matches multiple rows in `y`.
ℹ Row 7 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

manosphere_scores <- tidy_reddit |>
  inner_join(manosphere_lexicon, by = "word", relationship = "many-to-many") |>
  group_by(subreddit) |>
  summarise(
    manosphere_score = sum(score),
    term_count = n()
  )

ggplot(manosphere_scores, aes(x = subreddit, y = manosphere_score, fill = subreddit)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Manosphere Language Score by Subreddit",
    x = "Subreddit",
    y = "Manosphere Score (lower = more extreme)"
  ) +
  theme_minimal()

tidy_reddit |>
  inner_join(manosphere_lexicon, by = "word", relationship = "many-to-many") |>
  count(subreddit, word, sort = TRUE) |>
  group_by(subreddit) |>
  slice_max(n, n = 5)

# A tibble: 16 × 3
# Groups:   subreddit [3]
   subreddit      word            n
   <chr>          <chr>       <int>
 1 r/ForeverAlone bitch           4
 2 r/ForeverAlone validation      2
 3 r/ForeverAlone alpha           1
 4 r/ForeverAlone beta            1
 5 r/ForeverAlone wall            1
 6 r/lonely       bitch           2
 7 r/lonely       wall            1
 8 r/mensrights   misandry       64
 9 r/mensrights   females         7
10 r/mensrights   validation      3
11 r/mensrights   alpha           1
12 r/mensrights   beta            1
13 r/mensrights   cunt            1
14 r/mensrights   gynocentric     1
15 r/mensrights   hypergamy       1
16 r/mensrights   wall            1

reddit_sentimentr <- sentiment_by(
  get_sentences(reddit_data$text[!is.na(reddit_data$text)]),
  by = list(reddit_data$subreddit[!is.na(reddit_data$text)])
)

reddit_sentimentr <- reddit_sentimentr |>
  rename(subreddit = `reddit_data$subreddit[!is.na(reddit_data$text)]`)

ggplot(reddit_sentimentr, aes(x = subreddit, y = ave_sentiment, fill = subreddit)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Average Sentiment by Subreddit (sentimentr)",
    x = "Subreddit",
    y = "Average Sentiment Score"
  ) +
  theme_minimal()

Comparing the Base Example to the Extension

Austen’s novels offer clean, formal text written by a single author, where sentiment shifts map neatly onto narrative structure. Reddit is the opposite: thousands of voices, informal language, sarcasm, and community slang that standard lexicons were never built to handle. The analysis is messier, but the questions are more urgent.

What the Three Lexicons Revealed

Bing found all three subreddits net negative but struggled to distinguish them meaningfully. The custom manosphere lexicon told a sharper story. r/mensrights produced 80 ideological term hits versus 9 in r/ForeverAlone and 3 in r/lonely, suggesting organized grievance rather than personal pain. Sentimentr, which handles negation and context, produced the cleanest gradient with r/mensrights most negative and r/ForeverAlone closest to neutral.

What the Three Lexicons Revealed

The most radicalized communities in the original design, r/MGTOW and r/redpill, were banned before data collection, so this analysis captures a moderate slice of the manosphere. Many manosphere terms are context-dependent, and scoring them uniformly negative introduces noise. Standard lexicons may also misclassify sarcasm and slang. Finally, only post text was collected since comments were excluded due to API limitations.