This analysis follows the workflow in Text Mining with R, Chapter 2 (“Sentiment analysis with tidy data”) and then extends it to Week 10 discussion text.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(janeaustenr) # for the book example
# Get Jane Austen books and make them tidy
austen <- janeaustenr::austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number()) %>%
ungroup()
tidy_austen <- austen %>%
unnest_tokens(word, text)
# Get a sentiment lexicon from tidytext
bing_lex <- get_sentiments("bing")
austen_sent <- tidy_austen %>%
inner_join(bing_lex, by = "word") %>%
count(book, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(net_sentiment = positive - negative)
## Warning in inner_join(., bing_lex, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
austen_sent
## # A tibble: 6 × 4
## book negative positive net_sentiment
## <fct> <int> <int> <int>
## 1 Sense & Sensibility 3671 4933 1262
## 2 Pride & Prejudice 3652 5052 1400
## 3 Mansfield Park 4828 6749 1921
## 4 Emma 4809 7157 2348
## 5 Northanger Abbey 2518 3244 726
## 6 Persuasion 2201 3473 1272
# ---- corpus ----
my_posts <- tibble::tibble(
source = c("Kevin_discussion", "Candace_post", "Joao_post", "Kevin_reaction"),
text = c(
"I described a use case where I had to process multiple JSON files from different folders.",
"Candace talked about using Microsoft Graph JSON batching to reduce network roundtrips.",
"Joao described e-commerce JSON files with product, customer, and revenue data.",
"This workflow is really helpful but a little frustrating when the JSON is deeply nested."
)
)
tidy_posts <- my_posts %>%
unnest_tokens(word, text)
# NOTE: we are NOT removing stop words yet because corpus is small
# ---- diagnostic: what matched bing? ----
tidy_posts %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(source, sentiment, word, sort = TRUE)
## # A tibble: 2 × 4
## source sentiment word n
## <chr> <chr> <chr> <int>
## 1 Kevin_reaction negative frustrating 1
## 2 Kevin_reaction positive helpful 1
# all sources
all_sources <- tidy_posts %>%
dplyr::distinct(source)
bing_posts <- tidy_posts %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(source, sentiment) %>%
tidyr::pivot_wider(
names_from = sentiment,
values_from = n,
values_fill = list(n = 0)
) %>%
# make sure both columns exist
{ if (!"positive" %in% names(.)) dplyr::mutate(., positive = 0L) else . } %>%
{ if (!"negative" %in% names(.)) dplyr::mutate(., negative = 0L) else . } %>%
dplyr::mutate(net_sentiment = positive - negative) %>%
# join back to show sources with 0 matches
right_join(all_sources, by = "source") %>%
# replace NAs from right_join
mutate(
positive = tidyr::replace_na(positive, 0L),
negative = tidyr::replace_na(negative, 0L),
net_sentiment = tidyr::replace_na(net_sentiment, 0L)
)
bing_posts
## # A tibble: 4 × 4
## source negative positive net_sentiment
## <chr> <int> <int> <int>
## 1 Kevin_reaction 1 1 0
## 2 Kevin_discussion 0 0 0
## 3 Candace_post 0 0 0
## 4 Joao_post 0 0 0
# ---- afinn_posts ----
afinn_posts <- tidy_posts %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(source) %>%
summarise(afinn_score = sum(value), .groups = "drop")
afinn_posts
## # A tibble: 1 × 2
## source afinn_score
## <chr> <dbl>
## 1 Kevin_reaction 0
# ---- loughran_posts ----
loughran_posts <- tidy_posts %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(source, sentiment, sort = TRUE) %>%
tidyr::pivot_wider(
names_from = sentiment,
values_from = n,
values_fill = 0
)
loughran_posts
## # A tibble: 1 × 2
## source negative
## <chr> <int>
## 1 Kevin_reaction 1
I first reproduced the sentiment example from Text Mining with
R, Chapter 2 to make sure my setup was correct. I then created a
small corpus from Week 10 discussion content (my post and two
classmates’ posts about JSON). I applied two different sentiment
lexicons (bing and afinn) to show that the
same text can get scored in slightly different ways depending on the
dictionary. This matches the assignment instructions.
This assignment helped me explore how different sentiment lexicons—Bing, AFINN, and Loughran—analyze text and how their results change depending on the type of language used. The Bing lexicon worked best for my short dataset since it contained a few subjective words like helpful and frustrating. The AFINN lexicon didn’t return any matches because my text didn’t include emotional words that carry numeric values, and the Loughran lexicon, which is designed for financial or business contexts, showed how domain-specific tools can be more suitable for certain topics.
Even though I reviewed the materials beforehand, I didn’t think about how little emotional vocabulary my chosen text would include. Since my discussion post focused on processing JSON files, I thought it would be interesting to reuse that same topic for sentiment analysis. Once I ran it, I saw that most of the words were technical and neutral, which meant the results were limited and didn’t produce visuals like the example bar chart in the textbook.
I briefly considered switching to a more emotional dataset just to get more colorful results, but the concept still worked—and I liked the connection to my discussion topic. This showed me that sentiment analysis depends heavily on the type of text being analyzed. Some topics, like data processing or technical writing, naturally contain less emotion, and that’s fine—the key is understanding why that happens.
If I wanted to see a wider emotional range, I could have used:
Product or restaurant reviews (e.g., “The food was amazing, but the service was terrible.”)
Movie reviews or social media posts (e.g., “I love this show!” or “This update completely ruined it.”)
News articles or blog comments about topics that trigger opinions or emotions.
Overall, this project reminded me that sentiment tools aren’t one-size-fits-all; the results only make sense when the dataset fits the type of analysis.