Assignment 10

Source: AFINN from Finn Årup Nielsen bing from Bing Liu and collaborators nrc from Saif Mohammad and Peter Turney

In this section, we uploaded three lexicons:

NRC: This lexicon categorizes emotions such as joy, sadness, anger, fear, surprise, and trust.

Bing: This lexicon classifies words as either positive or negative.

AFINN: This is a list of English words rated with scores ranging from -5 (very negative) to +5 (very positive).

knitr::opts_chunk$set(echo = TRUE)
#install.packages("tidytext")
#install.packages("textdata")
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
#library(gutenbergr)
library(textdata)  


afinn <- get_sentiments("afinn")

bing <- get_sentiments("bing")

nrc <- get_sentiments("nrc")

Next, we use the text from Jane Austen’s novels to create a data frame of words. We then filter the data to focus specifically on the book “Emma” and utilize the NRC lexicon to count the words that express joy.

library(janeaustenr)
library(dplyr)
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)



nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining with `by = join_by(word)`

## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ℹ 291 more rows

Visualization of the Jane Austen novels focusing on negative and positive words. from the graph we can see that the Book Sense & Sensibility has more negative that other novels.

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining with `by = join_by(word)`

## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

For this analysis, we used all three lexicons (Bing, AFINN, and NRC) on Pride and Prejudice and visualized how the differences can be shown.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")


afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining with `by = join_by(word)`

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`

## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

source: https://www.kaggle.com/datasets/reihanenamdari/mental-health-corpus?resource=download

Using a dataset from Kaggle on mental health, we first tokenized the text, breaking it down into smaller pieces called tokens. We then produced a word cloud of the most prominent words in the dataset, excluding common stop words. From the visualization, we can see that some of the most prominent words are “I’m,” “family,” “people,” “love,” and “die.”

#install.packages("wordcloud")
library(wordcloud)

## Loading required package: RColorBrewer

mental_health_data <- read.csv("C:/Users/Daniel.Brusche/Downloads/archive/mental_health.csv")


tidy_mental_health <- mental_health_data %>%
  unnest_tokens(word, text)  # Tokenize the text into words

tidy_mental_health %>%
  anti_join(stop_words) %>%  # Remove common stop words
  count(word) %>%            # Count occurrences of each word
  with(wordcloud(word, n, max.words = 100, random.order = FALSE, 
                  scale = c(4, 0.5), colors = brewer.pal(8, "Dark2")))

## Joining with `by = join_by(word)`

source: #https://search.r-project.org/CRAN/refmans/textdata/html/lexicon_loughran.html#:~:text=Loughran-McDonald%20sentiment%20lexicon%20Description%20English%20sentiment%20lexicon%20created,contexts%3A%20%22negative%22%2C%20%22positive%22%2C%20%22litigious%22%2C%20%22uncertainty%22%2C%20%22constraining%22%2C%20or%20%22superfluous%22.

From the article linked above, another lexicon used is the Loughran-McDonald lexicon, which is often applied to financial documents. It categorizes words as negative, positive, litigious, uncertain, constraining, and superfluous. From the visualization, we see that negative and positive sentiments are the highest in the dataset, which makes sense because this is a mental health dataset.

loughran1 <- tidy_mental_health%>%
  right_join(get_sentiments("loughran")) %>%
  filter(!is.na(sentiment)) %>%
  count(sentiment, sort = TRUE)

## Joining with `by = join_by(word)`

## Warning in right_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1650 of `x` matches multiple rows in `y`.
## ℹ Row 2239 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# Visualization of the sentiment counts
loughran1 %>%
  ggplot(aes(x = sentiment, y = n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  labs(title = "Sentiment Analysis using Loughran-McDonald Lexicon",
       x = "Sentiment",
       y = "Count") +
  theme_minimal()

In conclusion, we use lexicons to make sense of text data, which can be qualitative in nature. Lexicons enable us to perform further analysis and describe the text data in a structured way, allowing for insights that can inform decision-making and understanding in various contexts.

Assignment 10

2024-11-02