Data 607 - Week 10 NLP

General Setup

Summary of approach for this assignment

The approach to extend the code from the Chapter will be:

Get code working as per chapter with the included Lexicon in TIDY and the provided Corpus from Jane Austen
Load the lexicon R library and use a few of their included lexicons
Load in cloud server a few books in .TXT format which we will use as the Corpus to analyze.

For this assignment we will use “The Scarlet Letter”, BUT we also downloaded and tested “War and Peace”, “Ulysses” and “The Great Gatsby”

Libraries to use

rm(list=ls())
library(janeaustenr)
library(tidytext)
library(tidyverse)
#library(dplyr)
#library(stringr)
#library(tidyr)
#library(ggplot2)

Lexicons Included in Tidy

We will take a look at the 3 Lexicons included in Tidy:

Afinn
Bing
NRC

head(get_sentiments("afinn"))

## # A tibble: 6 x 2
##   word       value
##   <chr>      <dbl>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2

head(get_sentiments("bing"))

## # A tibble: 6 x 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative

head(get_sentiments("nrc"))

## # A tibble: 6 x 2
##   word      sentiment
##   <chr>     <chr>    
## 1 abacus    trust    
## 2 abandon   fear     
## 3 abandon   negative 
## 4 abandon   sadness  
## 5 abandoned anger    
## 6 abandoned fear

The three offer a somewhat different approach to rate a words and its sentiment. We need to consider each one’s range of options when analyzing data.

Sentiment Analysis

Let’s process Jane Austen Books

Let’s do some basic “tyding” on the corpuses from Jane Austen Books. Last step is to “tokenize” each word in the corpus.

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

NEW Custom Lexicon and Corpus

We will use the R Package lexicon from Tyle Rinker at https://github.com/trinker/lexicon

library(lexicon)

## Warning: package 'lexicon' was built under R version 4.1.3

# available_data('English')

We have uploaded a few books in our server in the cloud.

Ulysses
The Great Gatsby
Scarlet Letter
War and Peace

For this assignment we will use Scarlet Letter

First we define URL’s and available .txt in our server

url_1 <- "http://3.86.40.38/data607/"
url_2a <- "scarlet.txt"
url_2b <- "gatsby.txt"
url_2c <- "ulysses.txt"
url_2d <- "warandpeace.txt"

#We will start with Scarlet Letter
url12 <- paste0(url_1,url_2a)

Let’s read in the selected book and convert it into Tidy format and Dataframe.

corpus_txt <- read_lines(url12)
numberoflines <- length(read_lines(url12))

corpus_df <- tibble(line = 1:numberoflines, text = corpus_txt)

tidy_corpus <- corpus_df %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  unnest_tokens(word, text)

New Corpus Sentiment using BING

Let’s test the new corpus using the standard BING lexicon

corpus_sentiment <- tidy_corpus %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"

Let’s plot the results

library(ggplot2)

ggplot(corpus_sentiment, aes(index, sentiment)) +
  geom_col(show.legend = FALSE)

Some Basic analysis of NEW Corpus

Now lets use the nrc sentiment data set to assess the different sentiments that are represented across the selected new Corpus.

tidy_corpus %>%
        right_join(get_sentiments("nrc")) %>%
        filter(!is.na(sentiment)) %>%
        count(sentiment, sort = TRUE)

## Joining, by = "word"

## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 positive      6767
##  2 negative      6038
##  3 trust         3708
##  4 fear          3108
##  5 anticipation  3011
##  6 sadness       2934
##  7 joy           2728
##  8 anger         2357
##  9 disgust       1970
## 10 surprise      1457

New Lexicon from R package Lexicon

Let’s take a look at the Lexicon provided in the Lexicon package. It offers a few lexicons with words rated for sentiment.

#-1 OR +1
head(hash_sentiment_huliu)

##             x  y
## 1:     a plus  1
## 2:   abnormal -1
## 3:    abolish -1
## 4: abominable -1
## 5: abominably -1
## 6:  abominate -1

# head(hash_sentiment_nrc)

# From -1 to +1 in decimals
# head(hash_sentiment_senticnet)
#head(hash_sentiment_sentiword)
head(hash_sentiment_jockers)

##              x     y
## 1:     abandon -0.75
## 2:   abandoned -0.50
## 3:   abandoner -0.25
## 4: abandonment -0.25
## 5:    abandons -1.00
## 6:    abducted -1.00

As you can see from before, the lexicons as divided into two categories: One for lexicons which rate words simply into -1 or +1 AND lexicons which give an exact rating between -1 and +1 in decimals.

For this excercise we will use hash_sentiment_huliu for comparisons to other lexicons that also have only positive or negative line BINNG.

We will use hash_sentiment_jockers for comparisons to AFFIN which have a scale of ratings for sentiment

hash_words_Scale <- hash_sentiment_jockers %>%
    rename(word = x, sentiment = y)

hash_words_PosNeg <- hash_sentiment_huliu %>%
    rename(word = x, sentiment = y) %>%
    mutate(sentiment = replace(sentiment, sentiment == 1, "positive")) %>%
  mutate(sentiment = replace(sentiment, sentiment == -1, "negative"))

Basic analysis of new Lexicons

Let’s take a look at the joker lexicon using the new corpus.

joker_word_counts <- tidy_corpus %>%
  inner_join(hash_words_Scale) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

joker_word_counts

## # A tibble: 2,756 x 3
##    word      sentiment     n
##    <chr>         <dbl> <int>
##  1 child          0.6    197
##  2 like           0.5    146
##  3 good           0.75   121
##  4 new            0.8     96
##  5 well           0.8     77
##  6 better         0.8     65
##  7 physician     -0.25    65
##  8 great          0.5     61
##  9 work           0.25    60
## 10 young          0.4     58
## # ... with 2,746 more rows

Let’s take a look at the huliu lexicon using the new corpus.

huliu_word_counts <- tidy_corpus %>%
  inner_join(hash_words_PosNeg) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

huliu_word_counts

## # A tibble: 1,609 x 3
##    word    sentiment     n
##    <chr>   <chr>     <int>
##  1 like    positive    146
##  2 good    positive    121
##  3 well    positive     77
##  4 better  positive     65
##  5 great   positive     61
##  6 work    positive     60
##  7 smile   positive     53
##  8 wild    negative     51
##  9 strange negative     48
## 10 poor    negative     44
## # ... with 1,599 more rows

Sentiment analysis of New corpus with BING

newcorpus_sentiment <- tidy_corpus %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"

ggplot(newcorpus_sentiment, aes(index, sentiment)) +
  geom_col(show.legend = FALSE)

Sentiment analysis of New corpus with HULIU

newcorpus_sentiment <- tidy_corpus %>%
  inner_join(hash_words_PosNeg) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"

ggplot(newcorpus_sentiment, aes(index, sentiment)) +
  geom_col(show.legend = FALSE)

What we could see is that the results were very similar, in fact the difference between plots is very small still exists, so I would suspect either of the lexicons used the other as a basis.

Sentiment analysis Jane Austen “Pride and Prejudice” corpus and all Lexicons

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"
## Joining, by = "word"

huliu <- pride_prejudice %>%
  inner_join(hash_words_PosNeg) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative, method="HULIU")

## Joining, by = "word"

joker <- pride_prejudice %>% 
  inner_join(hash_words_Scale) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(sentiment)) %>% 
  mutate(method = "JOKER")

## Joining, by = "word"

Let’s plot the whole thing now.

bind_rows(afinn, 
          bing_and_nrc,joker,huliu) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Sentiment Analsysis NEW corpus with all Lexicons

afinn <- tidy_corpus %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

bing_and_nrc <- bind_rows(
  tidy_corpus %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  tidy_corpus %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"
## Joining, by = "word"

huliu <- tidy_corpus %>%
  inner_join(hash_words_PosNeg) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative, method="HULIU")

## Joining, by = "word"

joker <- tidy_corpus %>% 
  inner_join(hash_words_Scale) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(sentiment)) %>% 
  mutate(method = "JOKER")

## Joining, by = "word"