Text Mining and Sentiment Analysis

Introduction

Text Mining with R, Chapter 2, looks at sentiment analysis. The authors provide an example using the text of Jane Austen’s six completed, published novels from the janeaustenr library. All the code is originally credited to the authors, unless otherwise noted.

Required Libraries

library(tidyverse)
library(tidytext)
library(janeaustenr)
library(stringr)
library(jsonlite)
library(glue)
library(lubridate)
library(ggrepel)

Tidy Up Jane Austen’s Work

The authors take the text of the novels and converts the text to the tidy format using unnest_tokens(). They also create other columns to keep track of which line and chapter of the book each word comes from.

tidy_books <- 
  austen_books() |> 
  group_by(book) |> 
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) |> 
  ungroup() |> 
  unnest_tokens(word, text)

knitr::kable(head(tidy_books), caption = "Brief View of Tokenized Words")

Brief View of Tokenized Words
book	linenumber	word
Sense & Sensibility	1	sense
Sense & Sensibility	1	and
Sense & Sensibility	1	sensibility
Sense & Sensibility	3	by
Sense & Sensibility	3	jane
Sense & Sensibility	3	austen

Determining Overall Sentiment

Next, count up how many positive and negative words there are in defined sections of each book, along with a net sentiment score. They define an index here to keep track of where they are in the narrative. The index counts up sections of 80 lines of text.

jane_austen_sentiment <- 
  tidy_books |>
  inner_join(get_sentiments("bing")) |>
  count(book, index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |> 
  mutate(sentiment = positive - negative)

knitr::kable(head(jane_austen_sentiment), caption = "Brief View of Sentiment Scores by Indexing")

Brief View of Sentiment Scores by Indexing
book	index	negative	positive	sentiment
Sense & Sensibility	0	16	32	16
Sense & Sensibility	1	19	53	34
Sense & Sensibility	2	12	31	19
Sense & Sensibility	3	15	31	16
Sense & Sensibility	4	16	34	18
Sense & Sensibility	5	16	51	35

Visualizing Sentiment throughout each Novel

Finally, plot how each novel changes toward more positive or negative sentiment over the trajectory of the story.

jane_austen_sentiment |> 
  ggplot(aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Which Chapter has the Most Negative Words?

The authors also provide a proportion table to determine which chapter has the most negative words in each book.

bingnegative <- 
  get_sentiments("bing") |> 
  filter(sentiment == "negative")

wordcounts <- tidy_books |>
  group_by(book, chapter) |>
  summarize(words = n())

ratio_tbl <-
  tidy_books |>
  semi_join(bingnegative) |>
  group_by(book, chapter) |>
  summarize(negativewords = n()) |>
  left_join(wordcounts, by = c("book", "chapter")) |>
  mutate(ratio = negativewords/words) |>
  filter(chapter != 0) |>
  slice_max(ratio, n = 1) |> 
  ungroup()

knitr::kable(ratio_tbl)

book	chapter	negativewords	words	ratio
Sense & Sensibility	43	161	3405	0.0472834
Pride & Prejudice	34	111	2104	0.0527567
Mansfield Park	46	173	3685	0.0469471
Emma	15	151	3340	0.0452096
Northanger Abbey	21	149	2982	0.0499665
Persuasion	4	62	1807	0.0343110

Note: All work from this point forward has been created by me.

Corpus: NY Times Articles

Lets look at the NY Times published articles in March 2023. However, I will primarily look at the lead paragraph of each article. The goal is to get an idea of what kind of sentiment is being used within the different sections of each paragraph they offer such as Arts, U.S. and Sports. I will also look into seeing if certain times of the day lends itself more to positive or negative sentiments.

Connect to NY Times API

api_cnxn <- 
  fromJSON(glue("https://api.nytimes.com/svc/archive/v1/2023/3.json?api-key={rstudioapi::askForPassword('Enter NY Times API Key')}"), flatten = TRUE)

ny_times <- 
    as.data.frame(api_cnxn) |> 
    janitor::clean_names()

write_csv(ny_times, 'ny_times.csv')

Tidy Up Article Data

Clean up column formatting for published dates.

section_df <-
  ny_times |> 
  select(response_docs_pub_date, response_docs_section_name, response_docs_lead_paragraph) |> 
  mutate(response_docs_pub_date = str_extract(response_docs_pub_date, "[:graph:]*(?=\\+)")) |> 
  rename(pub_date = response_docs_pub_date, lead_paragraph = response_docs_lead_paragraph, section = response_docs_section_name)

section_df$pub_date <- 
  section_df$pub_date |> 
  ymd_hms()

section_df$hour <- 
  section_df$pub_date |> 
  hour()

Tokenize Sections

tokenize_df <-
  section_df |> 
  unnest_tokens(word, lead_paragraph)

knitr::kable(head(tokenize_df))

pub_date	section	word
2023-03-01 00:00:07	Opinion	to
2023-03-01 00:00:07	Opinion	president
2023-03-01 00:00:07	Opinion	emmanuel
2023-03-01 00:00:07	Opinion	macron
2023-03-01 00:00:07	Opinion	of
2023-03-01 00:00:07	Opinion	france

Categorize Sentiments between Sections

Here, we can see that the most negative overall sentiment is the U.S. section at 71%. The most positive overall sentiment is Arts at 22%.

sentiment_df <- 
  tokenize_df |>
  inner_join(get_sentiments("bing")) |>
  count(section, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |> 
  mutate(total_words = negative + positive, 
         ovr_sentiment = positive - negative, 
        pct = round(ovr_sentiment/total_words *100, 2))

knitr::kable(sentiment_df, caption = "Overall Sentiment based on Section")

Overall Sentiment based on Section
section	negative	positive	total_words	ovr_sentiment	pct
Admin	1	0	1	-1	-100.00
Arts	311	484	795	173	21.76
Books	365	354	719	-11	-1.53
Briefing	130	87	217	-43	-19.82
Business Day	458	311	769	-147	-19.12
Climate	28	29	57	1	1.75
Corrections	51	32	83	-19	-22.89
Crosswords & Games	104	135	239	31	12.97
Education	11	2	13	-9	-69.23
Fashion & Style	5	5	10	0	0.00
Food	71	173	244	102	41.80
Health	105	39	144	-66	-45.83
Magazine	151	151	302	0	0.00
Movies	290	328	618	38	6.15
New York	189	166	355	-23	-6.48
Obituaries	7	8	15	1	6.67
Opinion	456	446	902	-10	-1.11
Podcasts	39	56	95	17	17.89
Real Estate	56	79	135	23	17.04
Science	74	48	122	-26	-21.31
Smarter Living	0	2	2	2	100.00
Sports	161	283	444	122	27.48
Style	114	225	339	111	32.74
T Brand	0	1	1	1	100.00
T Magazine	76	124	200	48	24.00
Technology	40	67	107	27	25.23
The Learning Network	65	79	144	14	9.72
The Upshot	5	9	14	4	28.57
Theater	98	121	219	23	10.50
Times Insider	18	17	35	-1	-2.86
Travel	38	49	87	11	12.64
U.S.	2672	447	3119	-2225	-71.34
Video	1	0	1	-1	-100.00
Well	52	39	91	-13	-14.29
World	648	349	997	-299	-29.99
Your Money	4	4	8	0	0.00

sentiment_df |> 
  ggplot(aes(x = negative, y = positive, label = section)) +
  geom_point() +
  geom_label_repel(box.padding = 0.35) +
  xlim(0, 3000) +
  ylim(0, 3000)

Categorize Sentiments between Time of Day

When categorizing based on time of day, 0600 hour has the most positive leading paragraph sentiment at 31%, while at 1200 hour, it has the largest difference in negative sentiment at 88%.

pub_date_df <- 
  tokenize_df |>
  inner_join(get_sentiments("bing")) |>
  group_by() |> 
  count(hour, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |> 
  mutate(total_words = negative + positive, 
         ovr_sentiment = positive - negative, 
        pct = round(ovr_sentiment/total_words *100, 2))

knitr::kable(pub_date_df, caption = "Overall Sentiment based on Hour")

Overall Sentiment based on Hour
hour	negative	positive	total_words	ovr_sentiment	pct
0	153	124	277	-29	-10.47
1	77	53	130	-24	-18.46
2	77	66	143	-11	-7.69
3	68	57	125	-11	-8.80
4	107	115	222	8	3.60
5	84	93	177	9	5.08
6	46	87	133	41	30.83
7	163	164	327	1	0.31
8	80	78	158	-2	-1.27
9	749	836	1585	87	5.49
10	554	485	1039	-69	-6.64
11	196	170	366	-26	-7.10
12	2106	140	2246	-1966	-87.53
13	190	195	385	5	1.30
14	209	208	417	-1	-0.24
15	326	303	629	-23	-3.66
16	247	258	505	11	2.18
17	230	225	455	-5	-1.10
18	173	226	399	53	13.28
19	239	232	471	-7	-1.49
20	212	177	389	-35	-9.00
21	221	190	411	-31	-7.54
22	204	132	336	-72	-21.43
23	183	135	318	-48	-15.09

pub_date_df |> 
  ggplot(aes(x = hour, y = pct)) +
  geom_bar(stat = 'identity')

Lexicon Library

Researching for other types of lexicon libraries R has to offer, I came across the lexicon package. Here it contains many different methods, one of them which is the hash_sentiment_senticnet() function. It is a data.table dataset containing an augmented version of Cambria, Poria, Bajpai,& Schuller’s (2016) positive/negative word list as sentiment lookup values. Further documentation can be found here.

library(lexicon)

knitr::kable(head(hash_sentiment_senticnet), caption = 'Glance of Lookup Values')

Glance of Lookup Values
x	y
aaa	0.606
aah	-0.510
abandon	-0.560
abandonment	-0.650
abase	-0.580
abasement	-0.580

Calculate Sentiment based on New Sentiment Lexicon

What we see is that the slope relationship between positive and negative words are much more favorable for positive words when comparing total word counts to each section. This is because I primitively categorized each word based on either being positive or negative without it’s strength value accounted for. However, the U.S. section still shows as the leader for all sections with negative words overall.

lexicon_df <-
  hash_sentiment_senticnet |> 
  rename(word = x, value = y)

lex_sentiment_df <- 
  tokenize_df |>
  inner_join(lexicon_df) 

lex_section_df <-  
  lex_sentiment_df |> 
  mutate(type = case_when(value < 0 ~ "negative",
            value > 0 ~ "positive")) |> 
  group_by(section, type) |> 
  summarise(total_words = n()) |> 
  pivot_wider(names_from = type, values_from = total_words) |> 
  mutate(total_words = negative + positive)

knitr::kable(lex_section_df, caption = "Overall Sentiment based on Section")

Overall Sentiment based on Section
section	negative	positive	total_words
Admin	2	1	3
Arts	1158	3207	4365
Books	923	2238	3161
Briefing	339	925	1264
Business Day	1201	3174	4375
Climate	124	298	422
Corrections	146	435	581
Crosswords & Games	344	1031	1375
Education	15	35	50
Fashion & Style	23	38	61
Food	380	1154	1534
Headway	7	2	9
Health	258	494	752
Magazine	518	1088	1606
Movies	931	2067	2998
New York	697	1796	2493
Obituaries	16	35	51
Opinion	1441	3372	4813
Podcasts	147	414	561
Reader Center	5	12	17
Real Estate	320	1010	1330
Science	223	460	683
Smarter Living	2	16	18
Special Series	11	8	19
Sports	818	1828	2646
Style	596	1667	2263
T Brand	NA	4	NA
T Magazine	355	1027	1382
Technology	245	702	947
The Learning Network	257	852	1109
The New York Times Presents	1	4	5
The Upshot	29	113	142
Theater	345	797	1142
Times Insider	70	159	229
Travel	175	444	619
U.S.	4121	10856	14977
Video	6	5	11
Well	175	350	525
World	1924	4218	6142
Your Money	13	30	43

lex_section_df |> 
  ggplot(aes(x = negative, y = positive, label = section)) +
  geom_point() +
  geom_label_repel(box.padding = 0.35) +
  xlim(0, 11000) +
  ylim(0, 11000)

Conclusion

Based on the lexicon sentiment package used, the simple values used can change drastically, however, overall the general tendacies are the same such that the U.S section within in the NY Times generally tends to have more ngative sentiment words being used.