Introduction to Text Analysis (R)

Initial Steps

Install and Load Packages

# install.packages(c("tidyverse", "tidytext", "gutenbergr", "textdata"))

library(tidyverse)
library(tidytext)
library(gutenbergr)

Download data

austen <- gutenberg_download(1342) %>%
  mutate(line = row_number())

gutenberg_id	text	line
1342	THERE IS AN ILLUSTRATED EDITION OF THIS TITLE WHICH MAY VIEWED AT EBOOK	1
1342	[# 42671 ]	2
1342		3
1342	cover	4
1342		5
1342		6

Tokenize Data

austen_tokens <- austen %>%
  unnest_tokens(input = text, output = word, token = "words")

gutenberg_id	line	word
1342	1	there
1342	1	is
1342	1	an
1342	1	illustrated
1342	1	edition
1342	1	of

Activity 1

Try downloading and tokenizing a copy of another book from Project Gutenberg.

Save the results as my_document and my_words

Word Frequencies

counts <- austen_tokens %>%
  group_by(word) %>%
  tally() %>%
  arrange(-n)

word	n
the	4333
to	4163
of	3612
and	3586
her	2202
i	2047

Cleaning Data

Using regular expressions to get rid of non-word characters

tokens_clean <- austen_tokens %>%
  mutate(word = str_remove_all(string = word, pattern = "[^a-zA-Z]"))

counts_clean <- tokens_clean %>%
  group_by(word) %>%
  tally() %>%
  arrange(-n)

word	n
the	4333
to	4164
of	3613
and	3586
her	2226
i	2069

Activity 2

Take a few minutes to find the top 10 most frequent words from your book (my_document).

What are they? Put a few of them in the chat.

Removing Stopwords

data(stop_words)

word	lexicon
a	SMART
a’s	SMART
able	SMART
about	SMART
above	SMART
according	SMART
accordingly	SMART
across	SMART
actually	SMART
after	SMART

my_stop_words <- stop_words$word

counts_filtered <- counts_clean %>%
  filter(!(word %in% my_stop_words))

word	n
elizabeth	596
darcy	374
bennet	295
miss	283
jane	264
bingley	258
time	203
lady	183
sister	180
wickham	162

Adding Our Own Stopwords

my_stop_words <- c(my_stop_words, "", "chapter", "darcy")

counts_filtered <- counts_clean %>%
  filter(!(word %in% my_stop_words))

word	n
elizabeth	596
bennet	295
miss	283
jane	264
bingley	258
time	203
lady	183
sister	180
wickham	162
dear	158

Activity 3

Add some custom stopwords to your own list.

What are they? What new words made it into your top words after this?

Saving Results

write_csv(counts_filtered, "data/counts_filtered.csv")

Activity 4

Save your word frequencies to a CSV on your computer.

Sentiment Analysis

Basic sentiment analysis (not considering negating words, just word-score analysis)

Let’s begin by loading in a sentiment dictionary.

sentiments <- get_sentiments("afinn")

word	value
abandon	-2
abandoned	-2
abandons	-2
abducted	-2
abduction	-2
abductions	-2
abhor	-3
abhorred	-3
abhorrent	-3
abhors	-3

Let’s combine the book into a single text and then break it up into sentences, and then words.

austen_text <- tibble(text = paste(austen$text, collapse = " "))


austen_sentences <- austen_text %>%
  unnest_tokens(input = text, output = sentence, token = "sentences") %>%
  mutate(line = row_number())

sentence_words <- austen_sentences %>%
  unnest_tokens(input = sentence, output = word, token = "words")

Now that we have a sentence-by-sentence, word-by-word breakdown, we can assign average sentiment to each sentence.

word_sentiments <- sentence_words %>%
  inner_join(sentiments, by = "word")

sentence_sentiments <- word_sentiments %>%
  group_by(line) %>%
  summarize(mean_sentiment = mean(value, na.rm = TRUE))

Plotting sentiment

ggplot(sentence_sentiments, aes(x = line, y = mean_sentiment)) +
  geom_col() +
  geom_smooth()

Activity 5

Plot the sentiment as it changes throughout your book

Getting a little less granular

We can use %/% for integer division, which lets us break the book up into 50-line sections

section_sentiments <- word_sentiments %>%
  mutate(section = line %/% 50) %>%
  group_by(section) %>%
  summarize(mean_sentiment = mean(value, na.rm = TRUE))

ggplot(section_sentiments, aes(x = section, y = mean_sentiment)) +
  geom_col() +
  geom_smooth()

Acknowledgement

This tutorial borrows very heavily from Text Mining with R:

Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach. 1st edition. Beijing; Boston: O’Reilly Media, 2017. https://www.tidytextmining.com/index.html.