If you are an analyst i think you are often learned to play with data that mostly in numeric. But now your company want to extract and classify information from text, such as tweets, emails, product reviews, and survey responses. And here the come of tidytext R package by Silge and Robinson, give us a great tool to generating insights from data that unstructured and text-heavy.
library(gutenbergr) # For downloading books from Project Gutenberg
library(tidyverse) # For ggplot, dplyr, etc.
library(tidytext) # For generating insights from the literature, news and social media
library(cleanNLP) # For NLP analysis
Using the gutenbergr package to download some books directly from Project Gutenberg. The IDs for these books come from the URLs at their website.
# 174 The Picture of Dorian Gray by Oscar Wilde
The_Picture_of_Dorian_Gray_original <- gutenberg_download(174, meta_fields = "title")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
write_csv(The_Picture_of_Dorian_Gray_original, "data/The_Picture_of_Dorian_Gray_original.csv")
read_csv("data/The_Picture_of_Dorian_Gray_original.csv")
## Parsed with column specification:
## cols(
## gutenberg_id = col_double(),
## text = col_character(),
## title = col_character()
## )
## # A tibble: 8,498 x 3
## gutenberg_id text title
## <dbl> <chr> <chr>
## 1 174 The Picture of Dorian Gray The Picture of Dorian Gray
## 2 174 <NA> The Picture of Dorian Gray
## 3 174 by The Picture of Dorian Gray
## 4 174 <NA> The Picture of Dorian Gray
## 5 174 Oscar Wilde The Picture of Dorian Gray
## 6 174 <NA> The Picture of Dorian Gray
## 7 174 <NA> The Picture of Dorian Gray
## 8 174 <NA> The Picture of Dorian Gray
## 9 174 <NA> The Picture of Dorian Gray
## 10 174 THE PREFACE The Picture of Dorian Gray
## # ... with 8,488 more rows
The book get from Project Gutenberg comes in a good format, with a column for the book id, a column for the title, and a column for text. But sometimes this is not, it all depends on how the book is formatted at Project Gutenberg, so make sure to check your data.
Let see the book just get above
head(The_Picture_of_Dorian_Gray_original)
## # A tibble: 6 x 3
## gutenberg_id text title
## <int> <chr> <chr>
## 1 174 "The Picture of Dorian Gray" The Picture of Dorian Gray
## 2 174 "" The Picture of Dorian Gray
## 3 174 "by" The Picture of Dorian Gray
## 4 174 "" The Picture of Dorian Gray
## 5 174 "Oscar Wilde" The Picture of Dorian Gray
## 6 174 "" The Picture of Dorian Gray
If we look at the data by using view(…) we can see the hole dataset, and the first 57 rows are the table of contents and other parts of the front matter.
Because i want to group by chapter for next analysis so i using the cumsum() function.
The_Picture_of_Dorian_Gray <- The_Picture_of_Dorian_Gray_original %>%
# The actual book doesn't start until line 58
slice(58:n()) %>%
# Get rid of rows where text is missing
drop_na(text) %>%
# Chapters start with CHAPTER X, so mark if each row is a chapter start
# cumsum() calculates the cumulative sum, so it'll increase every time
# there's a new chapter and automatically make chapter numbers
mutate(chapter = str_detect(text, "^CHAPTER"),
chapter_number = cumsum(chapter)) %>%
# Remove columns we don't need
select(-gutenberg_id, -title, -chapter)
word_frequencies <- The_Picture_of_Dorian_Gray %>%
# The unnest_tokens() functions from tidytext counts words
# or bigram or paragraph to be in its own row
unnest_tokens(word, text) %>%
# Remove stop words
anti_join(stop_words) %>%
# use str_extract() here because the UTF-8 encoded texts
# from Project Gutenberg have some examples of words with
# underscores around them to indicate emphasis (like italics,
# ex: count “_any_” separately from “any” not good for counting word).
mutate(word = str_extract(word, "[a-z']+")) %>%
# Count all the words
count(word, sort = TRUE)
## Joining, by = "word"
word_frequencies %>%
# Keep top 15
top_n(15) %>%
# Make the words an ordered factor so they plot in order
mutate(word = fct_inorder(word)) %>%
ggplot(aes(x = n, y = word))+
geom_col()
## Selecting by n
The_Picture_of_Dorian_Gray_bigrams <- The_Picture_of_Dorian_Gray %>%
# n = 2 here means bigrams, trigrams (n = 3) or any type of n-gram
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
# Split the bigrams into two words so we can remove stopwords
separate(bigram, c("w1","w2"), sep = " ") %>%
filter(!w1 %in% stop_words$word,
!w2 %in% stop_words$word) %>%
# Put the two word columns back together
unite(bigram, w1, w2, sep = " ")
bigram_frequencies <- The_Picture_of_Dorian_Gray_bigrams %>%
# Count all the bigrams
count(bigram, sort = TRUE)
bigram_frequencies %>%
top_n(15) %>%
mutate(bigram = fct_inorder(bigram)) %>%
ggplot(aes(x = n, y = bigram))+
geom_col() +
labs(y = "Count", x = NULL,
title = "15 most frequent bigrams")
## Selecting by n
Using cleanNLP package for this analysis. In cleanNLP: cnlp_init_udpipe(): Use an R-only tagger that should work without installing anything extra (a little slower than the others, but requires no extra steps!) cnlp_init_spacy(): Use spaCy (if you’ve installed it on your computer with Python) cnlp_init_corenlp(): Use Stanford’s NLP library (if you’ve installed it on your computer with Java)
# For the tagger to work, each row needs to be unique, which means we need to
# combine all the text into individual chapter-based rows. This takes a little
# bit of text-wrangling with dplyr:
The_Picture_of_Dorian_Gray_tag <- The_Picture_of_Dorian_Gray %>%
# Group by chapter number
group_by(chapter_number) %>%
# Take all the rows in each chapter and collapse them into a single cell
nest(data = c(text)) %>%
ungroup() %>%
# Look at each individual cell full of text lines and paste them together into
# one really long string of text per chapter
mutate(text = map_chr(data, ~paste(.$text, collapse = " "))) %>%
# Get rid of this column
select(-data)
The_Picture_of_Dorian_Gray_tag
## # A tibble: 20 x 2
## chapter_number text
## <int> <chr>
## 1 1 "CHAPTER 1 The studio was filled with the rich odour of rose~
## 2 2 "CHAPTER 2 As they entered they saw Dorian Gray. He was sea~
## 3 3 "CHAPTER 3 At half-past twelve next day Lord Henry Wotton st~
## 4 4 "CHAPTER 4 One afternoon, a month later, Dorian Gray was rec~
## 5 5 "CHAPTER 5 \"Mother, Mother, I am so happy!\" whispered the ~
## 6 6 "CHAPTER 6 \"I suppose you have heard the news, Basil?\" sai~
## 7 7 "CHAPTER 7 For some reason or other, the house was crowded t~
## 8 8 "CHAPTER 8 It was long past noon when he awoke. His valet h~
## 9 9 "CHAPTER 9 As he was sitting at breakfast next morning, Basi~
## 10 10 "CHAPTER 10 When his servant entered, he looked at him stead~
## 11 11 "CHAPTER 11 For years, Dorian Gray could not free himself fr~
## 12 12 "CHAPTER 12 It was on the ninth of November, the eve of his ~
## 13 13 "CHAPTER 13 He passed out of the room and began the ascent, ~
## 14 14 "CHAPTER 14 At nine o'clock the next morning his servant cam~
## 15 15 "CHAPTER 15 That evening, at eight-thirty, exquisitely dress~
## 16 16 "CHAPTER 16 A cold rain began to fall, and the blurred stree~
## 17 17 "CHAPTER 17 A week later Dorian Gray was sitting in the cons~
## 18 18 "CHAPTER 18 The next day he did not leave the house, and, in~
## 19 19 "CHAPTER 19 \"There is no use your telling me that you are g~
## 20 20 "CHAPTER 20 It was a lovely night, so warm that he threw his~
# Use the built-in R-based tagger
cnlp_init_udpipe()
The_Picture_of_Dorian_Gray_tagged <- cnlp_annotate(The_Picture_of_Dorian_Gray_tag,
text_name = "text",
doc_name = "chapter_number")
write_csv(The_Picture_of_Dorian_Gray_tagged$token, "data/The_Picture_of_Dorian_Gray_tagged.csv")
The_Picture_of_Dorian_Gray_tagged <- read_csv("data/The_Picture_of_Dorian_Gray_tagged.csv")
## Parsed with column specification:
## cols(
## doc_id = col_double(),
## sid = col_double(),
## tid = col_double(),
## token = col_character(),
## token_with_ws = col_character(),
## lemma = col_character(),
## upos = col_character(),
## xpos = col_character(),
## feats = col_character(),
## tid_source = col_double(),
## relation = col_character()
## )
# Find all proper nouns
proper_nouns <- The_Picture_of_Dorian_Gray_tagged %>%
filter(upos == "PROPN")
main_characters_by_chapter <- proper_nouns %>%
# Find only Dorian , Hallward, Wotton, Sibyl, James and Alan
filter(lemma %in% c("Dorian", "Hallward", "Wotton",
"Sibyl","James", "Alan")) %>%
# Group by chapter and character name
group_by(doc_id, lemma) %>%
# Get the count of mentions
summarize(n = n()) %>%
# Make a new column named "name" that is an ordered factor of the names
mutate(name = factor(lemma, levels = c("Dorian", "Hallward", "Wotton",
"Sibyl","James", "Alan"), ordered = TRUE)) %>%
# Rename this so it's called chapter
rename(chapter = doc_id) %>%
# Group by chapter
group_by(chapter) %>%
# Calculate the proportion of each name in each chapter
mutate(prop = n / sum(n)) %>%
ungroup() %>%
# Make a cleaner chapter name column
mutate(chapter_name = paste("Chapter", chapter)) %>%
mutate(chapter_name = fct_inorder(chapter_name))
## `summarise()` regrouping output by 'doc_id' (override with `.groups` argument)
main_characters_by_chapter
## # A tibble: 63 x 6
## chapter lemma n name prop chapter_name
## <dbl> <chr> <int> <ord> <dbl> <fct>
## 1 1 Dorian 17 Dorian 0.5 Chapter 1
## 2 1 Hallward 16 Hallward 0.471 Chapter 1
## 3 1 Wotton 1 Wotton 0.0294 Chapter 1
## 4 2 Dorian 31 Dorian 0.585 Chapter 2
## 5 2 Hallward 17 Hallward 0.321 Chapter 2
## 6 2 Wotton 5 Wotton 0.0943 Chapter 2
## 7 3 Dorian 7 Dorian 0.778 Chapter 3
## 8 3 Hallward 1 Hallward 0.111 Chapter 3
## 9 3 Wotton 1 Wotton 0.111 Chapter 3
## 10 4 Dorian 18 Dorian 0.562 Chapter 4
## # ... with 53 more rows
ggplot(main_characters_by_chapter, aes(x = prop, y = 1, fill = fct_rev(name))) +
geom_col(position = position_stack()) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_fill_viridis_d(option = "plasma", end = 0.9, name = NULL) +
guides(fill = guide_legend(reverse = TRUE)) +
labs(x = NULL, y = NULL,
title = "Proportion of mentions of each character per chapter") +
facet_wrap(vars(chapter_name), nrow = 5) +
theme(legend.position = "top",
axis.text = element_blank(),
axis.ticks = element_blank(),
strip.background = element_rect(fill = "white"),
legend.text = element_text(face = "bold", size = rel(1)),
plot.title = element_text(face = "bold", hjust = 0.5, size = rel(1.7)),
plot.subtitle = element_text(hjust = 0.5, size = rel(1.1)))
Dorian basically dominates of the book. This is just the beginning of the long way to go with NLP, but i need time to dig deep with this. Of course here we can do anything we want went the data had tag.
\[ \begin{aligned} tf(\text{term}) &= \frac{n_{\text{term}}}{n_{\text{terms in document}}} \\ idf(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ tf\text{-}idf(\text{term}) &= tf(\text{term}) \times idf(\text{term}) \end{aligned} \]
Here, to do this analysis, add some another books of Oscar Wilde
# The Importance of Being Earnest: A Trivial Comedy for Serious People
b2 <- gutenberg_download(844, meta_fields = "title")
b2_clean <- b2 %>%
slice(5:n()) %>%
drop_na(text)
# 902: The Happy Prince, and Other Tales
b3 <- gutenberg_download(902, meta_fields = "title")
b3_clean <- b3 %>%
slice(63:n()) %>%
drop_na(text)
# 885: An Ideal Husband
b4 <- gutenberg_download(885, meta_fields = "title")
b4_clean <- b4 %>%
slice(29:n()) %>%
drop_na(text)
# Use bind_rows() from dplyr to bind multiple data by row
Oscar_Wilde_books <- bind_rows(b2_clean, b3_clean, b4_clean)
book_words <- Oscar_Wilde_books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
# use str_extract() here because the UTF-8 encoded texts
# from Project Gutenberg have some examples of words with
# underscores around them to indicate emphasis (like italics,
# ex: count “_any_” separately from “any” not good for counting word).
mutate(word = str_extract(word, "[a-z']+")) %>%
count(title, word, sort = TRUE)
## Joining, by = "word"
# find the words most distinctive to each document
book_words_tf_idf <- book_words %>%
bind_tf_idf(word, title, n)
# Get the top 10 uniquest words
book_words_10 <- book_words_tf_idf %>%
arrange(desc(tf_idf)) %>%
group_by(title) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = fct_inorder(word))
## Selecting by tf_idf
# Plot
ggplot(book_words_10,
aes(y = fct_rev(word), x = tf_idf, fill = title)) +
geom_col() +
guides(fill = FALSE) +
labs(x = "tf-idf", y = NULL) +
facet_wrap(~ title, scales = "free") +
theme_bw()
To see what the different dictionaries look like using get_sentiments() get_sentiments(“afinn”) # Scoring system get_sentiments(“bing”) # Negative/positive get_sentiments(“nrc”) # Specific emotions get_sentiments(“loughran”) # Designed for financial statements; positive/negative
Split into word tokens
The_Picture_of_Dorian_Gray_new <- The_Picture_of_Dorian_Gray %>%
unnest_tokens(word, text)
Join the sentiment dictionary
The_Picture_of_Dorian_Gray_sa <- The_Picture_of_Dorian_Gray_new %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
Get a count of postiive and negative words in each chapter. Convert the sentiment column into two columns named “positive” and “negative”
sentiment_analysis_chapter <- The_Picture_of_Dorian_Gray_sa %>%
# Get a count of postiive and negative words in each chapter
count(chapter_number, sentiment) %>%
# Convert the sentiment column into two columns named "positive" and "negative"
pivot_wider(names_from = sentiment, values_from = n) %>%
# Calculate net sentiment
mutate(net_sentiment = positive - negative)
# Plot it
sentiment_analysis_chapter %>%
ggplot(aes(x = chapter_number, y = net_sentiment)) +
geom_line()
Another way, by splitting the data into groups of lines, to show a more granular view of the progression of the plot
sentiment_analysis_range <- The_Picture_of_Dorian_Gray_sa %>%
mutate(line_number = row_number()) %>%
# Divide lines into groups of 100
mutate(index = line_number %/% 100) %>%
# Get a count of postiive and negative words in each 100-line chunk
count(index, sentiment) %>%
# Convert the sentiment column into two columns named "positive" and "negative"
pivot_wider(names_from = sentiment, values_from = n) %>%
# Calculate net sentiment
mutate(net_sentiment = positive - negative)
# Plot it
sentiment_analysis_range %>%
ggplot(aes(x = index, y = net_sentiment)) +
geom_col(aes(fill = net_sentiment > 0))