Introduction

If you are an analyst i think you are often learned to play with data that mostly in numeric. But now your company want to extract and classify information from text, such as tweets, emails, product reviews, and survey responses. And here the come of tidytext R package by Silge and Robinson, give us a great tool to generating insights from data that unstructured and text-heavy.

Load the libraries we’ll be using

library(gutenbergr) # For downloading books from Project Gutenberg
library(tidyverse) # For ggplot, dplyr, etc.
library(tidytext) # For generating insights from the literature, news and social media
library(cleanNLP) # For NLP analysis

Get data

Using the gutenbergr package to download some books directly from Project Gutenberg. The IDs for these books come from the URLs at their website.

# 174 The Picture of Dorian Gray by Oscar Wilde
The_Picture_of_Dorian_Gray_original <- gutenberg_download(174, meta_fields = "title")

## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

write_csv(The_Picture_of_Dorian_Gray_original, "data/The_Picture_of_Dorian_Gray_original.csv")

read_csv("data/The_Picture_of_Dorian_Gray_original.csv")

## Parsed with column specification:
## cols(
##   gutenberg_id = col_double(),
##   text = col_character(),
##   title = col_character()
## )

## # A tibble: 8,498 x 3
##    gutenberg_id text                       title                     
##           <dbl> <chr>                      <chr>                     
##  1          174 The Picture of Dorian Gray The Picture of Dorian Gray
##  2          174 <NA>                       The Picture of Dorian Gray
##  3          174 by                         The Picture of Dorian Gray
##  4          174 <NA>                       The Picture of Dorian Gray
##  5          174 Oscar Wilde                The Picture of Dorian Gray
##  6          174 <NA>                       The Picture of Dorian Gray
##  7          174 <NA>                       The Picture of Dorian Gray
##  8          174 <NA>                       The Picture of Dorian Gray
##  9          174 <NA>                       The Picture of Dorian Gray
## 10          174 THE PREFACE                The Picture of Dorian Gray
## # ... with 8,488 more rows

Clean up

The book get from Project Gutenberg comes in a good format, with a column for the book id, a column for the title, and a column for text. But sometimes this is not, it all depends on how the book is formatted at Project Gutenberg, so make sure to check your data.

Let see the book just get above

head(The_Picture_of_Dorian_Gray_original)

## # A tibble: 6 x 3
##   gutenberg_id text                         title                     
##          <int> <chr>                        <chr>                     
## 1          174 "The Picture of Dorian Gray" The Picture of Dorian Gray
## 2          174 ""                           The Picture of Dorian Gray
## 3          174 "by"                         The Picture of Dorian Gray
## 4          174 ""                           The Picture of Dorian Gray
## 5          174 "Oscar Wilde"                The Picture of Dorian Gray
## 6          174 ""                           The Picture of Dorian Gray

If we look at the data by using view(…) we can see the hole dataset, and the first 57 rows are the table of contents and other parts of the front matter.

Because i want to group by chapter for next analysis so i using the cumsum() function.

The_Picture_of_Dorian_Gray <- The_Picture_of_Dorian_Gray_original %>% 
  # The actual book doesn't start until line 58
  slice(58:n()) %>% 
  # Get rid of rows where text is missing
  drop_na(text) %>% 
  # Chapters start with CHAPTER X, so mark if each row is a chapter start
  # cumsum() calculates the cumulative sum, so it'll increase every time 
  # there's a new chapter and automatically make chapter numbers
  mutate(chapter = str_detect(text, "^CHAPTER"),
         chapter_number = cumsum(chapter)) %>% 
  # Remove columns we don't need
  select(-gutenberg_id, -title, -chapter)

Tokens and counting words

word_frequencies <- The_Picture_of_Dorian_Gray %>%
  # The unnest_tokens() functions from tidytext counts words 
  # or bigram or paragraph to be in its own row
  unnest_tokens(word, text) %>% 
  # Remove stop words
  anti_join(stop_words) %>% 
  # use str_extract() here because the UTF-8 encoded texts 
  # from Project Gutenberg have some examples of words with 
  # underscores around them to indicate emphasis (like italics, 
  # ex: count “_any_” separately from “any” not good for counting word).
  mutate(word = str_extract(word, "[a-z']+")) %>% 
  # Count all the words
  count(word, sort = TRUE)

## Joining, by = "word"

word_frequencies %>% 
  # Keep top 15
  top_n(15) %>%
  # Make the words an ordered factor so they plot in order
  mutate(word = fct_inorder(word)) %>% 
  ggplot(aes(x = n, y = word))+
  geom_col()

## Selecting by n

Bigrams

The_Picture_of_Dorian_Gray_bigrams <- The_Picture_of_Dorian_Gray %>% 
  # n = 2 here means bigrams, trigrams (n = 3) or any type of n-gram
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  # Split the bigrams into two words so we can remove stopwords
  separate(bigram, c("w1","w2"), sep = " ") %>% 
  filter(!w1 %in% stop_words$word,
         !w2 %in% stop_words$word) %>% 
  # Put the two word columns back together
  unite(bigram, w1, w2, sep = " ")

bigram_frequencies <- The_Picture_of_Dorian_Gray_bigrams %>% 
  # Count all the bigrams
  count(bigram, sort = TRUE)
 
bigram_frequencies %>% 
  top_n(15) %>%
  mutate(bigram = fct_inorder(bigram)) %>% 
  ggplot(aes(x = n, y = bigram))+
  geom_col() +
  labs(y = "Count", x = NULL, 
       title = "15 most frequent bigrams")

## Selecting by n

Scratch surface of part-of-speech tagging

Using cleanNLP package for this analysis. In cleanNLP: cnlp_init_udpipe(): Use an R-only tagger that should work without installing anything extra (a little slower than the others, but requires no extra steps!) cnlp_init_spacy(): Use spaCy (if you’ve installed it on your computer with Python) cnlp_init_corenlp(): Use Stanford’s NLP library (if you’ve installed it on your computer with Java)

# For the tagger to work, each row needs to be unique, which means we need to
# combine all the text into individual chapter-based rows. This takes a little
# bit of text-wrangling with dplyr:
The_Picture_of_Dorian_Gray_tag <- The_Picture_of_Dorian_Gray %>% 
  # Group by chapter number
  group_by(chapter_number) %>% 
  # Take all the rows in each chapter and collapse them into a single cell
  nest(data = c(text)) %>% 
  ungroup() %>% 
  # Look at each individual cell full of text lines and paste them together into
  # one really long string of text per chapter
  mutate(text = map_chr(data, ~paste(.$text, collapse = " "))) %>% 
  # Get rid of this column
  select(-data)
The_Picture_of_Dorian_Gray_tag

## # A tibble: 20 x 2
##    chapter_number text                                                          
##             <int> <chr>                                                         
##  1              1 "CHAPTER 1  The studio was filled with the rich odour of rose~
##  2              2 "CHAPTER 2  As they entered they saw Dorian Gray.  He was sea~
##  3              3 "CHAPTER 3  At half-past twelve next day Lord Henry Wotton st~
##  4              4 "CHAPTER 4  One afternoon, a month later, Dorian Gray was rec~
##  5              5 "CHAPTER 5  \"Mother, Mother, I am so happy!\" whispered the ~
##  6              6 "CHAPTER 6  \"I suppose you have heard the news, Basil?\" sai~
##  7              7 "CHAPTER 7  For some reason or other, the house was crowded t~
##  8              8 "CHAPTER 8  It was long past noon when he awoke.  His valet h~
##  9              9 "CHAPTER 9  As he was sitting at breakfast next morning, Basi~
## 10             10 "CHAPTER 10  When his servant entered, he looked at him stead~
## 11             11 "CHAPTER 11  For years, Dorian Gray could not free himself fr~
## 12             12 "CHAPTER 12  It was on the ninth of November, the eve of his ~
## 13             13 "CHAPTER 13  He passed out of the room and began the ascent, ~
## 14             14 "CHAPTER 14  At nine o'clock the next morning his servant cam~
## 15             15 "CHAPTER 15  That evening, at eight-thirty, exquisitely dress~
## 16             16 "CHAPTER 16  A cold rain began to fall, and the blurred stree~
## 17             17 "CHAPTER 17  A week later Dorian Gray was sitting in the cons~
## 18             18 "CHAPTER 18  The next day he did not leave the house, and, in~
## 19             19 "CHAPTER 19  \"There is no use your telling me that you are g~
## 20             20 "CHAPTER 20  It was a lovely night, so warm that he threw his~

# Use the built-in R-based tagger
cnlp_init_udpipe()

The_Picture_of_Dorian_Gray_tagged <- cnlp_annotate(The_Picture_of_Dorian_Gray_tag, 
                                     text_name = "text", 
                                     doc_name = "chapter_number")

write_csv(The_Picture_of_Dorian_Gray_tagged$token, "data/The_Picture_of_Dorian_Gray_tagged.csv")

The_Picture_of_Dorian_Gray_tagged <- read_csv("data/The_Picture_of_Dorian_Gray_tagged.csv")

## Parsed with column specification:
## cols(
##   doc_id = col_double(),
##   sid = col_double(),
##   tid = col_double(),
##   token = col_character(),
##   token_with_ws = col_character(),
##   lemma = col_character(),
##   upos = col_character(),
##   xpos = col_character(),
##   feats = col_character(),
##   tid_source = col_double(),
##   relation = col_character()
## )

# Find all proper nouns
proper_nouns <- The_Picture_of_Dorian_Gray_tagged %>%
  filter(upos == "PROPN")

main_characters_by_chapter <- proper_nouns %>% 
  # Find only Dorian , Hallward, Wotton, Sibyl, James and Alan 
  filter(lemma %in% c("Dorian", "Hallward", "Wotton",
                      "Sibyl","James", "Alan")) %>% 
  # Group by chapter and character name
  group_by(doc_id, lemma) %>% 
  # Get the count of mentions
  summarize(n = n()) %>% 
  # Make a new column named "name" that is an ordered factor of the names
  mutate(name = factor(lemma, levels = c("Dorian", "Hallward", "Wotton",
                      "Sibyl","James", "Alan"), ordered = TRUE)) %>% 
  # Rename this so it's called chapter
  rename(chapter = doc_id) %>% 
  # Group by chapter
  group_by(chapter) %>% 
  # Calculate the proportion of each name in each chapter
  mutate(prop = n / sum(n)) %>% 
  ungroup() %>% 
  # Make a cleaner chapter name column
  mutate(chapter_name = paste("Chapter", chapter)) %>% 
  mutate(chapter_name = fct_inorder(chapter_name))

## `summarise()` regrouping output by 'doc_id' (override with `.groups` argument)

main_characters_by_chapter

## # A tibble: 63 x 6
##    chapter lemma        n name       prop chapter_name
##      <dbl> <chr>    <int> <ord>     <dbl> <fct>       
##  1       1 Dorian      17 Dorian   0.5    Chapter 1   
##  2       1 Hallward    16 Hallward 0.471  Chapter 1   
##  3       1 Wotton       1 Wotton   0.0294 Chapter 1   
##  4       2 Dorian      31 Dorian   0.585  Chapter 2   
##  5       2 Hallward    17 Hallward 0.321  Chapter 2   
##  6       2 Wotton       5 Wotton   0.0943 Chapter 2   
##  7       3 Dorian       7 Dorian   0.778  Chapter 3   
##  8       3 Hallward     1 Hallward 0.111  Chapter 3   
##  9       3 Wotton       1 Wotton   0.111  Chapter 3   
## 10       4 Dorian      18 Dorian   0.562  Chapter 4   
## # ... with 53 more rows

ggplot(main_characters_by_chapter, aes(x = prop, y = 1, fill = fct_rev(name))) + 
  geom_col(position = position_stack()) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_fill_viridis_d(option = "plasma", end = 0.9, name = NULL) +
  guides(fill = guide_legend(reverse = TRUE)) +
  labs(x = NULL, y = NULL,
       title = "Proportion of mentions of each character per chapter") +
  facet_wrap(vars(chapter_name), nrow = 5) +
  theme(legend.position = "top",
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        strip.background = element_rect(fill = "white"),
        legend.text = element_text(face = "bold", size = rel(1)),
        plot.title = element_text(face = "bold", hjust = 0.5, size = rel(1.7)),
        plot.subtitle = element_text(hjust = 0.5, size = rel(1.1)))

Dorian basically dominates of the book. This is just the beginning of the long way to go with NLP, but i need time to dig deep with this. Of course here we can do anything we want went the data had tag.

Term frequency-inverse document frequency tf-idf

\[ \begin{aligned} tf(\text{term}) &= \frac{n_{\text{term}}}{n_{\text{terms in document}}} \\ idf(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ tf\text{-}idf(\text{term}) &= tf(\text{term}) \times idf(\text{term}) \end{aligned} \]

Here, to do this analysis, add some another books of Oscar Wilde

# The Importance of Being Earnest: A Trivial Comedy for Serious People
  b2 <- gutenberg_download(844, meta_fields = "title")
  b2_clean <- b2 %>% 
    slice(5:n()) %>% 
    drop_na(text)
# 902: The Happy Prince, and Other Tales
  b3 <- gutenberg_download(902, meta_fields = "title")
  b3_clean <- b3 %>% 
    slice(63:n()) %>% 
    drop_na(text)
# 885: An Ideal Husband
  b4 <- gutenberg_download(885, meta_fields = "title")
  b4_clean <- b4 %>% 
    slice(29:n()) %>% 
    drop_na(text)

# Use bind_rows() from dplyr to bind multiple data by row
  Oscar_Wilde_books <- bind_rows(b2_clean, b3_clean, b4_clean)
  
  book_words <- Oscar_Wilde_books %>% 
    unnest_tokens(word, text) %>% 
    anti_join(stop_words) %>% 
    # use str_extract() here because the UTF-8 encoded texts 
    # from Project Gutenberg have some examples of words with 
    # underscores around them to indicate emphasis (like italics, 
    # ex: count “_any_” separately from “any” not good for counting word).
    mutate(word = str_extract(word, "[a-z']+")) %>% 
    count(title, word, sort = TRUE)

## Joining, by = "word"

# find the words most distinctive to each document
book_words_tf_idf <-  book_words %>% 
  bind_tf_idf(word, title, n)
# Get the top 10 uniquest words
book_words_10 <- book_words_tf_idf %>% 
  arrange(desc(tf_idf)) %>% 
  group_by(title) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = fct_inorder(word))

## Selecting by tf_idf

# Plot
ggplot(book_words_10, 
       aes(y = fct_rev(word), x = tf_idf, fill = title)) +
  geom_col() +
  guides(fill = FALSE) +
  labs(x = "tf-idf", y = NULL) +
  facet_wrap(~ title, scales = "free") +
  theme_bw()

Sentiment analysis (sa)

To see what the different dictionaries look like using get_sentiments() get_sentiments(“afinn”) # Scoring system get_sentiments(“bing”) # Negative/positive get_sentiments(“nrc”) # Specific emotions get_sentiments(“loughran”) # Designed for financial statements; positive/negative

Split into word tokens

The_Picture_of_Dorian_Gray_new <- The_Picture_of_Dorian_Gray %>%
  unnest_tokens(word, text)

Join the sentiment dictionary

The_Picture_of_Dorian_Gray_sa <- The_Picture_of_Dorian_Gray_new %>%
  inner_join(get_sentiments("bing"))

## Joining, by = "word"

Get a count of postiive and negative words in each chapter. Convert the sentiment column into two columns named “positive” and “negative”

sentiment_analysis_chapter <- The_Picture_of_Dorian_Gray_sa %>% 
  # Get a count of postiive and negative words in each chapter 
  count(chapter_number, sentiment) %>% 
  # Convert the sentiment column into two columns named "positive" and "negative"
  pivot_wider(names_from = sentiment, values_from = n) %>% 
  # Calculate net sentiment
  mutate(net_sentiment = positive - negative)
  # Plot it
  sentiment_analysis_chapter %>% 
    ggplot(aes(x = chapter_number, y = net_sentiment)) +
    geom_line()

Another way, by splitting the data into groups of lines, to show a more granular view of the progression of the plot

sentiment_analysis_range <- The_Picture_of_Dorian_Gray_sa %>% 
  mutate(line_number = row_number()) %>% 
  # Divide lines into groups of 100
  mutate(index = line_number %/% 100) %>% 
  # Get a count of postiive and negative words in each 100-line chunk
  count(index, sentiment) %>% 
  # Convert the sentiment column into two columns named "positive" and "negative"
  pivot_wider(names_from = sentiment, values_from = n) %>% 
  # Calculate net sentiment
  mutate(net_sentiment = positive - negative)
  # Plot it
  sentiment_analysis_range %>% 
    ggplot(aes(x = index, y = net_sentiment)) +
    geom_col(aes(fill = net_sentiment > 0))

References

Text Mining with R: Julia Silge and David Robinson Dr. Andrew Heiss Project Gutenberg

Text-mining

Manh Tran