Text Mining - TOPIC MODELING OF SHERLOCK HOLMES STORIES

Libraries
Here we want to transform our data into tidy text by using unnested tokens into the word column. Then we want to remove stop words with anti_join.
“The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.” - With showing us tf_idf. This plot allows us to see all the 12 stories with the most important words used in each story.
Explore td_idf
Implement topic modeling
Docoument term matix
Here we want to see which words contribute the most to a topic

Libraries

library(tidyverse)
library(gutenbergr)
library(tidytext)
library(quanteda)
library(stm)

Here we want to see all the story names in all CAPS!

Now we want can see all the story names which are all in caps, and then we can see the stories in order and how many lines are in the stories

sherlock_raw <- gutenberg_download(1661)

## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

sherlock <- sherlock_raw %>%
    mutate(story = ifelse(str_detect(text, "ADVENTURE"),
                          text,
                          NA)) %>%
    fill(story) %>%
    filter(story != "THE ADVENTURES OF SHERLOCK HOLMES") %>%
    mutate(story = factor(story, levels = unique(story)))

sherlock

## # A tibble: 12,624 x 3
##    gutenberg_id text                                  story               
##           <int> <chr>                                 <fct>               
##  1         1661 ADVENTURE I. A SCANDAL IN BOHEMIA     ADVENTURE I. A SCAN~
##  2         1661 ""                                    ADVENTURE I. A SCAN~
##  3         1661 I.                                    ADVENTURE I. A SCAN~
##  4         1661 ""                                    ADVENTURE I. A SCAN~
##  5         1661 To Sherlock Holmes she is always THE~ ADVENTURE I. A SCAN~
##  6         1661 him mention her under any other name~ ADVENTURE I. A SCAN~
##  7         1661 and predominates the whole of her se~ ADVENTURE I. A SCAN~
##  8         1661 any emotion akin to love for Irene A~ ADVENTURE I. A SCAN~
##  9         1661 one particularly, were abhorrent to ~ ADVENTURE I. A SCAN~
## 10         1661 admirably balanced mind. He was, I t~ ADVENTURE I. A SCAN~
## # ... with 12,614 more rows

Here we want to transform our data into tidy text by using unnested tokens into the word column. Then we want to remove stop words with anti_join.

tidy_sherlock <- sherlock %>%
    mutate(line = row_number()) %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    filter(word != "holmes")

## Joining, by = "word"

tidy_sherlock %>%
    count(word, sort = TRUE)

## # A tibble: 7,437 x 2
##    word        n
##    <chr>   <int>
##  1 time      151
##  2 door      144
##  3 matter    125
##  4 house     123
##  5 hand      120
##  6 night     114
##  7 heard     113
##  8 found     108
##  9 day       106
## 10 morning   102
## # ... with 7,427 more rows

“The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.” - With showing us tf_idf. This plot allows us to see all the 12 stories with the most important words used in each story.

(https://www.tidytextmining.com/tfidf.html)

Explore td_idf

tidy_sherlock %>%
    count(story, word, sort = TRUE) %>%
    bind_tf_idf(word, story, n) %>%
    group_by(story) %>%
    top_n(10) %>%
    ungroup %>%
    mutate(word = reorder(word, tf_idf)) %>%
    ggplot(aes(word, tf_idf, fill = story)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ story, scales = "free") +
    coord_flip()

## Selecting by tf_idf

Implement topic modeling

Docoument term matix

sherlock_dfm <- tidy_sherlock %>%
    count(story, word, sort = TRUE) %>%
    cast_dfm(story, word, n)

topic_model <- stm(sherlock_dfm, K = 6, 
                   verbose = FALSE, init.type = "Spectral")
summary(topic_model)

## A topic model with 6 topics, 12 documents and a 7437 word dictionary.

## Topic 1 Top Words:
##       Highest Prob: st, simon, lord, day, lady, found, matter 
##       FREX: simon, clair, neville, doran, pa, lascar, opium 
##       Lift: aloysius, ceremony, doran, millar, 2s, allegro, amused 
##       Score: simon, clair, 1846, st, neville, frank, doran 
## Topic 2 Top Words:
##       Highest Prob: hat, goose, stone, bird, geese, baker, sir 
##       FREX: geese, horner, ryder, henry, peterson, salesman, countess 
##       Lift: henry, peterson, 117, 12s, 221b, 22nd, 249 
##       Score: goose, geese, horner, 12s, ryder, henry, peterson 
## Topic 3 Top Words:
##       Highest Prob: street, matter, hosmer, woman, photograph, door, angel 
##       FREX: hosmer, angel, windibank, majesty, briony, photograph, king 
##       Lift: godfrey, leadenhall, mask, 1, 1858, 1888, 31 
##       Score: hosmer, angel, windibank, 1, majesty, photograph, adler 
## Topic 4 Top Words:
##       Highest Prob: father, mccarthy, time, son, hand, lestrade, left 
##       FREX: mccarthy, pool, boscombe, openshaw, pips, horsham, turner 
##       Lift: bone, dundee, horsham, pondicherry, presumption, savannah, sundial 
##       Score: mccarthy, pool, lestrade, boscombe, 140, openshaw, turner 
## Topic 5 Top Words:
##       Highest Prob: door, miss, house, night, heard, matter, morning 
##       FREX: rucastle, hunter, stoner, toller, roylott, ventilator, beeches 
##       Lift: fowler, inhabited, slit, terrified, winchester, 1100, 120 
##       Score: rucastle, hunter, coronet, stoner, toller, 40, roylott 
## Topic 6 Top Words:
##       Highest Prob: time, door, red, business, heard, colonel, day 
##       FREX: wilson, league, merryweather, jones, hydraulic, coburg, eyford 
##       Lift: cared, cost, daring, fee, hydraulic, saturday, vincent 
##       Score: wilson, league, 17, merryweather, jones, headed, colonel

Here we want to see which words contribute the most to a topic

td_beta <- tidy(topic_model)

td_beta %>%
  group_by(topic) %>%
  top_n(10) %>%
  ungroup %>%
  mutate(term, reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

## Selecting by beta

#make a histogram of gamma. This gamma is the probability taking the stories and putting them into a topic.

td_gamma <- tidy(topic_model, matrix = "gamma",                    
                 document_names = rownames(sherlock_dfm))

ggplot(td_gamma, aes(gamma, fill = as.factor(topic))) +
  geom_histogram(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, ncol = 3) +
  labs(title = "Distribution of document probabilities for each topic",
       subtitle = "Each topic is associated with 1-3 stories",
       y = "Number of stories", x = expression(gamma))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

All and all.. we trained a topics model on the short stories of sherlock holmes to see which stories are similar and which stories are focus on certain topics.

Thank you!

Refernces:

https://www.youtube.com/watch?v=FkckgwMHP2s

https://www.tidytextmining.com/tfidf.html