Text Mining - TOPIC MODELING OF SHERLOCK HOLMES STORIES
Text Mining - TOPIC MODELING OF SHERLOCK HOLMES STORIES
- Libraries
- Here we want to transform our data into tidy text by using unnested tokens into the word column. Then we want to remove stop words with anti_join.
- “The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.” - With showing us tf_idf. This plot allows us to see all the 12 stories with the most important words used in each story.
- Explore td_idf
- Implement topic modeling
- Docoument term matix
- Here we want to see which words contribute the most to a topic
Libraries
library(tidyverse)
library(gutenbergr)
library(tidytext)
library(quanteda)
library(stm)Here we want to see all the story names in all CAPS!
Now we want can see all the story names which are all in caps, and then we can see the stories in order and how many lines are in the stories
sherlock_raw <- gutenberg_download(1661)## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
sherlock <- sherlock_raw %>%
mutate(story = ifelse(str_detect(text, "ADVENTURE"),
text,
NA)) %>%
fill(story) %>%
filter(story != "THE ADVENTURES OF SHERLOCK HOLMES") %>%
mutate(story = factor(story, levels = unique(story)))
sherlock## # A tibble: 12,624 x 3
## gutenberg_id text story
## <int> <chr> <fct>
## 1 1661 ADVENTURE I. A SCANDAL IN BOHEMIA ADVENTURE I. A SCAN~
## 2 1661 "" ADVENTURE I. A SCAN~
## 3 1661 I. ADVENTURE I. A SCAN~
## 4 1661 "" ADVENTURE I. A SCAN~
## 5 1661 To Sherlock Holmes she is always THE~ ADVENTURE I. A SCAN~
## 6 1661 him mention her under any other name~ ADVENTURE I. A SCAN~
## 7 1661 and predominates the whole of her se~ ADVENTURE I. A SCAN~
## 8 1661 any emotion akin to love for Irene A~ ADVENTURE I. A SCAN~
## 9 1661 one particularly, were abhorrent to ~ ADVENTURE I. A SCAN~
## 10 1661 admirably balanced mind. He was, I t~ ADVENTURE I. A SCAN~
## # ... with 12,614 more rows
Here we want to transform our data into tidy text by using unnested tokens into the word column. Then we want to remove stop words with anti_join.
tidy_sherlock <- sherlock %>%
mutate(line = row_number()) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
filter(word != "holmes")## Joining, by = "word"
tidy_sherlock %>%
count(word, sort = TRUE)## # A tibble: 7,437 x 2
## word n
## <chr> <int>
## 1 time 151
## 2 door 144
## 3 matter 125
## 4 house 123
## 5 hand 120
## 6 night 114
## 7 heard 113
## 8 found 108
## 9 day 106
## 10 morning 102
## # ... with 7,427 more rows
“The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.” - With showing us tf_idf. This plot allows us to see all the 12 stories with the most important words used in each story.
Explore td_idf
tidy_sherlock %>%
count(story, word, sort = TRUE) %>%
bind_tf_idf(word, story, n) %>%
group_by(story) %>%
top_n(10) %>%
ungroup %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf, fill = story)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ story, scales = "free") +
coord_flip() ## Selecting by tf_idf
Implement topic modeling
Docoument term matix
sherlock_dfm <- tidy_sherlock %>%
count(story, word, sort = TRUE) %>%
cast_dfm(story, word, n)
topic_model <- stm(sherlock_dfm, K = 6,
verbose = FALSE, init.type = "Spectral")
summary(topic_model)## A topic model with 6 topics, 12 documents and a 7437 word dictionary.
## Topic 1 Top Words:
## Highest Prob: st, simon, lord, day, lady, found, matter
## FREX: simon, clair, neville, doran, pa, lascar, opium
## Lift: aloysius, ceremony, doran, millar, 2s, allegro, amused
## Score: simon, clair, 1846, st, neville, frank, doran
## Topic 2 Top Words:
## Highest Prob: hat, goose, stone, bird, geese, baker, sir
## FREX: geese, horner, ryder, henry, peterson, salesman, countess
## Lift: henry, peterson, 117, 12s, 221b, 22nd, 249
## Score: goose, geese, horner, 12s, ryder, henry, peterson
## Topic 3 Top Words:
## Highest Prob: street, matter, hosmer, woman, photograph, door, angel
## FREX: hosmer, angel, windibank, majesty, briony, photograph, king
## Lift: godfrey, leadenhall, mask, 1, 1858, 1888, 31
## Score: hosmer, angel, windibank, 1, majesty, photograph, adler
## Topic 4 Top Words:
## Highest Prob: father, mccarthy, time, son, hand, lestrade, left
## FREX: mccarthy, pool, boscombe, openshaw, pips, horsham, turner
## Lift: bone, dundee, horsham, pondicherry, presumption, savannah, sundial
## Score: mccarthy, pool, lestrade, boscombe, 140, openshaw, turner
## Topic 5 Top Words:
## Highest Prob: door, miss, house, night, heard, matter, morning
## FREX: rucastle, hunter, stoner, toller, roylott, ventilator, beeches
## Lift: fowler, inhabited, slit, terrified, winchester, 1100, 120
## Score: rucastle, hunter, coronet, stoner, toller, 40, roylott
## Topic 6 Top Words:
## Highest Prob: time, door, red, business, heard, colonel, day
## FREX: wilson, league, merryweather, jones, hydraulic, coburg, eyford
## Lift: cared, cost, daring, fee, hydraulic, saturday, vincent
## Score: wilson, league, 17, merryweather, jones, headed, colonel
Here we want to see which words contribute the most to a topic
td_beta <- tidy(topic_model)
td_beta %>%
group_by(topic) %>%
top_n(10) %>%
ungroup %>%
mutate(term, reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() ## Selecting by beta
#make a histogram of gamma. This gamma is the probability taking the stories and putting them into a topic.
td_gamma <- tidy(topic_model, matrix = "gamma",
document_names = rownames(sherlock_dfm))
ggplot(td_gamma, aes(gamma, fill = as.factor(topic))) +
geom_histogram(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, ncol = 3) +
labs(title = "Distribution of document probabilities for each topic",
subtitle = "Each topic is associated with 1-3 stories",
y = "Number of stories", x = expression(gamma))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
All and all.. we trained a topics model on the short stories of sherlock holmes to see which stories are similar and which stories are focus on certain topics.
Thank you!
Refernces: