Analyse what words are in the book “The Adventures of Sherlock Holmes”.
https://sherlock-holm.es/ascii/
Sample text in story “A Scandal in Bohemia”"
Chapter I
To Sherlock Holmes she is always the woman. I have seldom heard him
mention her under any other name. In his eyes she eclipses and
predominates the whole of her sex. It was not that he felt any
emotion akin to love for Irene Adler. All emotions, and that one
particularly, were abhorrent to his cold, precise but admirably
balanced mind. He was, I take it, the most perfect reasoning and
observing machine that the world has seen, but as a lover he would
have placed himself in a false position. He never spoke of the softer
passions, save with a gibe and a sneer......
Book Index:
The Adventures of Sherlock Holmes
A Scandal in Bohemia 51 kB The Red-Headed League 54 kB A Case of Identity 42 kB The Boscombe Valley Mystery 56 kB The Five Orange Pips 43 kB The Man with the Twisted Lip 54 kB The Adventure of the Blue Carbuncle 46 kB The Adventure of the Speckled Band 58 kB The Adventure of the Engineer’s Thumb 49 kB The Adventure of the Noble Bachelor 49 kB The Adventure of the Beryl Coronet 56 kB The Adventure of the Copper Beeches 58 kB
filename.A_Scandal_in_Bohemia <- "TheAdventuresOfSherlockHolmes\\scan.txt"
filename.The_Red_Headed_League <- "TheAdventuresOfSherlockHolmes\\redh.txt"
filename.A_Case_of_Identity <- "TheAdventuresOfSherlockHolmes\\iden.txt"
filename.The_Boscombe_Valley_Mystery <- "TheAdventuresOfSherlockHolmes\\bosc.txt"
filename.The_Five_Orange_Pips <- "TheAdventuresOfSherlockHolmes\\five.txt"
filename.The_Man_with_the_Twisted_Lip <- "TheAdventuresOfSherlockHolmes\\twis.txt"
filename.The_Adventure_of_the_Blue_Carbuncle <- "TheAdventuresOfSherlockHolmes\\blue.txt"
filename.The_Adventure_of_the_Speckled_Band <- "TheAdventuresOfSherlockHolmes\\spec.txt"
filename.The_Adventure_of_the_Engineers_Thumb <- "TheAdventuresOfSherlockHolmes\\engr.txt"
filename.The_Adventure_of_the_Noble_Bachelor <- "TheAdventuresOfSherlockHolmes\\nobl.txt"
filename.The_Adventure_of_the_Beryl_Coronet <- "TheAdventuresOfSherlockHolmes\\bery.txt"
filename.The_Adventure_of_the_Copper_Beeches <- "TheAdventuresOfSherlockHolmes\\copp.txt"
conn <- file(filename.A_Scandal_in_Bohemia,open="r")
story.A_Scandal_in_Bohemia <-readLines(conn)
close(conn)
conn <- file(filename.The_Red_Headed_League,open="r")
story.The_Red_Headed_League <-readLines(conn)
close(conn)
conn <- file(filename.A_Case_of_Identity,open="r")
story.A_Case_of_Identity <-readLines(conn)
close(conn)
conn <- file(filename.The_Boscombe_Valley_Mystery,open="r")
story.The_Boscombe_Valley_Mystery <-readLines(conn)
close(conn)
conn <- file(filename.The_Five_Orange_Pips,open="r")
story.The_Five_Orange_Pips <-readLines(conn)
close(conn)
conn <- file(filename.The_Man_with_the_Twisted_Lip,open="r")
story.The_Man_with_the_Twisted_Lip <-readLines(conn)
close(conn)
conn <- file(filename.The_Adventure_of_the_Blue_Carbuncle,open="r")
story.The_Adventure_of_the_Blue_Carbuncle <-readLines(conn)
close(conn)
conn <- file(filename.The_Adventure_of_the_Speckled_Band,open="r")
story.The_Adventure_of_the_Speckled_Band <-readLines(conn)
close(conn)
conn <- file(filename.The_Adventure_of_the_Engineers_Thumb,open="r")
story.The_Adventure_of_the_Engineers_Thumb <-readLines(conn)
close(conn)
conn <- file(filename.The_Adventure_of_the_Noble_Bachelor,open="r")
story.The_Adventure_of_the_Noble_Bachelor <-readLines(conn)
close(conn)
conn <- file(filename.The_Adventure_of_the_Beryl_Coronet,open="r")
story.The_Adventure_of_the_Beryl_Coronet <-readLines(conn)
close(conn)
conn <- file(filename.The_Adventure_of_the_Copper_Beeches,open="r")
story.The_Adventure_of_the_Copper_Beeches <-readLines(conn)
close(conn)
library(tidytext)
library(dplyr)
library(stringr)
library(ggplot2)
Tidy Data We are going to use unnest_tokens() to convert the text to words:
text_df.A_Scandal_in_Bohemia <- data_frame(story = "A_Scandal_in_Bohemia", text = story.A_Scandal_in_Bohemia)
text_df.A_Scandal_in_Bohemia.tidytext <- text_df.A_Scandal_in_Bohemia %>%
unnest_tokens(word, text)
head(text_df.A_Scandal_in_Bohemia.tidytext)
## # A tibble: 6 x 2
## story word
## <chr> <chr>
## 1 A_Scandal_in_Bohemia a
## 2 A_Scandal_in_Bohemia scandal
## 3 A_Scandal_in_Bohemia in
## 4 A_Scandal_in_Bohemia bohemia
## 5 A_Scandal_in_Bohemia arthur
## 6 A_Scandal_in_Bohemia conan
Remove stop-words and analyze the words
text_df.A_Scandal_in_Bohemia.tidytext.count <- text_df.A_Scandal_in_Bohemia.tidytext %>%
anti_join(stop_words ) %>%
count(word, sort=TRUE ) %>%
mutate(
document = "A_Scandal_in_Bohemia"
)
The word order by frequency.
text_df.A_Scandal_in_Bohemia.tidytext.count %>%
filter(n > 10) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
Now, we try to find the sentiment of the story
text_df.A_Scandal_in_Bohemia.tidytext.sentiment <- text_df.A_Scandal_in_Bohemia.tidytext.count %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
text_df.A_Scandal_in_Bohemia.tidytext.sentiment.score <- text_df.A_Scandal_in_Bohemia.tidytext.sentiment %>%
group_by(sentiment) %>%
mutate(
n_sum = sum(n)
) %>%
select (
sentiment, n_sum
) %>%
arrange(-n_sum) %>%
filter(row_number()==1)
text_df.A_Scandal_in_Bohemia.tidytext.sentiment.score
## # A tibble: 10 x 2
## # Groups: sentiment [10]
## sentiment n_sum
## <chr> <int>
## 1 positive 345
## 2 trust 210
## 3 negative 209
## 4 anticipation 150
## 5 joy 130
## 6 fear 129
## 7 sadness 101
## 8 anger 80
## 9 surprise 58
## 10 disgust 55
How about other Sherlock Holmes stories?
TFIDF is a very popular representation for text. Term Frequency x Inverse Document Frequency. We can use this measurement instead of using stop-words dictionary to remove unwanted common-words.
Let’s prepare the word counts without skipping stop-words
text_df.A_Scandal_in_Bohemia.tidytext.count_raw <- text_df.A_Scandal_in_Bohemia.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "A_Scandal_in_Bohemia"
)
Prepare more word counts for other 11 books:
#prepare
#document per term per frequency
#The Red-Headed League
text_df.The_Red_Headed_League <- data_frame(story = "The_Red_Headed_League", text = story.The_Red_Headed_League)
text_df.The_Red_Headed_League.tidytext <- text_df.The_Red_Headed_League %>%
unnest_tokens(word, text)
text_df.The_Red_Headed_League.count_raw <- text_df.The_Red_Headed_League.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Red_Headed_League"
)
#A Case of Identity
text_df.A_Case_of_Identity <- data_frame(story = "A_Case_of_Identity", text = story.A_Case_of_Identity)
text_df.A_Case_of_Identity.tidytext <- text_df.A_Case_of_Identity %>%
unnest_tokens(word, text)
text_df.A_Case_of_Identity.count_raw <- text_df.A_Case_of_Identity.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "A_Case_of_Identity"
)
#The Boscombe Valley Mystery
text_df.The_Boscombe_Valley_Mystery <- data_frame(story = "The_Boscombe_Valley_Mystery", text = story.The_Boscombe_Valley_Mystery)
text_df.The_Boscombe_Valley_Mystery.tidytext <- text_df.The_Boscombe_Valley_Mystery %>%
unnest_tokens(word, text)
text_df.The_Boscombe_Valley_Mystery.count_raw <- text_df.The_Boscombe_Valley_Mystery.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Boscombe_Valley_Mystery"
)
#The Five Orange Pips
text_df.The_Five_Orange_Pips <- data_frame(story = "The_Five_Orange_Pips", text = story.The_Five_Orange_Pips)
text_df.The_Five_Orange_Pips.tidytext <- text_df.The_Five_Orange_Pips %>%
unnest_tokens(word, text)
text_df.The_Five_Orange_Pips.count_raw <- text_df.The_Five_Orange_Pips.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Five_Orange_Pips"
)
#The Man with the Twisted Lip
text_df.The_Man_with_the_Twisted_Lip <- data_frame(story = "The_Man_with_the_Twisted_Lip", text = story.The_Man_with_the_Twisted_Lip)
text_df.The_Man_with_the_Twisted_Lip.tidytext <- text_df.The_Man_with_the_Twisted_Lip %>%
unnest_tokens(word, text)
text_df.The_Man_with_the_Twisted_Lip.count_raw <- text_df.The_Man_with_the_Twisted_Lip.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Man_with_the_Twisted_Lip"
)
#The Adventure of the Blue Carbuncle
text_df.The_Adventure_of_the_Blue_Carbuncle <- data_frame(story = "The_Adventure_of_the_Blue_Carbuncle", text = story.The_Adventure_of_the_Blue_Carbuncle)
text_df.The_Adventure_of_the_Blue_Carbuncle.tidytext <- text_df.The_Adventure_of_the_Blue_Carbuncle %>%
unnest_tokens(word, text)
text_df.The_Adventure_of_the_Blue_Carbuncle.count_raw <- text_df.The_Adventure_of_the_Blue_Carbuncle.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Adventure_of_the_Blue_Carbuncle"
)
#The Adventure of the Speckled Band
text_df.The_Adventure_of_the_Speckled_Band <- data_frame(story = "The_Adventure_of_the_Speckled_Band", text = story.The_Adventure_of_the_Speckled_Band)
text_df.The_Adventure_of_the_Speckled_Band.tidytext <- text_df.The_Adventure_of_the_Speckled_Band %>%
unnest_tokens(word, text)
text_df.The_Adventure_of_the_Speckled_Band.count_raw <- text_df.The_Adventure_of_the_Speckled_Band.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Adventure_of_the_Speckled_Band"
)
#The Adventure of the Engineer's Thumb
text_df.The_Adventure_of_the_Engineers_Thumb <- data_frame(story = "The_Adventure_of_the_Engineers_Thumb", text = story.The_Adventure_of_the_Engineers_Thumb)
text_df.The_Adventure_of_the_Engineers_Thumb.tidytext <- text_df.The_Adventure_of_the_Engineers_Thumb %>%
unnest_tokens(word, text)
text_df.The_Adventure_of_the_Engineers_Thumb.count_raw <- text_df.The_Adventure_of_the_Engineers_Thumb.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Adventure_of_the_Engineers_Thumb"
)
#The Adventure of the Noble Bachelor
text_df.The_Adventure_of_the_Noble_Bachelor <- data_frame(story = "The_Adventure_of_the_Noble_Bachelor", text = story.The_Adventure_of_the_Noble_Bachelor)
text_df.The_Adventure_of_the_Noble_Bachelor.tidytext <- text_df.The_Adventure_of_the_Noble_Bachelor %>%
unnest_tokens(word, text)
text_df.The_Adventure_of_the_Noble_Bachelor.count_raw <- text_df.The_Adventure_of_the_Noble_Bachelor.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Adventure_of_the_Noble_Bachelor"
)
#The Adventure of the Beryl Coronet
text_df.The_Adventure_of_the_Beryl_Coronet <- data_frame(story = "The_Adventure_of_the_Beryl_Coronet", text = story.The_Adventure_of_the_Beryl_Coronet)
text_df.The_Adventure_of_the_Beryl_Coronet.tidytext <- text_df.The_Adventure_of_the_Beryl_Coronet %>%
unnest_tokens(word, text)
text_df.The_Adventure_of_the_Beryl_Coronet.count_raw <- text_df.The_Adventure_of_the_Beryl_Coronet.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Adventure_of_the_Beryl_Coronet"
)
#The Adventure of the Copper Beeches
text_df.The_Adventure_of_the_Copper_Beeches <- data_frame(story = "The_Adventure_of_the_Copper_Beeches", text = story.The_Adventure_of_the_Copper_Beeches)
text_df.The_Adventure_of_the_Copper_Beeches.tidytext <- text_df.The_Adventure_of_the_Copper_Beeches %>%
unnest_tokens(word, text)
text_df.The_Adventure_of_the_Copper_Beeches.count_raw <- text_df.The_Adventure_of_the_Copper_Beeches.tidytext %>%
count(word, sort=TRUE ) %>%
mutate(
document = "The_Adventure_of_the_Copper_Beeches"
)
#combine all books per books per count
books_words <- rbind(text_df.The_Red_Headed_League.count_raw,
text_df.A_Scandal_in_Bohemia.tidytext.count_raw)
books_words <- rbind(books_words,
text_df.A_Case_of_Identity.count_raw)
books_words <- rbind(books_words,
text_df.The_Boscombe_Valley_Mystery.count_raw)
books_words <- rbind(books_words,
text_df.The_Five_Orange_Pips.count_raw)
books_words <- rbind(books_words,
text_df.The_Man_with_the_Twisted_Lip.count_raw)
books_words <- rbind(books_words,
text_df.The_Adventure_of_the_Blue_Carbuncle.count_raw)
books_words <- rbind(books_words, text_df.The_Adventure_of_the_Speckled_Band.count_raw)
books_words <- rbind(books_words,text_df.The_Adventure_of_the_Engineers_Thumb.count_raw)
books_words <- rbind(books_words,text_df.The_Adventure_of_the_Noble_Bachelor.count_raw)
books_words <- rbind(books_words, text_df.The_Adventure_of_the_Beryl_Coronet.count_raw)
books_words <- rbind(books_words, text_df.The_Adventure_of_the_Copper_Beeches.count_raw)
Take advantage of tf-idf to create our own word list. TF-IDF overweight rare-used words while underweight frequent words.
books_words.tf_idf <- books_words %>%
bind_tf_idf(word, document, n)
head(books_words.tf_idf)
## # A tibble: 6 x 6
## word n document tf idf tf_idf
## <chr> <int> <chr> <dbl> <dbl> <dbl>
## 1 the 464 The_Red_Headed_League 0.04990320 0 0
## 2 and 281 The_Red_Headed_League 0.03022155 0 0
## 3 i 261 The_Red_Headed_League 0.02807055 0 0
## 4 a 241 The_Red_Headed_League 0.02591955 0 0
## 5 to 232 The_Red_Headed_League 0.02495160 0 0
## 6 of 227 The_Red_Headed_League 0.02441385 0 0
#create our own Detective-Stop-Words
Detective_stop_words <-
books_words.tf_idf %>%
filter(
tf_idf == 0
) %>%
select (
word
)
head(Detective_stop_words)
## # A tibble: 6 x 1
## word
## <chr>
## 1 the
## 2 and
## 3 i
## 4 a
## 5 to
## 6 of
DetectiveWords <-
books_words.tf_idf %>%
filter(
tf_idf > 0
)
DetectiveWords %>%
filter(n > 20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
More… 3) TF-IDF can be used for cosine similarity analysis 4) As Input for Search engline. TF, IDF, and TF-IDF help to link keywords to documents.