Data607 Presentation Tidying Text

Tidying Sherlock Holmes stories

Analyse what words are in the book “The Adventures of Sherlock Holmes”.

Sample text in story “A Scandal in Bohemia”"

Chapter I

 To Sherlock Holmes she is always the woman. I have seldom heard him
 mention her under any other name. In his eyes she eclipses and
 predominates the whole of her sex. It was not that he felt any
 emotion akin to love for Irene Adler. All emotions, and that one
 particularly, were abhorrent to his cold, precise but admirably
 balanced mind. He was, I take it, the most perfect reasoning and
 observing machine that the world has seen, but as a lover he would
 have placed himself in a false position. He never spoke of the softer
 passions, save with a gibe and a sneer......

Book Index:

The Adventures of Sherlock Holmes

A Scandal in Bohemia 51 kB The Red-Headed League 54 kB A Case of Identity 42 kB The Boscombe Valley Mystery 56 kB The Five Orange Pips 43 kB The Man with the Twisted Lip 54 kB The Adventure of the Blue Carbuncle 46 kB The Adventure of the Speckled Band 58 kB The Adventure of the Engineer’s Thumb 49 kB The Adventure of the Noble Bachelor 49 kB The Adventure of the Beryl Coronet 56 kB The Adventure of the Copper Beeches 58 kB

filename.A_Scandal_in_Bohemia <- "TheAdventuresOfSherlockHolmes\\scan.txt"
filename.The_Red_Headed_League <- "TheAdventuresOfSherlockHolmes\\redh.txt"
filename.A_Case_of_Identity <- "TheAdventuresOfSherlockHolmes\\iden.txt"
filename.The_Boscombe_Valley_Mystery <- "TheAdventuresOfSherlockHolmes\\bosc.txt"
filename.The_Five_Orange_Pips <- "TheAdventuresOfSherlockHolmes\\five.txt"
filename.The_Man_with_the_Twisted_Lip <- "TheAdventuresOfSherlockHolmes\\twis.txt"
filename.The_Adventure_of_the_Blue_Carbuncle <- "TheAdventuresOfSherlockHolmes\\blue.txt"
filename.The_Adventure_of_the_Speckled_Band <- "TheAdventuresOfSherlockHolmes\\spec.txt"
filename.The_Adventure_of_the_Engineers_Thumb <- "TheAdventuresOfSherlockHolmes\\engr.txt"
filename.The_Adventure_of_the_Noble_Bachelor <- "TheAdventuresOfSherlockHolmes\\nobl.txt"
filename.The_Adventure_of_the_Beryl_Coronet <- "TheAdventuresOfSherlockHolmes\\bery.txt"
filename.The_Adventure_of_the_Copper_Beeches <- "TheAdventuresOfSherlockHolmes\\copp.txt"

conn <- file(filename.A_Scandal_in_Bohemia,open="r")

story.A_Scandal_in_Bohemia <-readLines(conn)

close(conn)

conn <- file(filename.The_Red_Headed_League,open="r")

story.The_Red_Headed_League <-readLines(conn)

close(conn)

conn <- file(filename.A_Case_of_Identity,open="r")

story.A_Case_of_Identity <-readLines(conn)

close(conn)


conn <- file(filename.The_Boscombe_Valley_Mystery,open="r")

story.The_Boscombe_Valley_Mystery <-readLines(conn)

close(conn)


conn <- file(filename.The_Five_Orange_Pips,open="r")

story.The_Five_Orange_Pips <-readLines(conn)

close(conn)


conn <- file(filename.The_Man_with_the_Twisted_Lip,open="r")

story.The_Man_with_the_Twisted_Lip <-readLines(conn)

close(conn)


conn <- file(filename.The_Adventure_of_the_Blue_Carbuncle,open="r")

story.The_Adventure_of_the_Blue_Carbuncle <-readLines(conn)

close(conn)


conn <- file(filename.The_Adventure_of_the_Speckled_Band,open="r")

story.The_Adventure_of_the_Speckled_Band <-readLines(conn)

close(conn)


conn <- file(filename.The_Adventure_of_the_Engineers_Thumb,open="r")

story.The_Adventure_of_the_Engineers_Thumb <-readLines(conn)

close(conn)

conn <- file(filename.The_Adventure_of_the_Noble_Bachelor,open="r")

story.The_Adventure_of_the_Noble_Bachelor <-readLines(conn)

close(conn)


conn <- file(filename.The_Adventure_of_the_Beryl_Coronet,open="r")

story.The_Adventure_of_the_Beryl_Coronet <-readLines(conn)

close(conn)


conn <- file(filename.The_Adventure_of_the_Copper_Beeches,open="r")

story.The_Adventure_of_the_Copper_Beeches <-readLines(conn)

close(conn)

library(tidytext)
library(dplyr)
library(stringr)
library(ggplot2)

Tidy Data We are going to use unnest_tokens() to convert the text to words:

text_df.A_Scandal_in_Bohemia <- data_frame(story = "A_Scandal_in_Bohemia", text = story.A_Scandal_in_Bohemia)

text_df.A_Scandal_in_Bohemia.tidytext <- text_df.A_Scandal_in_Bohemia %>%
  unnest_tokens(word, text) 

head(text_df.A_Scandal_in_Bohemia.tidytext)

## # A tibble: 6 x 2
##                  story    word
##                  <chr>   <chr>
## 1 A_Scandal_in_Bohemia       a
## 2 A_Scandal_in_Bohemia scandal
## 3 A_Scandal_in_Bohemia      in
## 4 A_Scandal_in_Bohemia bohemia
## 5 A_Scandal_in_Bohemia  arthur
## 6 A_Scandal_in_Bohemia   conan

Remove stop-words and analyze the words

text_df.A_Scandal_in_Bohemia.tidytext.count <- text_df.A_Scandal_in_Bohemia.tidytext %>%
  anti_join(stop_words ) %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "A_Scandal_in_Bohemia"
  )

The word order by frequency.

text_df.A_Scandal_in_Bohemia.tidytext.count %>%
    filter(n > 10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Now, we try to find the sentiment of the story

text_df.A_Scandal_in_Bohemia.tidytext.sentiment <- text_df.A_Scandal_in_Bohemia.tidytext.count %>% 
  inner_join(get_sentiments("nrc"))

## Joining, by = "word"

text_df.A_Scandal_in_Bohemia.tidytext.sentiment.score <- text_df.A_Scandal_in_Bohemia.tidytext.sentiment %>%
  group_by(sentiment) %>%
  mutate(
    n_sum = sum(n)
  ) %>%
  select (
    sentiment, n_sum
  ) %>%
  arrange(-n_sum) %>%
  filter(row_number()==1)


text_df.A_Scandal_in_Bohemia.tidytext.sentiment.score

## # A tibble: 10 x 2
## # Groups:   sentiment [10]
##       sentiment n_sum
##           <chr> <int>
##  1     positive   345
##  2        trust   210
##  3     negative   209
##  4 anticipation   150
##  5          joy   130
##  6         fear   129
##  7      sadness   101
##  8        anger    80
##  9     surprise    58
## 10      disgust    55

How about other Sherlock Holmes stories?

TFIDF is a very popular representation for text. Term Frequency x Inverse Document Frequency. We can use this measurement instead of using stop-words dictionary to remove unwanted common-words.

Let’s prepare the word counts without skipping stop-words

text_df.A_Scandal_in_Bohemia.tidytext.count_raw <- text_df.A_Scandal_in_Bohemia.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "A_Scandal_in_Bohemia"
  )

Prepare more word counts for other 11 books:

#prepare
#document per term per frequency

#The Red-Headed League

text_df.The_Red_Headed_League <- data_frame(story = "The_Red_Headed_League", text = story.The_Red_Headed_League)

text_df.The_Red_Headed_League.tidytext <- text_df.The_Red_Headed_League %>%
  unnest_tokens(word, text) 

text_df.The_Red_Headed_League.count_raw <- text_df.The_Red_Headed_League.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Red_Headed_League"
  )

#A Case of Identity

text_df.A_Case_of_Identity <- data_frame(story = "A_Case_of_Identity", text = story.A_Case_of_Identity)

text_df.A_Case_of_Identity.tidytext <- text_df.A_Case_of_Identity %>%
  unnest_tokens(word, text) 

text_df.A_Case_of_Identity.count_raw <- text_df.A_Case_of_Identity.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "A_Case_of_Identity"
  )


#The Boscombe Valley Mystery
text_df.The_Boscombe_Valley_Mystery <- data_frame(story = "The_Boscombe_Valley_Mystery", text = story.The_Boscombe_Valley_Mystery)

text_df.The_Boscombe_Valley_Mystery.tidytext <- text_df.The_Boscombe_Valley_Mystery %>%
  unnest_tokens(word, text) 

text_df.The_Boscombe_Valley_Mystery.count_raw <- text_df.The_Boscombe_Valley_Mystery.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Boscombe_Valley_Mystery"
  )


#The Five Orange Pips
text_df.The_Five_Orange_Pips <- data_frame(story = "The_Five_Orange_Pips", text = story.The_Five_Orange_Pips)

text_df.The_Five_Orange_Pips.tidytext <- text_df.The_Five_Orange_Pips %>%
  unnest_tokens(word, text) 

text_df.The_Five_Orange_Pips.count_raw <- text_df.The_Five_Orange_Pips.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Five_Orange_Pips"
  )

#The Man with the Twisted Lip

text_df.The_Man_with_the_Twisted_Lip <- data_frame(story = "The_Man_with_the_Twisted_Lip", text = story.The_Man_with_the_Twisted_Lip)

text_df.The_Man_with_the_Twisted_Lip.tidytext <- text_df.The_Man_with_the_Twisted_Lip %>%
  unnest_tokens(word, text) 

text_df.The_Man_with_the_Twisted_Lip.count_raw <- text_df.The_Man_with_the_Twisted_Lip.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Man_with_the_Twisted_Lip"
  )


#The Adventure of the Blue Carbuncle


text_df.The_Adventure_of_the_Blue_Carbuncle <- data_frame(story = "The_Adventure_of_the_Blue_Carbuncle", text = story.The_Adventure_of_the_Blue_Carbuncle)

text_df.The_Adventure_of_the_Blue_Carbuncle.tidytext <- text_df.The_Adventure_of_the_Blue_Carbuncle %>%
  unnest_tokens(word, text) 

text_df.The_Adventure_of_the_Blue_Carbuncle.count_raw <- text_df.The_Adventure_of_the_Blue_Carbuncle.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Adventure_of_the_Blue_Carbuncle"
  )


#The Adventure of the Speckled Band
text_df.The_Adventure_of_the_Speckled_Band <- data_frame(story = "The_Adventure_of_the_Speckled_Band", text = story.The_Adventure_of_the_Speckled_Band)

text_df.The_Adventure_of_the_Speckled_Band.tidytext <- text_df.The_Adventure_of_the_Speckled_Band %>%
  unnest_tokens(word, text) 

text_df.The_Adventure_of_the_Speckled_Band.count_raw <- text_df.The_Adventure_of_the_Speckled_Band.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Adventure_of_the_Speckled_Band"
  )

#The Adventure of the Engineer's Thumb
text_df.The_Adventure_of_the_Engineers_Thumb <- data_frame(story = "The_Adventure_of_the_Engineers_Thumb", text = story.The_Adventure_of_the_Engineers_Thumb)

text_df.The_Adventure_of_the_Engineers_Thumb.tidytext <- text_df.The_Adventure_of_the_Engineers_Thumb %>%
  unnest_tokens(word, text) 

text_df.The_Adventure_of_the_Engineers_Thumb.count_raw <- text_df.The_Adventure_of_the_Engineers_Thumb.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Adventure_of_the_Engineers_Thumb"
  )

#The Adventure of the Noble Bachelor
text_df.The_Adventure_of_the_Noble_Bachelor <- data_frame(story = "The_Adventure_of_the_Noble_Bachelor", text = story.The_Adventure_of_the_Noble_Bachelor)

text_df.The_Adventure_of_the_Noble_Bachelor.tidytext <- text_df.The_Adventure_of_the_Noble_Bachelor %>%
  unnest_tokens(word, text) 

text_df.The_Adventure_of_the_Noble_Bachelor.count_raw <- text_df.The_Adventure_of_the_Noble_Bachelor.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Adventure_of_the_Noble_Bachelor"
  )

#The Adventure of the Beryl Coronet
text_df.The_Adventure_of_the_Beryl_Coronet <- data_frame(story = "The_Adventure_of_the_Beryl_Coronet", text = story.The_Adventure_of_the_Beryl_Coronet)

text_df.The_Adventure_of_the_Beryl_Coronet.tidytext <- text_df.The_Adventure_of_the_Beryl_Coronet %>%
  unnest_tokens(word, text) 

text_df.The_Adventure_of_the_Beryl_Coronet.count_raw <- text_df.The_Adventure_of_the_Beryl_Coronet.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Adventure_of_the_Beryl_Coronet"
  )

#The Adventure of the Copper Beeches
text_df.The_Adventure_of_the_Copper_Beeches <- data_frame(story = "The_Adventure_of_the_Copper_Beeches", text = story.The_Adventure_of_the_Copper_Beeches)

text_df.The_Adventure_of_the_Copper_Beeches.tidytext <- text_df.The_Adventure_of_the_Copper_Beeches %>%
  unnest_tokens(word, text) 

text_df.The_Adventure_of_the_Copper_Beeches.count_raw <- text_df.The_Adventure_of_the_Copper_Beeches.tidytext %>%
  count(word, sort=TRUE ) %>%
  mutate(
    document = "The_Adventure_of_the_Copper_Beeches"
  )



#combine all books per books per count
books_words <- rbind(text_df.The_Red_Headed_League.count_raw,
                    text_df.A_Scandal_in_Bohemia.tidytext.count_raw)
books_words <- rbind(books_words, 
                    text_df.A_Case_of_Identity.count_raw)
books_words <- rbind(books_words,
                    text_df.The_Boscombe_Valley_Mystery.count_raw)
books_words <- rbind(books_words,
                    text_df.The_Five_Orange_Pips.count_raw)
books_words <- rbind(books_words,
                    text_df.The_Man_with_the_Twisted_Lip.count_raw)
books_words <- rbind(books_words,
                  text_df.The_Adventure_of_the_Blue_Carbuncle.count_raw)
books_words <- rbind(books_words,                    text_df.The_Adventure_of_the_Speckled_Band.count_raw)
              
books_words <- rbind(books_words,text_df.The_Adventure_of_the_Engineers_Thumb.count_raw)
                    books_words <- rbind(books_words,text_df.The_Adventure_of_the_Noble_Bachelor.count_raw)
books_words <- rbind(books_words,                    text_df.The_Adventure_of_the_Beryl_Coronet.count_raw)

books_words <- rbind(books_words,                    text_df.The_Adventure_of_the_Copper_Beeches.count_raw)

Take advantage of tf-idf to create our own word list. TF-IDF overweight rare-used words while underweight frequent words.

books_words.tf_idf <- books_words %>% 
  bind_tf_idf(word, document, n) 
  
head(books_words.tf_idf)

## # A tibble: 6 x 6
##    word     n              document         tf   idf tf_idf
##   <chr> <int>                 <chr>      <dbl> <dbl>  <dbl>
## 1   the   464 The_Red_Headed_League 0.04990320     0      0
## 2   and   281 The_Red_Headed_League 0.03022155     0      0
## 3     i   261 The_Red_Headed_League 0.02807055     0      0
## 4     a   241 The_Red_Headed_League 0.02591955     0      0
## 5    to   232 The_Red_Headed_League 0.02495160     0      0
## 6    of   227 The_Red_Headed_League 0.02441385     0      0

TF-IDF word list, stop-words means TF-IDF == 0

#create our own Detective-Stop-Words
Detective_stop_words <-
  books_words.tf_idf %>%
  filter(
    tf_idf == 0
  ) %>%
  select (
    word
  )
 

head(Detective_stop_words)

## # A tibble: 6 x 1
##    word
##   <chr>
## 1   the
## 2   and
## 3     i
## 4     a
## 5    to
## 6    of

characteristic words when TF-IDF > 0

DetectiveWords <-
books_words.tf_idf %>%
  filter(
    tf_idf > 0
  )
  

DetectiveWords %>%
    filter(n > 20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

More… 3) TF-IDF can be used for cosine similarity analysis 4) As Input for Search engline. TF, IDF, and TF-IDF help to link keywords to documents.

Data607 Presentation Tidying Text

Yuen Chun Wong

October 24, 2017

Tidying Sherlock Holmes stories