#Text mining and natural language processing
The goals of this week’s assignment are as follows: 1. get primary example code from chapter 2 of “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson. 2. Work with a different corpus. For this, I chose to work with a collection of works by my favorite author, Kurt Vonnegut Jr. I used the gutenberg project to populate the data. 3. Incorporate at least one additional sentiment lexicon. 4. Find most frequent words. 5. Find most important words. 6. Publish to RPubs and Github. Post links to BB.
#Import Data
kurt <- gutenberg_works(author == "Vonnegut, Kurt")
kurt_works <- gutenberg_download(c(21279, 30240))
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
Unfortunately, there are only two books by Kurt Vonnegut Jr. in Project Gutenberg.
#Tidy the Data
tidy_kurt <- kurt_works %>%
unnest_tokens(word, text)%>%
anti_join(stop_words)
## Joining, by = "word"
#Most Frequent Words
kurt_frequency <- tidy_kurt %>%
count(word)
#Plot Frequencies
kurt_top_words <- kurt_frequency %>%
filter(n > 14)
ggplot(kurt_top_words, aes(x = word, y = n)) +
geom_col()
As we can see the most used words are gramps and lou, which are likely to be character names.
#Get Sentinents
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,875 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,865 more rows
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
Here I filtered out all of the words related to positive feelings from the nrc lexicon.
Next, I need to find the joyful words in Kurt Vonnegut works.
#nrc joy Here we take all of the words that are related to joy from Vonnegut works.
nrc_joy_kurt <- tidy_kurt %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
As you can see the “jouful” word that is used the most is “happy”.
#affrin joy This will pull out the words from Vonnegut works that afrinn relate as positive.
affin_joy <- get_sentiments("afinn") %>%
filter(value > 0)
kurt_afinn_joy <- tidy_kurt %>%
inner_join(affin_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
#Bing joy Here we find positive Vonnegut words using the lexicon Bing.
bing_joy <- get_sentiments("bing")%>%
filter(sentiment == "positive")
kurt_bing_joy <- tidy_kurt %>%
inner_join(bing_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
#Loughran Joy Here we will use a lexicon not used in the examples of in the book.
loughran_joy <- get_sentiments("loughran")%>%
filter(sentiment == "positive")
kurt_loughran_joy <- tidy_kurt %>%
inner_join(loughran_joy)%>%
count(word, sort = TRUE)
## Joining, by = "word"
#Graphing the top words This is great way to compare how the sentiments work
graph_kurt_afinn<- kurt_afinn_joy %>%
filter(n >3)
graph_kurt_bing <- kurt_bing_joy %>%
filter(n >3)
graph_kurt_loughran <- kurt_loughran_joy %>%
filter(n >3)
graph_kurt_nrc <- nrc_joy_kurt %>%
filter(n >3)
ggplot() +
geom_point(data = graph_kurt_afinn, aes(x = word, y = n), color = "red") +
geom_point(data = graph_kurt_bing, aes(x = word, y = n), color = "blue") +
geom_point(data = graph_kurt_loughran, aes(x = word, y = n), color = "green") +
geom_point(data = graph_kurt_nrc, aes(x = word, y = n), color = "yellow")
#Which Lexicon to use? Well, it definitely seems like a matter of preference. I personally want to see more words being marked. Thus, I would prefer nrc. It simply had more words.
#Which words are most important? To find which words are the most important we use the tf_idf.
colnames(tidy_kurt) <- c("book", "word")
tidy_kurt$book <- tidy_kurt$book %>%
str_replace_all(c("21279" = "2 b r 0 2 b", "30240" = "the big trip up yonder"))
#now to pull out word frequencies per book
book_words <- tidy_kurt %>%
count(book, word, sort = TRUE)
total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n))
book_words <- left_join(book_words, total_words)
## Joining, by = "book"
freq_by_rank <- book_words %>%
group_by(book) %>%
mutate(rank = row_number(),
`term frequency` = n/total)%>%
ungroup()
freq_by_rank
## # A tibble: 1,515 x 6
## book word n total rank `term frequency`
## <chr> <chr> <int> <int> <int> <dbl>
## 1 the big trip up yonder gramps 44 1454 1 0.0303
## 2 the big trip up yonder lou 44 1454 2 0.0303
## 3 2 b r 0 2 b wehling 22 910 1 0.0242
## 4 2 b r 0 2 b hitz 21 910 2 0.0231
## 5 2 b r 0 2 b dr 18 910 3 0.0198
## 6 2 b r 0 2 b painter 14 910 4 0.0154
## 7 the big trip up yonder em 14 1454 3 0.00963
## 8 the big trip up yonder time 14 1454 4 0.00963
## 9 2 b r 0 2 b people 13 910 5 0.0143
## 10 2 b r 0 2 b duncan 12 910 6 0.0132
## # ... with 1,505 more rows
book_tf_idf <- book_words %>%
bind_tf_idf(word, book, n)
book_tf_idf %>%
arrange(desc(tf_idf))
## # A tibble: 1,515 x 7
## book word n total tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 the big trip up yonder gramps 44 1454 0.0303 0.693 0.0210
## 2 the big trip up yonder lou 44 1454 0.0303 0.693 0.0210
## 3 2 b r 0 2 b wehling 22 910 0.0242 0.693 0.0168
## 4 2 b r 0 2 b painter 14 910 0.0154 0.693 0.0107
## 5 2 b r 0 2 b duncan 12 910 0.0132 0.693 0.00914
## 6 2 b r 0 2 b orderly 11 910 0.0121 0.693 0.00838
## 7 2 b r 0 2 b happy 10 910 0.0110 0.693 0.00762
## 8 the big trip up yonder em 14 1454 0.00963 0.693 0.00667
## 9 the big trip up yonder anti 12 1454 0.00825 0.693 0.00572
## 10 the big trip up yonder gerasone 12 1454 0.00825 0.693 0.00572
## # ... with 1,505 more rows
#Conclusions
In conclusion, the same words that are most used happen to also be considered to be the most important words. I wonder how this would change if Project Gutenberg had more of Kurt vonnegut’s books. I imagine that the characters wouldn’t be the top words in regards to frequency or importance.