Week 10 Assignment

#Text mining and natural language processing

The goals of this week’s assignment are as follows: 1. get primary example code from chapter 2 of “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson. 2. Work with a different corpus. For this, I chose to work with a collection of works by my favorite author, Kurt Vonnegut Jr. I used the gutenberg project to populate the data. 3. Incorporate at least one additional sentiment lexicon. 4. Find most frequent words. 5. Find most important words. 6. Publish to RPubs and Github. Post links to BB.

#Import Data

kurt <- gutenberg_works(author == "Vonnegut, Kurt")
kurt_works <- gutenberg_download(c(21279, 30240))
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

Unfortunately, there are only two books by Kurt Vonnegut Jr. in Project Gutenberg.

#Tidy the Data

tidy_kurt <- kurt_works %>%
  unnest_tokens(word, text)%>%
  anti_join(stop_words)
## Joining, by = "word"

#Most Frequent Words

kurt_frequency <- tidy_kurt %>%
  count(word)

#Plot Frequencies

kurt_top_words <- kurt_frequency %>%
  filter(n > 14)

ggplot(kurt_top_words, aes(x = word, y = n)) + 
  geom_col()

As we can see the most used words are gramps and lou, which are likely to be character names.

#Get Sentinents

get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,875 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,865 more rows
nrc_joy <- get_sentiments("nrc") %>%
  filter(sentiment == "joy")

Here I filtered out all of the words related to positive feelings from the nrc lexicon.

Next, I need to find the joyful words in Kurt Vonnegut works.

#nrc joy Here we take all of the words that are related to joy from Vonnegut works.

nrc_joy_kurt <- tidy_kurt %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"

As you can see the “jouful” word that is used the most is “happy”.

#affrin joy This will pull out the words from Vonnegut works that afrinn relate as positive.

affin_joy <- get_sentiments("afinn") %>%
  filter(value > 0)

kurt_afinn_joy <- tidy_kurt %>%
  inner_join(affin_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"

#Bing joy Here we find positive Vonnegut words using the lexicon Bing.

bing_joy <- get_sentiments("bing")%>%
  filter(sentiment == "positive")

kurt_bing_joy <- tidy_kurt %>%
  inner_join(bing_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"

#Loughran Joy Here we will use a lexicon not used in the examples of in the book.

loughran_joy <- get_sentiments("loughran")%>%
  filter(sentiment == "positive")

kurt_loughran_joy <- tidy_kurt %>%
  inner_join(loughran_joy)%>%
  count(word, sort = TRUE)
## Joining, by = "word"

#Graphing the top words This is great way to compare how the sentiments work

graph_kurt_afinn<- kurt_afinn_joy %>%
  filter(n >3)
graph_kurt_bing <- kurt_bing_joy %>%
  filter(n >3)
graph_kurt_loughran <- kurt_loughran_joy %>%
  filter(n >3)
graph_kurt_nrc <- nrc_joy_kurt %>%
  filter(n >3)
ggplot() +
  geom_point(data = graph_kurt_afinn, aes(x = word, y = n), color = "red") +
  geom_point(data = graph_kurt_bing, aes(x = word, y = n), color = "blue") +
  geom_point(data = graph_kurt_loughran, aes(x = word, y = n), color = "green") +
  geom_point(data = graph_kurt_nrc, aes(x = word, y = n), color = "yellow")

#Which Lexicon to use? Well, it definitely seems like a matter of preference. I personally want to see more words being marked. Thus, I would prefer nrc. It simply had more words.

#Which words are most important? To find which words are the most important we use the tf_idf.

colnames(tidy_kurt) <- c("book", "word")
tidy_kurt$book <- tidy_kurt$book %>%
  str_replace_all(c("21279" = "2 b r 0 2 b", "30240" = "the big trip up yonder"))

#now to pull out word frequencies per book

book_words <- tidy_kurt %>%
  count(book, word, sort = TRUE)

total_words <- book_words %>%
  group_by(book) %>%
  summarize(total = sum(n))

book_words <- left_join(book_words, total_words)
## Joining, by = "book"
freq_by_rank <- book_words %>%
  group_by(book) %>%
  mutate(rank = row_number(),
         `term frequency` = n/total)%>%
  ungroup()
freq_by_rank
## # A tibble: 1,515 x 6
##    book                   word        n total  rank `term frequency`
##    <chr>                  <chr>   <int> <int> <int>            <dbl>
##  1 the big trip up yonder gramps     44  1454     1          0.0303 
##  2 the big trip up yonder lou        44  1454     2          0.0303 
##  3 2 b r 0 2 b            wehling    22   910     1          0.0242 
##  4 2 b r 0 2 b            hitz       21   910     2          0.0231 
##  5 2 b r 0 2 b            dr         18   910     3          0.0198 
##  6 2 b r 0 2 b            painter    14   910     4          0.0154 
##  7 the big trip up yonder em         14  1454     3          0.00963
##  8 the big trip up yonder time       14  1454     4          0.00963
##  9 2 b r 0 2 b            people     13   910     5          0.0143 
## 10 2 b r 0 2 b            duncan     12   910     6          0.0132 
## # ... with 1,505 more rows
book_tf_idf <- book_words %>%
  bind_tf_idf(word, book, n)
book_tf_idf %>%
  arrange(desc(tf_idf))
## # A tibble: 1,515 x 7
##    book                   word         n total      tf   idf  tf_idf
##    <chr>                  <chr>    <int> <int>   <dbl> <dbl>   <dbl>
##  1 the big trip up yonder gramps      44  1454 0.0303  0.693 0.0210 
##  2 the big trip up yonder lou         44  1454 0.0303  0.693 0.0210 
##  3 2 b r 0 2 b            wehling     22   910 0.0242  0.693 0.0168 
##  4 2 b r 0 2 b            painter     14   910 0.0154  0.693 0.0107 
##  5 2 b r 0 2 b            duncan      12   910 0.0132  0.693 0.00914
##  6 2 b r 0 2 b            orderly     11   910 0.0121  0.693 0.00838
##  7 2 b r 0 2 b            happy       10   910 0.0110  0.693 0.00762
##  8 the big trip up yonder em          14  1454 0.00963 0.693 0.00667
##  9 the big trip up yonder anti        12  1454 0.00825 0.693 0.00572
## 10 the big trip up yonder gerasone    12  1454 0.00825 0.693 0.00572
## # ... with 1,505 more rows

#Conclusions

In conclusion, the same words that are most used happen to also be considered to be the most important words. I wonder how this would change if Project Gutenberg had more of Kurt vonnegut’s books. I imagine that the characters wouldn’t be the top words in regards to frequency or importance.