This is an R Markdown Notebook.

Assignment:

  1. Unnest the words of the tweets, remove stop words and weird web “words”, and create a table and a word cloud of the top words.
  2. Conduct a sentiment analysis using bing, remove any errors like trump = positive, and create a graph of the words that contribute most to each sentiment.
  3. Do the same as above but with the nrc sentiment lexicon.
  4. Unnest the tweets as bigrams, remove stop words and errors, and create a table and word cloud of the most common bigrams.
  5. Using the bigrams, look for the most common words that follow two different words. You may choose trump and pelosi, or choose your own.
usatoday_tweets <- get_timeline("usatoday", n =5000)
usatoday_words <- usatoday_tweets %>% 
  unnest_tokens(word, text) %>% 
  select(screen_name, word)
usatoday_words %>% 
  anti_join(get_stopwords()) %>% 
  filter(!word == "https") %>% 
  filter(!word == "t.co") %>% 
  count(word, sort = T)
Joining, by = "word"

God, this is going to be sad. Okay. Top words are corona-related. No surprise.

usatoday_words %>% 
  anti_join(get_stopwords()) %>% 
  filter(!word == "https") %>% 
  filter(!word == "t.co") %>% 
  count(word, sort = T)%>% 
  top_n(200) %>% 
  wordcloud2(size = .6)
Joining, by = "word"
Selecting by n

Cardi B Voice: CAH-ROHNA VI-RUSS shit is real shit is getting real.

bing_words <- get_sentiments("bing")
usatoday_words %>% 
  inner_join(bing_words) %>% 
  filter(!word == "trump") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  group_by(sentiment) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) + 
  facet_wrap(vars(sentiment), scales = "free") +
  labs(y = "Bing Sentiment Contribution by Word (USA Today Tweets)",
       x = NULL) +
  coord_flip() +
  theme_minimal()
Joining, by = "word"
Selecting by n

I was right: this is depressing. All the positive words are like an overly cheerful person telling you to just “stop being sad,” and the negative words are all about dying. Great.

nrc_words <- get_sentiments("nrc")
usatoday_words %>% 
  inner_join(nrc_words) %>% 
  filter(!word == "trump") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  group_by(sentiment) %>% 
  top_n(5) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) + 
  facet_wrap(vars(sentiment), scales = "free") +
  labs(y = "NRC Sentiment Contribution by Word (USA Today Tweets)",
       x = NULL) +
  coord_flip() +
  theme_minimal()
Joining, by = "word"
Selecting by n

We’re all so scared. “Pandemic” shows up at the top for negative, sadness, and fear.

remove_words = c("https", "t.co")

usatoday_tweets %>% 
  select(text) %>% 
  unnest_tokens(words, text, token = "ngrams", n = 2) %>% 
  separate(words, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word) %>% 
  filter(!word1 %in% remove_words) %>% 
  filter(!word2 %in% remove_words) %>% 
  unite(words, word1, word2, sep = " ") -> usatoday_bigrams
usatoday_bigrams %>% 
  count(words, sort = T)

This would be the ideal project to see the spread of tweets/interaction regarding the presidential candidates right before election, but now the world is ending and all we want to talk about is COVID.

usatoday_bigrams %>% 
  count(words, sort = T) %>% 
  top_n(100) %>% 
  wordcloud2(size = .5)
Selecting by n

See? Even with the limited coverage surrounding the 2020 presidential bid, it’s still cool to see how it could have been.

first_words <- c("biden", "sanders")

usatoday_bigrams %>% 
  count(words, sort = TRUE) %>% 
  separate(words, c("word1", "word2"), sep = " ") %>% 
  filter(word1 %in% first_words) %>% 
  count(word1, word2, wt = n, sort = TRUE) %>% 
  rename(total = nn)

Let’s do this one time with DNC candidates. Fairly positive.

first_words <- c("corona", "coronavirus")

usatoday_bigrams %>% 
  count(words, sort = TRUE) %>% 
  separate(words, c("word1", "word2"), sep = " ") %>% 
  filter(word1 %in% first_words) %>% 
  count(word1, word2, wt = n, sort = TRUE) %>% 
  rename(total = nn)

Oh, look. We’re sad again.

LS0tCnRpdGxlOiAiQVByYXR0IHR3ZWV0cyBBc3NpZ25tZW50IgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpUaGlzIGlzIGFuIFtSIE1hcmtkb3duXShodHRwOi8vcm1hcmtkb3duLnJzdHVkaW8uY29tKSBOb3RlYm9vay4gCgpBc3NpZ25tZW50OgoKCjEuIFVubmVzdCB0aGUgd29yZHMgb2YgdGhlIHR3ZWV0cywgcmVtb3ZlIHN0b3Agd29yZHMgYW5kIHdlaXJkIHdlYiAid29yZHMiLCBhbmQgY3JlYXRlIGEgdGFibGUgYW5kIGEgd29yZCBjbG91ZCBvZiB0aGUgdG9wIHdvcmRzLiAgCjIuIENvbmR1Y3QgYSBzZW50aW1lbnQgYW5hbHlzaXMgdXNpbmcgYmluZywgcmVtb3ZlIGFueSBlcnJvcnMgbGlrZSB0cnVtcCA9IHBvc2l0aXZlLCBhbmQgY3JlYXRlIGEgZ3JhcGggb2YgdGhlIHdvcmRzIHRoYXQgY29udHJpYnV0ZSBtb3N0IHRvIGVhY2ggc2VudGltZW50LiAgCjMuIERvIHRoZSBzYW1lIGFzIGFib3ZlIGJ1dCB3aXRoIHRoZSBucmMgc2VudGltZW50IGxleGljb24uICAKNC4gVW5uZXN0IHRoZSB0d2VldHMgYXMgYmlncmFtcywgcmVtb3ZlIHN0b3Agd29yZHMgYW5kIGVycm9ycywgYW5kIGNyZWF0ZSBhIHRhYmxlIGFuZCB3b3JkIGNsb3VkIG9mIHRoZSBtb3N0IGNvbW1vbiBiaWdyYW1zLiAgCjUuIFVzaW5nIHRoZSBiaWdyYW1zLCBsb29rIGZvciB0aGUgbW9zdCBjb21tb24gd29yZHMgdGhhdCBmb2xsb3cgdHdvIGRpZmZlcmVudCB3b3Jkcy4gWW91IG1heSBjaG9vc2UgdHJ1bXAgYW5kIHBlbG9zaSwgb3IgY2hvb3NlIHlvdXIgb3duLiAgCgpgYGB7cn0KdXNhdG9kYXlfdHdlZXRzIDwtIGdldF90aW1lbGluZSgidXNhdG9kYXkiLCBuID01MDAwKQpgYGAKCmBgYHtyfQp1c2F0b2RheV93b3JkcyA8LSB1c2F0b2RheV90d2VldHMgJT4lIAogIHVubmVzdF90b2tlbnMod29yZCwgdGV4dCkgJT4lIAogIHNlbGVjdChzY3JlZW5fbmFtZSwgd29yZCkKYGBgCgpgYGB7cn0KdXNhdG9kYXlfd29yZHMgJT4lIAogIGFudGlfam9pbihnZXRfc3RvcHdvcmRzKCkpICU+JSAKICBmaWx0ZXIoIXdvcmQgPT0gImh0dHBzIikgJT4lIAogIGZpbHRlcighd29yZCA9PSAidC5jbyIpICU+JSAKICBjb3VudCh3b3JkLCBzb3J0ID0gVCkKYGBgCkdvZCwgdGhpcyBpcyBnb2luZyB0byBiZSBzYWQuIE9rYXkuIFRvcCB3b3JkcyBhcmUgY29yb25hLXJlbGF0ZWQuIE5vIHN1cnByaXNlLiAKCmBgYHtyfQp1c2F0b2RheV93b3JkcyAlPiUgCiAgYW50aV9qb2luKGdldF9zdG9wd29yZHMoKSkgJT4lIAogIGZpbHRlcighd29yZCA9PSAiaHR0cHMiKSAlPiUgCiAgZmlsdGVyKCF3b3JkID09ICJ0LmNvIikgJT4lIAogIGNvdW50KHdvcmQsIHNvcnQgPSBUKSU+JSAKICB0b3BfbigyMDApICU+JSAKICB3b3JkY2xvdWQyKHNpemUgPSAuNikKYGBgCgpDYXJkaSBCIFZvaWNlOiBDQUgtUk9ITkEgVkktUlVTUyBzaGl0IGlzIHJlYWwgc2hpdCBpcyBnZXR0aW5nIHJlYWwuIAoKYGBge3J9CmJpbmdfd29yZHMgPC0gZ2V0X3NlbnRpbWVudHMoImJpbmciKQoKYGBgCgpgYGB7cn0KdXNhdG9kYXlfd29yZHMgJT4lIAogIGlubmVyX2pvaW4oYmluZ193b3JkcykgJT4lIAogIGZpbHRlcighd29yZCA9PSAidHJ1bXAiKSAlPiUgCiAgY291bnQod29yZCwgc2VudGltZW50LCBzb3J0ID0gVFJVRSkgJT4lIAogIGdyb3VwX2J5KHNlbnRpbWVudCkgJT4lIAogIHRvcF9uKDEwKSAlPiUgCiAgdW5ncm91cCgpICU+JSAKICBtdXRhdGUod29yZCA9IHJlb3JkZXIod29yZCwgbikpICU+JSAKICBnZ3Bsb3QoYWVzKHdvcmQsIG4sIGZpbGwgPSBzZW50aW1lbnQpKSArCiAgZ2VvbV9jb2woc2hvdy5sZWdlbmQgPSBGQUxTRSkgKyAKICBmYWNldF93cmFwKHZhcnMoc2VudGltZW50KSwgc2NhbGVzID0gImZyZWUiKSArCiAgbGFicyh5ID0gIkJpbmcgU2VudGltZW50IENvbnRyaWJ1dGlvbiBieSBXb3JkIChVU0EgVG9kYXkgVHdlZXRzKSIsCiAgICAgICB4ID0gTlVMTCkgKwogIGNvb3JkX2ZsaXAoKSArCiAgdGhlbWVfbWluaW1hbCgpCmBgYApJIHdhcyByaWdodDogdGhpcyBpcyBkZXByZXNzaW5nLiBBbGwgdGhlIHBvc2l0aXZlIHdvcmRzIGFyZSBsaWtlIGFuIG92ZXJseSBjaGVlcmZ1bCBwZXJzb24gdGVsbGluZyB5b3UgdG8ganVzdCAic3RvcCBiZWluZyBzYWQsIiBhbmQgdGhlIG5lZ2F0aXZlIHdvcmRzIGFyZSBhbGwgYWJvdXQgZHlpbmcuIEdyZWF0LiAKCmBgYHtyfQpucmNfd29yZHMgPC0gZ2V0X3NlbnRpbWVudHMoIm5yYyIpCgpgYGAKYGBge3J9CnVzYXRvZGF5X3dvcmRzICU+JSAKICBpbm5lcl9qb2luKG5yY193b3JkcykgJT4lIAogIGZpbHRlcighd29yZCA9PSAidHJ1bXAiKSAlPiUgCiAgY291bnQod29yZCwgc2VudGltZW50LCBzb3J0ID0gVFJVRSkgJT4lIAogIGdyb3VwX2J5KHNlbnRpbWVudCkgJT4lIAogIHRvcF9uKDUpICU+JSAKICB1bmdyb3VwKCkgJT4lIAogIG11dGF0ZSh3b3JkID0gcmVvcmRlcih3b3JkLCBuKSkgJT4lIAogIGdncGxvdChhZXMod29yZCwgbiwgZmlsbCA9IHNlbnRpbWVudCkpICsKICBnZW9tX2NvbChzaG93LmxlZ2VuZCA9IEZBTFNFKSArIAogIGZhY2V0X3dyYXAodmFycyhzZW50aW1lbnQpLCBzY2FsZXMgPSAiZnJlZSIpICsKICBsYWJzKHkgPSAiTlJDIFNlbnRpbWVudCBDb250cmlidXRpb24gYnkgV29yZCAoVVNBIFRvZGF5IFR3ZWV0cykiLAogICAgICAgeCA9IE5VTEwpICsKICBjb29yZF9mbGlwKCkgKwogIHRoZW1lX21pbmltYWwoKQpgYGAKCldlJ3JlIGFsbCBzbyBzY2FyZWQuICJQYW5kZW1pYyIgc2hvd3MgdXAgYXQgdGhlIHRvcCBmb3IgbmVnYXRpdmUsIHNhZG5lc3MsIGFuZCBmZWFyLiAKCmBgYHtyfQpyZW1vdmVfd29yZHMgPSBjKCJodHRwcyIsICJ0LmNvIikKCnVzYXRvZGF5X3R3ZWV0cyAlPiUgCiAgc2VsZWN0KHRleHQpICU+JSAKICB1bm5lc3RfdG9rZW5zKHdvcmRzLCB0ZXh0LCB0b2tlbiA9ICJuZ3JhbXMiLCBuID0gMikgJT4lIAogIHNlcGFyYXRlKHdvcmRzLCBjKCJ3b3JkMSIsICJ3b3JkMiIpLCBzZXAgPSAiICIpICU+JSAKICBmaWx0ZXIoIXdvcmQxICVpbiUgc3RvcF93b3JkcyR3b3JkKSAlPiUgCiAgZmlsdGVyKCF3b3JkMiAlaW4lIHN0b3Bfd29yZHMkd29yZCkgJT4lIAogIGZpbHRlcighd29yZDEgJWluJSByZW1vdmVfd29yZHMpICU+JSAKICBmaWx0ZXIoIXdvcmQyICVpbiUgcmVtb3ZlX3dvcmRzKSAlPiUgCiAgdW5pdGUod29yZHMsIHdvcmQxLCB3b3JkMiwgc2VwID0gIiAiKSAtPiB1c2F0b2RheV9iaWdyYW1zCmBgYAoKYGBge3J9CnVzYXRvZGF5X2JpZ3JhbXMgJT4lIAogIGNvdW50KHdvcmRzLCBzb3J0ID0gVCkKYGBgClRoaXMgd291bGQgYmUgdGhlIGlkZWFsIHByb2plY3QgdG8gc2VlIHRoZSBzcHJlYWQgb2YgdHdlZXRzL2ludGVyYWN0aW9uIHJlZ2FyZGluZyB0aGUgcHJlc2lkZW50aWFsIGNhbmRpZGF0ZXMgcmlnaHQgYmVmb3JlIGVsZWN0aW9uLCBidXQgbm93IHRoZSB3b3JsZCBpcyBlbmRpbmcgYW5kIGFsbCB3ZSB3YW50IHRvIHRhbGsgYWJvdXQgaXMgQ09WSUQuIAoKYGBge3J9CnVzYXRvZGF5X2JpZ3JhbXMgJT4lIAogIGNvdW50KHdvcmRzLCBzb3J0ID0gVCkgJT4lIAogIHRvcF9uKDEwMCkgJT4lIAogIHdvcmRjbG91ZDIoc2l6ZSA9IC41KQpgYGAKU2VlPyBFdmVuIHdpdGggdGhlIGxpbWl0ZWQgY292ZXJhZ2Ugc3Vycm91bmRpbmcgdGhlIDIwMjAgcHJlc2lkZW50aWFsIGJpZCwgaXQncyBzdGlsbCBjb29sIHRvIHNlZSBob3cgaXQgY291bGQgaGF2ZSBiZWVuLiAKCmBgYHtyfQpmaXJzdF93b3JkcyA8LSBjKCJiaWRlbiIsICJzYW5kZXJzIikKCnVzYXRvZGF5X2JpZ3JhbXMgJT4lIAogIGNvdW50KHdvcmRzLCBzb3J0ID0gVFJVRSkgJT4lIAogIHNlcGFyYXRlKHdvcmRzLCBjKCJ3b3JkMSIsICJ3b3JkMiIpLCBzZXAgPSAiICIpICU+JSAKICBmaWx0ZXIod29yZDEgJWluJSBmaXJzdF93b3JkcykgJT4lIAogIGNvdW50KHdvcmQxLCB3b3JkMiwgd3QgPSBuLCBzb3J0ID0gVFJVRSkgJT4lIAogIHJlbmFtZSh0b3RhbCA9IG5uKQpgYGAKCkxldCdzIGRvIHRoaXMgb25lIHRpbWUgd2l0aCBETkMgY2FuZGlkYXRlcy4gRmFpcmx5IHBvc2l0aXZlLiAKCmBgYHtyfQpmaXJzdF93b3JkcyA8LSBjKCJjb3JvbmEiLCAiY29yb25hdmlydXMiKQoKdXNhdG9kYXlfYmlncmFtcyAlPiUgCiAgY291bnQod29yZHMsIHNvcnQgPSBUUlVFKSAlPiUgCiAgc2VwYXJhdGUod29yZHMsIGMoIndvcmQxIiwgIndvcmQyIiksIHNlcCA9ICIgIikgJT4lIAogIGZpbHRlcih3b3JkMSAlaW4lIGZpcnN0X3dvcmRzKSAlPiUgCiAgY291bnQod29yZDEsIHdvcmQyLCB3dCA9IG4sIHNvcnQgPSBUUlVFKSAlPiUgCiAgcmVuYW1lKHRvdGFsID0gbm4pCmBgYAoKT2gsIGxvb2suIFdlJ3JlIHNhZCBhZ2Fpbi4gCg==