Kaitlin Kavlie PSYC-541

Lab #9: Text Analysis of Tweets

  1. I extracted 5,000 tweets from CNN’s Twitter, unnested the words of the tweets, removed stop words and weird web related terms, and created a table and a word cloud of the top words.

I extracted the tweets using this first code.

info_tweets <- get_timeline("cnn", n = 5000)

Then I unnested the words of the tweets with this code below.

info_words <- info_tweets %>% 
  unnest_tokens(word, text) %>% 
  select(screen_name, word)

Using the code chunk below I removed stop words and weird words, as well as created a table of the top words.

info_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = T) %>%
  filter(!word == "https") %>%
  filter(!word == "t.co")
Joining, by = "word"

With this last code chunk I created a word cloud of the top words.

info_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = T) %>%
  filter(!word == "https") %>%
  filter(!word == "t.co") %>%
  top_n(100) %>%
  wordcloud2(size = .5)
Joining, by = "word"
Selecting by n
  1. I conducted a sentiment analysis using bing, removed multiple errors, and created a graph of the words that contribute the most to each sentiment.

I ran the sentiment analysis with bing by using the first code below.

bing <- get_sentiments("bing")
bing

Then I removed multiple word errors with the following code.

info_words %>% 
  inner_join(bing) %>% 
  count(word, sentiment, sort = TRUE) %>%
  filter(!word == "trump") %>%
  filter(!word == "like") %>%
  filter(!word == "top")
Joining, by = "word"

Using this last code I created a graph of the words that contribute the most to each sentiment.

info_words %>% 
  inner_join(bing) %>% 
  count(word, sentiment, sort = TRUE) %>%
  filter(!word == "trump") %>%
  filter(!word == "like") %>%
  filter(!word == "top") %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(sentiment), scales = "free") +
  labs(y = "News headlines: Words that contribute the most to each sentiment",
       x = NULL) +
  coord_flip() +
  theme_minimal()
Joining, by = "word"
Selecting by n

  1. I unnested the tweets as bigrams, removed stop words and errors, and created a table and word cloud of the most common bigrams.

This first code chunk was used to unnest the tweets as bigrams.

info_tweets %>%
  select(text) %>%                                                 
  unnest_tokens(words, text, token = "ngrams", n = 2) %>%
  count(words, sort = T)

This next code filtered out stop words.

info_tweets %>%
  select(text) %>%                                                 
  unnest_tokens(words, text, token = "ngrams", n = 2) %>% 
  separate(words, c("word1", "word2"), sep = " ") %>%          
  filter(!word1 %in% stop_words$word) %>%                      
  filter(!word2 %in% stop_words$word) %>% 
  filter(!word1 %in% remove_words) %>%                         
  filter(!word2 %in% remove_words) %>%
  unite(words, word1, word2, sep = " ") 

Then this code was used to filter out web terms.

remove_words = c("https", "t.co")

info_tweets %>%
  select(text) %>%                                                 
  unnest_tokens(words, text, token = "ngrams", n = 2) %>% 
  separate(words, c("word1", "word2"), sep = " ") %>%          
  filter(!word1 %in% stop_words$word) %>%                      
  filter(!word2 %in% stop_words$word) %>%                        
  filter(!word1 %in% remove_words) %>%                         
  filter(!word2 %in% remove_words) %>%                         
  unite(words, word1, word2, sep = " ") -> info_bigrams                      

This code created a table of the most common bigrams.

info_bigrams %>%
  count(words, sort = T)

Then this code was used to create a word cloud of the most common bigrams.

info_bigrams %>%
  count(words, sort = T) %>%
  top_n(100) %>%
  wordcloud2(size = .5)
Selecting by n
  1. Above in question 3 I created bigrams of the tweets, removed the stopwords, and created a table and word cloud of the most common bigrams. I believe this question 4 is a repeat of question 3.

  2. I used the bigram method and found the most common words that come after ‘ukraine’ and ‘russia’.

firstinfo_word <- c("ukraine", "russia")                                  

info_bigrams %>%             
  count(words, sort = TRUE) %>%
  separate(words, c("word1", "word2"), sep = " ") %>%     
  filter(word1 %in% firstinfo_word) %>%                          
  count(word1, word2, wt = n, sort = TRUE)

After finding the most common words that come after ‘ukraine’ and ‘russia’, I created bar graph displaying the results for each word.

firstinfo_word <- c("ukraine", "russia")                                  

info_bigrams %>%             
  count(words, sort = TRUE) %>%
  separate(words, c("word1", "word2"), sep = " ") %>%       
  filter(word1 %in% firstinfo_word) %>%                          
  count(word1, word2, wt = n, sort = TRUE) %>%
  mutate(word2 = factor(word2, levels = rev(unique(word2)))) %>%     
  group_by(word1) %>% 
  top_n(5) %>% 
  ggplot(aes(word2, n, fill = word1)) +                          
  scale_fill_viridis_d() +                                           
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = NULL, title = "Word following:") +
  facet_wrap(~word1, scales = "free") +
  coord_flip()
Selecting by n

LS0tDQp0aXRsZTogIlIgTm90ZWJvb2siDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQpLYWl0bGluIEthdmxpZSBQU1lDLTU0MQ0KDQpMYWIgIzk6IFRleHQgQW5hbHlzaXMgb2YgVHdlZXRzDQoNCg0KDQoNCjEuIEkgZXh0cmFjdGVkIDUsMDAwIHR3ZWV0cyBmcm9tIENOTidzIFR3aXR0ZXIsIHVubmVzdGVkIHRoZSB3b3JkcyBvZiB0aGUgdHdlZXRzLCByZW1vdmVkIHN0b3Agd29yZHMgYW5kIHdlaXJkIHdlYiByZWxhdGVkIHRlcm1zLCBhbmQgY3JlYXRlZCBhIHRhYmxlIGFuZCBhIHdvcmQgY2xvdWQgb2YgdGhlIHRvcCB3b3Jkcy4gIA0KDQpJIGV4dHJhY3RlZCB0aGUgdHdlZXRzIHVzaW5nIHRoaXMgZmlyc3QgY29kZS4NCg0KYGBge3J9DQppbmZvX3R3ZWV0cyA8LSBnZXRfdGltZWxpbmUoImNubiIsIG4gPSA1MDAwKQ0KDQpgYGANCg0KDQpUaGVuIEkgdW5uZXN0ZWQgdGhlIHdvcmRzIG9mIHRoZSB0d2VldHMgd2l0aCB0aGlzIGNvZGUgYmVsb3cuDQoNCmBgYHtyfQ0KaW5mb193b3JkcyA8LSBpbmZvX3R3ZWV0cyAlPiUgDQogIHVubmVzdF90b2tlbnMod29yZCwgdGV4dCkgJT4lIA0KICBzZWxlY3Qoc2NyZWVuX25hbWUsIHdvcmQpIA0KDQpgYGANCg0KDQpVc2luZyB0aGUgY29kZSBjaHVuayBiZWxvdyBJIHJlbW92ZWQgc3RvcCB3b3JkcyBhbmQgd2VpcmQgd29yZHMsIGFzIHdlbGwgYXMgY3JlYXRlZCBhIHRhYmxlIG9mIHRoZSB0b3Agd29yZHMuDQoNCmBgYHtyfQ0KaW5mb193b3JkcyAlPiUgDQogIGFudGlfam9pbihzdG9wX3dvcmRzKSAlPiUgDQogIGNvdW50KHdvcmQsIHNvcnQgPSBUKSAlPiUNCiAgZmlsdGVyKCF3b3JkID09ICJodHRwcyIpICU+JQ0KICBmaWx0ZXIoIXdvcmQgPT0gInQuY28iKQ0KYGBgDQoNCg0KDQpXaXRoIHRoaXMgbGFzdCBjb2RlIGNodW5rIEkgY3JlYXRlZCBhIHdvcmQgY2xvdWQgb2YgdGhlIHRvcCB3b3Jkcy4NCg0KDQpgYGB7cn0NCmluZm9fd29yZHMgJT4lIA0KICBhbnRpX2pvaW4oc3RvcF93b3JkcykgJT4lIA0KICBjb3VudCh3b3JkLCBzb3J0ID0gVCkgJT4lDQogIGZpbHRlcighd29yZCA9PSAiaHR0cHMiKSAlPiUNCiAgZmlsdGVyKCF3b3JkID09ICJ0LmNvIikgJT4lDQogIHRvcF9uKDEwMCkgJT4lDQogIHdvcmRjbG91ZDIoc2l6ZSA9IC41KQ0KYGBgDQoNCg0KDQoyLiBJIGNvbmR1Y3RlZCBhIHNlbnRpbWVudCBhbmFseXNpcyB1c2luZyBiaW5nLCByZW1vdmVkIG11bHRpcGxlIGVycm9ycywgYW5kIGNyZWF0ZWQgYSBncmFwaCBvZiB0aGUgd29yZHMgdGhhdCBjb250cmlidXRlIHRoZSBtb3N0IHRvIGVhY2ggc2VudGltZW50Lg0KDQoNCkkgcmFuIHRoZSBzZW50aW1lbnQgYW5hbHlzaXMgd2l0aCBiaW5nIGJ5IHVzaW5nIHRoZSBmaXJzdCBjb2RlIGJlbG93Lg0KDQpgYGB7cn0NCmJpbmcgPC0gZ2V0X3NlbnRpbWVudHMoImJpbmciKQ0KYmluZw0KYGBgDQoNCg0KVGhlbiBJIHJlbW92ZWQgbXVsdGlwbGUgd29yZCBlcnJvcnMgd2l0aCB0aGUgZm9sbG93aW5nIGNvZGUuDQoNCmBgYHtyfQ0KaW5mb193b3JkcyAlPiUgDQogIGlubmVyX2pvaW4oYmluZykgJT4lIA0KICBjb3VudCh3b3JkLCBzZW50aW1lbnQsIHNvcnQgPSBUUlVFKSAlPiUNCiAgZmlsdGVyKCF3b3JkID09ICJ0cnVtcCIpICU+JQ0KICBmaWx0ZXIoIXdvcmQgPT0gImxpa2UiKSAlPiUNCiAgZmlsdGVyKCF3b3JkID09ICJ0b3AiKQ0KDQpgYGANCg0KDQoNCg0KVXNpbmcgdGhpcyBsYXN0IGNvZGUgSSBjcmVhdGVkIGEgZ3JhcGggb2YgdGhlIHdvcmRzIHRoYXQgY29udHJpYnV0ZSB0aGUgbW9zdCB0byBlYWNoIHNlbnRpbWVudC4NCg0KYGBge3J9DQppbmZvX3dvcmRzICU+JSANCiAgaW5uZXJfam9pbihiaW5nKSAlPiUgDQogIGNvdW50KHdvcmQsIHNlbnRpbWVudCwgc29ydCA9IFRSVUUpICU+JQ0KICBmaWx0ZXIoIXdvcmQgPT0gInRydW1wIikgJT4lDQogIGZpbHRlcighd29yZCA9PSAibGlrZSIpICU+JQ0KICBmaWx0ZXIoIXdvcmQgPT0gInRvcCIpICU+JQ0KICBncm91cF9ieShzZW50aW1lbnQpICU+JQ0KICB0b3BfbigxMCkgJT4lDQogIHVuZ3JvdXAoKSAlPiUNCiAgbXV0YXRlKHdvcmQgPSByZW9yZGVyKHdvcmQsIG4pKSAlPiUNCiAgZ2dwbG90KGFlcyh3b3JkLCBuLCBmaWxsID0gc2VudGltZW50KSkgKw0KICBnZW9tX2NvbChzaG93LmxlZ2VuZCA9IEZBTFNFKSArDQogIGZhY2V0X3dyYXAodmFycyhzZW50aW1lbnQpLCBzY2FsZXMgPSAiZnJlZSIpICsNCiAgbGFicyh5ID0gIk5ld3MgaGVhZGxpbmVzOiBXb3JkcyB0aGF0IGNvbnRyaWJ1dGUgdGhlIG1vc3QgdG8gZWFjaCBzZW50aW1lbnQiLA0KICAgICAgIHggPSBOVUxMKSArDQogIGNvb3JkX2ZsaXAoKSArDQogIHRoZW1lX21pbmltYWwoKQ0KYGBgDQoNCg0KDQozLiBJIHVubmVzdGVkIHRoZSB0d2VldHMgYXMgYmlncmFtcywgcmVtb3ZlZCBzdG9wIHdvcmRzIGFuZCBlcnJvcnMsIGFuZCBjcmVhdGVkIGEgdGFibGUgYW5kIHdvcmQgY2xvdWQgb2YgdGhlIG1vc3QgY29tbW9uIGJpZ3JhbXMuICANCg0KDQpUaGlzIGZpcnN0IGNvZGUgY2h1bmsgd2FzIHVzZWQgdG8gdW5uZXN0IHRoZSB0d2VldHMgYXMgYmlncmFtcy4NCg0KYGBge3J9DQppbmZvX3R3ZWV0cyAlPiUNCiAgc2VsZWN0KHRleHQpICU+JSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICANCiAgdW5uZXN0X3Rva2Vucyh3b3JkcywgdGV4dCwgdG9rZW4gPSAibmdyYW1zIiwgbiA9IDIpICU+JQ0KICBjb3VudCh3b3Jkcywgc29ydCA9IFQpDQpgYGANCg0KVGhpcyBuZXh0IGNvZGUgZmlsdGVyZWQgb3V0IHN0b3Agd29yZHMuDQoNCmBgYHtyfQ0KaW5mb190d2VldHMgJT4lDQogIHNlbGVjdCh0ZXh0KSAlPiUgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgDQogIHVubmVzdF90b2tlbnMod29yZHMsIHRleHQsIHRva2VuID0gIm5ncmFtcyIsIG4gPSAyKSAlPiUgDQogIHNlcGFyYXRlKHdvcmRzLCBjKCJ3b3JkMSIsICJ3b3JkMiIpLCBzZXAgPSAiICIpICU+JSAgICAgICAgICANCiAgZmlsdGVyKCF3b3JkMSAlaW4lIHN0b3Bfd29yZHMkd29yZCkgJT4lICAgICAgICAgICAgICAgICAgICAgIA0KICBmaWx0ZXIoIXdvcmQyICVpbiUgc3RvcF93b3JkcyR3b3JkKSAlPiUgDQogIGZpbHRlcighd29yZDEgJWluJSByZW1vdmVfd29yZHMpICU+JSAgICAgICAgICAgICAgICAgICAgICAgICANCiAgZmlsdGVyKCF3b3JkMiAlaW4lIHJlbW92ZV93b3JkcykgJT4lDQogIHVuaXRlKHdvcmRzLCB3b3JkMSwgd29yZDIsIHNlcCA9ICIgIikgDQpgYGANCg0KDQoNClRoZW4gdGhpcyBjb2RlIHdhcyB1c2VkIHRvIGZpbHRlciBvdXQgd2ViIHRlcm1zLg0KDQpgYGB7cn0NCnJlbW92ZV93b3JkcyA9IGMoImh0dHBzIiwgInQuY28iKQ0KDQppbmZvX3R3ZWV0cyAlPiUNCiAgc2VsZWN0KHRleHQpICU+JSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICANCiAgdW5uZXN0X3Rva2Vucyh3b3JkcywgdGV4dCwgdG9rZW4gPSAibmdyYW1zIiwgbiA9IDIpICU+JSANCiAgc2VwYXJhdGUod29yZHMsIGMoIndvcmQxIiwgIndvcmQyIiksIHNlcCA9ICIgIikgJT4lICAgICAgICAgIA0KICBmaWx0ZXIoIXdvcmQxICVpbiUgc3RvcF93b3JkcyR3b3JkKSAlPiUgICAgICAgICAgICAgICAgICAgICAgDQogIGZpbHRlcighd29yZDIgJWluJSBzdG9wX3dvcmRzJHdvcmQpICU+JSAgICAgICAgICAgICAgICAgICAgICAgIA0KICBmaWx0ZXIoIXdvcmQxICVpbiUgcmVtb3ZlX3dvcmRzKSAlPiUgICAgICAgICAgICAgICAgICAgICAgICAgDQogIGZpbHRlcighd29yZDIgJWluJSByZW1vdmVfd29yZHMpICU+JSAgICAgICAgICAgICAgICAgICAgICAgICANCiAgdW5pdGUod29yZHMsIHdvcmQxLCB3b3JkMiwgc2VwID0gIiAiKSAtPiBpbmZvX2JpZ3JhbXMgICAgICAgICAgICAgICAgICAgICAgDQoNCmBgYA0KDQoNCg0KVGhpcyBjb2RlIGNyZWF0ZWQgYSB0YWJsZSBvZiB0aGUgbW9zdCBjb21tb24gYmlncmFtcy4NCmBgYHtyfQ0KaW5mb19iaWdyYW1zICU+JQ0KICBjb3VudCh3b3Jkcywgc29ydCA9IFQpDQpgYGANCg0KVGhlbiB0aGlzIGNvZGUgd2FzIHVzZWQgdG8gY3JlYXRlIGEgd29yZCBjbG91ZCBvZiB0aGUgbW9zdCBjb21tb24gYmlncmFtcy4NCg0KYGBge3J9DQppbmZvX2JpZ3JhbXMgJT4lDQogIGNvdW50KHdvcmRzLCBzb3J0ID0gVCkgJT4lDQogIHRvcF9uKDEwMCkgJT4lDQogIHdvcmRjbG91ZDIoc2l6ZSA9IC41KQ0KYGBgDQoNCg0KDQoNCg0KNC4gQWJvdmUgaW4gcXVlc3Rpb24gMyBJIGNyZWF0ZWQgYmlncmFtcyBvZiB0aGUgdHdlZXRzLCByZW1vdmVkIHRoZSBzdG9wd29yZHMsIGFuZCBjcmVhdGVkIGEgdGFibGUgYW5kIHdvcmQgY2xvdWQgb2YgdGhlIG1vc3QgY29tbW9uIGJpZ3JhbXMuIEkgYmVsaWV2ZSB0aGlzIHF1ZXN0aW9uIDQgaXMgYSByZXBlYXQgb2YgcXVlc3Rpb24gMy4gDQoNCg0KDQoNCjUuIEkgdXNlZCB0aGUgYmlncmFtIG1ldGhvZCBhbmQgZm91bmQgdGhlIG1vc3QgY29tbW9uIHdvcmRzIHRoYXQgY29tZSBhZnRlciAndWtyYWluZScgYW5kICdydXNzaWEnLg0KDQoNCmBgYHtyfQ0KZmlyc3RpbmZvX3dvcmQgPC0gYygidWtyYWluZSIsICJydXNzaWEiKSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICANCg0KaW5mb19iaWdyYW1zICU+JSAgICAgICAgICAgICANCiAgY291bnQod29yZHMsIHNvcnQgPSBUUlVFKSAlPiUNCiAgc2VwYXJhdGUod29yZHMsIGMoIndvcmQxIiwgIndvcmQyIiksIHNlcCA9ICIgIikgJT4lICAgICANCiAgZmlsdGVyKHdvcmQxICVpbiUgZmlyc3RpbmZvX3dvcmQpICU+JSAgICAgICAgICAgICAgICAgICAgICAgICAgDQogIGNvdW50KHdvcmQxLCB3b3JkMiwgd3QgPSBuLCBzb3J0ID0gVFJVRSkNCmBgYA0KDQpBZnRlciBmaW5kaW5nIHRoZSBtb3N0IGNvbW1vbiB3b3JkcyB0aGF0IGNvbWUgYWZ0ZXIgJ3VrcmFpbmUnIGFuZCAncnVzc2lhJywgSSBjcmVhdGVkIGJhciBncmFwaCBkaXNwbGF5aW5nIHRoZSByZXN1bHRzIGZvciBlYWNoIHdvcmQuDQoNCmBgYHtyfQ0KZmlyc3RpbmZvX3dvcmQgPC0gYygidWtyYWluZSIsICJydXNzaWEiKSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICANCg0KaW5mb19iaWdyYW1zICU+JSAgICAgICAgICAgICANCiAgY291bnQod29yZHMsIHNvcnQgPSBUUlVFKSAlPiUNCiAgc2VwYXJhdGUod29yZHMsIGMoIndvcmQxIiwgIndvcmQyIiksIHNlcCA9ICIgIikgJT4lICAgICAgIA0KICBmaWx0ZXIod29yZDEgJWluJSBmaXJzdGluZm9fd29yZCkgJT4lICAgICAgICAgICAgICAgICAgICAgICAgICANCiAgY291bnQod29yZDEsIHdvcmQyLCB3dCA9IG4sIHNvcnQgPSBUUlVFKSAlPiUNCiAgbXV0YXRlKHdvcmQyID0gZmFjdG9yKHdvcmQyLCBsZXZlbHMgPSByZXYodW5pcXVlKHdvcmQyKSkpKSAlPiUgICAgIA0KICBncm91cF9ieSh3b3JkMSkgJT4lIA0KICB0b3Bfbig1KSAlPiUgDQogIGdncGxvdChhZXMod29yZDIsIG4sIGZpbGwgPSB3b3JkMSkpICsgICAgICAgICAgICAgICAgICAgICAgICAgIA0KICBzY2FsZV9maWxsX3ZpcmlkaXNfZCgpICsgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgDQogIGdlb21fY29sKHNob3cubGVnZW5kID0gRkFMU0UpICsNCiAgbGFicyh4ID0gTlVMTCwgeSA9IE5VTEwsIHRpdGxlID0gIldvcmQgZm9sbG93aW5nOiIpICsNCiAgZmFjZXRfd3JhcCh+d29yZDEsIHNjYWxlcyA9ICJmcmVlIikgKw0KICBjb29yZF9mbGlwKCkNCg0KYGBgDQoNCg0KDQoNCg0K