Revised Mining for Tweets Tutorial

Steve Dutky by way of Rachel Saidi

August 14, 2019

Search for Tweets Related to query

Use the tutorial on the following website:

https://www.earthdatascience.org/courses/earth-analytics/get-data-using-apis/text-mining-twitter-data-intro-r/

In this example, let’s find tweets that are using the words “whistle blower” in them.

First, you load the rtweet and other needed R packages. Note you are introducing 2 new packages lower in this lesson: igraph and ggraph.

# load twitter library - the rtweet library is recommended now over twitteR
library(rtweet)
# plotting and pipes - tidyverse!
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# text mining library
library(tidytext)
# plotting packages
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggraph)

The first time you run the next line of code, it will take you to your twitter page and you have to agree that you are allowing this app to work with R

Set up query parameters

# set up query parameters BE SURE TO CHECK VALUE OF loadFromTwitter
query="whistle blower"
queryTitle="whistleBlower"
queryFile="whistleBlower.rda"
loadFromTwitter<- list()
loadFromTwitter[[query]]<- TRUE
if (loadFromTwitter[[query]]) { 
  query_tweets <- search_tweets(q = query, n = 10000,
                                      lang = "en",
                                      include_rts = FALSE)
  save(query_tweets,file=queryFile)
  loadFromTwitter[[query]]<-FALSE
}  else {
  load(file = queryFile)
}
# check data to see if there are emojis
head(query_tweets$text)
## [1] "Says the \"pay for play access fraud\" that defines Hillary Clinton. The House jumped the gun. The whistle blower was  a paid for spy and all he delivered was easily refutable gossip. Hillary coming out so strong? I wonder if she were a part of this scam, too! It would make sense. https://t.co/qYxqzCzyz8"
## [2] "I believe Schiff not only knew about the whistle blower, I wonder if he or one of his buddies is the whistle blower. We already know that Schiff is corrupt. What he said before the hearing should have him expelled...not censured. We do not need that kind of behavior in OUR House. https://t.co/hsN2inerOl" 
## [3] "That is why they changed the whistle blower rules and are trying to protect the he/she/it (they may only be a real person on paper). It is so obvious what these desperadoes have done. Now, I want to ask, can I sue the Democrat party for destroying my peace of mind? https://t.co/QnxWzPAvEf"                
## [4] "Makes you wonder how much Brennan, the CIA fraud, was involved with the phony whistle blower report. Trying to impeach a President for doing what he is legally allowed to do is going to make anyone who votes for impeachment guilty of treason for subverting the Constitution. https://t.co/DnvN7w2DFA"       
## [5] "@GOPChairwoman @realDonaldTrump Me Romney, the transcript speaks for itself Trump and Giuliani self incriminated on live TV. Nothing in the whistle blower complaint has been disproven \n\nErin Brockavitch was a whistleblower who didn’t witness the poisoning of H2O first hand. That’s not a requirement."   
## [6] "@kylegriffin1 I fear for this whistle blower."

Data Clean-Up

Looking at the data above, it becomes clear that there is a lot of clean-up associated with social media data.

First, there are url’s in your tweets. If you want to do a text analysis to figure out what words are most common in your tweets, the URL’s won’t be helpful. Let’s remove those.

# remove urls tidyverse is failing here for some reason
# query_tweets %>%
#  mutate_at(c("stripped_text"), gsub("http.*","",.))

# remove http elements manually
# modified regex for url to end at next white space character
query_tweets$stripped_text <- gsub('http\\S+\\s*',"",  query_tweets$text)
query_tweets$stripped_text <- gsub('https\\S+\\s*',"",query_tweets$stripped_text)

Finally, you can clean up your text. If you are trying to create a list of unique words in your tweets, words with capitalization will be different from words that are all lowercase. Also you don’t need punctuation to be returned as a unique word.

You can use the tidytext::unnest_tokens() function in the tidytext package to magically clean up your text! When you use this function the following things will be cleaned up in the text:

Convert text to lowercase: each word found in the text will be converted to lowercase, so ensure that you don’t get duplicate words due to variation in capitalization.

Punctuation is removed: all instances of periods, commas etc will be removed from your list of words , and Unique id associated with the tweet: will be added for each occurrence of the word

The unnest_tokens() function takes two arguments:

  1. The name of the column where the unique word will be stored and
  2. The column name from the data.frame that you are using that you want to pull unique words from.

In your case, you want to use the stripped_text column which is where you have your cleaned up tweet text stored.

# remove punctuation, convert to lowercase, add id for each tweet!
query_tweets_clean <- query_tweets %>%
  dplyr::select(stripped_text) %>%
  unnest_tokens(word, stripped_text)

Now you can plot your data. What do you notice?

# plot the top 15 words -- notice any issues?
query_tweets_clean %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in tweets")
## Selecting by n

Your plot of unique words contains some words that may not be useful to use. For instance “a” and “to”. In the world of text mining you call those words - ‘stop words’. You want to remove these words from your analysis as they are fillers used to compose a sentence.

Lucky for us, the tidytext package has a function that will help us clean up stop words! To use this you:

Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis. Then you use anti_join to remove all stop words from your analysis. Let’s give this a try next!

# load list of stop words - from the tidytext package
data("stop_words")
# view first 6 words
head(stop_words)
## # A tibble: 6 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART
## # A tibble: 6 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART

nrow(query_tweets_clean)
## [1] 279711
## [1] 247606 or something similar

# remove stop words from your list of words
cleaned_tweet_words <- query_tweets_clean %>%
  anti_join(stop_words)
## Joining, by = "word"
# there should be fewer words now
nrow(cleaned_tweet_words)
## [1] 133706
## [1] 133584 or something similar

Explore Networks of Words

You might also want to explore words that occur together in tweets. Let’s do that next.

ngrams specifies pairs and 2 is the number of words together

# library(devtools)
# install_github("dgrtwo/widyr")
library(widyr)

# remove punctuation, convert to lowercase, add id for each tweet!
query_tweets_paired_words <- query_tweets %>%
  dplyr::select(stripped_text) %>%
  unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)

query_tweets_paired_words %>%
  count(paired_words, sort = TRUE)
## # A tibble: 112,817 x 2
##    paired_words         n
##    <chr>            <int>
##  1 whistle blower    9338
##  2 the whistle       5115
##  3 a whistle         1156
##  4 of the            1088
##  5 blower complaint   867
##  6 blower is          715
##  7 in the             686
##  8 is a               610
##  9 to the             599
## 10 is the             531
## # … with 112,807 more rows
## # A tibble: 134,656 x 2
##    paired_words         n
##    <chr>            <int>
##  1 query change    1021
##  2 of the             804
##  3 in the             798
##  4 querychange is   570
##  5 is a               442
##  6 of querychange   437
##  7 on the             383
##  8 on querychange   364
##  9 to the             354
## 10 this is            331
## # . with 134,646 more rows
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:igraph':
## 
##     crossing
query_tweets_separated_words <- query_tweets_paired_words %>%
  separate(paired_words, c("word1", "word2"), sep = " ")

query_tweets_filtered <- query_tweets_separated_words %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
query_words_counts <- query_tweets_filtered %>%
  count(word1, word2, sort = TRUE)

head(query_words_counts)
## # A tibble: 6 x 3
##   word1   word2         n
##   <chr>   <chr>     <int>
## 1 whistle blower     9338
## 2 blower  complaint   867
## 3 blower  rules       379
## 4 blower  report      353
## 5 hand    knowledge   341
## 6 white   house       254
## # A tibble: 6 x 3
##   word1         word2             n
##   <chr>         <chr>         <int>
## 1 query       change         1021
## 2 querychange querycrisis   232
## 3 querychange globalwarming   147
## 4 global        warming         141
## 5 hurricane     dorian          113
## 6 query       crisis          100

Finally, plot the data

library(igraph)
library(ggraph)

# plot query word network
# (plotting graph edges is currently broken)
query_words_counts %>%
        filter(n >= 24) %>%
        graph_from_data_frame() %>%
        ggraph(layout = "fr") +
        geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
        geom_node_point(color = "darkslategray4", size = 3) +
        geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
        labs(title = paste("Word Network: Tweets containing",query),
             subtitle = "Text mining twitter data ",
             x = "", y = "")