Revised Mining for Tweets Tutorial

Steve Dutky by way of Rachel Saidi

August 14, 2019

Search for Tweets Related to query

Use the tutorial on the following website:

https://www.earthdatascience.org/courses/earth-analytics/get-data-using-apis/text-mining-twitter-data-intro-r/

In this example, let’s find tweets that are using the words “whistle blower” in them.

First, you load the rtweet and other needed R packages. Note you are introducing 2 new packages lower in this lesson: igraph and ggraph.

# load twitter library - the rtweet library is recommended now over twitteR
library(rtweet)
# plotting and pipes - tidyverse!
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# text mining library
library(tidytext)
# plotting packages
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggraph)

The first time you run the next line of code, it will take you to your twitter page and you have to agree that you are allowing this app to work with R

Set up query parameters

# set up query parameters BE SURE TO CHECK VALUE OF loadFromTwitter
query="witch hunt"
queryTitle="witchHunt"
queryFile="witchHunt.rda"
loadFromTwitter<- list()
loadFromTwitter[[query]]<- FALSE
if (loadFromTwitter[[query]]) { 
  query_tweets <- search_tweets(q = query, n = 10000,
                                      lang = "en",
                                      include_rts = FALSE)
  save(query_tweets,file=queryFile)
  loadFromTwitter[[query]]<-FALSE
}  else {
  load(file = queryFile)
}
# check data to see if there are emojis
head(query_tweets$text)
## [1] "@realDonaldTrump @POTUS @PeteHegseth @foxandfriends It ain't no witch hunt it's \"TRUMPERY\"...LMFAO"                                                                                                                                                
## [2] "@RudyGiuliani Do you recall the witch-hunt against Hillary and the money spent  only to find nothing."                                                                                                                                               
## [3] "@RNCResearch @realDonaldTrump President Trump asked Quid Pro Joe to release his transcript, his campaign and the Fake News media IGNORED IT. Ridiculous! We can’t let these LIARS get away with ANOTHER reckless WITCH HUNT!"                        
## [4] "Dr Kafeel Khan gets 'clean chit' &amp; demands 'honour &amp; self-respect'\nUP Govt. says, 'probe still on', while Opposition calls it 'witch hunt'.\n\nWatch Athar Khan on @thenewshour SPECIAL EDITION tonight at 9:30 PM. https://t.co/9ANe3JvhUe"
## [5] "Dr Kafeel Khan gets 'clean chit’ &amp; demands 'honour &amp; self-respect'.\nUP Govt says 'probe still on', while opposition calls it 'witch hunt'. \n\nTIMES NOW’s Mohit and Wajji with details. Listen in. https://t.co/OsVeNtnb9y"                
## [6] "I do not understand how this witch hunt can just continue to go on from one thing to another! https://t.co/qjPDltgwju"

Data Clean-Up

Looking at the data above, it becomes clear that there is a lot of clean-up associated with social media data.

First, there are url’s in your tweets. If you want to do a text analysis to figure out what words are most common in your tweets, the URL’s won’t be helpful. Let’s remove those.

# remove urls tidyverse is failing here for some reason
# query_tweets %>%
#  mutate_at(c("stripped_text"), gsub("http.*","",.))

# remove http elements manually
# modified regex for url to end at next white space character
query_tweets$stripped_text <- gsub('http\\S+\\s*',"",  query_tweets$text)
query_tweets$stripped_text <- gsub('https\\S+\\s*',"",query_tweets$stripped_text)

Finally, you can clean up your text. If you are trying to create a list of unique words in your tweets, words with capitalization will be different from words that are all lowercase. Also you don’t need punctuation to be returned as a unique word.

You can use the tidytext::unnest_tokens() function in the tidytext package to magically clean up your text! When you use this function the following things will be cleaned up in the text:

Convert text to lowercase: each word found in the text will be converted to lowercase, so ensure that you don’t get duplicate words due to variation in capitalization.

Punctuation is removed: all instances of periods, commas etc will be removed from your list of words , and Unique id associated with the tweet: will be added for each occurrence of the word

The unnest_tokens() function takes two arguments:

  1. The name of the column where the unique word will be stored and
  2. The column name from the data.frame that you are using that you want to pull unique words from.

In your case, you want to use the stripped_text column which is where you have your cleaned up tweet text stored.

# remove punctuation, convert to lowercase, add id for each tweet!
query_tweets_clean <- query_tweets %>%
  dplyr::select(stripped_text) %>%
  unnest_tokens(word, stripped_text)

Now you can plot your data. What do you notice?

# plot the top 15 words -- notice any issues?
query_tweets_clean %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in tweets")
## Selecting by n

Your plot of unique words contains some words that may not be useful to use. For instance “a” and “to”. In the world of text mining you call those words - ‘stop words’. You want to remove these words from your analysis as they are fillers used to compose a sentence.

Lucky for us, the tidytext package has a function that will help us clean up stop words! To use this you:

Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis. Then you use anti_join to remove all stop words from your analysis. Let’s give this a try next!

# load list of stop words - from the tidytext package
data("stop_words")
# view first 6 words
head(stop_words)
## # A tibble: 6 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART
## # A tibble: 6 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART

nrow(query_tweets_clean)
## [1] 280325
## [1] 247606 or something similar

# remove stop words from your list of words
cleaned_tweet_words <- query_tweets_clean %>%
  anti_join(stop_words)
## Joining, by = "word"
# there should be fewer words now
nrow(cleaned_tweet_words)
## [1] 131614
## [1] 133584 or something similar

Explore Networks of Words

You might also want to explore words that occur together in tweets. Let’s do that next.

ngrams specifies pairs and 2 is the number of words together

# library(devtools)
# install_github("dgrtwo/widyr")
library(widyr)

# remove punctuation, convert to lowercase, add id for each tweet!
query_tweets_paired_words <- query_tweets %>%
  dplyr::select(stripped_text) %>%
  unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)

query_tweets_paired_words %>%
  count(paired_words, sort = TRUE)
## # A tibble: 121,649 x 2
##    paired_words      n
##    <chr>         <int>
##  1 witch hunt     9922
##  2 a witch        2478
##  3 the witch      1204
##  4 this is         891
##  5 of the          819
##  6 is a            770
##  7 in the          601
##  8 this witch      561
##  9 another witch   537
## 10 hunt and        492
## # … with 121,639 more rows
## # A tibble: 134,656 x 2
##    paired_words         n
##    <chr>            <int>
##  1 query change    1021
##  2 of the             804
##  3 in the             798
##  4 querychange is   570
##  5 is a               442
##  6 of querychange   437
##  7 on the             383
##  8 on querychange   364
##  9 to the             354
## 10 this is            331
## # . with 134,646 more rows
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:igraph':
## 
##     crossing
query_tweets_separated_words <- query_tweets_paired_words %>%
  separate(paired_words, c("word1", "word2"), sep = " ")

query_tweets_filtered <- query_tweets_separated_words %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
query_words_counts <- query_tweets_filtered %>%
  count(word1, word2, sort = TRUE)

head(query_words_counts)
## # A tibble: 6 x 3
##   word1       word2             n
##   <chr>       <chr>         <int>
## 1 witch       hunt           9922
## 2 fake        news            393
## 3 impeachment inquiry         348
## 4 hunt        impeachment     289
## 5 slams       democrats       280
## 6 democrats   whistleblower   270
## # A tibble: 6 x 3
##   word1         word2             n
##   <chr>         <chr>         <int>
## 1 query       change         1021
## 2 querychange querycrisis   232
## 3 querychange globalwarming   147
## 4 global        warming         141
## 5 hurricane     dorian          113
## 6 query       crisis          100

Finally, plot the data

library(igraph)
library(ggraph)

# plot query word network
# (plotting graph edges is currently broken)
query_words_counts %>%
        filter(n >= 24) %>%
        graph_from_data_frame() %>%
        ggraph(layout = "fr") +
        geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
        geom_node_point(color = "darkslategray4", size = 3) +
        geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
        labs(title = paste("Word Network: Tweets containing",query),
             subtitle = "Text mining twitter data ",
             x = "", y = "")