Use the tutorial on the following website:
In this example, let’s find tweets that are using the words “whistle blower” in them.
First, you load the rtweet and other needed R packages. Note you are introducing 2 new packages lower in this lesson: igraph and ggraph.
# load twitter library - the rtweet library is recommended now over twitteR
library(rtweet)
# plotting and pipes - tidyverse!
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# text mining library
library(tidytext)
# plotting packages
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
The first time you run the next line of code, it will take you to your twitter page and you have to agree that you are allowing this app to work with R
# set up query parameters BE SURE TO CHECK VALUE OF loadFromTwitter
query="whistle blower"
queryTitle="whistleBlower"
queryFile="whistleBlower.rda"
loadFromTwitter<- list()
loadFromTwitter[[query]]<- TRUE
if (loadFromTwitter[[query]]) {
query_tweets <- search_tweets(q = query, n = 10000,
lang = "en",
include_rts = FALSE)
save(query_tweets,file=queryFile)
loadFromTwitter[[query]]<-FALSE
} else {
load(file = queryFile)
}
# check data to see if there are emojis
head(query_tweets$text)
## [1] "Says the \"pay for play access fraud\" that defines Hillary Clinton. The House jumped the gun. The whistle blower was a paid for spy and all he delivered was easily refutable gossip. Hillary coming out so strong? I wonder if she were a part of this scam, too! It would make sense. https://t.co/qYxqzCzyz8"
## [2] "I believe Schiff not only knew about the whistle blower, I wonder if he or one of his buddies is the whistle blower. We already know that Schiff is corrupt. What he said before the hearing should have him expelled...not censured. We do not need that kind of behavior in OUR House. https://t.co/hsN2inerOl"
## [3] "That is why they changed the whistle blower rules and are trying to protect the he/she/it (they may only be a real person on paper). It is so obvious what these desperadoes have done. Now, I want to ask, can I sue the Democrat party for destroying my peace of mind? https://t.co/QnxWzPAvEf"
## [4] "Makes you wonder how much Brennan, the CIA fraud, was involved with the phony whistle blower report. Trying to impeach a President for doing what he is legally allowed to do is going to make anyone who votes for impeachment guilty of treason for subverting the Constitution. https://t.co/DnvN7w2DFA"
## [5] "@GOPChairwoman @realDonaldTrump Me Romney, the transcript speaks for itself Trump and Giuliani self incriminated on live TV. Nothing in the whistle blower complaint has been disproven \n\nErin Brockavitch was a whistleblower who didn’t witness the poisoning of H2O first hand. That’s not a requirement."
## [6] "@kylegriffin1 I fear for this whistle blower."
Looking at the data above, it becomes clear that there is a lot of clean-up associated with social media data.
First, there are url’s in your tweets. If you want to do a text analysis to figure out what words are most common in your tweets, the URL’s won’t be helpful. Let’s remove those.
# remove urls tidyverse is failing here for some reason
# query_tweets %>%
# mutate_at(c("stripped_text"), gsub("http.*","",.))
# remove http elements manually
# modified regex for url to end at next white space character
query_tweets$stripped_text <- gsub('http\\S+\\s*',"", query_tweets$text)
query_tweets$stripped_text <- gsub('https\\S+\\s*',"",query_tweets$stripped_text)
Finally, you can clean up your text. If you are trying to create a list of unique words in your tweets, words with capitalization will be different from words that are all lowercase. Also you don’t need punctuation to be returned as a unique word.
You can use the tidytext::unnest_tokens() function in the tidytext package to magically clean up your text! When you use this function the following things will be cleaned up in the text:
Convert text to lowercase: each word found in the text will be converted to lowercase, so ensure that you don’t get duplicate words due to variation in capitalization.
Punctuation is removed: all instances of periods, commas etc will be removed from your list of words , and Unique id associated with the tweet: will be added for each occurrence of the word
The unnest_tokens() function takes two arguments:
In your case, you want to use the stripped_text column which is where you have your cleaned up tweet text stored.
# remove punctuation, convert to lowercase, add id for each tweet!
query_tweets_clean <- query_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
# plot the top 15 words -- notice any issues?
query_tweets_clean %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")
## Selecting by n
Your plot of unique words contains some words that may not be useful to use. For instance “a” and “to”. In the world of text mining you call those words - ‘stop words’. You want to remove these words from your analysis as they are fillers used to compose a sentence.
Lucky for us, the tidytext package has a function that will help us clean up stop words! To use this you:
Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis. Then you use anti_join to remove all stop words from your analysis. Let’s give this a try next!
# load list of stop words - from the tidytext package
data("stop_words")
# view first 6 words
head(stop_words)
## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
nrow(query_tweets_clean)
## [1] 279711
## [1] 247606 or something similar
# remove stop words from your list of words
cleaned_tweet_words <- query_tweets_clean %>%
anti_join(stop_words)
## Joining, by = "word"
# there should be fewer words now
nrow(cleaned_tweet_words)
## [1] 133706
## [1] 133584 or something similar
You might also want to explore words that occur together in tweets. Let’s do that next.
ngrams specifies pairs and 2 is the number of words together
# library(devtools)
# install_github("dgrtwo/widyr")
library(widyr)
# remove punctuation, convert to lowercase, add id for each tweet!
query_tweets_paired_words <- query_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)
query_tweets_paired_words %>%
count(paired_words, sort = TRUE)
## # A tibble: 112,817 x 2
## paired_words n
## <chr> <int>
## 1 whistle blower 9338
## 2 the whistle 5115
## 3 a whistle 1156
## 4 of the 1088
## 5 blower complaint 867
## 6 blower is 715
## 7 in the 686
## 8 is a 610
## 9 to the 599
## 10 is the 531
## # … with 112,807 more rows
## # A tibble: 134,656 x 2
## paired_words n
## <chr> <int>
## 1 query change 1021
## 2 of the 804
## 3 in the 798
## 4 querychange is 570
## 5 is a 442
## 6 of querychange 437
## 7 on the 383
## 8 on querychange 364
## 9 to the 354
## 10 this is 331
## # . with 134,646 more rows
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:igraph':
##
## crossing
query_tweets_separated_words <- query_tweets_paired_words %>%
separate(paired_words, c("word1", "word2"), sep = " ")
query_tweets_filtered <- query_tweets_separated_words %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
query_words_counts <- query_tweets_filtered %>%
count(word1, word2, sort = TRUE)
head(query_words_counts)
## # A tibble: 6 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 whistle blower 9338
## 2 blower complaint 867
## 3 blower rules 379
## 4 blower report 353
## 5 hand knowledge 341
## 6 white house 254
## # A tibble: 6 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 query change 1021
## 2 querychange querycrisis 232
## 3 querychange globalwarming 147
## 4 global warming 141
## 5 hurricane dorian 113
## 6 query crisis 100
library(igraph)
library(ggraph)
# plot query word network
# (plotting graph edges is currently broken)
query_words_counts %>%
filter(n >= 24) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
labs(title = paste("Word Network: Tweets containing",query),
subtitle = "Text mining twitter data ",
x = "", y = "")