Use the tutorial on the following website:
In this example, let’s find tweets that are using the words “whistle blower” in them.
First, you load the rtweet and other needed R packages. Note you are introducing 2 new packages lower in this lesson: igraph and ggraph.
# load twitter library - the rtweet library is recommended now over twitteR
library(rtweet)
# plotting and pipes - tidyverse!
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# text mining library
library(tidytext)
# plotting packages
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
The first time you run the next line of code, it will take you to your twitter page and you have to agree that you are allowing this app to work with R
# set up query parameters BE SURE TO CHECK VALUE OF loadFromTwitter
query="witch hunt"
queryTitle="witchHunt"
queryFile="witchHunt.rda"
loadFromTwitter<- list()
loadFromTwitter[[query]]<- FALSE
if (loadFromTwitter[[query]]) {
query_tweets <- search_tweets(q = query, n = 10000,
lang = "en",
include_rts = FALSE)
save(query_tweets,file=queryFile)
loadFromTwitter[[query]]<-FALSE
} else {
load(file = queryFile)
}
# check data to see if there are emojis
head(query_tweets$text)
## [1] "@realDonaldTrump @POTUS @PeteHegseth @foxandfriends It ain't no witch hunt it's \"TRUMPERY\"...LMFAO"
## [2] "@RudyGiuliani Do you recall the witch-hunt against Hillary and the money spent only to find nothing."
## [3] "@RNCResearch @realDonaldTrump President Trump asked Quid Pro Joe to release his transcript, his campaign and the Fake News media IGNORED IT. Ridiculous! We can’t let these LIARS get away with ANOTHER reckless WITCH HUNT!"
## [4] "Dr Kafeel Khan gets 'clean chit' & demands 'honour & self-respect'\nUP Govt. says, 'probe still on', while Opposition calls it 'witch hunt'.\n\nWatch Athar Khan on @thenewshour SPECIAL EDITION tonight at 9:30 PM. https://t.co/9ANe3JvhUe"
## [5] "Dr Kafeel Khan gets 'clean chit’ & demands 'honour & self-respect'.\nUP Govt says 'probe still on', while opposition calls it 'witch hunt'. \n\nTIMES NOW’s Mohit and Wajji with details. Listen in. https://t.co/OsVeNtnb9y"
## [6] "I do not understand how this witch hunt can just continue to go on from one thing to another! https://t.co/qjPDltgwju"
Looking at the data above, it becomes clear that there is a lot of clean-up associated with social media data.
First, there are url’s in your tweets. If you want to do a text analysis to figure out what words are most common in your tweets, the URL’s won’t be helpful. Let’s remove those.
# remove urls tidyverse is failing here for some reason
# query_tweets %>%
# mutate_at(c("stripped_text"), gsub("http.*","",.))
# remove http elements manually
# modified regex for url to end at next white space character
query_tweets$stripped_text <- gsub('http\\S+\\s*',"", query_tweets$text)
query_tweets$stripped_text <- gsub('https\\S+\\s*',"",query_tweets$stripped_text)
Finally, you can clean up your text. If you are trying to create a list of unique words in your tweets, words with capitalization will be different from words that are all lowercase. Also you don’t need punctuation to be returned as a unique word.
You can use the tidytext::unnest_tokens() function in the tidytext package to magically clean up your text! When you use this function the following things will be cleaned up in the text:
Convert text to lowercase: each word found in the text will be converted to lowercase, so ensure that you don’t get duplicate words due to variation in capitalization.
Punctuation is removed: all instances of periods, commas etc will be removed from your list of words , and Unique id associated with the tweet: will be added for each occurrence of the word
The unnest_tokens() function takes two arguments:
In your case, you want to use the stripped_text column which is where you have your cleaned up tweet text stored.
# remove punctuation, convert to lowercase, add id for each tweet!
query_tweets_clean <- query_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
# plot the top 15 words -- notice any issues?
query_tweets_clean %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")
## Selecting by n
Your plot of unique words contains some words that may not be useful to use. For instance “a” and “to”. In the world of text mining you call those words - ‘stop words’. You want to remove these words from your analysis as they are fillers used to compose a sentence.
Lucky for us, the tidytext package has a function that will help us clean up stop words! To use this you:
Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis. Then you use anti_join to remove all stop words from your analysis. Let’s give this a try next!
# load list of stop words - from the tidytext package
data("stop_words")
# view first 6 words
head(stop_words)
## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
nrow(query_tweets_clean)
## [1] 280325
## [1] 247606 or something similar
# remove stop words from your list of words
cleaned_tweet_words <- query_tweets_clean %>%
anti_join(stop_words)
## Joining, by = "word"
# there should be fewer words now
nrow(cleaned_tweet_words)
## [1] 131614
## [1] 133584 or something similar
You might also want to explore words that occur together in tweets. Let’s do that next.
ngrams specifies pairs and 2 is the number of words together
# library(devtools)
# install_github("dgrtwo/widyr")
library(widyr)
# remove punctuation, convert to lowercase, add id for each tweet!
query_tweets_paired_words <- query_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)
query_tweets_paired_words %>%
count(paired_words, sort = TRUE)
## # A tibble: 121,649 x 2
## paired_words n
## <chr> <int>
## 1 witch hunt 9922
## 2 a witch 2478
## 3 the witch 1204
## 4 this is 891
## 5 of the 819
## 6 is a 770
## 7 in the 601
## 8 this witch 561
## 9 another witch 537
## 10 hunt and 492
## # … with 121,639 more rows
## # A tibble: 134,656 x 2
## paired_words n
## <chr> <int>
## 1 query change 1021
## 2 of the 804
## 3 in the 798
## 4 querychange is 570
## 5 is a 442
## 6 of querychange 437
## 7 on the 383
## 8 on querychange 364
## 9 to the 354
## 10 this is 331
## # . with 134,646 more rows
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:igraph':
##
## crossing
query_tweets_separated_words <- query_tweets_paired_words %>%
separate(paired_words, c("word1", "word2"), sep = " ")
query_tweets_filtered <- query_tweets_separated_words %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
query_words_counts <- query_tweets_filtered %>%
count(word1, word2, sort = TRUE)
head(query_words_counts)
## # A tibble: 6 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 witch hunt 9922
## 2 fake news 393
## 3 impeachment inquiry 348
## 4 hunt impeachment 289
## 5 slams democrats 280
## 6 democrats whistleblower 270
## # A tibble: 6 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 query change 1021
## 2 querychange querycrisis 232
## 3 querychange globalwarming 147
## 4 global warming 141
## 5 hurricane dorian 113
## 6 query crisis 100
library(igraph)
library(ggraph)
# plot query word network
# (plotting graph edges is currently broken)
query_words_counts %>%
filter(n >= 24) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
labs(title = paste("Word Network: Tweets containing",query),
subtitle = "Text mining twitter data ",
x = "", y = "")