Note you are introducing 2 new packages lower in this lesson: igraph and ggraph.
# load twitter library - the rtweet library is recommended now over twitteR
library(rtweet)
# plotting and pipes - tidyverse!
library(ggplot2)
library(dplyr)
library(tidyverse)
# text mining library
library(tidytext)
In order to download tweets, you will need a twitter account.
Then you will need a developer account. Follow this CRAN tutorial in order to create a developer account: https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html
Copy and paste the four keys (along with the name of your app) into an R script file and pass them along to create_token().
api_key <- “…” api_secret_key <- “…” access_token <- “…” access_token_secret <- “…”
Search for tweets related to Russia
russia_tweets <- search_tweets(q = "Putin Navalny", n = 5000, # had to lower the number to get under rate limit
lang = "en",
include_rts = FALSE)
# check data to see if there are emojis
head(russia_tweets$text)
## [1] "Putin needs to go same as his pupil @realDonaldTrump https://t.co/6WjCNa4SGS"
## [2] "Questions...Putin, Xi, Un also don’t like questions and throw people and journalists in jail, and some other “nice” countries kill them #Khashoggi... @Kasparov63 @navalny Who are the people of the USA \U0001f1fa\U0001f1f8 with??? @SenKamalaHarris @JoeBiden https://t.co/zaDTf71tS0"
## [3] "\U0001f923Who even believable these things from the minions of Putin?!? https://t.co/th64BN3CCx"
## [4] "@realDonaldTrump invites little green men to “observe “ election? #Putin @navalny https://t.co/JCAAEKSpOa"
## [5] "@mfa_russia @guardian @RussianEmbassy @FCDOGovUK Russia is a terrorist state.\nThe West would support Navalny, who is the one of few visible threats to Putin's managed democracy, why on earth would they poison him?"
## [6] "@realDonaldTrump Speaking of poison you still have not mentioned the poisoning and attempted assassination of Putin opposition leader Alexey Navalny, also you still have not punished Russia for putting bounties on USA soldiers."
Navalny and Putin are two different people, so I wouldn’t expect their names to be used together in a string, so I won’t try concatenating their names. So I’ll proceed to cleaning the data I do have.
Looking at the data above, it becomes clear that there is a lot of clean-up associated with social media data.
First, there are url’s in your tweets. If you want to do a text analysis to figure out what words are most common in your tweets, the URL’s won’t be helpful. Let’s remove those.
# remove http elements manually
russia_tweets$stripped_text <- gsub("http.*","", russia_tweets$text)
russia_tweets$stripped_text <- gsub("https.*","", russia_tweets$stripped_text)
You can use the tidytext::unnest_tokens() function in the tidytext package to magically clean up your text! When you use this function the following things will be cleaned up in the text:
Convert text to lowercase: each word found in the text will be converted to lowercase, so ensure that you don’t get duplicate words due to variation in capitalization.
Punctuation is removed: all instances of periods, commas etc will be removed from your list of words , and Unique id associated with the tweet: will be added for each occurrence of the word
The unnest_tokens() function takes two arguments:
In your case, you want to use the stripped_text column which is where you have your cleaned up tweet text stored.
# remove punctuation, convert to lowercase, add id for each tweet!
russia_tweets_clean <- russia_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
# plot the top 15 words -- notice any issues?
russia_tweets_clean %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")
## Selecting by n
Your plot of unique words contains some words that may not be useful to use, called “stop_words”. For instance “a” and “to”. You want to remove these words from your analysis as they are fillers used to compose a sentence.
Lucky for us, the tidytext package has a function that will help us clean up stop words! To use this you:
Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis. Then you use anti_join to remove all stop words from your analysis. Let’s give this a try next!
cleaned_tweet_words %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")
## Selecting by n
## Explore Networks of Words
You might also want to explore words that occur together in tweets. Let's do that next.
ngrams specifies pairs and 2 is the number of words together
```r
# library(devtools)
# install_github("dgrtwo/widyr")
library(widyr)
# remove punctuation, convert to lowercase, add id for each tweet!
russia_tweets_paired_words <- russia_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)
russia_tweets_paired_words %>%
count(paired_words, sort = TRUE)
## # A tibble: 8,121 x 2
## paired_words n
## <chr> <int>
## 1 alexei navalny 71
## 2 putin says 67
## 3 navalny to 59
## 4 says he 52
## 5 to be 51
## 6 of the 42
## 7 for treatment 41
## 8 vladimir putin 40
## 9 to germany 35
## 10 poisoning of 34
## # … with 8,111 more rows
This is not unexpected. Alexei (or Alexey depending on transliteration) is Navalny’s first name and Vladimir is Putin’s first name.
Again, eliminate the stop words from the paired words
library(tidyr)
russia_tweets_separated_words <- russia_tweets_paired_words %>%
separate(paired_words, c("word1", "word2"), sep = " ")
russia_tweets_filtered <- russia_tweets_separated_words %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
russia_words_counts <- russia_tweets_filtered %>%
count(word1, word2, sort = TRUE)
head(russia_words_counts)
## # A tibble: 6 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 alexei navalny 71
## 2 vladimir putin 40
## 3 alexey navalny 29
## 4 leave russia 22
## 5 navalny poisoning 22
## 6 opposition leader 22
# plotting packages
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
# (plotting graph edges is currently broken)
russia_words_counts %>%
filter(n >= 24) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
labs(title = "Word Network: Tweets using the words Navalny or Putin",
subtitle = "Text mining twitter data ",
x = "", y = "")
Not particularly revealing since it accurately connects the first name with the last name for the two men. It does reveal that Alexei is a more popular transliteration than Alexey.
I wonder if I can do this in Russian.
russian_lang_tweets <- search_tweets(q = "путин навальный", n = 4500, # lowered number so total <10000
lang = "ru",
include_rts = FALSE)
# check data to see if there are emojis
head(russian_lang_tweets$text)
## [1] "@i_korotchenko Вообще-то его сейчас нет в эфире, если не считать вчерашнее интервью Л. Соболь на канале Навальный лайв. Вернется в эфир и очень быстро наберет ещё больше подписчиков, так же как и Платошкин. Путин и олигархоз - ВСË!"
## [2] "@gudkov_g Это тот самый Путин, который вместе с СК и рафаэлкой \"вступился\" за лругого ветерана, якобы Навальный его сильно оскорбил. Вот так."
## [3] "@vsyo @redhot4ili Болотная была не одна, основная за день до инаугурации, когда \"чуть Кремль не взяли\" и Путин на неё ехал по пустынным улицам. Именно Навальный отказался покинуть площадь и людей начали вытеснять силой. Я тогда был не на самой площади, а Вы где?"
## [4] "https://t.co/V9wmYdbupr Антон Орехъ: Путин проболтался! Дело не в том, что Навальный опроверг путинские слова, дело в самих словах. Важна фраза Путина, что он дал указание Гепрокуратуре и «этого гражданина» отпустили лечиться за границу."
## [5] "@EchoMskRu Странно, Путин согласно сливу французов обвиняет, что Навальный изготавливает на территории РФ боевое отравляющее вещество, но расследовать это не хочет)"
## [6] "Рейтинг ста персон, пользующихся наибольшим доверием россиян от исследовательского холдинга «Ромир», обновляется раз в квартал:\n1. Путин\n2. Лавров\n3. Жириновский\n4. Навальный\n5. Шойгу\n6. Мишустин\n...\nhttps://t.co/pA12Lyljzc"
Ha! It does work, though obviously I will have a problem with stop words in Russian.
Cleaning the data just like with the previous one.
# remove http elements manually
russian_lang_tweets$stripped_text <- gsub("http.*","", russian_lang_tweets$text)
russian_lang_tweets$stripped_text <- gsub("https.*","", russian_lang_tweets$stripped_text)
# remove punctuation, convert to lowercase, add id for each tweet!
russian_lang_tweets_clean <- russian_lang_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
# plot the top 15 words -- notice any issues?
russian_lang_tweets_clean %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")
## Selecting by n
All of the words below the top two (Putin and Navalny) are prepositions, conjunctions, or pronouns. I need to figure out how to get rid of them!
Google tells me the “stopwords” package should work with Russian.
library(stopwords)
stopwords("russian")
## [1] "и" "в" "во" "не" "что" "он" "на"
## [8] "я" "с" "со" "как" "а" "то" "все"
## [15] "она" "так" "его" "но" "да" "ты" "к"
## [22] "у" "же" "вы" "за" "бы" "по" "только"
## [29] "ее" "мне" "было" "вот" "от" "меня" "еще"
## [36] "нет" "о" "из" "ему" "теперь" "когда" "даже"
## [43] "ну" "вдруг" "ли" "если" "уже" "или" "ни"
## [50] "быть" "был" "него" "до" "вас" "нибудь" "опять"
## [57] "уж" "вам" "сказал" "ведь" "там" "потом" "себя"
## [64] "ничего" "ей" "может" "они" "тут" "где" "есть"
## [71] "надо" "ней" "для" "мы" "тебя" "их" "чем"
## [78] "была" "сам" "чтоб" "без" "будто" "человек" "чего"
## [85] "раз" "тоже" "себе" "под" "жизнь" "будет" "ж"
## [92] "тогда" "кто" "этот" "говорил" "того" "потому" "этого"
## [99] "какой" "совсем" "ним" "здесь" "этом" "один" "почти"
## [106] "мой" "тем" "чтобы" "нее" "кажется" "сейчас" "были"
## [113] "куда" "зачем" "сказать" "всех" "никогда" "сегодня" "можно"
## [120] "при" "наконец" "два" "об" "другой" "хоть" "после"
## [127] "над" "больше" "тот" "через" "эти" "нас" "про"
## [134] "всего" "них" "какая" "много" "разве" "сказала" "три"
## [141] "эту" "моя" "впрочем" "хорошо" "свою" "этой" "перед"
## [148] "иногда" "лучше" "чуть" "том" "нельзя" "такой" "им"
## [155] "более" "всегда" "конечно" "всю" "между"
That’ll work.
nrow(russian_lang_tweets_clean)
## [1] 11202
russianStopWords <- tibble(word = stopwords("russian") )
# remove stop words from your list of words
cleaned_rl_tweet_words <- russian_lang_tweets_clean %>%
anti_join(russianStopWords, by = 'word', copy = TRUE)
# there should be fewer words now
nrow(cleaned_rl_tweet_words)
## [1] 7422
Now we’ll see what the unique words are:
# plot the top 15 words -- notice any issues?
cleaned_rl_tweet_words %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")
## Selecting by n
Much better, I still need to get rid of “это”, which means “this is”, and “все”, which means “everything”, but other than that, it looks much better. I’ll try tweaking the russianStopWords tibble.
# I copied the data into a text file, added the words I wanted, and edited the text to make it a csv --- "myrussianwords.txt"
# Next I'll load it and convert it into a tibble, just like I did with the pre-packaged stop word list
myRussianStopWords <- read_csv("myrussianwords.txt")
## Parsed with column specification:
## cols(
## word = col_character()
## )
Test if that worked:
# remove stop words from your list of words
cleaned_rl_tweet_words <- russian_lang_tweets_clean %>%
anti_join(myRussianStopWords, by = 'word', copy = TRUE)
# plot the top 15 words -- notice any issues?
cleaned_rl_tweet_words %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")
## Selecting by n
Excellent! I got rid of the two extraneous words. (and now I have a text file I can go back to and add in words that I notice need to be added as stop words.) (I need to figure out how to filter out the endings of the names though…they decline in Russian so they’re basically the same words with a few different letters…perhaps regular expressions?)
# library(devtools)
# install_github("dgrtwo/widyr")
library(widyr)
# remove punctuation, convert to lowercase, add id for each tweet!
russianlanguage_tweets_paired_words <- russian_lang_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)
russianlanguage_tweets_paired_words %>%
count(paired_words, sort = TRUE)
## # A tibble: 8,964 x 2
## paired_words n
## <chr> <int>
## 1 что путин 42
## 2 путин и 37
## 3 путин не 29
## 4 и навальный 26
## 5 путин навальный 25
## 6 навальный и 20
## 7 навальный путин 20
## 8 фамилию навальный 19
## 9 что навальный 19
## 10 алексей навальный 18
## # … with 8,954 more rows
That has a lot of stop words in it, and a combination of first and last names.
library(tidyr)
russianlanguage_tweets_separated_words <- russianlanguage_tweets_paired_words %>%
separate(paired_words, c("word1", "word2"), sep = " ")
russianlanguage_tweets_filtered <- russianlanguage_tweets_separated_words %>%
filter(!word1 %in% russianStopWords$word) %>%
filter(!word2 %in% russianStopWords$word)
# new bigram counts:
russianlanguage_counts <- russianlanguage_tweets_filtered %>%
count(word1, word2, sort = TRUE)
head(russianlanguage_counts)
## # A tibble: 6 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 путин навальный 25
## 2 навальный путин 20
## 3 фамилию навальный 19
## 4 алексей навальный 18
## 5 владимир путин 14
## 6 1 путин 11
Plot the data
# plotting packages
library(igraph)
library(ggraph)
# plot climate change word network
# (plotting graph edges is currently broken)
russianlanguage_counts %>%
filter(n >= 10) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
labs(title = "Word Network: Tweets using the words Navalny and Putin",
subtitle = "Text mining twitter data ",
x = "", y = "")