It has been over two weeks since Australia’s Covid-19 travel ban starts. Many people are expressing their concerns and opinions on Twitter regarding Covid-19. It is intereting to find out what other topics Australians are discussing along with Covid-19 on Twitter.
This vignette introduces how to collect data from twitter through rtweet
package and and an example of handle and process text strings (hashtags) through package stringr
with R.
To get started, we need to:
rtweet
package in R.Accoridng to Twitter trends, #Covid_19australia is the most popular hashtag in Australia at this moment. Therefore we could start from searching all tweets including this hashtag. We can simply send a search request to Twitter’s API using function search_tweets():
Within the function, n
specifies the desired total number of latest tweets (in the past 9 days) to return (Defaults is 100 and Maximum is 18,000); include_rts
indicates whether to include retweets in search results (We set FALSE to filter retweets).
When we run the code above, a web browser will popup asking for our Twitter account details:
Figure 1
Then simply click “Authorize app” to interact with Twitter.
Figure 2
With rtweet
it is no longer necessary to obtain a developer account to use Twitter’s API. It makes interaction much faster and easier. There are many more functions other than search_tweets() to explore, but we are only using this function for our example.
Now we have our dataset ready. As we know, data can be divided into quantitative data and qualitative data. Qualitative data, also known as categorical data, is descriptive and non-numerical in nature and collected through methods like observations and interviews. The tweets are this type of data and in the form of text and strings. In general, R may not be as rich and diverse as other scripting languages when it comes to string manipulation, but for continuity and consistency, it is better to stay in the same environment. On the other hand, R is very useful when it comes to computation of character strings and text.
stringr
package is included in tidyverse
package, so we load tidyverse
here:
Before we start processing, check the structure of the dataset and have a simple idea of how the data looks like:
glimpse(tweets)
#> Observations: 6,430
#> Variables: 90
#> $ user_id <chr> "2470735572", "2470735572", "2470735572", "...
#> $ status_id <chr> "1246628328724770818", "1246170563720048640...
#> $ created_at <dbl> 43926.11, 43924.85, 43925.92, 43925.93, 439...
#> $ screen_name <chr> "MelissackovacM", "MelissackovacM", "Meliss...
#> $ text <chr> "@DanielAndrewsMP\n@JennyMikakos\n@Victoria...
#> $ source <chr> "Twitter for Android", "Twitter for Android...
#> $ display_text_width <dbl> 151, 249, 248, 224, 241, 142, 72, 237, 182,...
#> $ reply_to_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "12...
#> $ reply_to_user_id <chr> "228535666", NA, NA, NA, NA, NA, NA, NA, NA...
#> $ reply_to_screen_name <chr> "DanielAndrewsMP", NA, NA, NA, NA, NA, NA, ...
#> $ is_quote <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
#> $ is_retweet <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
#> $ favorite_count <dbl> 0, 0, 0, 2, 0, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0...
#> $ retweet_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 3, 0, 0...
#> $ quote_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ reply_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ hashtags <chr> "Covid_19australia, coronavirusaus, COVID19...
#> $ symbols <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ urls_url <chr> NA, "abc.net.au/news/health/20…", "abc.net....
#> $ urls_t.co <chr> NA, "https://t.co/M1lKeRc4rS", "https://t.c...
#> $ urls_expanded_url <chr> NA, "https://www.abc.net.au/news/health/202...
#> $ media_url <chr> NA, NA, NA, NA, NA, NA, "http://pbs.twimg.c...
#> $ media_t.co <chr> NA, NA, NA, NA, NA, NA, "https://t.co/61rlW...
#> $ media_expanded_url <chr> NA, NA, NA, NA, NA, NA, "https://twitter.co...
#> $ media_type <chr> NA, NA, NA, NA, NA, NA, "photo", NA, NA, NA...
#> $ ext_media_url <chr> NA, NA, NA, NA, NA, NA, "http://pbs.twimg.c...
#> $ ext_media_t.co <chr> NA, NA, NA, NA, NA, NA, "https://t.co/61rlW...
#> $ ext_media_expanded_url <chr> NA, NA, NA, NA, NA, NA, "https://twitter.co...
#> $ ext_media_type <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ mentions_user_id <chr> "228535666, 418667840, 1182090678999736321"...
#> $ mentions_screen_name <chr> "DanielAndrewsMP, JennyMikakos, VictorianCH...
#> $ lang <chr> "en", "en", "en", "en", "en", "en", "und", ...
#> $ quoted_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, "1246384584...
#> $ quoted_text <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Our peacef...
#> $ quoted_created_at <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 43925.44, 4...
#> $ quoted_source <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Twitter fo...
#> $ quoted_favorite_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 66, 32, NA,...
#> $ quoted_retweet_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 40, 10, NA,...
#> $ quoted_user_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, "9304142869...
#> $ quoted_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, "FarhadBand...
#> $ quoted_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Farhad Ban...
#> $ quoted_followers_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 300, 25236,...
#> $ quoted_friends_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 119, 2005, ...
#> $ quoted_statuses_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 87, 46744, ...
#> $ quoted_location <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Melbourne,...
#> $ quoted_description <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Artist/mus...
#> $ quoted_verified <lgl> NA, NA, NA, NA, NA, NA, NA, NA, FALSE, TRUE...
#> $ retweet_status_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_text <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_created_at <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_source <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_favorite_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_retweet_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_user_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_screen_name <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_name <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_followers_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_friends_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_statuses_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_location <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_description <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_verified <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ place_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ place_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ place_full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ place_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ country <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ country_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ geo_coords <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ coords_coords <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ bbox_coords <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ status_url <chr> "https://twitter.com/MelissackovacM/status/...
#> $ name <chr> "ogden", "ogden", "ogden", "ogden", "ogden"...
#> $ location <chr> "", "", "", "", "", "", "", "", "Sydney, Au...
#> $ description <chr> "stuck in a rut", "stuck in a rut", "stuck ...
#> $ url <chr> NA, NA, NA, NA, NA, NA, NA, NA, "https://t....
#> $ protected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
#> $ followers_count <dbl> 2, 2, 2, 2, 2, 182, 182, 182, 2810, 2810, 2...
#> $ friends_count <dbl> 6, 6, 6, 6, 6, 485, 485, 485, 2640, 2640, 2...
#> $ listed_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 347, 347, 347, 347,...
#> $ statuses_count <dbl> 93, 93, 93, 93, 93, 13276, 13276, 13276, 11...
#> $ favourites_count <dbl> 5, 5, 5, 5, 5, 19794, 19794, 19794, 25417, ...
#> $ account_created_at <dbl> 41759.49, 41759.49, 41759.49, 41759.49, 417...
#> $ verified <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
#> $ profile_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, "https://t....
#> $ profile_expanded_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, "http://www...
#> $ account_lang <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ profile_banner_url <chr> NA, NA, NA, NA, NA, "https://pbs.twimg.com/...
#> $ profile_background_url <chr> "http://abs.twimg.com/images/themes/theme1/...
#> $ profile_image_url <chr> "http://abs.twimg.com/sticky/default_profil...
tweets
is a huge dataset with 90 variables and 6430 observations. text
column holds the content of the tweets. Since we are going to analyse hashtags, this is the one we will focus on.
First, we create pattern to search hashtags within text:
hashtag_pat is used to look for strings start with # and following with strings or special characters ("_“,”-“,”ー" or “.”) in any length. Then with str_extract_all
, results are stored in the list hashtag
.
Second, we convert list to vector for further process. In order to merge hashtags with the same content, we convert all hashtags to lowercase and removed special characters.
hashtag_word <- unlist(hashtag)
hashtag_word <- tolower(hashtag_word)
hashtag_word <- gsub("[[:punct:]ー]", "", hashtag_word)
Last step, since our purpose is to find out what other topics Australians discuss along with covid19 on Twitter, all hashtags include “covid” or “corona” have been removed.
Now we have a clean dataset. So we could count the frequency of each unique hashtags to see what are the top 20 popular topics.
hashtag_count <- table(hashtag_word)
top_20_freqs <- sort(hashtag_count, decreasing = TRUE)[1:20]
top_20_freqs
#> hashtag_word
#> auspol stayathome australia rubyprincess
#> 1125 176 125 111
#> stayhomeaustralia socialdistancing scottyfrommarketing lockdownaustralia
#> 87 82 77 70
#> nswpol lockdown auspol2020 memes
#> 69 66 60 54
#> australialockdown stayhomesavelives insiders scottyfromhillsong
#> 50 47 46 42
#> stayhome auspoi 5g nsw
#> 41 40 33 33
Here is the barplot of top 20 hashtags in decending order.
as.data.frame(hashtag_word) %>%
count(hashtag_word, sort = TRUE) %>%
mutate(hashtag_word = reorder(hashtag_word, n)) %>%
top_n(20) %>%
ggplot(aes(x = hashtag_word, y = n)) +
geom_col() +
coord_flip() +
labs(x = "Count",
y = "Hashtag",
title = "Top 20 Popular Hashtags along with Covid19")
Figure 3
There is another package wordcloud
could help us visualise the ranking.
library(wordcloud)
top_20_hashtags <- as.character(as.data.frame(top_20_freqs)[,1])
wordcloud(top_20_hashtags, top_20_freqs,
scale=c(3.5,1.5), random.order=FALSE, rot.per=.25)
Figure 4
Through the hashtag frequency ranking, we can see the most popular hashtag that is discussed along with Covid-19 in Australia is “#auspol” or “#Auspol”, which is far more popular than other tags. It seems many Australians are commenting
on the federal government during the pandemic lockdown ;)
Here are helpful resources I used - read and enjoy!