Search Tweets & Analyse Hashtags with R

Jun Chen

2020-04-05

It has been over two weeks since Australia’s Covid-19 travel ban starts. Many people are expressing their concerns and opinions on Twitter regarding Covid-19. It is intereting to find out what other topics Australians are discussing along with Covid-19 on Twitter.

This vignette introduces how to collect data from twitter through rtweet package and and an example of handle and process text strings (hashtags) through package stringr with R.

Search Tweets

To get started, we need to:

library(rtweet)

Accoridng to Twitter trends, #Covid_19australia is the most popular hashtag in Australia at this moment. Therefore we could start from searching all tweets including this hashtag. We can simply send a search request to Twitter’s API using function search_tweets():

tweets <- search_tweets("#Covid_19Australia", n = 10000, include_rts = FALSE)

Within the function, n specifies the desired total number of latest tweets (in the past 9 days) to return (Defaults is 100 and Maximum is 18,000); include_rts indicates whether to include retweets in search results (We set FALSE to filter retweets).

When we run the code above, a web browser will popup asking for our Twitter account details:

Figure 1

Figure 1

Then simply click “Authorize app” to interact with Twitter.

Figure 2

Figure 2

With rtweet it is no longer necessary to obtain a developer account to use Twitter’s API. It makes interaction much faster and easier. There are many more functions other than search_tweets() to explore, but we are only using this function for our example.

Processing Text

Now we have our dataset ready. As we know, data can be divided into quantitative data and qualitative data. Qualitative data, also known as categorical data, is descriptive and non-numerical in nature and collected through methods like observations and interviews. The tweets are this type of data and in the form of text and strings. In general, R may not be as rich and diverse as other scripting languages when it comes to string manipulation, but for continuity and consistency, it is better to stay in the same environment. On the other hand, R is very useful when it comes to computation of character strings and text.

stringr package is included in tidyverse package, so we load tidyverse here:

Before we start processing, check the structure of the dataset and have a simple idea of how the data looks like:

glimpse(tweets)
#> Observations: 6,430
#> Variables: 90
#> $ user_id                 <chr> "2470735572", "2470735572", "2470735572", "...
#> $ status_id               <chr> "1246628328724770818", "1246170563720048640...
#> $ created_at              <dbl> 43926.11, 43924.85, 43925.92, 43925.93, 439...
#> $ screen_name             <chr> "MelissackovacM", "MelissackovacM", "Meliss...
#> $ text                    <chr> "@DanielAndrewsMP\n@JennyMikakos\n@Victoria...
#> $ source                  <chr> "Twitter for Android", "Twitter for Android...
#> $ display_text_width      <dbl> 151, 249, 248, 224, 241, 142, 72, 237, 182,...
#> $ reply_to_status_id      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "12...
#> $ reply_to_user_id        <chr> "228535666", NA, NA, NA, NA, NA, NA, NA, NA...
#> $ reply_to_screen_name    <chr> "DanielAndrewsMP", NA, NA, NA, NA, NA, NA, ...
#> $ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
#> $ is_retweet              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
#> $ favorite_count          <dbl> 0, 0, 0, 2, 0, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0...
#> $ retweet_count           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 3, 0, 0...
#> $ quote_count             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ reply_count             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ hashtags                <chr> "Covid_19australia, coronavirusaus, COVID19...
#> $ symbols                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ urls_url                <chr> NA, "abc.net.au/news/health/20…", "abc.net....
#> $ urls_t.co               <chr> NA, "https://t.co/M1lKeRc4rS", "https://t.c...
#> $ urls_expanded_url       <chr> NA, "https://www.abc.net.au/news/health/202...
#> $ media_url               <chr> NA, NA, NA, NA, NA, NA, "http://pbs.twimg.c...
#> $ media_t.co              <chr> NA, NA, NA, NA, NA, NA, "https://t.co/61rlW...
#> $ media_expanded_url      <chr> NA, NA, NA, NA, NA, NA, "https://twitter.co...
#> $ media_type              <chr> NA, NA, NA, NA, NA, NA, "photo", NA, NA, NA...
#> $ ext_media_url           <chr> NA, NA, NA, NA, NA, NA, "http://pbs.twimg.c...
#> $ ext_media_t.co          <chr> NA, NA, NA, NA, NA, NA, "https://t.co/61rlW...
#> $ ext_media_expanded_url  <chr> NA, NA, NA, NA, NA, NA, "https://twitter.co...
#> $ ext_media_type          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ mentions_user_id        <chr> "228535666, 418667840, 1182090678999736321"...
#> $ mentions_screen_name    <chr> "DanielAndrewsMP, JennyMikakos, VictorianCH...
#> $ lang                    <chr> "en", "en", "en", "en", "en", "en", "und", ...
#> $ quoted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, "1246384584...
#> $ quoted_text             <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Our peacef...
#> $ quoted_created_at       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 43925.44, 4...
#> $ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Twitter fo...
#> $ quoted_favorite_count   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 66, 32, NA,...
#> $ quoted_retweet_count    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 40, 10, NA,...
#> $ quoted_user_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, "9304142869...
#> $ quoted_screen_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, "FarhadBand...
#> $ quoted_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Farhad Ban...
#> $ quoted_followers_count  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 300, 25236,...
#> $ quoted_friends_count    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 119, 2005, ...
#> $ quoted_statuses_count   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 87, 46744, ...
#> $ quoted_location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Melbourne,...
#> $ quoted_description      <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Artist/mus...
#> $ quoted_verified         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, FALSE, TRUE...
#> $ retweet_status_id       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_text            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_created_at      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_source          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_favorite_count  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_retweet_count   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_user_id         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_screen_name     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_name            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_followers_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_friends_count   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_statuses_count  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_location        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_description     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ retweet_verified        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ geo_coords              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ coords_coords           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ bbox_coords             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ status_url              <chr> "https://twitter.com/MelissackovacM/status/...
#> $ name                    <chr> "ogden", "ogden", "ogden", "ogden", "ogden"...
#> $ location                <chr> "", "", "", "", "", "", "", "", "Sydney, Au...
#> $ description             <chr> "stuck in a rut", "stuck in a rut", "stuck ...
#> $ url                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, "https://t....
#> $ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
#> $ followers_count         <dbl> 2, 2, 2, 2, 2, 182, 182, 182, 2810, 2810, 2...
#> $ friends_count           <dbl> 6, 6, 6, 6, 6, 485, 485, 485, 2640, 2640, 2...
#> $ listed_count            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 347, 347, 347, 347,...
#> $ statuses_count          <dbl> 93, 93, 93, 93, 93, 13276, 13276, 13276, 11...
#> $ favourites_count        <dbl> 5, 5, 5, 5, 5, 19794, 19794, 19794, 25417, ...
#> $ account_created_at      <dbl> 41759.49, 41759.49, 41759.49, 41759.49, 417...
#> $ verified                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
#> $ profile_url             <chr> NA, NA, NA, NA, NA, NA, NA, NA, "https://t....
#> $ profile_expanded_url    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "http://www...
#> $ account_lang            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ profile_banner_url      <chr> NA, NA, NA, NA, NA, "https://pbs.twimg.com/...
#> $ profile_background_url  <chr> "http://abs.twimg.com/images/themes/theme1/...
#> $ profile_image_url       <chr> "http://abs.twimg.com/sticky/default_profil...

tweets is a huge dataset with 90 variables and 6430 observations. text column holds the content of the tweets. Since we are going to analyse hashtags, this is the one we will focus on.

First, we create pattern to search hashtags within text:

hashtag_pat <- "#[a-zA-Z0-9_-ー\\.]+"
hashtag <- str_extract_all(tweets$text, hashtag_pat)

hashtag_pat is used to look for strings start with # and following with strings or special characters ("_“,”-“,”ー" or “.”) in any length. Then with str_extract_all, results are stored in the list hashtag.

Second, we convert list to vector for further process. In order to merge hashtags with the same content, we convert all hashtags to lowercase and removed special characters.

hashtag_word <- unlist(hashtag)
hashtag_word <- tolower(hashtag_word)
hashtag_word <- gsub("[[:punct:]ー]", "", hashtag_word)

Last step, since our purpose is to find out what other topics Australians discuss along with covid19 on Twitter, all hashtags include “covid” or “corona” have been removed.

hashtag_word <- hashtag_word[!str_detect(hashtag_word, "covid")]
hashtag_word <- hashtag_word[!str_detect(hashtag_word, "corona")]

Analyse Data

Now we have a clean dataset. So we could count the frequency of each unique hashtags to see what are the top 20 popular topics.

hashtag_count <- table(hashtag_word)
top_20_freqs <- sort(hashtag_count, decreasing = TRUE)[1:20]
top_20_freqs
#> hashtag_word
#>              auspol          stayathome           australia        rubyprincess 
#>                1125                 176                 125                 111 
#>   stayhomeaustralia    socialdistancing scottyfrommarketing   lockdownaustralia 
#>                  87                  82                  77                  70 
#>              nswpol            lockdown          auspol2020               memes 
#>                  69                  66                  60                  54 
#>   australialockdown   stayhomesavelives            insiders  scottyfromhillsong 
#>                  50                  47                  46                  42 
#>            stayhome              auspoi                  5g                 nsw 
#>                  41                  40                  33                  33

Here is the barplot of top 20 hashtags in decending order.

as.data.frame(hashtag_word) %>%
  count(hashtag_word, sort = TRUE) %>%
  mutate(hashtag_word = reorder(hashtag_word, n)) %>%
  top_n(20) %>%
  ggplot(aes(x = hashtag_word, y = n)) +
  geom_col() +
  coord_flip() +
  labs(x = "Count",
       y = "Hashtag",
       title = "Top 20 Popular Hashtags along with Covid19")
Figure 3

Figure 3

There is another package wordcloud could help us visualise the ranking.

library(wordcloud)
top_20_hashtags <- as.character(as.data.frame(top_20_freqs)[,1])
wordcloud(top_20_hashtags, top_20_freqs, 
          scale=c(3.5,1.5), random.order=FALSE, rot.per=.25)
Figure 4

Figure 4

Through the hashtag frequency ranking, we can see the most popular hashtag that is discussed along with Covid-19 in Australia is “#auspol” or “#Auspol”, which is far more popular than other tags. It seems many Australians are commenting on the federal government during the pandemic lockdown ;)

Resources

Here are helpful resources I used - read and enjoy!