@VaxHuntersCan Twitter Analysis in R

For this analysis, we will capture, process, analyze, store, and visualize tweets from @VaxHuntersCan in order to build a map of Covid19 vaccines clinics, supply, and eligibility data in Canada.

Vaccine Hunters Canada is a volunteer organization helping Canadians navigate the patchwork of Covid19 vaccine clinics, supply, and eligibility criteria. Primarily operating on Twitter & Discord, @VaxHuntersCan has created a crowd-sourced hub for Covid19 vaccine information in Canada. In addition to the usual metadata that the Twitter API gives developers access to, @VaxHuntersCan tweets contain a common data structure from which we can parse useful location, supply, and eligibility information.

This project would not be possible without the great work being done by the creators & maintainers of @VaxHuntersCan.

Load Project Libraries

library(rtweet)
library(stringr)
library(dplyr)
library(mapboxapi)
library(prettydoc)
library(leaflet)
library(leaflet.providers)
library(shiny)
library(shinydashboard)
library(ggplot2)
library(DT)
library(googleLanguageR)
library(tidyr)

Authenticate to Twitter API

First, a Twitter developer account is required. Request one here to start building apps that utilize the Twitter API.

Next, create a new app project in your Twitter developer console, and copy and paste your api_key, api_secret_key, access_token, and access_token_secret into the relevant fields of the twitter_auth.R file which we will source in the snippet below and use as environmental variables to generate an auth token.

source("twitter_auth.R")
  
token <- create_token(
  app = "VaxHuntersTweetBot",
  consumer_key = api_key,
  consumer_secret = api_secret_key,
  access_token = access_token,
  access_secret = access_token_secret
)

Pull Tweets from @VaxHuntersCan Timeline

Here, we use the get_timeline() function from the rtweet package to pull tweets from the @VaxHuntersCan Twitter timeline. This function creates a data frame which we will call vax_tweets. We then use the colnames function to take a look at the column headers in the returned data frame.

vax_tweets <- rtweet::get_timeline("VaxHuntersCan",n=3000, parse = TRUE)

colnames(vax_tweets)

##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "quote_count"             "reply_count"            
## [17] "hashtags"                "symbols"                
## [19] "urls_url"                "urls_t.co"              
## [21] "urls_expanded_url"       "media_url"              
## [23] "media_t.co"              "media_expanded_url"     
## [25] "media_type"              "ext_media_url"          
## [27] "ext_media_t.co"          "ext_media_expanded_url" 
## [29] "ext_media_type"          "mentions_user_id"       
## [31] "mentions_screen_name"    "lang"                   
## [33] "quoted_status_id"        "quoted_text"            
## [35] "quoted_created_at"       "quoted_source"          
## [37] "quoted_favorite_count"   "quoted_retweet_count"   
## [39] "quoted_user_id"          "quoted_screen_name"     
## [41] "quoted_name"             "quoted_followers_count" 
## [43] "quoted_friends_count"    "quoted_statuses_count"  
## [45] "quoted_location"         "quoted_description"     
## [47] "quoted_verified"         "retweet_status_id"      
## [49] "retweet_text"            "retweet_created_at"     
## [51] "retweet_source"          "retweet_favorite_count" 
## [53] "retweet_retweet_count"   "retweet_user_id"        
## [55] "retweet_screen_name"     "retweet_name"           
## [57] "retweet_followers_count" "retweet_friends_count"  
## [59] "retweet_statuses_count"  "retweet_location"       
## [61] "retweet_description"     "retweet_verified"       
## [63] "place_url"               "place_name"             
## [65] "place_full_name"         "place_type"             
## [67] "country"                 "country_code"           
## [69] "geo_coords"              "coords_coords"          
## [71] "bbox_coords"             "status_url"             
## [73] "name"                    "location"               
## [75] "description"             "url"                    
## [77] "protected"               "followers_count"        
## [79] "friends_count"           "listed_count"           
## [81] "statuses_count"          "favourites_count"       
## [83] "account_created_at"      "verified"               
## [85] "profile_url"             "profile_expanded_url"   
## [87] "account_lang"            "profile_banner_url"     
## [89] "profile_background_url"  "profile_image_url"

Pre-process Tweets

Using a custom helper function unescape_html() in conjunction with sapply(), we start cleaning up the vax_tweets$text column by un-escaping html values using the xml2 package. This cleans up the text of the tweets so that encodings like ‘&amp’ end up displaying as ‘&’ instead. Then, we replace carriage & line return symbols with blank spaces to further clean-up the vax_tweets$text field.

# Pre-process tweets by unescaping xml/html values using xml2 package
unescape_html <- function(str)
  {xml2::xml_text(xml2::read_html(paste0("<x>",str,"</x>")))}

vax_tweets$text <- sapply(vax_tweets$text, unescape_html, USE.NAMES = FALSE)

# Replace carriage & line returns using stringr
vax_tweets$text <- str_replace_all(vax_tweets$text, "[\r\n]" , " ")

Separate date & time fields

# Split 'created_at' field into date & time fields
vax_tweets$date <- as.Date(vax_tweets$created_at)
vax_tweets$time <- format(vax_tweets$created_at,"%H:%M:%S")

#
datatable(head(vax_tweets[5], 3000), options = list(
  columnDefs = list(list(className = 'dt-left', targets = "_all",width='auto')),
  pageLength = 5,
  lengthMenu = c(5,10),
  fontSize='50%'
))

Simple plot of the frequency of @VaxHuntersCan tweets over time

# Group by and tally the number of tweets per day 
daily_vax_tweets <- vax_tweets %>% group_by(date) %>% tally()
g<-ggplot(daily_vax_tweets, aes(date, n))+geom_smooth(color="blueviolet")+geom_line(color="firebrick")

g

Authenticate to Google Cloud Natural Language API & Perform Entity Analysis

Next, we will send the cleaned tweets to the Google Natural Language API using the googleLanguageR package. The goal here is to perform Entity Analysis on the tweets using the analyzeEntities method to extract address information from vax_tweets$text and return the result to nlp_result.

#
nlp_result <- gl_nlp(vax_tweets$text[1:100],nlp_type = "analyzeEntities")

head(nlp_result[["entities"]][[80]]$name,10)

##  [1] "0"                                       
##  [2] "40"                                      
##  [3] "45"                                      
##  [4] "45 Innovator Ave"                        
##  [5] "45 Innovator Ave, Whitchurch-Stouffville"
##  [6] "66"                                      
##  [7] "ANYONE"                                  
##  [8] "COVID19ON #vhcON"                        
##  [9] "L4A 0G4"                                 
## [10] "ON"

# Helper function to unnest the returned NLP object from Google 
depth <- function(this) ifelse(is.list(this), 1L + max(sapply(this, depth)), 0L)

bind_at_any_depth <- function(l) {
  if (depth(l) == 2) {
    return(bind_rows(l))
  } else {
    l <- at_depth(l, depth(l) - 2, bind_rows)
    bind_at_any_depth(l)
  }
}

#
x <- bind_at_any_depth(nlp_result[["entities"]])
str(x)

## tibble [861 x 24] (S3: tbl_df/tbl/data.frame)
##  $ name           : chr [1:861] "18" "@MounikaReddyRa1" "@VaxHuntersON" "AstraZeneca" ...
##  $ type           : chr [1:861] "NUMBER" "PERSON" "PERSON" "ORGANIZATION" ...
##  $ salience       : num [1:861] 0 0.2994 0.2994 0.2218 0.0557 ...
##  $ mid            : chr [1:861] NA NA NA "/m/028zbs" ...
##  $ wikipedia_url  : chr [1:861] NA NA NA "https://en.wikipedia.org/wiki/AstraZeneca" ...
##  $ value          : chr [1:861] "18" NA NA NA ...
##  $ magnitude      : num [1:861] NA NA NA NA NA NA NA NA NA NA ...
##  $ score          : num [1:861] NA NA NA NA NA NA NA NA NA NA ...
##  $ beginOffset    : int [1:861] 45 0 17 59 78 78 78 78 41 49 ...
##  $ mention_type   : chr [1:861] "TYPE_UNKNOWN" "PROPER" "PROPER" "PROPER" ...
##  $ year           : chr [1:861] NA NA NA NA ...
##  $ street_number  : chr [1:861] NA NA NA NA ...
##  $ street_name    : chr [1:861] NA NA NA NA ...
##  $ locality       : chr [1:861] NA NA NA NA ...
##  $ country        : chr [1:861] NA NA NA NA ...
##  $ broad_region   : chr [1:861] NA NA NA NA ...
##  $ narrow_region  : chr [1:861] NA NA NA NA ...
##  $ postal_code    : chr [1:861] NA NA NA NA ...
##  $ area_code      : chr [1:861] NA NA NA NA ...
##  $ number         : chr [1:861] NA NA NA NA ...
##  $ national_prefix: chr [1:861] NA NA NA NA ...
##  $ day            : chr [1:861] NA NA NA NA ...
##  $ month          : chr [1:861] NA NA NA NA ...
##  $ sublocality    : chr [1:861] NA NA NA NA ...

addresses <- x %>% filter(type == "ADDRESS")
addresses

## # A tibble: 19 x 24
##    name  type  salience mid   wikipedia_url value magnitude score beginOffset
##    <chr> <chr>    <dbl> <chr> <chr>         <chr>     <dbl> <dbl>       <int>
##  1 1510~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          85
##  2 1472~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          47
##  3 1472~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          47
##  4 Vaug~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          59
##  5 1510~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          86
##  6 1055~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          32
##  7 1472~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          49
##  8 45 I~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA         140
##  9 60 Q~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          37
## 10 1633~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA         110
## 11 1633~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA         110
## 12 5600~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          98
## 13 5600~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          98
## 14 60 Q~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA         211
## 15 7070~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          41
## 16 L5N2~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          82
## 17 3029~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          45
## 18 L5N2~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          89
## 19 L5N2~ ADDR~        0 <NA>  <NA>          <NA>         NA    NA          89
## # ... with 15 more variables: mention_type <chr>, year <chr>,
## #   street_number <chr>, street_name <chr>, locality <chr>, country <chr>,
## #   broad_region <chr>, narrow_region <chr>, postal_code <chr>,
## #   area_code <chr>, number <chr>, national_prefix <chr>, day <chr>,
## #   month <chr>, sublocality <chr>

Authenticate to Mapbox API

Similar to the Twitter authentication step above, here we authenticate with Mapbox which allows us to send the extracted addresses from vax_tweets to the Mapbox (Forward) Geocoding API which takes in addresses, and returns coordinates (lat/lon). The free tier through Mapbox allows for 100,000 API requests per month.

#
source(file = "mapbox_auth.R")

Format & Send Request to Mapbox Geocoding API

Now that we’ve authenticated to the Mapbox API, we can use the mb_geocode() function to convert street addresses into coordinates (lat/lon) which will allow us to plot coordinates on a leaflet map.

One-off Geocoding Request

#
geocoded_address <- mb_geocode(
search_text = "CN Tower",
  endpoint = "mapbox.places",
  limit = 1,
  types = NULL,
  search_within = NULL,
  language = NULL,
  output = "coordinates",
  access_token = access_token 
)

geocoded_address

## [1] -79.38716  43.64264

Batch Geocoding Request

#
geocoded_addresses <- as.data.frame(lapply(addresses$name, mb_geocode))

Shiny & Leaflet (Next Steps)

Using the leaflet and shiny packages, we will build an interactive map to visualize the data in our vax_tweets data frame.

#
# ui <- dashboardPage(
#  dashboardHeader(),
#  dashboardSidebar(),
#  dashboardBody(
#    tags$style(type = "text/css", "#map {height: calc(100vh - 80px) !important;}"),
#    leafletOutput("map")
#  )
# ) 

#server <- function(input, output) {
#  output$map <- renderLeaflet({
    leaflet() %>%  # Base groups
  addTiles(group = "OSM (default)") %>%
  addProviderTiles(providers$CartoDB.DarkMatter, group = "Dark") %>%
        addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") %>%
  # Layers control
  addLayersControl(
    baseGroups = c("OSM (default)", "ESRI Satellite"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>% 
    addMarkers(-79.38716, 43.64264, 4) %>% 
    setView(-79.38716, 43.64264, 4) %>% 
  addProviderTiles(providers$Stamen.TonerLines,
    options = providerTileOptions(opacity = 0.35)) %>%
  addProviderTiles(providers$CartoDB.VoyagerOnlyLabels)

#  })
#}

#runApp(shinyApp(ui, server), launch.browser = FALSE)