For this analysis, we will capture, process, analyze, store, and visualize tweets from @VaxHuntersCan in order to build a map of Covid19 vaccines clinics, supply, and eligibility data in Canada.
Vaccine Hunters Canada is a volunteer organization helping Canadians navigate the patchwork of Covid19 vaccine clinics, supply, and eligibility criteria. Primarily operating on Twitter & Discord, @VaxHuntersCan has created a crowd-sourced hub for Covid19 vaccine information in Canada. In addition to the usual metadata that the Twitter API gives developers access to, @VaxHuntersCan tweets contain a common data structure from which we can parse useful location, supply, and eligibility information.
This project would not be possible without the great work being done by the creators & maintainers of @VaxHuntersCan.
Load Project Libraries
library(rtweet)
library(stringr)
library(dplyr)
library(mapboxapi)
library(prettydoc)
library(leaflet)
library(leaflet.providers)
library(shiny)
library(shinydashboard)
library(ggplot2)
library(DT)
library(googleLanguageR)
library(tidyr)
Authenticate to Twitter API
First, a Twitter developer account is required. Request one here to start building apps that utilize the Twitter API.
Next, create a new app project in your Twitter developer console, and copy and paste your api_key
, api_secret_key
, access_token
, and access_token_secret
into the relevant fields of the twitter_auth.R
file which we will source
in the snippet below and use as environmental variables to generate an auth token.
source("twitter_auth.R")
<- create_token(
token app = "VaxHuntersTweetBot",
consumer_key = api_key,
consumer_secret = api_secret_key,
access_token = access_token,
access_secret = access_token_secret
)
Pull Tweets from @VaxHuntersCan Timeline
Here, we use the get_timeline()
function from the rtweet
package to pull tweets from the @VaxHuntersCan Twitter timeline. This function creates a data frame which we will call vax_tweets
. We then use the colnames
function to take a look at the column headers in the returned data frame.
<- rtweet::get_timeline("VaxHuntersCan",n=3000, parse = TRUE)
vax_tweets
colnames(vax_tweets)
## [1] "user_id" "status_id"
## [3] "created_at" "screen_name"
## [5] "text" "source"
## [7] "display_text_width" "reply_to_status_id"
## [9] "reply_to_user_id" "reply_to_screen_name"
## [11] "is_quote" "is_retweet"
## [13] "favorite_count" "retweet_count"
## [15] "quote_count" "reply_count"
## [17] "hashtags" "symbols"
## [19] "urls_url" "urls_t.co"
## [21] "urls_expanded_url" "media_url"
## [23] "media_t.co" "media_expanded_url"
## [25] "media_type" "ext_media_url"
## [27] "ext_media_t.co" "ext_media_expanded_url"
## [29] "ext_media_type" "mentions_user_id"
## [31] "mentions_screen_name" "lang"
## [33] "quoted_status_id" "quoted_text"
## [35] "quoted_created_at" "quoted_source"
## [37] "quoted_favorite_count" "quoted_retweet_count"
## [39] "quoted_user_id" "quoted_screen_name"
## [41] "quoted_name" "quoted_followers_count"
## [43] "quoted_friends_count" "quoted_statuses_count"
## [45] "quoted_location" "quoted_description"
## [47] "quoted_verified" "retweet_status_id"
## [49] "retweet_text" "retweet_created_at"
## [51] "retweet_source" "retweet_favorite_count"
## [53] "retweet_retweet_count" "retweet_user_id"
## [55] "retweet_screen_name" "retweet_name"
## [57] "retweet_followers_count" "retweet_friends_count"
## [59] "retweet_statuses_count" "retweet_location"
## [61] "retweet_description" "retweet_verified"
## [63] "place_url" "place_name"
## [65] "place_full_name" "place_type"
## [67] "country" "country_code"
## [69] "geo_coords" "coords_coords"
## [71] "bbox_coords" "status_url"
## [73] "name" "location"
## [75] "description" "url"
## [77] "protected" "followers_count"
## [79] "friends_count" "listed_count"
## [81] "statuses_count" "favourites_count"
## [83] "account_created_at" "verified"
## [85] "profile_url" "profile_expanded_url"
## [87] "account_lang" "profile_banner_url"
## [89] "profile_background_url" "profile_image_url"
Pre-process Tweets
Using a custom helper function unescape_html()
in conjunction with sapply()
, we start cleaning up the vax_tweets$text
column by un-escaping html values using the xml2
package. This cleans up the text of the tweets so that encodings like ‘&’ end up displaying as ‘&’ instead. Then, we replace carriage & line return symbols with blank spaces to further clean-up the vax_tweets$text
field.
# Pre-process tweets by unescaping xml/html values using xml2 package
<- function(str)
unescape_html ::xml_text(xml2::read_html(paste0("<x>",str,"</x>")))}
{xml2
$text <- sapply(vax_tweets$text, unescape_html, USE.NAMES = FALSE)
vax_tweets
# Replace carriage & line returns using stringr
$text <- str_replace_all(vax_tweets$text, "[\r\n]" , " ") vax_tweets
Separate date & time fields
# Split 'created_at' field into date & time fields
$date <- as.Date(vax_tweets$created_at)
vax_tweets$time <- format(vax_tweets$created_at,"%H:%M:%S") vax_tweets
#
datatable(head(vax_tweets[5], 3000), options = list(
columnDefs = list(list(className = 'dt-left', targets = "_all",width='auto')),
pageLength = 5,
lengthMenu = c(5,10),
fontSize='50%'
))
Simple plot of the frequency of @VaxHuntersCan tweets over time
# Group by and tally the number of tweets per day
<- vax_tweets %>% group_by(date) %>% tally()
daily_vax_tweets <-ggplot(daily_vax_tweets, aes(date, n))+geom_smooth(color="blueviolet")+geom_line(color="firebrick")
g
g
Authenticate to Google Cloud Natural Language API & Perform Entity Analysis
Next, we will send the cleaned tweets to the Google Natural Language API using the googleLanguageR
package. The goal here is to perform Entity Analysis on the tweets using the analyzeEntities
method to extract address information from vax_tweets$text
and return the result to nlp_result
.
#
<- gl_nlp(vax_tweets$text[1:100],nlp_type = "analyzeEntities")
nlp_result
head(nlp_result[["entities"]][[80]]$name,10)
## [1] "0"
## [2] "40"
## [3] "45"
## [4] "45 Innovator Ave"
## [5] "45 Innovator Ave, Whitchurch-Stouffville"
## [6] "66"
## [7] "ANYONE"
## [8] "COVID19ON #vhcON"
## [9] "L4A 0G4"
## [10] "ON"
# Helper function to unnest the returned NLP object from Google
<- function(this) ifelse(is.list(this), 1L + max(sapply(this, depth)), 0L)
depth
<- function(l) {
bind_at_any_depth if (depth(l) == 2) {
return(bind_rows(l))
else {
} <- at_depth(l, depth(l) - 2, bind_rows)
l bind_at_any_depth(l)
} }
#
<- bind_at_any_depth(nlp_result[["entities"]])
x str(x)
## tibble [861 x 24] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:861] "18" "@MounikaReddyRa1" "@VaxHuntersON" "AstraZeneca" ...
## $ type : chr [1:861] "NUMBER" "PERSON" "PERSON" "ORGANIZATION" ...
## $ salience : num [1:861] 0 0.2994 0.2994 0.2218 0.0557 ...
## $ mid : chr [1:861] NA NA NA "/m/028zbs" ...
## $ wikipedia_url : chr [1:861] NA NA NA "https://en.wikipedia.org/wiki/AstraZeneca" ...
## $ value : chr [1:861] "18" NA NA NA ...
## $ magnitude : num [1:861] NA NA NA NA NA NA NA NA NA NA ...
## $ score : num [1:861] NA NA NA NA NA NA NA NA NA NA ...
## $ beginOffset : int [1:861] 45 0 17 59 78 78 78 78 41 49 ...
## $ mention_type : chr [1:861] "TYPE_UNKNOWN" "PROPER" "PROPER" "PROPER" ...
## $ year : chr [1:861] NA NA NA NA ...
## $ street_number : chr [1:861] NA NA NA NA ...
## $ street_name : chr [1:861] NA NA NA NA ...
## $ locality : chr [1:861] NA NA NA NA ...
## $ country : chr [1:861] NA NA NA NA ...
## $ broad_region : chr [1:861] NA NA NA NA ...
## $ narrow_region : chr [1:861] NA NA NA NA ...
## $ postal_code : chr [1:861] NA NA NA NA ...
## $ area_code : chr [1:861] NA NA NA NA ...
## $ number : chr [1:861] NA NA NA NA ...
## $ national_prefix: chr [1:861] NA NA NA NA ...
## $ day : chr [1:861] NA NA NA NA ...
## $ month : chr [1:861] NA NA NA NA ...
## $ sublocality : chr [1:861] NA NA NA NA ...
<- x %>% filter(type == "ADDRESS")
addresses addresses
## # A tibble: 19 x 24
## name type salience mid wikipedia_url value magnitude score beginOffset
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <int>
## 1 1510~ ADDR~ 0 <NA> <NA> <NA> NA NA 85
## 2 1472~ ADDR~ 0 <NA> <NA> <NA> NA NA 47
## 3 1472~ ADDR~ 0 <NA> <NA> <NA> NA NA 47
## 4 Vaug~ ADDR~ 0 <NA> <NA> <NA> NA NA 59
## 5 1510~ ADDR~ 0 <NA> <NA> <NA> NA NA 86
## 6 1055~ ADDR~ 0 <NA> <NA> <NA> NA NA 32
## 7 1472~ ADDR~ 0 <NA> <NA> <NA> NA NA 49
## 8 45 I~ ADDR~ 0 <NA> <NA> <NA> NA NA 140
## 9 60 Q~ ADDR~ 0 <NA> <NA> <NA> NA NA 37
## 10 1633~ ADDR~ 0 <NA> <NA> <NA> NA NA 110
## 11 1633~ ADDR~ 0 <NA> <NA> <NA> NA NA 110
## 12 5600~ ADDR~ 0 <NA> <NA> <NA> NA NA 98
## 13 5600~ ADDR~ 0 <NA> <NA> <NA> NA NA 98
## 14 60 Q~ ADDR~ 0 <NA> <NA> <NA> NA NA 211
## 15 7070~ ADDR~ 0 <NA> <NA> <NA> NA NA 41
## 16 L5N2~ ADDR~ 0 <NA> <NA> <NA> NA NA 82
## 17 3029~ ADDR~ 0 <NA> <NA> <NA> NA NA 45
## 18 L5N2~ ADDR~ 0 <NA> <NA> <NA> NA NA 89
## 19 L5N2~ ADDR~ 0 <NA> <NA> <NA> NA NA 89
## # ... with 15 more variables: mention_type <chr>, year <chr>,
## # street_number <chr>, street_name <chr>, locality <chr>, country <chr>,
## # broad_region <chr>, narrow_region <chr>, postal_code <chr>,
## # area_code <chr>, number <chr>, national_prefix <chr>, day <chr>,
## # month <chr>, sublocality <chr>
Authenticate to Mapbox API
Similar to the Twitter authentication step above, here we authenticate with Mapbox which allows us to send the extracted addresses from vax_tweets
to the Mapbox (Forward) Geocoding API which takes in addresses, and returns coordinates (lat/lon). The free tier through Mapbox allows for 100,000 API requests per month.
#
source(file = "mapbox_auth.R")
Format & Send Request to Mapbox Geocoding API
Now that we’ve authenticated to the Mapbox API, we can use the mb_geocode()
function to convert street addresses into coordinates (lat/lon) which will allow us to plot coordinates on a leaflet
map.
One-off Geocoding Request
#
<- mb_geocode(
geocoded_address search_text = "CN Tower",
endpoint = "mapbox.places",
limit = 1,
types = NULL,
search_within = NULL,
language = NULL,
output = "coordinates",
access_token = access_token
)
geocoded_address
## [1] -79.38716 43.64264
Batch Geocoding Request
#
<- as.data.frame(lapply(addresses$name, mb_geocode)) geocoded_addresses
Shiny & Leaflet (Next Steps)
Using the leaflet
and shiny
packages, we will build an interactive map to visualize the data in our vax_tweets
data frame.
#
# ui <- dashboardPage(
# dashboardHeader(),
# dashboardSidebar(),
# dashboardBody(
# tags$style(type = "text/css", "#map {height: calc(100vh - 80px) !important;}"),
# leafletOutput("map")
# )
# )
#server <- function(input, output) {
# output$map <- renderLeaflet({
leaflet() %>% # Base groups
addTiles(group = "OSM (default)") %>%
addProviderTiles(providers$CartoDB.DarkMatter, group = "Dark") %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") %>%
# Layers control
addLayersControl(
baseGroups = c("OSM (default)", "ESRI Satellite"),
options = layersControlOptions(collapsed = FALSE)
%>%
) addMarkers(-79.38716, 43.64264, 4) %>%
setView(-79.38716, 43.64264, 4) %>%
addProviderTiles(providers$Stamen.TonerLines,
options = providerTileOptions(opacity = 0.35)) %>%
addProviderTiles(providers$CartoDB.VoyagerOnlyLabels)
# })
#}
#runApp(shinyApp(ui, server), launch.browser = FALSE)