The Infodemic Tomograph (InTo) is an infoveillance tool that enables public health officials and healthcare managers to forecast healthcare demand given social media content during public health crises. At present, the tool presents the results of forecasting Covid-19 healthcare pressure given the positivity of text messages (tweets) from Twitter. The InTo process requires a large volume of data and is expected to be redone on a scheduled basis. This document provides a walk-through of the R code used to conduct one iteration of this process. The source code can be viewed here and is licensed for use under the GPL-3.0 License.
The basic InTo work-flow is illustrated below. This work-flow was predicated on the granularity of data accessible to the Nexus Lab and initial objectives during the creation of the process. We were particularly interested in misinformation and healthcare, so our disaggregation of social media took only those into consideration. Epidemiological data were only available at the city scale so extracting additional features was not useful. The user can augment this process given their objectives and data.
InTo Workflow
Before getting started, the work environment will need to be set up properly and certain dependencies will need to be installed. Users will need to have JIDT downloaded from here and unzipped into the working directory; we suggest downloading the full distribution instead of the git. The analyst will not need a Twitter API key thanks to rtweet- though having one increases the amount of data accessible to the user- but a Google Maps API key will be necessary.
# Set work directory, clean work environment and random seed
setwd("C:/Users/USER/OneDrive/Desktop/InTo")
remove(list = ls())
set.seed(12345)
# For this workbook
library(knitr)
# For downloading twitter data
library(tidyverse)
library(lubridate)
library(rtweet)
# For analyzing twitter data
library(qdap)
library(tidytext)
library(SnowballC)
# For using the JIDT package
library("rJava")
# For conducting the time series forecasts
library(tsibble)
library(fable)
library(fabletools)
# For conducting geo-statistical kriging
library(sp)
library(automap)
opts_chunk$set(echo = T, message=F, error=F, warning=F, tidy = F)
# Set the location of interest here
loc <- "bangkok"
# Set folder path here
folder.path <- paste0("./tweetData/", loc, "/")
The beginning of the process involves acquiring the necessary data.
rTweet returns the endpoints provided by Twitter’s API so see here for a full description of the data returned. For our purposes, we retain an n x 6 table where n is a tweet. Each row represents an entry for that tweet and the columns we retain are user_id, status_id, created_at, text, retweet_count, and coords_coords.``
kable(head(tweet))
| user_id | status_id | created_at | text | retweet_count | coords_coords |
|---|---|---|---|---|---|
| 399872095 | 1283561909493444608 | 2020-07-16 00:39:41 | @AnnCoulter Yes - it’s part of the bad side of America - NY City - a Democratic Party hellhole, with a mayor who slaughtered thousands of citizens by sending Covid-19 infected persons into nursing homes, and spends time helping paint “Brazen Leftist Malcontent” banners onto his streets. | 0 | NA NA |
| 15277758 | 1283561495914098688 | 2020-07-16 00:38:02 | How to plan a remote funeral or memorial during the coronavirus pandemic - CNET https://t.co/8EHRID9PJc https://t.co/uuxixu6dTS | 0 | NA NA |
| 108644476 | 1283560524664504320 | 2020-07-16 00:34:11 | Seems the government are keen to whip up the foreigner covid frenzy at every opportunity. Good deflection <U+301C> | 0 | NA NA |
| 70439136 | 1283557298661494784 | 2020-07-16 00:21:21 | Schools shut amid coronavirus fears https://t.co/7QEw214NL0 | 1 | NA NA |
| 70439136 | 1280834111381200896 | 2020-07-08 12:00:23 | Brazil’s president contracts Covid: Virus world update https://t.co/7QEw214NL0 | 0 | NA NA |
| 936882758 | 1283556813946753024 | 2020-07-16 00:19:26 | The Interior Ministry has instructed provincial governors to tell people who had travelled to 7 Eleven on 11-7st to immediately contact the Department of Good Taste hotline. |
Today has also seen record numbers of either road deaths or covid, I dunno I can’t read Thai. https://t.co/XcM1i3LymS | 0|NA NA |
The Epidemiological data used by InTo are daily cases and hospitalizations. This data can be obtained from any reliable and reputable source the user decides. It should be in a n x 3 table where n is the number of days. The three columns are Date (YYYY-MM-DD); Cases (daily new cases) and Hospitalization (daily new hospitalization). Each row represents an entry for that date. Spatio-temporal granularity is preferred and would necessitate the inclusion of a coordinates column, however, here we one metric for pixel, rather than several values distributed within that pixel.
epi_data %>%
head() %>%
kable()
| recordDate | daily_case | new_hosp |
|---|---|---|
| 2020-04-18 | 17 | NA |
| 2020-04-19 | 16 | -55 |
| 2020-04-20 | 14 | -22 |
| 2020-04-21 | 10 | -45 |
| 2020-04-22 | 8 | -115 |
| 2020-04-23 | 7 | -33 |
At this point in the work-flow the user will disaggregate the data into the subsets of interest.
Twitter data is disaggregated into three sets: all tweets, misinformation and health care. Healthcare and misinformation tweets are identified by identifying those messages that contain certain key terms or phrases. These can be changed to whatever the user feels is most appropriate. Any other subset can be created by identifying the key terms and filtering the main dataset in a manner as shown below.
all_tweets <- tweet
misinfo_term <- "fake|misinformation|^''lie|false"
health_term <- "healthcare|hospital|test"
misinfo_tweets <- tweet %>%
filter(str_detect(text, misinfo_term))
health_tweets <- tweet %>%
filter(str_detect(text, health_term))
Epidemiological data tends to already be disaggregated in cases and hospitalization. Nonetheless, the user may be interested in cases or hospitalization of specific groups, such as females, the poor, tourists or some other group. This is not yet implemented but is quite possible if the specifications exist within the epidemiological to allow for such subsetting.
The main features of interest for the social media data are: n-grams (bi-grams), sentiments (affects and positivity), and metadata (volume, popularity). For emotions, the textdata package will need to be installed.
To begin, the text data must be preprocessed and tokenized. Here we use the tidytext package to replace abbreviations, symbols contractions and ordinals before separating each text into its constituent words. The InTo process takes a bag-of-words approach to sentiment quantification so we do not consider the order of words important.
To measure the positivity we used the labMT lexicon as employed here. This lexical approach involves matching words in the text against words in the lexicon that contains quantification of its positivity. We first match the words as they appear then stem the words and attempt quantification again. This last step is useful as sometimes a word, like viruses, may not have a value, but its stem, like virus, will. The InTo process only uses positivity but the user may be interested in the distribution of emotional affects in the text as well. The lexical approach is also used to determine the emotional affects associated with each word.
unnest_tweet <- all_tweets %>%
mutate(text = replace_abbreviation(text) %>%
replace_symbol() %>%
replace_contraction() %>%
replace_ordinal()) %>%
unnest_tokens("word", "text", token = "tweets", strip_punct = T, strip_url = T)
####----- Sentiment Database: labMT ------####
tweet_sentiment <- unnest_tweet %>%
# determine sentiments of words
left_join(labMT) %>%
mutate(stemmed_word = ifelse(is.na(happiness_average),
wordStem(word, language = "en"),
"")) %>%
left_join(labMT, by = c("stemmed_word" = "word")) %>%
mutate(positivity = ifelse(is.na(happiness_average.x),
happiness_average.y,
happiness_average.x)) %>%
# remove values in between 4 and 6
filter(!(4 < positivity & positivity < 6)) %>%
## count the number of positive and negative words per status per user
group_by(user_id, status_id, coords_coords,
"day_created" = strftime(created_at, format = "%Y-%m-%d")) %>%
summarise(Sent = mean(positivity, na.rm = T)) %>%
ungroup() %>%
separate(col = "coords_coords", into = c("lng", "lat"), sep = " ") %>%
mutate(lng = as.numeric(lng),
lat = as.numeric(lat),
day_created = as.Date(day_created))
tweet_sentiment %>%
head() %>%
kable()
| user_id | status_id | lng | lat | day_created | Sent |
|---|---|---|---|---|---|
| 1000764871720321024 | 1280686050403213312 | NA | NA | 2020-07-08 | 5.620000 |
| 1000764871720321024 | 1280701935973068803 | NA | NA | 2020-07-08 | 6.170000 |
| 1000960419568103425 | 1277956284394827776 | NA | NA | 2020-06-30 | 6.795000 |
| 1001145719942627328 | 1257653769942130688 | NA | NA | 2020-05-05 | 5.652500 |
| 1001145719942627328 | 1269863014800306177 | NA | NA | 2020-06-08 | 6.343333 |
| 1001145719942627328 | 1270936858071990272 | NA | NA | 2020-06-11 | 6.380000 |
tweetBigrams <- all_tweets %>%
mutate(text = rm_twitter_url(text) %>%
str_remove_all(pattern = "[:punct:]") %>%
str_remove_all(pattern = "[:digit:]")) %>%
unnest_tokens("word", "text", token = "ngrams", n = 2) %>%
separate(col = word, into = c("word1", "word2"), sep = " ") %>%
filter(!(word1 %in% c(stop_words$word, "coronavirus", "covid19",
"#covid19", "#coronavirus", "#covid2019",
"amp", "covid", "-", "|", "19"))) %>%
filter(!(word2 %in% c(stop_words$word, "coronavirus", "covid19",
"#covid19", "#coronavirus", "#covid2019",
"amp", "covid", "-", "|", "19"))) %>%
unite(col = "pairs", c(word1, word2), sep = " ") %>%
group_by("date_created" = as.Date(created_at), pairs) %>%
count() %>%
ungroup()
tweetBigrams %>%
head() %>%
kable()
| date_created | pairs | n |
|---|---|---|
| 2020-04-17 | abruzzo region | 1 |
| 2020-04-17 | abuse online | 1 |
| 2020-04-17 | accidentally leaked | 1 |
| 2020-04-17 | acknowledges fox | 1 |
| 2020-04-17 | active lifestyles | 1 |
| 2020-04-17 | americans canada | 1 |
tweetEmotions <- unnest_tweet %>%
inner_join(get_sentiments("nrc")) %>%
group_by("day_created" = strftime(created_at, format = "%Y-%m-%d"),
sentiment) %>%
count() %>%
ungroup()
tweetEmotions %>%
head() %>%
kable()
| day_created | sentiment | n |
|---|---|---|
| 2020-04-17 | anger | 22 |
| 2020-04-17 | anticipation | 24 |
| 2020-04-17 | disgust | 14 |
| 2020-04-17 | fear | 27 |
| 2020-04-17 | joy | 23 |
| 2020-04-17 | negative | 39 |
This next step involves calculating the correlation and transfer entropy between positivity and epidemiological data. Correlation will quantify the nature and strength of the linear atemporal relationship between the variables. Transfer entropy from positivity to epidemiological data will quantify the divergence between the two time series serving as a measure of the predictability of positivity for hospitalization and cases. The code below will normalize positivity, cases and hospitalizations and use those normalized values in the calculate the correlation coefficient and transfer entropy from positivity to epidemiological data.
.jinit()
## [1] 0
# Change location of jar to match yours:
.jaddClassPath("./infodynamics-dist-1.5/infodynamics.jar")
teCal_jidt_knl_func <- function(srcArr,dstArr,histLen,width){
# histLen: 1L as an example; width: 0.5 as an example
# Create a TE calculator and run it:
teCalc<-.jnew("infodynamics/measures/continuous/kernel/TransferEntropyCalculatorKernel")
.jcall(teCalc,"V","setProperty", "NORMALISE", "true") # Normalise the individual variables
.jcall(teCalc,"V","initialise", histLen, width) # Use history length 1 (Schreiber k=1), kernel width of 0.5 normalised units
.jcall(teCalc,"V","setObservations", srcArr, dstArr)
# For copied source, should give something close to expected value for correlated Gaussians:
result <- .jcall(teCalc,"D","computeAverageLocalOfObservations") # bit
return(result)
}
miCal_jidt_func_lag <- function(srcArr,dstArr,time_lag){
# time_lag: string -> "1" or "2"
# different calculators JIDT provides (select one of them!):
# implementingClass <- "infodynamics/measures/continuous/kraskov/MutualInfoCalculatorMultiVariateKraskov1" # MI([1,2], [3,4]) = 0.36353
implementingClass <- "infodynamics/measures/continuous/kernel/MutualInfoCalculatorMultiVariateKernel"
# implementingClass <- "infodynamics/measures/continuous/gaussian/MutualInfoCalculatorMultiVariateGaussian"
# implementingClass <- "infodynamics/measures/continuous/kraskov/MutualInfoCalculatorMultiVariateKraskov1"
miCalc<-.jnew(implementingClass)
.jcall(miCalc,"V","setProperty", "TIME_DIFF", time_lag)
# a. Initialise the calculator to use the required number of
# dimensions for each variable:
.jcall(miCalc,"V","initialise")
# b. Supply the observations to compute the PDFs from:
.jcall(miCalc,"V","setObservations",srcArr,dstArr)
# c. Make the MI calculation:
result <- .jcall(miCalc,"D","computeAverageLocalOfObservations") # bit
return(result)
}
opt_lag4mi_func <- function(srcArr,dstArr,max_lag){
opt_lag <- 0
opt_mi <- miCal_jidt_func_lag(srcArr,dstArr,as.character(opt_lag))
for (i in seq(max_lag)) {
mi <- miCal_jidt_func_lag(srcArr,dstArr,as.character(i))
if(mi>opt_mi){
opt_mi <- mi
opt_lag <- i
} else {
next
}
}
if(opt_lag == 0){
# since the time lag for TE calculaton can not be zero.
# if zero, it reports an error saying "source index -1 out of bounds for double[]"
opt_lag <- 1
}
result_list <- list(opt_lag = opt_lag,max_mi = opt_mi)
return(result_list)
}
te_cal_opt_hist_func <- function(src_data,dst_data,max_hist,knl_width){
te_value <- teCal_jidt_knl_func(src_data,dst_data,
as.integer(opt_lag4mi_func(src_data,dst_data,max_hist)$opt_lag),
knl_width)
return(te_value)
}
pos_epi <- tweet_sentiment %>%
left_join(epi_data, by = c("day_created" = "recordDate")) %>%
mutate(norm_sent = (Sent-min(tweet_sentiment$Sent,na.rm = T))/(max(tweet_sentiment$Sent,na.rm = T)-min(tweet_sentiment$Sent,na.rm = T)),
norm_hosp = (new_hosp-min(epi_data$new_hosp,na.rm = T))/(max(epi_data$new_hosp,na.rm = T)-min(epi_data$new_hosp,na.rm = T)),
norm_case = (daily_case-min(epi_data$daily_case,na.rm = T))/(max(epi_data$daily_case,na.rm = T)-min(epi_data$daily_case, na.rm = T))) %>%
ungroup() %>%
summarise(
pos_case_cor = cor(x = norm_sent, y = norm_case, use = "complete.obs"),
pos_hosp_cor = cor(y = norm_hosp, x = norm_sent, use = "complete.obs"),
pos_case_te = te_cal_opt_hist_func(src_data = norm_sent, dst_data = norm_case,
max_hist = 1L, knl_width = 0.5),
pos_hosp_te = te_cal_opt_hist_func(src_data = norm_sent, dst_data = norm_hosp,
max_hist = 1L, knl_width = 0.5)
)
pos_epi %>%
kable()
| pos_case_cor | pos_hosp_cor | pos_case_te | pos_hosp_te |
|---|---|---|---|
| 0.0143249 | 0.0104277 | 0 | 0 |
Here we employ an ARIMA model to forecast the daily hospitalizations and cases given the tweet positivity. The urca package will need to be installed. We forecast hospitalization and cases based on their individual histories and relationship to positivity. Caluclated the forecasted values using the fable package re This set of code first forecasts positivity given its autocorrelation. This code also extracts the upper and lower limits of the forecasted values.
# Combine positivity data and epidemiological data and convert to a time series
pos_epi_ts <- tweet_sentiment %>%
left_join(epi_data, by = c("day_created" = "recordDate")) %>%
group_by(day_created) %>%
summarise(positivity = mean(Sent, na.rm = T),
hospitalization = mean(new_hosp, na.rm = T),
cases = mean(daily_case, na.rm = T)) %>%
ungroup() %>%
filter(day_created <= ymd("2020-07-01")) %>%
as_tsibble(index = day_created)
# Forecast positivity given its autocorrelation
pos_fcast <- pos_epi_ts %>%
model(pos.arima = ARIMA(positivity)) %>%
forecast(h = 14)
# Store the forecasted positivity as a time series for later use
newdata_pos <- as_tibble(pos_fcast) %>%
select(day_created, "positivity" = .mean) %>%
as_tsibble(index = day_created)
# Forecast hospitalization given its history and relationship to positivity
pos_hosp_fcast <- pos_epi_ts %>%
model(arima.auto = ARIMA(hospitalization ~ positivity)) %>%
forecast(new_data = newdata_pos)
hosp_fcast_data <- pos_hosp_fcast %>%
hilo() %>%
unpack_hilo(c(`80%`, `95%`)) %>%
select(day_created, "hosp_fcast" = .mean,
`80%_lower`, `80%_upper`, `95%_lower`, `95%_upper`) %>%
as_tibble()
hosp_fcast_data %>%
head() %>%
kable()
| day_created | hosp_fcast | 80%_lower | 80%_upper | 95%_lower | 95%_upper |
|---|---|---|---|---|---|
| 2020-07-02 | -0.6719441 | -17.65083 | 16.30694 | -26.63890 | 25.29502 |
| 2020-07-03 | -0.6719441 | -18.67218 | 17.32829 | -28.20093 | 26.85704 |
| 2020-07-04 | -0.6719441 | -19.63862 | 18.29473 | -29.67897 | 28.33508 |
| 2020-07-05 | -0.6719441 | -20.55814 | 19.21425 | -31.08526 | 29.74137 |
| 2020-07-06 | -0.6719441 | -21.43699 | 20.09310 | -32.42933 | 31.08545 |
| 2020-07-07 | -0.6719441 | -22.28012 | 20.93623 | -33.71879 | 32.37490 |
pos_case_fcast <- pos_epi_ts %>%
model(arima.auto = ARIMA(cases ~ positivity)) %>%
forecast(new_data = newdata_pos)
case_fcast_data <- pos_case_fcast %>%
hilo() %>%
unpack_hilo(c(`80%`, `95%`)) %>%
select(day_created, "hosp_fcast" = .mean,
`80%_lower`, `80%_upper`, `95%_lower`, `95%_upper`) %>%
as_tibble()
case_fcast_data %>%
head() %>%
kable()
| day_created | hosp_fcast | 80%_lower | 80%_upper | 95%_lower | 95%_upper |
|---|---|---|---|---|---|
| 2020-07-02 | 0.5531869 | -3.652086 | 4.758460 | -5.878223 | 6.984597 |
| 2020-07-03 | 0.3788096 | -3.979387 | 4.737007 | -6.286477 | 7.044097 |
| 2020-07-04 | 0.2044324 | -4.301501 | 4.710366 | -6.686799 | 7.095663 |
| 2020-07-05 | 0.0300552 | -4.618923 | 4.679033 | -7.079943 | 7.140053 |
| 2020-07-06 | -0.1443221 | -4.932073 | 4.643428 | -7.466555 | 7.177911 |
| 2020-07-07 | -0.3186993 | -5.241312 | 4.603913 | -7.847185 | 7.209787 |
Below we illustrate the performance of our approach given the full time series of hospitalizations.
pos_hosp_model <- pos_epi_ts %>%
model(arima.auto = ARIMA(hospitalization ~ positivity))
augment(pos_hosp_model) %>%
ggplot(aes(x = day_created)) +
geom_line(aes(y = hospitalization, color = "Observed")) +
geom_line(aes(y = .fitted, color = "Forecasted")) +
labs(x="Date", y = "", color = "Values") +
theme(aspect.ratio = 5/7) +
theme_bw()
### Spatial Forecast Now we can use geostatistical kriging to forecast the spatial spread of hospitalization given the positivity.
loc_pos_epi <- tweet_sentiment %>%
left_join(epi_data, by = c("day_created" = "recordDate")) %>%
na.omit() %>%
distinct()
coordinates(loc_pos_epi) <- ~lat+lng
loc_pos_epi <- loc_pos_epi[which(!duplicated(loc_pos_epi@coords)),]
bbox <- lookup_coords(loc)
random_points <- data.frame("lng" = c(runif(10000, min = bbox$box[1], max = bbox$box[3])),
"lat" = c(runif(10000, min = bbox$box[2], max = bbox$box[4])))
coordinates(random_points) <- ~lat+lng
krige_pos <- autoKrige(formula = Sent ~ 1,
input_data = loc_pos_epi,
new_data = random_points)
## [using ordinary kriging]
krige_pos_output <- krige_pos$krige_output %>%
as.data.frame() %>%
rename("Sent" = var1.pred)
coordinates(krige_pos_output) <- ~lat+lng
krige_hosp <- autoKrige(formula = new_hosp ~ Sent,
input_data = loc_pos_epi,
new_data = krige_pos_output)
## [using universal kriging]
krige_hosp_output <- krige_hosp$krige_output %>%
as.data.frame() %>%
rename("Sent" = var1.pred)