Detecting Flu Epidemics via Search Engine Query Data

Flu epidemics constitute a major public health concern causing respiratory illnesses, hospitalizations, and deaths. According to the National Vital Statistics Reports published in October 2012, influenza ranked as the eighth leading cause of death in 2011 in the United States. Each year, 250,000 to 500,000 deaths are attributed to influenza related diseases throughout the world.

The U.S. Centers for Disease Control and Prevention (CDC) and the European Influenza Surveillance Scheme (EISS) detect influenza activity through virologic and clinical data, including Influenza-like Illness (ILI) physician visits. Reporting national and regional data, however, are published with a 1-2 week lag.

The Google Flu Trends project was initiated to see if faster reporting can be made possible by considering flu-related online search queries – data that is available almost immediately.

Understanding the Data

We would like to estimate influenza-like illness (ILI) activity using Google web search logs.

ILI Data - The CDC publishes on its website the official regional and state-level percentage of patient visits to healthcare providers for ILI purposes on a weekly basis.

Google Search Queries - Google Trends allows public retrieval of weekly counts for every query searched by users around the world. For each location, the counts are normalized by dividing the count for each query in a particular week by the total number of online search queries submitted in that location during the week. Then, the values are adjusted to be between 0 and 1.

The csv file FluTrain.csv aggregates this data from January 1, 2004 until December 31, 2011 as follows:

“Week” - The range of dates represented by this observation, in year/month/day format.

“ILI” - This column lists the percentage of ILI-related physician visits for the corresponding week.

“Queries” - This column lists the fraction of queries that are ILI-related for the corresponding week, adjusted to be between 0 and 1 (higher values correspond to more ILI-related search queries).

FluTrain = read.csv("FluTrain.csv")
head(FluTrain)

##                      Week      ILI   Queries
## 1 2004-01-04 - 2004-01-10 2.418331 0.2377158
## 2 2004-01-11 - 2004-01-17 1.809056 0.2204515
## 3 2004-01-18 - 2004-01-24 1.712024 0.2257636
## 4 2004-01-25 - 2004-01-31 1.542495 0.2377158
## 5 2004-02-01 - 2004-02-07 1.437868 0.2244356
## 6 2004-02-08 - 2004-02-14 1.324274 0.2071713

str(FluTrain)

## 'data.frame':    417 obs. of  3 variables:
##  $ Week   : Factor w/ 417 levels "2004-01-04 - 2004-01-10",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ILI    : num  2.42 1.81 1.71 1.54 1.44 ...
##  $ Queries: num  0.238 0.22 0.226 0.238 0.224 ...

summary(FluTrain)

##                       Week          ILI            Queries       
##  2004-01-04 - 2004-01-10:  1   Min.   :0.5341   Min.   :0.04117  
##  2004-01-11 - 2004-01-17:  1   1st Qu.:0.9025   1st Qu.:0.15671  
##  2004-01-18 - 2004-01-24:  1   Median :1.2526   Median :0.28154  
##  2004-01-25 - 2004-01-31:  1   Mean   :1.6769   Mean   :0.28603  
##  2004-02-01 - 2004-02-07:  1   3rd Qu.:2.0587   3rd Qu.:0.37849  
##  2004-02-08 - 2004-02-14:  1   Max.   :7.6189   Max.   :1.00000  
##  (Other)                :411

Looking at the time period 2004-2011, which week corresponds to the highest percentage of ILI-related physician visits?

    FluTrain[which.max(FluTrain$ILI),]

##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1

Which week corresponds to the highest percentage of ILI-related query fraction?

subset(FluTrain, Queries == max(Queries))

##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1

The Distribution of Values of ILI

library(ggplot2)
library(plotly)
ggplotly(ggplot(FluTrain, aes(ILI)) + geom_histogram(bins = 10,color = "blue") + xlab("ILI") + ylab("Total ILI Visits") + ggtitle("DISTRIBITION OF ILI VALUES"))

Most of the ILI values are small, with a relatively small number of much larger values. The distribution is skewed to the right.

When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself – this prevents the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models.Thus, we will predict the natural log of the ILI variable.

ggplotly(ggplot(FluTrain, aes(log(ILI), Queries)) + geom_point(color = "orange") + ggtitle("LOG(ILI) VS QUERIES"))

There is a positive, linear relationship between log(ILI) and Queries.

Linear Regression Model

Based on the plot, a linear regression model could be a good modeling choice. Thus our estimation model would be: log(ILI) = intercept + coefficient x Queries, where the coefficient is positive

FluTrend1 <- lm(log(ILI) ~ Queries , data = FluTrain)
summary(FluTrend1)

## 
## Call:
## lm(formula = log(ILI) ~ Queries, data = FluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76003 -0.19696 -0.01657  0.18685  1.06450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.49934    0.03041  -16.42   <2e-16 ***
## Queries      2.96129    0.09312   31.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.7083 
## F-statistic:  1011 on 1 and 415 DF,  p-value: < 2.2e-16

The Pr value from the summary results show that Queries is a very good predictor of ILI. The R-squared value is 0.709, implying the model is a good fit.

The Correlation between Dependent and Independent Variables

cor(FluTrain$Queries, log(FluTrain$ILI))

## [1] 0.8420333

It appears that Correlation^2 is equal to the R-squared value. It can be proved that this is always the case.

Performance on the Test Set Data

The csv file FluTest.csv provides the 2012 weekly data of the ILI-related search queries and the observed weekly percentage of ILI-related physician visits.

FluTest <- read.csv("FluTest.csv")
str(FluTest)

## 'data.frame':    52 obs. of  3 variables:
##  $ Week   : Factor w/ 52 levels "2012-01-01 - 2012-01-07",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ILI    : num  1.77 1.54 1.65 1.68 1.86 ...
##  $ Queries: num  0.594 0.499 0.501 0.479 0.471 ...

Normally, we would obtain test-set predictions from the model FluTrend1 using the code

PredTest1 = predict(FluTrend1, newdata=FluTest)

However, the dependent variable in our model is log(ILI), so PredTest1 would contain predictions of the log(ILI) value. We are instead interested in obtaining predictions of the ILI value. We can convert from predictions of log(ILI) to predictions of ILI via exponentiation function. The new code, which predicts the ILI value, is

PredTest1 = exp(predict(FluTrend1, newdata=FluTest))
head(PredTest1)

##        1        2        3        4        5        6 
## 3.520332 2.662689 2.673181 2.510160 2.451624 2.694289

What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012?

which(FluTest$Week == "2012-03-11 - 2012-03-17")

## [1] 11

PredTest1[11]

##       11 
## 2.187378

What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012?

observed_ILI <- FluTest$ILI[11]
Estimated_ILI <- PredTest1[11]
RE <- (observed_ILI - Estimated_ILI)/observed_ILI
RE

##         11 
## 0.04623827

What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?

SSE <- sum((PredTest1 - FluTest$ILI)^2)
RMSE <- sqrt (SSE/nrow(FluTest))
RMSE

## [1] 0.7490645

Training a Time Series Model

The observations in this dataset are consecutive weekly measurements of the dependent and independent variables. This sort of dataset is called a “time series.” Often, statistical models can be improved by predicting the current value of the dependent variable using the value of the dependent variable from earlier weeks. In our models, this means we will predict the ILI variable in the current week using values of the ILI variable from previous weeks.

First, we need to decide the amount of time to lag the observations. Because the ILI variable is reported with a 1- or 2-week lag, a decision maker cannot rely on the previous week’s ILI value to predict the current week’s value. Instead, the decision maker will only have data available from 2 or more weeks ago. We will build a variable called ILILag2 that contains the ILI value from 2 weeks before the current observation.

To do so, we will use the “zoo” package, which provides a number of helpful methods for time series models.

library(zoo)

We now create the ILILag2 variable in the training set:

ILILag2 = lag(zoo(FluTrain$ILI), -2, na.pad=TRUE)
FluTrain$ILILag2 = coredata(ILILag2)

In these commands, the value of -2 passed to lag means to return 2 observations before the current one; a positive value would have returned future observations. The parameter na.pad=TRUE means to add missing values for the first two weeks of our dataset, where we can’t compute the data from 2 weeks earlier.

How many values are missing in the new ILILag2 variable?

summary(FluTrain$ILILag2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.5341  0.9010  1.2519  1.6754  2.0580  7.6189       2

ggplotly(ggplot(FluTrain, aes(log(ILILag2), log(ILI))) + geom_point(color = "red") + ggtitle("RELATIONSHIP BETWEEN ILILag2 AND ILI"))

There is a strong positive relationship between log(ILILag2) and log(ILI).

We can the train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable. We will Call this model as FluTrend2.

FluTrend2 <- lm(log(ILI) ~ Queries + log(ILILag2), data = FluTrain)
summary(FluTrend2)

## 
## Call:
## lm(formula = log(ILI) ~ Queries + log(ILILag2), data = FluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52209 -0.11082 -0.01819  0.08143  0.76785 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.24064    0.01953  -12.32   <2e-16 ***
## Queries       1.25578    0.07910   15.88   <2e-16 ***
## log(ILILag2)  0.65569    0.02251   29.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1703 on 412 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.9063, Adjusted R-squared:  0.9059 
## F-statistic:  1993 on 2 and 412 DF,  p-value: < 2.2e-16

As can be seen, all three coefficients are highly significant, and the R^2 value is 0.9063.

On the basis of R-squared value and significance of coefficients, moving from FluTrend1 to FluTrend2, in-sample R^2 improved from 0.709 to 0.9063, and the new variable is highly significant. As a result, there is no sign of overfitting, and FluTrend2 is superior to FluTrend1 on the training set.

Evaluating the Time Series Model in the Test Set

So far, we have only added the ILILag2 variable to the FluTrain data frame. To make predictions with our FluTrend2 model, we will also need to add ILILag2 to the FluTest data frame (note that adding variables before splitting into a training and testing set can prevent this duplication of effort)

ILILag2 <- lag(zoo(FluTest$ILI), -2, na.pad=TRUE)
FluTest$ILILag2 <- coredata(ILILag2)
summary(FluTest$ILILag2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.9018  1.1359  1.3409  1.5188  1.7606  3.6002       2

The new variable (ILILag2) has two missing variables.

In this problem, the training and testing sets are split sequentially - the training set contains all observations from 2004-2011 and the testing set contains all observations from 2012. There is no time gap between the two datasets, meaning the first observation in FluTest was recorded one week after the last observation in FluTrain. From this, we can identify how to fill in the missing values for the ILILag2 variable in FluTest.

To fill in the ILILag2 variable for the first observation in FluTest, we use the ILI value of the second-to-last observation in the FluTrain data frame, while to fill in the ILILag2 variable for the second observation in FluTest, we use the ILI value of the last observation in the FluTrain data frame.

Filling in the Missing Values

nrow(FluTrain)

## [1] 417

FluTest$ILILag2[1] = FluTrain$ILI[416]
FluTest$ILILag2[2] = FluTrain$ILI[417]

FluTest$ILILag2[1]

## [1] 1.852736

FluTest$ILILag2[2]

## [1] 2.12413

Evaluating the Time Series Model in the Test Set Data

We now set out to obtain the test set predictions of the ILI variable from the FluTrend2 model, again remembering to call the exp() function on the result of the predict() function to obtain predictions for ILI instead of log(ILI).Subsequently, we determine the test-set RMSE of the FluTrend2 model?

pred <- exp(predict(FluTrend2 , newdata = FluTest))
summary(pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.004   1.275   1.655   1.808   2.141   5.006

SSE <- sum((pred - FluTest$ILI)^2)
SSE

## [1] 4.500877

RMSE <- sqrt(mean((pred - FluTest$ILI)^2))
RMSE

## [1] 0.2942029

Comparing the two models, FluTrend2 obtained the best test-set RMSE.The test-set RMSE of FluTrend2 is 0.294, as opposed to the 0.749 value obtained by the FluTrend1 model.