Goolge flu trend https://www.google.org/flutrends/about/#US project was initiated to see if faster reporting can be made possible by considering flu-related online search queries – data that is available almost immediately.

“Week” - The range of dates represented by this observation, in year/month/day format.
“ILI” - This column lists the percentage of ILI-related physician visits for the corresponding week.
“Queries” - This column lists the fraction of queries that are ILI-related for the corresponding week, adjusted to be between 0 and 1 (higher values correspond to more ILI-related search queries).

subset(FluTrain, ILI == max(ILI))

##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1

-Oct. 18 2009 corresponds to the highest percentage of ILI-related physician visits.

which.max(FluTrain$ILI)

## [1] 303

FluTrain$Week[303]

## [1] 2009-10-18 - 2009-10-24
## 417 Levels: 2004-01-04 - 2004-01-10 ... 2011-12-25 - 2011-12-31

An alternative to find out the highest percentage of ILI-related physician visits.

subset(FluTrain, Queries == max(Queries))

##                        Week      ILI Queries
## 303 2009-10-18 - 2009-10-24 7.618892       1

Oct. 18 2009 week corresponds to the highest percentage of ILI-related query fraction

Most of the ILI values are small, with a relatively small number of much larger values (in statistics, this sort of data is called “skew right”).

hist(FluTrain$ILI)

When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself – this prevents the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models.

plot(Queries ~ log(ILI), data=FluTrain)

There is a positive, linear relationship between log(ILI) and Queries.
log(ILI) = intercept + coefficient x Queries, where the coefficient is positive

LINEAR REGRESSION MODEL

FluTrend1 = lm(log(ILI) ~ Queries, data = FluTrain)
summary(FluTrend1)

## 
## Call:
## lm(formula = log(ILI) ~ Queries, data = FluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76003 -0.19696 -0.01657  0.18685  1.06450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.49934    0.03041  -16.42   <2e-16 ***
## Queries      2.96129    0.09312   31.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2995 on 415 degrees of freedom
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.7083 
## F-statistic:  1011 on 1 and 415 DF,  p-value: < 2.2e-16

R-squared value for FluTrend1 model is 0.709

For a single variable linear regression model, there is a direct relationship between the R-squared and the correlation between the independent and the dependent variables:

R-squared = Correlation^2.

To test these hypotheses, we first need to compute the correlation between the independent variable used in the model (Queries) and the dependent variable (log(ILI)).

Correlation = cor(FluTrain$Queries, log(FluTrain$ILI))
Correlation

## [1] 0.8420333

Correlation^2 = 0.7090201
log(1/Correlation) = 0.1719357
exp(-0.5*Correlation) = 0.6563792
It appears that Correlation^2 is equal to the R-squared value

Load this data into a data frame called FluTest.

## 'data.frame':    52 obs. of  3 variables:
##  $ Week   : Factor w/ 52 levels "2012-01-01 - 2012-01-07",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ILI    : num  1.77 1.54 1.65 1.68 1.86 ...
##  $ Queries: num  0.594 0.499 0.501 0.479 0.471 ...

Normally, we would obtain test-set predictions from the model FluTrend1 using the code

PredTest1 = predict(FluTrend1, newdata=FluTest)

However, the dependent variable in our model is log(ILI), so PredTest1 would contain predictions of the log(ILI) value. We are instead interested in obtaining predictions of the ILI value. We can convert from predictions of log(ILI) to predictions of ILI via exponentiation, or the exp() function. The new code, which predicts the ILI value, is

PredTest1 = exp(predict(FluTrend1, newdata=FluTest))
PredTest1

##        1        2        3        4        5        6        7        8 
## 3.520332 2.662689 2.673181 2.510160 2.451624 2.694289 2.780402 2.673181 
##        9       10       11       12       13       14       15       16 
## 2.375693 2.357081 2.187378 1.796914 1.789861 1.714084 1.789861 1.615894 
##       17       18       19       20       21       22       23       24 
## 1.529333 1.499555 1.380689 1.487806 1.375269 1.317045 1.359139 1.276254 
##       25       26       27       28       29       30       31       32 
## 1.317045 1.231872 1.276254 1.332675 1.256335 1.276254 1.266255 1.281283 
##       33       34       35       36       37       38       39       40 
## 1.348491 1.447408 1.523330 1.628654 1.755010 2.094771 2.195997 2.357081 
##       41       42       43       44       45       46       47       48 
## 2.422870 2.338614 2.329436 2.539950 2.442001 2.858006 2.758619 3.690445 
##       49       50       51       52 
## 4.439679 4.898351 6.250888 6.591252

which(FluTest$Week == "2012-03-11 - 2012-03-17")

## [1] 11

PredTest1[11]

##       11 
## 2.187378

What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012? 2.187378%

What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012?

FluTest$ILI[11]

## [1] 2.293422

PredTest1[11]

##       11 
## 2.187378

(Observed ILI - Estimated ILI)/Observed ILI
(2.293422 - 2.187378)/2.293422 = 0.04624

What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?

SSE = sum((PredTest1-FluTest$ILI)^2)
SSE

## [1] 29.17708

nrow(FluTest)

## [1] 52

RMSE = sqrt(SSE / nrow(FluTest))

RMSE = 0.7490645

sqrt(mean((PredTest1-FluTest$ILI)^2))

## [1] 0.7490645

An alternative way to calculate RMSE

The observations in this dataset are consecutive weekly measurements of the dependent and independent variables. This sort of dataset is called a “time series.” Often, statistical models can be improved by predicting the current value of the dependent variable using the value of the dependent variable from earlier weeks. In our models, this means we will predict the ILI variable in the current week using values of the ILI variable from previous weeks.

we need to decide the amount of time to lag the observations. Because the ILI variable is reported with a 1- or 2-week lag, a decision maker cannot rely on the previous week’s ILI value to predict the current week’s value. Instead, the decision maker will only have data available from 2 or more weeks ago. We will build a variable called ILILag2 that contains the ILI value from 2 weeks before the current observation.
install package “zoo”, create the ILILag2 variable in the training set

library(zoo)

## Warning: package 'zoo' was built under R version 3.2.2

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

ILILag2 = lag(zoo(FluTrain$ILI), -2, na.pad=TRUE)
FluTrain$ILILag2 = coredata(ILILag2)
str(FluTrain)

## 'data.frame':    417 obs. of  4 variables:
##  $ Week   : Factor w/ 417 levels "2004-01-04 - 2004-01-10",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ILI    : num  2.42 1.81 1.71 1.54 1.44 ...
##  $ Queries: num  0.238 0.22 0.226 0.238 0.224 ...
##  $ ILILag2: num  NA NA 2.42 1.81 1.71 ...

summary(FluTrain$ILILag2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.5341  0.9010  1.2520  1.6750  2.0580  7.6190       2

the value of -2 passed to lag means to return 2 observations before the current one; a positive value would have returned future observations. The parameter na.pad=TRUE means to add missing values for the first two weeks of our dataset, where we can’t compute the data from 2 weeks earlier.

plot the log of ILILag2 against the log of ILI

plot(log(ILILag2) ~ log(ILI), data=FluTrain)

There is a strong positive relationship between log(ILILag2) and log(ILI)

Train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable

FluTrend2  = lm(log(ILI) ~ Queries + log(ILILag2), data = FluTrain)
summary(FluTrend2)

## 
## Call:
## lm(formula = log(ILI) ~ Queries + log(ILILag2), data = FluTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52209 -0.11082 -0.01819  0.08143  0.76785 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.24064    0.01953  -12.32   <2e-16 ***
## Queries       1.25578    0.07910   15.88   <2e-16 ***
## log(ILILag2)  0.65569    0.02251   29.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1703 on 412 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.9063, Adjusted R-squared:  0.9059 
## F-statistic:  1993 on 2 and 412 DF,  p-value: < 2.2e-16

All three coefficients are highly significant, and the R^2 value is 0.9063.
FluTrend2 is a stronger model than FluTrend1 on the training set. Moving from FluTrend1 to FluTrend2, in-sample R^2 improved from 0.709 to 0.9063, and the new variable is highly significant. As a result, there is no sign of overfitting, and FluTrend2 is superior to FluTrend1 on the training set.

So far, we have only added the ILILag2 variable to the FluTrain data frame. To make predictions with our FluTrend2 model, we will also need to add ILILag2 to the FluTest data frame

ILILag2 = lag(zoo(FluTest$ILI), -2, na.pad=TRUE)
FluTest$ILILag2 = coredata(ILILag2)
summary(FluTest$ILILag2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.9018  1.1360  1.3410  1.5190  1.7610  3.6000       2

The training set contains all observations from 2004-2011 and the testing set contains all observations from 2012. There is no time gap between the two datasets, meaning the first observation in FluTest was recorded one week after the last observation in FluTrain.

The ILI value of the second-to-last observation in the FluTrain data frame should be used to fill in the ILILag2 variable for the first observation in FluTest.
The ILI value of the last observation in the FluTrain data frame should be used to fill in the ILILag2 variable for the second observation in FluTest.

Fill in the missing values for ILILag2 in FluTest

nrow(FluTrain)

## [1] 417

FluTest$ILILag2[1] = FluTrain$ILI[416]
FluTest$ILILag2[2] = FluTrain$ILI[417]
FluTest$ILILag2[1]

## [1] 1.852736

the new value of the ILILag2 variable in the first row of FluTest

FluTest$ILILag2[2]

## [1] 2.12413

the new value of the ILILag2 variable in the second row of FluTest

EVALUATING THE TIME SERIES MODEL IN THE TEST SET

PredTest2 = exp(predict(FluTrend2, newdata=FluTest))
SSE = sum((PredTest2-FluTest$ILI)^2)
SSE

## [1] 4.500877

nrow(FluTest)

## [1] 52

RMSE = sqrt(SSE / nrow(FluTest))
RMSE

## [1] 0.2942029

The test-set RMSE of FluTrend2 is 0.294, as opposed to the 0.749 value obtained by the FluTrend1 model.

sqrt(mean((PredTest2-FluTest$ILI)^2))

## [1] 0.2942029

An alternative