Step 3: R analysis

Load Packages

Read Data

##      Country                     WHOREGION                     FLUREGION
## 1 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 2 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 3 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 4 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 5 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 6 Costa Rica Region of the Americas of WHO Central America and Caribbean
##   Year Week      SDATE      EDATE SPEC_RECEIVED_NB SPEC_PROCESSED_NB AH1
## 1 2005   31 2005-08-01 2005-08-07               NA                 1  NA
## 2 2005   33 2005-08-15 2005-08-21               NA                 1  NA
## 3 2005   35 2005-08-29 2005-09-04               NA                 1  NA
## 4 2006   23 2006-06-05 2006-06-11               NA                NA  NA
## 5 2006   26 2006-06-26 2006-07-02               NA                NA  NA
## 6 2006   27 2006-07-03 2006-07-09               NA                NA  NA
##   AH1N12009 AH3 AH5 ANOTSUBTYPED INF_A BYAMAGATA BVICTORIA BNOTDETERMINED
## 1        NA   1  NA           NA     1        NA        NA              0
## 2        NA   1  NA           NA     1        NA        NA              0
## 3        NA   1  NA           NA     1        NA        NA              0
## 4        NA   1  NA           NA     1        NA        NA              0
## 5        NA  NA  NA           NA     0        NA        NA              5
## 6        NA   3  NA           NA     3        NA        NA              0
##   INF_B ALL_INF ALL_INF2     TITLE
## 1     0       1       NA No Report
## 2     0       1       NA No Report
## 3     0       1       NA No Report
## 4     0       1       NA No Report
## 5     5       5       NA No Report
## 6     0       3       NA No Report

Structure Data as Time Series

Since we only have information on count data at the weekly level, weeks with no cases are likely not recorded on FluNet. Therefore, we need to fill in the gaps with 0 cases to ensure a complete time series.

Time Series of Flu Cases in Costa Rica, Source: WHO FluNet

Decomposition

Are negative values for case counts possible for an observational time series?

The above decomposition plot provides a closer look at the random, seasonal, and trend components of the observed time series. However, lookinag at the y axis for each of these plots, we see negative values–which is infeasible for a time series based on counts where the lowest value would be 0 (meaning no observed cases). we will have to adjust our modeling approaches to accommodate this unique property in our dataset.

We can also use another decomposition method (Loess smoothing) to assess the seasonal component. This decomposition graph also indicates negative remainders, which we know is impossible given our dataset.

Check for stationarity

⊕Stationary time series: $ e_t = Y_t - S_t - T_t $

A time series is stationary if the following are true:

The mean value of time-series is constant over time, which implies, the trend component is nullified.
The variance does not increase over time.
Seasonality effect is minimal.

If these conditions are true, the time series does not have a trend or seasonal patterns. The time series looks like random white noise. This can be verified using statistical tests (discussed further below).

Based on this seasonal pattern, do you think this time series is stationary?

Lagged Time Series

A lag shifts the time series by a given number of periods. Lags of a time series are often used as explanatory variables to model the time series itself, because the state of the time series few periods back may influence the series current state.

Can you guess the lag period based on the graph?

Which Lag Best Fits the Data?

To answer this question, we must examine two properties of the time series further:

Autocorrelation (ACF): correlation of a time series with lags of itself
Partial Autocorrelation (PACF): correlation of the time series with a lag of itself, with the linear dependence of all the lags between them removed

Lags determine if previous states of the time series influence present or future states. ACF or PACF plots can point to specific lags with significant influence on the time series.

If the autocorrelation crosses the dashed blue line, it means that specific lag is significantly correlated with current series. Based on this explanation, which lags do you think are significant for this time series?

AR & MA Model Formulations

As with most data, you can use several approaches to model this time series. Let’s start with the most commonly-used models for time series data:

Auto-Regressive (AR) model: specifies that the current value in a time series depends on its own previous values (lagged series), and the error term. The model specification is thus:

\[ Y_t = \beta_0 + \beta_1 Y_{t-1} + \beta_n Y_{t-n}~ +\epsilon_t \]

where $\beta$s are coefficients, $Y_t$ is the response, $\epsilon_t$ is the error, and $n$ is the time period (order) for which the model is specified.

Moving Average (MA) model: specifies that the current value in a time series depends on the mean value of the time series, and the current and past values of the error term. The model specification is thus:

\[ Y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_n \epsilon_{t-n} \]

where $\mu$ is the mean for the time series, $\epsilon$ is the error, $\theta$ are coefficients, $Y_t$ is the response, and $n$ is the time period (order) for which the model is specified.

Both AR and MA model components can be combined in ARIMA models. An ARIMA model is specified as ARIMA(p,d,q), where:

p is the number of autoregressive terms,
d is the number of nonseasonal differences needed for stationarity, and
q is the number of lagged forecast errors in the prediction equation.

Building a Simple ARIMA model

The following steps can be used to build a simple time series model:

Plot the data.
If necessary, transform the data to stabilize the variance.
Check for stationarity. If the data are non-stationary, take first differences of the data until the data are stationary.Differencing is done by subtracting each data point in the series from its successor, or from the respective season’s data points
Once the the data is stationary, examine the ACF/PACF.
Build an AR/MA/ARIMA model and use the AICc to search for a better model.
Check the residuals from your chosen model by plotting the ACF of the residuals, and doing a portmanteau test of the residuals. If they do not look like white noise, try a modified model.
Once the residuals look like white noise, calculate forecasts.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts.sa
## Dickey-Fuller = -5.817, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

Correcting TS for Count Data: GLM for Time Series (TSGLM)

Using the tsglm package, we can build models including multiple seasonalities. Below we fit two models to the Costa Rica flu data–Poisson and Negative Binomial. Both distributions accurately represent count data, making this a more favored approach compared to the ARIMA method discussed above.

We can begin comparing the two models using histograms as well as scoring methods.

##         logarithmic  quadratic  spherical rankprob    dawseb     normsq
## Poisson    2.817562 -0.3027172 -0.4945904 2.330446 11.608200 10.9435731
## NegBin     2.558048 -0.1540738 -0.4709549 4.154732  5.078168  0.9877368
##          sqerror
## Poisson 46.94582
## NegBin  46.94582

The histograms indicate that significant portions of the Poisson model fall above the blue line, and are therefore not adequate for model fitting. The Negative Binomial histogram comparatively fits the data better.

The marginal calibration graph indicates the opposite–i.e. the Poisson model hovers closer to a 0 difference, and thus may be more preferrable than the Negative Binomial distribution.

The scoring table can help us decide which model is a better fit. The smaller the scoring rule the better–therefore, based on this table, we select the Negative Binomial model as most descriptive for the data.

## 
## Call:
## tsglm(ts = dat_ts, model = list(past_obs = c(1, 2, 6, 12, 24, 
##     52), past_mean = 52), distr = "nbinom")
## 
## Coefficients:
##              Estimate  Std.Error  CI(lower)  CI(upper)
## (Intercept)  4.84e-01     0.2032     0.0855     0.8821
## beta_1       6.13e-01     0.4411    -0.2511     1.4780
## beta_2       3.01e-01     0.3849    -0.4538     1.0551
## beta_6       3.32e-03     0.0829    -0.1591     0.1657
## beta_12      1.24e-09     0.0247    -0.0484     0.0484
## beta_24      2.27e-10     0.0145    -0.0284     0.0284
## beta_52      1.22e-11     0.0218    -0.0427     0.0427
## alpha_52     1.41e-04     0.0242    -0.0474     0.0476
## sigmasq      1.48e+01         NA         NA         NA
## Standard errors and confidence intervals (level =  95 %) obtained
## by normal approximation.
## 
## Link function: identity 
## Distribution family: nbinom (with overdispersion coefficient 'sigmasq') 
## Number of coefficients: 9 
## Log-likelihood: -1665.289 
## AIC: 3348.578 
## BIC: 3388.885 
## QIC: 5138.774

Prediction

##         EDATE INF_A
## 1  2018-01-21     1
## 2  2018-01-28     1
## 3  2018-02-04     2
## 4  2018-02-11     2
## 5  2018-02-18     2
## 6  2018-02-25     2
## 7  2018-03-04     3
## 8  2018-03-11     3
## 9  2018-03-18     3
## 10 2018-03-25     3

Plot Prediction

Advanced Forecasting for Count Data

TBATS

##          Point Forecast      Lo 80     Hi 80     Lo 95    Hi 95
## 2018.115      1.5104761  -6.825365  9.846317 -11.23809 14.25904
## 2018.135      0.8980272  -8.537916 10.333971 -13.53300 15.32906
## 2018.154      1.3011369  -9.702043 12.304317 -15.52678 18.12905
## 2018.173      2.2930639  -9.797984 14.384112 -16.19860 20.78473
## 2018.192      3.0423367  -9.748941 15.833615 -16.52023 22.60491
## 2018.212      2.6541989 -10.522972 15.831369 -17.49854 22.80694
## 2018.231      3.0659366 -10.605808 16.737681 -17.84319 23.97507
## 2018.250      4.0376725 -10.022169 18.097514 -17.46500 25.54034
## 2018.269      4.0717493 -10.161430 18.304928 -17.69602 25.83952
## 2018.288      3.7618841 -10.623002 18.146770 -18.23790 25.76167

AR Conditional Poisson (ACP) Regression Model

## Call:
## acp.formula(formula = INF_A ~ -1, data = dat2, p = 2, q = 1)
## 
##     Estimate   StdErr t.value   p.value    
## a   0.428546 0.050691  8.4541 < 2.2e-16 ***
## b 1 0.611889 0.025860 23.6613 < 2.2e-16 ***
## b 2 0.247466 0.058656  4.2189 2.805e-05 ***
## c 1 0.065259 0.062034  1.0520    0.2932    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Logl:
## [1] -1836.097
## 
## AIC:
## [1] 5.653139
## 
## BIC:
## [1] 5.680656

Generalized Linear AR MA Model (GLARMA)

## 
## Call: glarma(y = dat_sub, X = as.matrix(rep(1, length(dat_sub))), type = "NegBin", 
##     method = "FS", residuals = "Pearson", phiLags = c(1, 2, 6, 
##         12), maxit = 100, grad = 1e-06)
## 
## Negative Binomial Parameter:
##      <NA>  
## 0.7384749  
## 
## GLARMA Coefficients:
##       phi_2        phi_6       phi_12        alpha  
##  0.42775671   0.19011039   0.07768772  -0.04133052  
## 
## Linear Model Coefficients:
##     phi_1  
## 0.8860162  
## 
## Degrees of Freedom: 644 Total (i.e. Null);  639 Residual
## Null Deviance: 611.7529 
## Residual Deviance: 1866.685 
## AIC: 2709.121

## $eta
##  [1] 0.8860162        NA        NA        NA        NA        NA        NA
##  [8]        NA        NA        NA        NA        NA
## 
## $W
##  [1] 1.673114       NA       NA       NA       NA       NA       NA
##  [8]       NA       NA       NA       NA       NA
## 
## $mu
##  [1] 5.328733       NA       NA       NA       NA       NA       NA
##  [8]       NA       NA       NA       NA       NA
## 
## $Y
##  [1]   4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## 
## $n.ahead
## [1] 12
## 
## $newdata
##       [,1]
##  [1,]    1
##  [2,]    1
##  [3,]    1
##  [4,]    1
##  [5,]    1
##  [6,]    1
##  [7,]    1
##  [8,]    1
##  [9,]    1
## [10,]    1
## [11,]    1
## [12,]    1
## 
## $newoffset
## [1] 0
## 
## $newm
## [1] 1
## 
## $model
## 
## Call: glarma(y = dat_sub, X = as.matrix(rep(1, length(dat_sub))), type = "NegBin", 
##     method = "FS", residuals = "Pearson", phiLags = c(1, 2, 6, 
##         12), maxit = 100, grad = 1e-06)
## 
## Negative Binomial Parameter:
##      <NA>  
## 0.7384749  
## 
## GLARMA Coefficients:
##       phi_2        phi_6       phi_12        alpha  
##  0.42775671   0.19011039   0.07768772  -0.04133052  
## 
## Linear Model Coefficients:
##     phi_1  
## 0.8860162  
## 
## Degrees of Freedom: 644 Total (i.e. Null);  639 Residual
## Null Deviance: 611.7529 
## Residual Deviance: 1866.685 
## AIC: 2709.121 
## 
## 
## attr(,"class")
## [1] "glarmaForecast"

Time Series Analysis for Epidemiological Data

A Brief Introduction and How-To in R

Aishwarya Venkat

2018-02-01

Introduction

Key Concepts

Step 1: Download & Install R and RStudio

Step 2: Load Data