This tutorial will walk you through the process of cleaning up case count data, converting it to a time series, and developing a statistically sound forecast using the ts R package
Additive time series:
Yt = St + Tt + et
Multiplicative time series:
Yt = St * Tt * et
Time Series: Any metric that is measured over regular time intervals. Examples of well-studied time series includes weather phenomena, river volumes, and stock prices.
Each data point at time t in a can be expressed as either a sum or a product of 3 components, namely, Seasonality (St), Trend (Tt) and Error or noise (et).
Time series can be additive or multiplicative. A multiplicative time series can be converted to additive by taking a log of the time series.
To accurately model a time series and develop forecasts, we must first decompose the time series.
Visit the R website. Select a mirror link based on your location in the world. From this link, download and install the appropriate R files for your computer and operating system.
Next, visit the RStudio website and download the Free RStudio version.
Install the R file first, then the RStudio file by double-clicking on the downloaded files, and selecting Install.
For this example, we will use the World Health Organization (WHO)’s FluNet database for Costa Rica. Select data from the oldest available year/week (1995/1) to the newest available year/week (2018/53). Click Display Report, and then the floppy drive icon to export the data to a Comma Separated Values (CSV) file.
For the purposes of this exercise, this dataset can also be downloaded directly from this link. In our script, we will be reading data directly from this link.
## Country WHOREGION FLUREGION
## 1 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 2 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 3 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 4 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 5 Costa Rica Region of the Americas of WHO Central America and Caribbean
## 6 Costa Rica Region of the Americas of WHO Central America and Caribbean
## Year Week SDATE EDATE SPEC_RECEIVED_NB SPEC_PROCESSED_NB AH1
## 1 2005 31 2005-08-01 2005-08-07 NA 1 NA
## 2 2005 33 2005-08-15 2005-08-21 NA 1 NA
## 3 2005 35 2005-08-29 2005-09-04 NA 1 NA
## 4 2006 23 2006-06-05 2006-06-11 NA NA NA
## 5 2006 26 2006-06-26 2006-07-02 NA NA NA
## 6 2006 27 2006-07-03 2006-07-09 NA NA NA
## AH1N12009 AH3 AH5 ANOTSUBTYPED INF_A BYAMAGATA BVICTORIA BNOTDETERMINED
## 1 NA 1 NA NA 1 NA NA 0
## 2 NA 1 NA NA 1 NA NA 0
## 3 NA 1 NA NA 1 NA NA 0
## 4 NA 1 NA NA 1 NA NA 0
## 5 NA NA NA NA 0 NA NA 5
## 6 NA 3 NA NA 3 NA NA 0
## INF_B ALL_INF ALL_INF2 TITLE
## 1 0 1 NA No Report
## 2 0 1 NA No Report
## 3 0 1 NA No Report
## 4 0 1 NA No Report
## 5 5 5 NA No Report
## 6 0 3 NA No Report
Since we only have information on count data at the weekly level, weeks with no cases are likely not recorded on FluNet. Therefore, we need to fill in the gaps with 0 cases to ensure a complete time series.
Time Series of Flu Cases in Costa Rica, Source: WHO FluNet
Are negative values for case counts possible for an observational time series?
The above decomposition plot provides a closer look at the random, seasonal, and trend components of the observed time series. However, lookinag at the y axis for each of these plots, we see negative values–which is infeasible for a time series based on counts where the lowest value would be 0 (meaning no observed cases). we will have to adjust our modeling approaches to accommodate this unique property in our dataset.
We can also use another decomposition method (Loess smoothing) to assess the seasonal component. This decomposition graph also indicates negative remainders, which we know is impossible given our dataset.
Stationary time series: $ e_t = Y_t - S_t - T_t $
A time series is stationary if the following are true:
If these conditions are true, the time series does not have a trend or seasonal patterns. The time series looks like random white noise. This can be verified using statistical tests (discussed further below).
Based on this seasonal pattern, do you think this time series is stationary?
A lag shifts the time series by a given number of periods. Lags of a time series are often used as explanatory variables to model the time series itself, because the state of the time series few periods back may influence the series current state.
Can you guess the lag period based on the graph?
To answer this question, we must examine two properties of the time series further:
Lags determine if previous states of the time series influence present or future states. ACF or PACF plots can point to specific lags with significant influence on the time series.
If the autocorrelation crosses the dashed blue line, it means that specific lag is significantly correlated with current series. Based on this explanation, which lags do you think are significant for this time series?
As with most data, you can use several approaches to model this time series. Let’s start with the most commonly-used models for time series data:
\[ Y_t = \beta_0 + \beta_1 Y_{t-1} + \beta_n Y_{t-n}~ +\epsilon_t \]
where \(\beta\)s are coefficients, \(Y_t\) is the response, \(\epsilon_t\) is the error, and \(n\) is the time period (order) for which the model is specified.
\[ Y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_n \epsilon_{t-n} \]
where \(\mu\) is the mean for the time series, \(\epsilon\) is the error, \(\theta\) are coefficients, \(Y_t\) is the response, and \(n\) is the time period (order) for which the model is specified.
Both AR and MA model components can be combined in ARIMA models. An ARIMA model is specified as ARIMA(p,d,q), where:
The following steps can be used to build a simple time series model:
##
## Augmented Dickey-Fuller Test
##
## data: ts.sa
## Dickey-Fuller = -5.817, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
Using the tsglm package, we can build models including multiple seasonalities. Below we fit two models to the Costa Rica flu data–Poisson and Negative Binomial. Both distributions accurately represent count data, making this a more favored approach compared to the ARIMA method discussed above.
We can begin comparing the two models using histograms as well as scoring methods.
## logarithmic quadratic spherical rankprob dawseb normsq
## Poisson 2.817562 -0.3027172 -0.4945904 2.330446 11.608200 10.9435731
## NegBin 2.558048 -0.1540738 -0.4709549 4.154732 5.078168 0.9877368
## sqerror
## Poisson 46.94582
## NegBin 46.94582
The histograms indicate that significant portions of the Poisson model fall above the blue line, and are therefore not adequate for model fitting. The Negative Binomial histogram comparatively fits the data better.
The marginal calibration graph indicates the opposite–i.e. the Poisson model hovers closer to a 0 difference, and thus may be more preferrable than the Negative Binomial distribution.
The scoring table can help us decide which model is a better fit. The smaller the scoring rule the better–therefore, based on this table, we select the Negative Binomial model as most descriptive for the data.
##
## Call:
## tsglm(ts = dat_ts, model = list(past_obs = c(1, 2, 6, 12, 24,
## 52), past_mean = 52), distr = "nbinom")
##
## Coefficients:
## Estimate Std.Error CI(lower) CI(upper)
## (Intercept) 4.84e-01 0.2032 0.0855 0.8821
## beta_1 6.13e-01 0.4411 -0.2511 1.4780
## beta_2 3.01e-01 0.3849 -0.4538 1.0551
## beta_6 3.32e-03 0.0829 -0.1591 0.1657
## beta_12 1.24e-09 0.0247 -0.0484 0.0484
## beta_24 2.27e-10 0.0145 -0.0284 0.0284
## beta_52 1.22e-11 0.0218 -0.0427 0.0427
## alpha_52 1.41e-04 0.0242 -0.0474 0.0476
## sigmasq 1.48e+01 NA NA NA
## Standard errors and confidence intervals (level = 95 %) obtained
## by normal approximation.
##
## Link function: identity
## Distribution family: nbinom (with overdispersion coefficient 'sigmasq')
## Number of coefficients: 9
## Log-likelihood: -1665.289
## AIC: 3348.578
## BIC: 3388.885
## QIC: 5138.774
## EDATE INF_A
## 1 2018-01-21 1
## 2 2018-01-28 1
## 3 2018-02-04 2
## 4 2018-02-11 2
## 5 2018-02-18 2
## 6 2018-02-25 2
## 7 2018-03-04 3
## 8 2018-03-11 3
## 9 2018-03-18 3
## 10 2018-03-25 3
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2018.115 1.5104761 -6.825365 9.846317 -11.23809 14.25904
## 2018.135 0.8980272 -8.537916 10.333971 -13.53300 15.32906
## 2018.154 1.3011369 -9.702043 12.304317 -15.52678 18.12905
## 2018.173 2.2930639 -9.797984 14.384112 -16.19860 20.78473
## 2018.192 3.0423367 -9.748941 15.833615 -16.52023 22.60491
## 2018.212 2.6541989 -10.522972 15.831369 -17.49854 22.80694
## 2018.231 3.0659366 -10.605808 16.737681 -17.84319 23.97507
## 2018.250 4.0376725 -10.022169 18.097514 -17.46500 25.54034
## 2018.269 4.0717493 -10.161430 18.304928 -17.69602 25.83952
## 2018.288 3.7618841 -10.623002 18.146770 -18.23790 25.76167
## Call:
## acp.formula(formula = INF_A ~ -1, data = dat2, p = 2, q = 1)
##
## Estimate StdErr t.value p.value
## a 0.428546 0.050691 8.4541 < 2.2e-16 ***
## b 1 0.611889 0.025860 23.6613 < 2.2e-16 ***
## b 2 0.247466 0.058656 4.2189 2.805e-05 ***
## c 1 0.065259 0.062034 1.0520 0.2932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Logl:
## [1] -1836.097
##
## AIC:
## [1] 5.653139
##
## BIC:
## [1] 5.680656
##
## Call: glarma(y = dat_sub, X = as.matrix(rep(1, length(dat_sub))), type = "NegBin",
## method = "FS", residuals = "Pearson", phiLags = c(1, 2, 6,
## 12), maxit = 100, grad = 1e-06)
##
## Negative Binomial Parameter:
## <NA>
## 0.7384749
##
## GLARMA Coefficients:
## phi_2 phi_6 phi_12 alpha
## 0.42775671 0.19011039 0.07768772 -0.04133052
##
## Linear Model Coefficients:
## phi_1
## 0.8860162
##
## Degrees of Freedom: 644 Total (i.e. Null); 639 Residual
## Null Deviance: 611.7529
## Residual Deviance: 1866.685
## AIC: 2709.121
## $eta
## [1] 0.8860162 NA NA NA NA NA NA
## [8] NA NA NA NA NA
##
## $W
## [1] 1.673114 NA NA NA NA NA NA
## [8] NA NA NA NA NA
##
## $mu
## [1] 5.328733 NA NA NA NA NA NA
## [8] NA NA NA NA NA
##
## $Y
## [1] 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
##
## $n.ahead
## [1] 12
##
## $newdata
## [,1]
## [1,] 1
## [2,] 1
## [3,] 1
## [4,] 1
## [5,] 1
## [6,] 1
## [7,] 1
## [8,] 1
## [9,] 1
## [10,] 1
## [11,] 1
## [12,] 1
##
## $newoffset
## [1] 0
##
## $newm
## [1] 1
##
## $model
##
## Call: glarma(y = dat_sub, X = as.matrix(rep(1, length(dat_sub))), type = "NegBin",
## method = "FS", residuals = "Pearson", phiLags = c(1, 2, 6,
## 12), maxit = 100, grad = 1e-06)
##
## Negative Binomial Parameter:
## <NA>
## 0.7384749
##
## GLARMA Coefficients:
## phi_2 phi_6 phi_12 alpha
## 0.42775671 0.19011039 0.07768772 -0.04133052
##
## Linear Model Coefficients:
## phi_1
## 0.8860162
##
## Degrees of Freedom: 644 Total (i.e. Null); 639 Residual
## Null Deviance: 611.7529
## Residual Deviance: 1866.685
## AIC: 2709.121
##
##
## attr(,"class")
## [1] "glarmaForecast"