Processing math: 100%

Abstract


This report’s objective is to find an adequate model to represent the percentage of US deaths attributable to Pneumonia and Influenza. I outline the model selection process, describe the resulting solution, and provide predictions for 70 future observations.

Based on the residual diagnostics and the AICc/BIC values I selected a seasonal ARIMA model with parameters (1,1,5) x (0,1,1)52. It appears to describe the model well and the predictions look reasonable.

Dataset Selection


Description

I chose to evaluate the percentage of all US deaths due to Pneumonia and Influenza. The data can be retrieved from the following CDC website:

Mortality Data

The dataset begins at week 40 in 2015 and continues until week 41 in 2019, for a total of 210 observations.

Motivation

I have a background in Finance and spent many years as an equity analyst. I primarily covered the healthcare sector, with a focus on healthcare services and medical devices. Trading in healthcare stocks sometimes requires an estimate of patient visits for the current period, but since I did not have a background in Statistics I relied mostly on conversations with hospital administrators. I often wondered if there was a better way to forecast, but until now I have never attempted to do so with time series models.

Unfortunately there is no weekly or monthly data available for healthcare utilization, and so I chose this dataset as a proxy.

Data Exploration & Transformation


Transformation

I began by viewing the time series plot of the dataset, as shown below. We can see that there is a clear seasonal pattern, with peaks during the winter months and troughs during the summer. We can also see that the variance from year to year is not constant, with a particularly strong flu season 2 years ago. To determine if a transformation would help stabilize the variance I ran a Box-Cox test, which suggested that an inverse transformation would be useful (value of -0.9999). We can see from the inverse plot below that some of the variance is lessened.

library(astsa)
library(forecast)
library(tseries)
library(TSA)
library(knitr)
library(kableExtra)

#Import Dataset
PIdths <- read.csv("~/Baruch/Baruch Class Projects/9701 Time Series/PIdeaths.csv")
PIdths = ts(PIdths) #convert to time series

#Plot the timeseries
plot.ts(PIdths,main="Percentage of all deaths due to pneumonia and influenza",ylab="% of Deaths")

#Determine whether transformation is necessary
BoxCox.lambda(PIdths)
## [1] -0.9999242
#result of -.9999 indicates inverse transformation 

InvPIdths = PIdths^-1 # Convert data to inverse

#Plot the inverse timeseries
plot.ts(InvPIdths,main="Inverse % of all deaths due to pneumonia and influenza",ylab="% of Deaths")

Next, I viewed both the ACF and PACF of the inversed data. We can see the seasonal pattern in the ACF, and a spike in the PACF at 52 weeks.

#Plot the sample ACF and PACF
acf(PIdths,100, main="")

pacf(PIdths,100)

Stationarity

To normalize the seasonal pattern I differenced the weekly data by 52. I then evaluated the Augmented Dickey-Fuller test along with the ndiffs function, which suggested that another difference would help make the data stationary. We can see from the resulting plot that the time series does appear to have a constant mean and relatively stable variance. I again evaluated the Augmented Dickey-Fuller test, this time on the differenced data (52 week difference followed by 1 week difference), and with a p-value of <0.01 we can reject the null hypothesis that the data is not stationary.

# 52 week data.  Clear seasonal trend, so take 52 week difference
dInvPIdths = diff(PIdths,52)

#Plot the 52 week differenced timeseries
plot.ts(dInvPIdths,main="Inverse % of all deaths due to P&I, 52 week diff",ylab="% of Deaths")

#Test whether data is stationary or if differencing is necessary
adf.test(dInvPIdths) # Augmented Dickey-Fuller test

    Augmented Dickey-Fuller Test

data:  dInvPIdths
Dickey-Fuller = -3.5047, Lag order = 5, p-value = 0.04429
alternative hypothesis: stationary
ndiffs(dInvPIdths)  # confirm with ndiffs.  Recommends 1 difference
[1] 1
ddInvPIdths = diff(dInvPIdths,1) #differece the data again

#Plot the double differenced timeseries
plot.ts(ddInvPIdths,main="Inverse % of all deaths due to P&I, diff 52 and 1",ylab="% of Deaths")

# plot looks stable with constant variance and mean

#Test whether data is stationary or if differencing is necessary
adf.test(ddInvPIdths) # Augmented Dickey-Fuller test

    Augmented Dickey-Fuller Test

data:  ddInvPIdths
Dickey-Fuller = -5.4507, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
ndiffs(ddInvPIdths)  # confirm with ndiffs
[1] 0
# confirmed stationary

ACF and PACF review

I then viewed the ACF and PACF plots of the differenced data. We can see that the seasonal component (52 week lags) suggest a possible seasonal MA(1) model, since the ACF cuts off after lag 1 and PACF tails off. The ACF and PACF of the non-seasonal component do not have clear cut-off points, which seems to suggest that an ARMA model would be appropriate.

acf(ddInvPIdths,200, main="") # non seasonal ACF tails off, seasonal spike at 52, 54 weeks, cuts off

pacf(ddInvPIdths,200) #non seasonal PACF tails off, #seasonal PACF tails off

#Since the seasonal ACF cuts off and seaonal PACF tails off, use seasonal MA model
#Since non-seasonal ACF and PACF tail off, use non-seaonal ARMA
#Use 1 seasonal difference at 52 weeks and 1 non-seaonal difference

Model Selection


Evaluation

Based on the ACF/PACF analysis I began by evaluating seasonal ARIMA models with 1 non-seasonal difference, 1 seasonal difference, a seasonal MA(1) component, and an ARMA non-seasonal component. I began by evaluating the simplest model first, an ARIMA(1,1,1) x (0,1,1)52. I then continued evaluating different possibilities until I was satisfied with the residual analysis and AICc/BIC values.

#try simplest model first
fit1 <- sarima(InvPIdths,1,1,1,0,1,1,52) #diagnostics OK, but ACF of some resids significant
fit1$AICc 
[1] -5.658188
fit1$BIC 
[1] -5.59998
Model AICc BIC Residual Analysis
SARIMA(1,1,1) x (0,1,1)52 -5.658 -5.599 A few residuals with significant ACF values
SARIMA(1,1,1) x (0,1,2)52 -5.675 -5.602 A few residuals with significant ACF values
SARIMA(1,1,3) x (0,1,1)52 -5.647 -5.560 A few residuals with significant ACF values
SARIMA(1,1,5) x (0,1,1)52 -5.692 -5.605 Excellent. No significant ACF values. Ljung-Box all pass

Based on the results, I chose the SARIMA(1,1,5) x (0,1,1)52 model.

fit3 <- sarima(InvPIdths,1,1,5,0,1,1,52) #diagnostics perfect
fit3$ttable
     Estimate     SE  t.value p.value
ar1    0.8360 0.0730  11.4454  0.0000
ma1   -1.1995 0.1039 -11.5449  0.0000
ma2    0.4070 0.1308   3.1127  0.0022
ma3   -0.0703 0.1337  -0.5260  0.5997
ma4    0.1283 0.1140   1.1259  0.2620
ma5   -0.2450 0.0794  -3.0851  0.0024
sma1  -0.5311 0.1533  -3.4650  0.0007

Since the ma3 and ma4 coefficients were not significant, I eliminated them from the final model.

Residual Diagnostics

We can see that the final model’s residuals have no clear pattern, have no significant ACF values, and all have Ljung-Box p-values over 0.05.

Final Model

The seasonal autoregressive integrated moving average model is represented by:

SARIMA(p,d,q)x(P,D,Q)s

Φp(Bs)ϕ(B)DsdXt=δ+ΘQ(Bs)θ(B)wt

Where:
ΦP(Bs)=1Φ1BsΦ2B2s...ΦPBPs
ϕ(B)=1ϕ1Bϕ2B2...ϕPBP
Ds=(1Bs)D
d=(1B)d
ΘQ(Bs)=1+Θ1Bs+Θ2B2s+...+ΘQBQs
θ(B)=1+θ1B+θ2B2+...+θqBq

Model Coefficients

fit4 <- sarima(InvPIdths,1,1,5,0,1,1,52, fixed=c(NA,NA,NA,0,0,NA,NA))
fit4$ttable
     Estimate     SE  t.value p.value
ar1    0.8463 0.0724  11.6970   0e+00
ma1   -1.2164 0.1073 -11.3325   0e+00
ma2    0.4065 0.0879   4.6228   0e+00
ma5   -0.1755 0.0464  -3.7794   2e-04
sma1  -0.5230 0.1514  -3.4541   7e-04

Based on the resulting coefficients I was able to obtain the following model:
SARIMA(1,1,5)x(0,1,1)52
(10.85B)(1B52)(1B)zt=(10.52B52)(11.22B+0.41B20.18B5)wt zt=+1.85zt10.85zt2+zt521.85zt53+0.85zt54 +wt1.22wt1+0.41wt20.18wt50.52wt52+0.63wt530.21wt54+0.09wt57

Where zt=1xt

Model Forecasting


# predict forward 70
fit5 <-sarima.for(InvPIdths,70,1,1,5,0,1,1,52,fixed=c(NA,NA,NA,0,0,NA,NA))
mod_fore <- fit5$pred^-1 # convert back to non-inverse
Mod_se1 <- (fit5$pred - 1.96*fit5$se)^-1 #95% interval, non-inverse
Mod_se2 <- (fit5$pred + 1.96*fit5$se)^-1 #95% interval, non-inverse


ts.plot(PIdths,xlim=c(0,275), ylim=c(3,11),main="Percentage of all deaths due to P&I forecasted ahead 70",ylab="% of Deaths") #plot original data
points(mod_fore,type="l",col=2) #plot predictions

points(Mod_se1, type="l", col=4, lty=4) #plot 95% bands
points(Mod_se2, type="l", col=4, lty=4) #plot 95% bands

I used the resulting model to predict forward 70 weeks. The red line represents the model’s prediction and the blue lines represent the 95% confidence interval. It appears to reflect the seasonal pattern very well.

#first 125 datapoints
InvPIdthsA = InvPIdths[1:150,]
length(InvPIdthsA)
## [1] 150
length(PIdths)
## [1] 210
# predict forward 60 from shortened data and compare to actual data
fit6 <-sarima.for(InvPIdthsA,60,1,1,5,0,1,1,52,fixed=c(NA,NA,NA,0,0,NA,NA))
mod_fore <- fit6$pred^-1 # convert back to non-inverse
Mod_se1 <- (fit6$pred - 1.96*fit6$se)^-1 #95% interval, non-inverse
Mod_se2 <- (fit6$pred + 1.96*fit6$se)^-1 #95% interval, non-inverse

ts.plot(PIdths,xlim=c(0,225), ylim=c(3,11),main="Percentage of all deaths due to P&I vs prediction",ylab="% of Deaths") #plot original data
points(mod_fore,type="l",col=2) #plot predictions

I also eliminated the last 60 values of the dataset and predicted forward 60. I then compared the prediction to the actual values. Most of the predicted values align well, but the model does predict a higher peak.