This report’s objective is to find an adequate model to represent the percentage of US deaths attributable to Pneumonia and Influenza. I outline the model selection process, describe the resulting solution, and provide predictions for 70 future observations.
Based on the residual diagnostics and the AICc/BIC values I selected a seasonal ARIMA model with parameters (1,1,5) x (0,1,1)52. It appears to describe the model well and the predictions look reasonable.
I chose to evaluate the percentage of all US deaths due to Pneumonia and Influenza. The data can be retrieved from the following CDC website:
The dataset begins at week 40 in 2015 and continues until week 41 in 2019, for a total of 210 observations.
I have a background in Finance and spent many years as an equity analyst. I primarily covered the healthcare sector, with a focus on healthcare services and medical devices. Trading in healthcare stocks sometimes requires an estimate of patient visits for the current period, but since I did not have a background in Statistics I relied mostly on conversations with hospital administrators. I often wondered if there was a better way to forecast, but until now I have never attempted to do so with time series models.
Unfortunately there is no weekly or monthly data available for healthcare utilization, and so I chose this dataset as a proxy.
I began by viewing the time series plot of the dataset, as shown below. We can see that there is a clear seasonal pattern, with peaks during the winter months and troughs during the summer. We can also see that the variance from year to year is not constant, with a particularly strong flu season 2 years ago. To determine if a transformation would help stabilize the variance I ran a Box-Cox test, which suggested that an inverse transformation would be useful (value of -0.9999). We can see from the inverse plot below that some of the variance is lessened.
library(astsa)
library(forecast)
library(tseries)
library(TSA)
library(knitr)
library(kableExtra)
#Import Dataset
PIdths <- read.csv("~/Baruch/Baruch Class Projects/9701 Time Series/PIdeaths.csv")
PIdths = ts(PIdths) #convert to time series
#Plot the timeseries
plot.ts(PIdths,main="Percentage of all deaths due to pneumonia and influenza",ylab="% of Deaths")
#Determine whether transformation is necessary
BoxCox.lambda(PIdths)
## [1] -0.9999242
#result of -.9999 indicates inverse transformation
InvPIdths = PIdths^-1 # Convert data to inverse
#Plot the inverse timeseries
plot.ts(InvPIdths,main="Inverse % of all deaths due to pneumonia and influenza",ylab="% of Deaths")
Next, I viewed both the ACF and PACF of the inversed data. We can see the seasonal pattern in the ACF, and a spike in the PACF at 52 weeks.
#Plot the sample ACF and PACF
acf(PIdths,100, main="")
pacf(PIdths,100)
To normalize the seasonal pattern I differenced the weekly data by 52. I then evaluated the Augmented Dickey-Fuller test along with the ndiffs function, which suggested that another difference would help make the data stationary. We can see from the resulting plot that the time series does appear to have a constant mean and relatively stable variance. I again evaluated the Augmented Dickey-Fuller test, this time on the differenced data (52 week difference followed by 1 week difference), and with a p-value of <0.01 we can reject the null hypothesis that the data is not stationary.
# 52 week data. Clear seasonal trend, so take 52 week difference
dInvPIdths = diff(PIdths,52)
#Plot the 52 week differenced timeseries
plot.ts(dInvPIdths,main="Inverse % of all deaths due to P&I, 52 week diff",ylab="% of Deaths")
#Test whether data is stationary or if differencing is necessary
adf.test(dInvPIdths) # Augmented Dickey-Fuller test
Augmented Dickey-Fuller Test
data: dInvPIdths
Dickey-Fuller = -3.5047, Lag order = 5, p-value = 0.04429
alternative hypothesis: stationary
ndiffs(dInvPIdths) # confirm with ndiffs. Recommends 1 difference
[1] 1
ddInvPIdths = diff(dInvPIdths,1) #differece the data again
#Plot the double differenced timeseries
plot.ts(ddInvPIdths,main="Inverse % of all deaths due to P&I, diff 52 and 1",ylab="% of Deaths")
# plot looks stable with constant variance and mean
#Test whether data is stationary or if differencing is necessary
adf.test(ddInvPIdths) # Augmented Dickey-Fuller test
Augmented Dickey-Fuller Test
data: ddInvPIdths
Dickey-Fuller = -5.4507, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
ndiffs(ddInvPIdths) # confirm with ndiffs
[1] 0
# confirmed stationary
I then viewed the ACF and PACF plots of the differenced data. We can see that the seasonal component (52 week lags) suggest a possible seasonal MA(1) model, since the ACF cuts off after lag 1 and PACF tails off. The ACF and PACF of the non-seasonal component do not have clear cut-off points, which seems to suggest that an ARMA model would be appropriate.
acf(ddInvPIdths,200, main="") # non seasonal ACF tails off, seasonal spike at 52, 54 weeks, cuts off
pacf(ddInvPIdths,200) #non seasonal PACF tails off, #seasonal PACF tails off
#Since the seasonal ACF cuts off and seaonal PACF tails off, use seasonal MA model
#Since non-seasonal ACF and PACF tail off, use non-seaonal ARMA
#Use 1 seasonal difference at 52 weeks and 1 non-seaonal difference
Based on the ACF/PACF analysis I began by evaluating seasonal ARIMA models with 1 non-seasonal difference, 1 seasonal difference, a seasonal MA(1) component, and an ARMA non-seasonal component. I began by evaluating the simplest model first, an ARIMA(1,1,1) x (0,1,1)52. I then continued evaluating different possibilities until I was satisfied with the residual analysis and AICc/BIC values.
#try simplest model first
fit1 <- sarima(InvPIdths,1,1,1,0,1,1,52) #diagnostics OK, but ACF of some resids significant
fit1$AICc
[1] -5.658188
fit1$BIC
[1] -5.59998
Model | AICc | BIC | Residual Analysis |
---|---|---|---|
SARIMA(1,1,1) x (0,1,1)52 | -5.658 | -5.599 | A few residuals with significant ACF values |
SARIMA(1,1,1) x (0,1,2)52 | -5.675 | -5.602 | A few residuals with significant ACF values |
SARIMA(1,1,3) x (0,1,1)52 | -5.647 | -5.560 | A few residuals with significant ACF values |
SARIMA(1,1,5) x (0,1,1)52 | -5.692 | -5.605 | Excellent. No significant ACF values. Ljung-Box all pass |
Based on the results, I chose the SARIMA(1,1,5) x (0,1,1)52 model.
fit3 <- sarima(InvPIdths,1,1,5,0,1,1,52) #diagnostics perfect
fit3$ttable
Estimate SE t.value p.value
ar1 0.8360 0.0730 11.4454 0.0000
ma1 -1.1995 0.1039 -11.5449 0.0000
ma2 0.4070 0.1308 3.1127 0.0022
ma3 -0.0703 0.1337 -0.5260 0.5997
ma4 0.1283 0.1140 1.1259 0.2620
ma5 -0.2450 0.0794 -3.0851 0.0024
sma1 -0.5311 0.1533 -3.4650 0.0007
Since the ma3 and ma4 coefficients were not significant, I eliminated them from the final model.
We can see that the final model’s residuals have no clear pattern, have no significant ACF values, and all have Ljung-Box p-values over 0.05.
The seasonal autoregressive integrated moving average model is represented by:
SARIMA(p,d,q)x(P,D,Q)s
Φp(Bs)ϕ(B)∇Ds∇dXt=δ+ΘQ(Bs)θ(B)wt
Where:
ΦP(Bs)=1−Φ1Bs−Φ2B2s−...−ΦPBPs
ϕ(B)=1−ϕ1B−ϕ2B2−...−ϕPBP
∇Ds=(1−Bs)D
∇d=(1−B)d
ΘQ(Bs)=1+Θ1Bs+Θ2B2s+...+ΘQBQs
θ(B)=1+θ1B+θ2B2+...+θqBq
fit4 <- sarima(InvPIdths,1,1,5,0,1,1,52, fixed=c(NA,NA,NA,0,0,NA,NA))
fit4$ttable
Estimate SE t.value p.value
ar1 0.8463 0.0724 11.6970 0e+00
ma1 -1.2164 0.1073 -11.3325 0e+00
ma2 0.4065 0.0879 4.6228 0e+00
ma5 -0.1755 0.0464 -3.7794 2e-04
sma1 -0.5230 0.1514 -3.4541 7e-04
Based on the resulting coefficients I was able to obtain the following model:
SARIMA(1,1,5)x(0,1,1)52
(1−0.85B)(1−B52)(1−B)zt=(1−0.52B52)(1−1.22B+0.41B2−0.18B5)wt zt=+1.85zt−1−0.85zt−2+zt−52−1.85zt−53+0.85zt−54 +wt−1.22wt−1+0.41wt−2−0.18wt−5−0.52wt−52+0.63wt−53−0.21wt−54+0.09wt−57
Where zt=1xt
# predict forward 70
fit5 <-sarima.for(InvPIdths,70,1,1,5,0,1,1,52,fixed=c(NA,NA,NA,0,0,NA,NA))
mod_fore <- fit5$pred^-1 # convert back to non-inverse
Mod_se1 <- (fit5$pred - 1.96*fit5$se)^-1 #95% interval, non-inverse
Mod_se2 <- (fit5$pred + 1.96*fit5$se)^-1 #95% interval, non-inverse
ts.plot(PIdths,xlim=c(0,275), ylim=c(3,11),main="Percentage of all deaths due to P&I forecasted ahead 70",ylab="% of Deaths") #plot original data
points(mod_fore,type="l",col=2) #plot predictions
points(Mod_se1, type="l", col=4, lty=4) #plot 95% bands
points(Mod_se2, type="l", col=4, lty=4) #plot 95% bands
I used the resulting model to predict forward 70 weeks. The red line represents the model’s prediction and the blue lines represent the 95% confidence interval. It appears to reflect the seasonal pattern very well.
#first 125 datapoints
InvPIdthsA = InvPIdths[1:150,]
length(InvPIdthsA)
## [1] 150
length(PIdths)
## [1] 210
# predict forward 60 from shortened data and compare to actual data
fit6 <-sarima.for(InvPIdthsA,60,1,1,5,0,1,1,52,fixed=c(NA,NA,NA,0,0,NA,NA))
mod_fore <- fit6$pred^-1 # convert back to non-inverse
Mod_se1 <- (fit6$pred - 1.96*fit6$se)^-1 #95% interval, non-inverse
Mod_se2 <- (fit6$pred + 1.96*fit6$se)^-1 #95% interval, non-inverse
ts.plot(PIdths,xlim=c(0,225), ylim=c(3,11),main="Percentage of all deaths due to P&I vs prediction",ylab="% of Deaths") #plot original data
points(mod_fore,type="l",col=2) #plot predictions
I also eliminated the last 60 values of the dataset and predicted forward 60. I then compared the prediction to the actual values. Most of the predicted values align well, but the model does predict a higher peak.