Data Introduction

I decided to use the COVID-19 Data published by IHME. The data is published to the web by the Institute of Health Metrics and Evaluation for the sake of assisting policymakers with planning for uncertainty surrounding COVID. The data is collected from local and national governments, as well as hospitals and associations, WHO, and others.

They “use data from government websites for Brazil, Canada, France, Germany, Italy, Japan, Mexico, Pakistan, South Africa, Spain, United Kingdom, and the US states of Indiana, Illinois, Maryland, and Washington”, while data for New York is collected from the New York Times Github Repository. I decided on this dataset because of the importance of understanding the COVID-19 pandemic and the relevant factors impacting its transmission. This is a topic that is very closely connected to my everyday life and analyzing this data will help me to understand the problem in order to help those around me be safer and engage with the community in a more informed way.

I will specifically be analyzing the time series of “Mean Daily Cases” for the variable of interest for this analysis, looking at the global population, rather than any country-specific data.

What drives variation in Daily Cases?

The data was generated based on disease transmission, so variation in the variable is tied pretty heavily to adherence to pandemic protocols, or lack thereof due to fatigue, in addition to variants of the virus itself mutating to become more or less infectious/harmful. Localized quarantine procedures have been observed to be effective (Royal Society for Public Health - Volume 189) at containing the virus and disinfecting procedures have been observed to limit exposure in non-sterile environments (Vardoulakis et al.), though high levels of stress (Park et al) due to the pandemic may damage willingness to adhere to public health guidelines. This will be a very difficult variable to forecast because of the interactions between the nature of the virus itself as well as the behaviors exhibited by the population. This, in addition to the potential for new variants to appear from other portions of the globe, heightens an already dangerous situation due to how closely interconnected all the peoples of the world are.

Disclaimer - the articles cited above may be paywalled to non-UC users.

Date Range

The maximum date value contained in this model is for May of 2022 - this data is from the reference scenario produced by IHME, which means that it will be displaying projections created by IHME regarding daily cases. For the purposes of this analysis, I will remove all values following today’s date (2/24/2022) and will treat the series as if it is the ground truth in order to test forecasting methods, though this will not become immediately necessary until later in the process, when I am no longer

Section 1 - EDA

Explanatory Graphics

data <- df[, c(2, 4, 23)]
data <- subset(data, location_name == "Global")
data$cases_mean[data$cases_mean == 0] <- 0.1
unique(data$location_name)
## [1] "Global"
summary(data)%>%
  kbl(caption = "Summary Table for Global Daily Mean COVID-19 Cases Data") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Summary Table for Global Daily Mean COVID-19 Cases Data
date location_name cases_mean
Length:486 Length:486 Min. : 53214
Class :character Class :character 1st Qu.: 420254
Mode :character Mode :character Median : 560434
NA NA Mean : 882330
NA NA 3rd Qu.: 703547
NA NA Max. :5410623

The above table displays summary statistics for the entire dataset in a concise and well formatted table. By viewing this table in the HTML submission file you’ll be able to scroll through the columns inside the table, I sized it down for the purpose of brevity in the report, given the high number of columns.

library(kableExtra)

head(data)  %>% 
  kbl(caption = "Mean Cases vs. Date") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Mean Cases vs. Date
date location_name cases_mean
2021-01-01 Global 635601.5
2021-01-02 Global 646073.9
2021-01-03 Global 657019.8
2021-01-04 Global 665052.6
2021-01-05 Global 680489.2
2021-01-06 Global 708657.8

Above we can see summary statistics of our data as well as the first 10 records of the data.

Missingness of Data

nulls <- rbind(data.frame(round(colSums(is.na(data))/nrow(data), digits = 2)))
names(nulls)[names(nulls) == 'round.colSums.is.na.data...nrow.data...digits...2.'] <- 'Proportion Null'
names(nulls)
## [1] "Proportion Null"
nulls %>% 
 kbl(caption = "Proportion of Missing Values by Variable") %>%
  kable_styling(full_width = F) %>% 
  kable_paper("hover", full_width = F) %>%
  column_spec(1, bold = TRUE, border_right = TRUE, color = "black", background = "lightgrey")
Proportion of Missing Values by Variable
Proportion Null
date 0
location_name 0
cases_mean 0

Our data isn’t missing any values in the time series of interest.

Time Series Plot

library(magrittr)
library(ggplot2)
data$date <- as.Date(data$date)
data %>%
  ggplot()+
      geom_line(aes(date,cases_mean))+
      theme_bw()+
      ggtitle("Mean Cases Over Time")+
      ylab("Mean Cases")+
      xlab("Date")

Linear Regression Model

Fit a linear regression predicting your outcome (the time-series) as a function of time (measured in hours, days, weeks, etc.). Is there a statistically significant impact of time on your outcome? Present the results of the regression in a publication-quality table, and interpret the output. How much of the variation in the outcome is explained by a time trend? Does it look like a linear fit would do a good job of predicting the behavior of the time series?

library(sjPlot)
model <- lm(cases_mean ~ date, data = data)
summary_model <- summary(model)
tab_model(summary_model)
  cases mean
Predictors Estimates CI p
(Intercept) -44765600.99 -56900516.01 – -32630685.96 <0.001
date 2419.01 1775.97 – 3062.06 <0.001
Observations 486
R2 / R2 adjusted 0.101 / 0.100

The above linear model appears to have a very low R^2 and Adj. R^2, which likely means that there is not a linear relationship between the two, however, fitting a better model with more finely tuned parameters may change that.

ARIMA Modeling

library(magrittr)
library(ggplot2)
data$date <- as.Date(data$date)
data %>%
  ggplot()+
      geom_line(aes(date,cases_mean))+
      theme_bw()+
      ggtitle("Mean Cases Over Time")+
      ylab("Mean Cases")+
      xlab("Date")

There are some observable trends in the data in terms of rising average case numbers worldwide, with what appears to be 6-month humps in cases with an observed high recently due to the widespread infectiousness of Omicron. Due to the observed periodicity of the dataset we will likely not be able to observe stationarity - a regular pace seems to exist wherein cases begin to rise. Seasonality may not be the correct word for what I’m observing, as the data has only been relevant for a short period due to the novelty of the disease.

data = data %>% 
  mutate(lag1_cases = dplyr::lag(cases_mean), 
         lag2_cases = dplyr::lag(cases_mean,2), 
         lag3_cases = dplyr::lag(cases_mean,3))
head(data) %>%
  kbl(caption = "First 10 Rows in the New Dataset") %>%
  kable_styling(full_width = F) %>% 
  kable_paper("hover", full_width = F) %>%
  kable_classic(full_width = F, html_font = "Cambria")
First 10 Rows in the New Dataset
date location_name cases_mean lag1_cases lag2_cases lag3_cases
2021-01-01 Global 635601.5 NA NA NA
2021-01-02 Global 646073.9 635601.5 NA NA
2021-01-03 Global 657019.8 646073.9 635601.5 NA
2021-01-04 Global 665052.6 657019.8 646073.9 635601.5
2021-01-05 Global 680489.2 665052.6 657019.8 646073.9
2021-01-06 Global 708657.8 680489.2 665052.6 657019.8

Differencing and Transformations

cases_diff = data %>%
  mutate(case_diff = cases_mean - lag(cases_mean))

cases_diff %>% 
  select(date,cases_mean,case_diff) %>% 
  head()  %>% 
  kbl(caption = "Head of Differenced Case Data") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Head of Differenced Case Data
date cases_mean case_diff
2021-01-01 635601.5 NA
2021-01-02 646073.9 10472.43
2021-01-03 657019.8 10945.96
2021-01-04 665052.6 8032.74
2021-01-05 680489.2 15436.63
2021-01-06 708657.8 28168.53
cases_diff %>%
  ggplot()+
      geom_line(aes(date,case_diff))+
      theme_bw()+
      ggtitle("Mean Daily Cases (First Difference)")+
      ylab("Mean Cases (Difference))")+
      xlab("Date")

library(tidyr)
cases_diff = cases_diff %>%
  mutate(case_log = log1p(case_diff)) %>% 
  mutate(case_log_diff = case_log - lag(case_log))

nan_df <- subset(cases_diff, is.na(cases_diff))

cases_diff %>%
  ggplot()+
      geom_line(aes(date,case_log_diff))+
      theme_bw()+
      ggtitle("Mean Daily Cases (Log; First Difference)")+
      ylab("Mean Cases (Difference))")+
      xlab("Date")

cases_log_diff <- data %>%
 arrange(date) %>%
  mutate(
    cases_log = log1p(cases_mean), # Log for Variance Stationarity
    cases_diff = cases_mean - lag(cases_mean),
    cases_log_diff = cases_log - lag(cases_log) # Difference for Mean Stationarity
  ) %>%
  drop_na() 

cases_log_diff %>%
  ggplot() +
  geom_line(aes(date, cases_log_diff)) +
  theme_bw() +
  ggtitle("Difference in Log Values, Mean Daily Cases over Time") +
  ylab("Mean Daily Cases (Difference))") +
  xlab("Date")

In the above section I attempt to create a more stationary version of the time series, such that it will be easier to forecast. I’ve attempted to do this utilizing differenced calculations in the data (accounting for lag) and utilizing log transformations, though these don’t appear to have effectively made the series more stationary.

Statistical Testing

  • ADF test: (null: non-stationary) p-value <0.05 indicates stationary.
  • KPSS test: (Null: stationary) p-value <0.05 indicates non- stationary.
library(tseries) # Use tseries package for adf.test function
adf_mean <- adf.test(cases_log_diff$cases_mean) # Raw Close Value
adf_diff <- adf.test(cases_log_diff$cases_diff) # First Difference
adf_log <- adf.test(cases_log_diff$cases_log) # Log Close Value
adf_log_diff <- adf.test(cases_log_diff$cases_log_diff) # Differenced Log Close Value
kp_mean <- kpss.test(cases_log_diff$cases_mean) # Raw close value
kp_diff <- kpss.test(cases_log_diff$cases_diff) # First difference
kp_log <- kpss.test(cases_log_diff$cases_log) # Log close value
kp_log_diff <- kpss.test(cases_log_diff$cases_log_diff) # Differenced log close value

stats <- matrix(c("ADF","Log Transform",adf_log$p.value,"Stationary",
                "KPSS","Log Transform",kp_log$p.value,"Stationary",
                "ADF","First Difference",adf_diff$p.value,"Non-stationary",
                "KPSS","First Difference",kp_diff$p.value,"Stationary"),
              ncol=4,byrow=TRUE)

stats %>%
  kbl(caption = "Table 2: ADF/KPSS test on Log vs. Box-cox transform") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 2: ADF/KPSS test on Log vs. Box-cox transform
ADF Log Transform 0.0190727806666848 Stationary
KPSS Log Transform 0.0279839115781265 Stationary
ADF First Difference 0.01 Non-stationary
KPSS First Difference 0.1 Stationary

After running and ADF (Augmented Dickey-Fuller) Test on the mean of daily cases as well as the difference in daily cases, the log of cases, and the log difference of daily cases, we can see that the p values of the 1st difference in cases and the log difference in cases indicate there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis that the data are stationary, indicating that the 1st difference and log of cases are stationary series.

The KPSS test indicates that the Log transformation fails to reject the null hypothesis, meaning that the series is trend stationary, while all others reject the null hypothesis and indicate that the series is not trend stationary. Our results for these tests together indicate that the log transformation on undifferenced data will likely be best for our time-series model.

Autocorrelation Plots

Above I’ve created the ACF and PACF charts for 1st order differences and log transformations of the daily mean cases. It appears that the most effective transformation is the log transformation, however, it may be useful to explore some 2nd transformations to observe whether they have an impact on the autocorrelation effects observed in the ACF plots.

2nd Transformations

cases_log_diff <- data %>%
 arrange(date) %>%
  mutate(
    cases_log = log1p(cases_mean), # Log for Variance Stationarity
    cases_diff = cases_mean - lag(cases_mean),
    cases_log_diff = cases_log - lag(cases_log), # Difference for Mean Stationarity
    cases_diff2 = cases_mean - lag(cases_mean,2),
    cases_log_diff2 = cases_log - lag(cases_log, 2)
  ) %>%
  drop_na() 

Above I’ve calculated second degree transformations for the log and difference values of mean daily cases, and will calculate the KPSS & ADF test values on the new fields as well as graph their ACF & PACF plots below.

The transformations calculated above indicate higher autocorrelative effects than the first degree transformations, so I will utilize the first degree transformations only.

cases_log_diff %>%
  ggplot()+
      geom_line(aes(date,cases_log))+
      theme_bw()+
      ggtitle("Mean Daily Cases (Log Transformed)")+
      ylab("Mean Cases (Log))")+
      xlab("Date")

data$cases_mean[data$cases_mean == 0] <- 0.1
cases_boxcox = data %>%
 # filter(date>=ymd('2009-01-01') & date<=ymd("2015-01-01")) %>%
  mutate(case_boxcox = forecast::BoxCox(cases_mean,lambda='auto'))

min(data$cases_mean)
## [1] 53214.16
cases_boxcox %>%
  ggplot()+
      geom_line(aes(date,case_boxcox))+
      theme_bw()+
      ggtitle("Mean Cases over Time (Box-Cox Transformation)")+
      ylab("Mean Cases (Box-Cox Tranformation)")+
      xlab("Date")

I have above calculated a Boxcox transformation of mean daily cases in an attempt to identify a more suitable model, and have built a variety of models below altering settings within ARIMA models to evaluate the models against one another.

Model Selection

BIC(
  arima(cases_boxcox$case_boxcox,order=c(0,1,1)),
  arima(cases_boxcox$case_boxcox,order=c(0,1,2)),
  arima(cases_boxcox$case_boxcox,order=c(0,1,3)),
  arima(cases_boxcox$case_boxcox,order=c(1,1,0)),
  arima(cases_boxcox$case_boxcox,order=c(1,1,1)),
  arima(cases_boxcox$case_boxcox,order=c(2,1,0)),
  arima(cases_boxcox$case_boxcox,order=c(3,1,0))
) %>% 
  kbl(caption = "Comparing Model BIC") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Comparing Model BIC
df BIC
arima(cases_boxcox\(case_boxcox, order = c(0, 1, 1)) </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> -3680.105 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_boxcox\)case_boxcox, order = c(0, 1, 2)) 3 -4252.574
arima(cases_boxcox\(case_boxcox, order = c(0, 1, 3)) </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> -4573.805 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_boxcox\)case_boxcox, order = c(1, 1, 0)) 2 -4930.274
arima(cases_boxcox\(case_boxcox, order = c(1, 1, 1)) </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -5234.642 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_boxcox\)case_boxcox, order = c(2, 1, 0)) 3 -5046.246
arima(cases_boxcox$case_boxcox, order = c(3, 1, 0)) 4 -5116.265

The best model of the above seems to be the ARIMA model of order (1,1,1), let’s compare those results with that of the log transformed arima models.

BIC(
  arima(cases_log_diff$cases_log,order=c(0,1,1)),
  arima(cases_log_diff$cases_log,order=c(0,1,2)),
  arima(cases_log_diff$cases_log,order=c(0,1,3)),
  arima(cases_log_diff$cases_log,order=c(1,1,0)),
  arima(cases_log_diff$cases_log,order=c(1,1,1)),
  arima(cases_log_diff$cases_log,order=c(2,1,0)),
  arima(cases_log_diff$cases_log,order=c(3,1,0))
) %>% 
  kbl(caption = "Comparing Model BIC") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Comparing Model BIC
df BIC
arima(cases_log_diff\(cases_log, order = c(0, 1, 1)) </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> -2651.339 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_log_diff\)cases_log, order = c(0, 1, 2)) 3 -3224.673
arima(cases_log_diff\(cases_log, order = c(0, 1, 3)) </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> -3568.429 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_log_diff\)cases_log, order = c(1, 1, 0)) 2 -3893.528
arima(cases_log_diff\(cases_log, order = c(1, 1, 1)) </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -4202.144 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_log_diff\)cases_log, order = c(2, 1, 0)) 3 -4018.758
arima(cases_log_diff$cases_log, order = c(3, 1, 0)) 4 -4089.790

These models don’t perform as well as the boxcox transformation models according to the BIC criterion, however, before accepting a single model I will also create some auto.arima models to see if there are any better options I haven’t considered.

library(forecast)
auto.arima(cases_boxcox$case_boxcox,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
## Series: cases_boxcox$case_boxcox 
## ARIMA(0,2,5) 
## 
## Coefficients:
##          ma1      ma2      ma3    ma4      ma5
##       0.9648  -0.0360  -0.1682  0.053  -0.1165
## s.e.  0.0446   0.0646   0.0588  0.064   0.0489
## 
## sigma^2 estimated as 1.418e-06:  log likelihood=2628.56
## AIC=-5245.13   AICc=-5244.95   BIC=-5220.03

The auto-ARIMA model of a boxcox transformation selected a model of order (0, 2, 5), and has a comparable BIC to the model selected above.The boxcox transformation with a stationary = FALSE setting performed better than the stationary = TRUE setting.

auto.arima(cases_log_diff$cases_log,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
## Series: cases_log_diff$cases_log 
## ARIMA(0,2,5) 
## 
## Coefficients:
##          ma1     ma2      ma3     ma4      ma5
##       1.0244  0.0251  -0.1846  0.0305  -0.1187
## s.e.  0.0461  0.0703   0.0625  0.0701   0.0526
## 
## sigma^2 estimated as 9.669e-06:  log likelihood=2115.2
## AIC=-4218.41   AICc=-4218.23   BIC=-4193.35

The auto arima models for both boxcox and log transormed mean cases appear to perform strongly, however, the boxcox transformed arima model appears stronger.

Forecasting & Diagnostics

best_mod = arima(cases_boxcox$case_boxcox,order=c(1,1,1))
forecast::checkresiduals(best_mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 31.839, df = 8, p-value = 9.951e-05
## 
## Model df: 2.   Total lags used: 10

The above residuals indicate that some autocorrelative effects are still lingering in the ARIMA model selected through the above model selection means.

par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)

This effect can be seen most obviously in the first period lag.

Box.test(resid,type='Ljung-Box',lag=1)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.18195, df = 1, p-value = 0.6697
Box.test(resid,type='Ljung-Box',lag=5)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 3.9355, df = 5, p-value = 0.5587
Box.test(resid,type='Ljung-Box',lag=10)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 31.839, df = 10, p-value = 0.0004259
Box.test(resid,type='Ljung-Box',lag=20)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 43.535, df = 20, p-value = 0.001736

The Box-Ljung tests above indicate that there are significant autocorrelative effects occurring at the 10 and 20 period lags in the model as it currently runs.

resid = best_mod$residuals

pred = resid+cases_boxcox$case_boxcox

ggplot()+
  geom_line(aes(cases_boxcox$date,cases_boxcox$case_boxcox))+
  geom_line(aes(cases_boxcox$date,pred),color='blue',alpha=0.4)+
  theme_bw()+
  ggtitle("Box-Cox Transformed Mean Daily Cases Predictions vs. Actuals (Training Model") +
  xlab("Date")+
  ylab("Box-Cox Daily Cases")

RMSE = sqrt(mean((expm1(pred) - expm1(cases_boxcox$case_boxcox))^2,na.rm=T))
RMSE
## [1] 4.860085
best_mod %>%
  forecast(h=60) %>% 
  autoplot()

Our current model indicates that cases will continue to be low in the future. These values seem reasonable as observed trends in the past indicate that the period after the holiday season sees a reduction in cases, and the Omicron variant has been so far reaching that it is losing steam relative to the remaining vulnerable population.

library(lubridate)
train = data %>% 
  filter(date<ymd("2022-1-31")) %>% 
  mutate(case_boxcox = forecast::BoxCox(cases_mean,lambda='auto'))
test = data %>% 
  filter(date>=ymd("2022-1-31") & date<ymd("2022-02-24")) %>% 
  mutate(case_boxcox = forecast::BoxCox(cases_mean,lambda='auto'))
best_mod = arima(train$case_boxcox,order=c(1,1,1))

test_pred1 = predict(best_mod,24)
error_df = data.frame(test_pred1$pred, test$case_boxcox)
error1 = (mean((error_df$test_pred1.pred - error_df$test.case_boxcox)^2)^1/2)
error1
## [1] 44115394

The out of sample error of the best ARIMA model selected from the data is pretty substantial, which makes sense because the number of cases has differed so heavily, with the minor hills observed earlier in the dataset representing swings in case numbers to the order of 100’s of 1000’s of cases.

Prophet Modeling

library(magrittr)
library(ggplot2)
data$date <- as.Date(data$date)
data %>%
  ggplot()+
      geom_line(aes(date,cases_mean))+
      theme_bw()+
      ggtitle("Mean Cases Over Time")+
      ylab("Mean Cases")+
      xlab("Date")

The above graphic displays the mean daily cases over time starting in 2021, we can see a major spike in total daily cases due to the rapid spread of the Omicron variant starting in early 2022.

The maximum date value contained in this model is for May of 2022 - this data is from the reference scenario produced by IHME, which means that it will be displaying projections created by IHME regarding daily cases. For the purposes of this analysis, I will remove all values following today’s date (2/17/2022) and will treat the series as if it is the ground truth in order to test the forecasting method.

# R
library(prophet)
library(dplyr)
prophet_data = data %>% 
    rename(ds = date, # Have to name our date variable "ds"
    y = cases_mean)  # Have to name our time series "y"
m <- prophet(prophet_data, )
train = prophet_data %>% 
  filter(ds<ymd("2021-12-31"))

test = prophet_data %>%
  filter(ds>=ymd("2021-12-31")) %>% 
  filter(ds<ymd("2022-02-17"))

model = prophet(train)

future = make_future_dataframe(model,periods =31)

forecast = predict(model,future)
tail(future) %>% 
  kbl(caption = "Tail of Future Caps on Data") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Tail of Future Caps on Data
ds
390 2022-01-25
391 2022-01-26
392 2022-01-27
393 2022-01-28
394 2022-01-29
395 2022-01-30
tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')]) %>% 
  kbl(caption = "Tail of Future Forecasting Bounds") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Tail of Future Forecasting Bounds
ds yhat yhat_lower yhat_upper
390 2022-01-25 819777.6 647605.2 997128.8
391 2022-01-26 825841.0 637025.6 1011474.6
392 2022-01-27 831199.3 646897.9 1011912.5
393 2022-01-28 821353.3 640962.6 1001604.6
394 2022-01-29 825413.1 655076.5 993271.1
395 2022-01-30 830027.2 658915.2 1008848.7

Above we can see the first forecasts generated by dropping the data into the Prophet model. We’re given a range of upper and lower values for the dates leading up to 1/30/2022, 30 days after the end of our training dataset. We can visualize these results and the intervals included with a simple plot:

# R
plot(m, forecast) +
  ylab("Daily Covid Cases")+xlab("Date") +theme_bw()

prophet_plot_components(m, forecast)

We can see by plotting the ground truth data of our time series vs. the projections that this model was unable to predict the dramatic rise in cases due to Omicron, but it was responsive enought to detect the uptick in cases represented at the end of the training data. The decomposition of the seasonality elements present in the data indicates that Monday’s are low points for daily mean COVID cases, with Sunday being the high point.

dyplot.prophet(model,forecast)

We can also observe interactive plots which indicate specific dates, actual, and predicted values, which displays a helpful density graphic below the X axis. We can examine shorter time periods by zooming in on the time period in question using the filters attached to the bottom of the graphic.

Changepoints

plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Daily Covid Cases")

Prophet allows for the easy inclusion of changepoints (click here to learn more about changpoints), which can help us to identify when trends reverse course, this can be helpful in the case of the Omicron variant by helping us to recognize a significant departure from our modeled values beginning in October, putting us on high alert about the number of cases on the horizon.

We can modify the number of changepoints as shown below by adding a simple parameter to the prophet model.

# Number of Changepoints
model = prophet(train,n.changepoints=100)

forecast = predict(model,future)

plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("COVID Cases")

Decomposition

prophet_plot_components(model,forecast)

The decomposed seasonality elements yielded by the above model for changepoints indicates that Thursdays, rather than Monday’s, are to be watched out for in terms of weekly COVID case changepoints. When considering our original time series decomposition, we saw that Sundays were the high point of the week, which is supported above by the increase observed through the week, peaking at thursday, and then reversing.

The Two-Year-Future Forecast

By making forecasts 2 years into the future, we can observe the current trends that Prophet expects the data to follow.

two_yr_future = make_future_dataframe(model,periods = 730)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Mean Daily Cases")

Considering that our training data ends in the throes of the Omicron surge, we shouldn’t be surprised that the two year forecast for our data indicates high amounts of cases. We should, however, consider adding a ceiling to this prediction to manage any overzealous predictions that are too heavily influenced by recent trends.

# Set "floor" in training data
train$floor = 0
train$cap = 9999999
future$floor = 0
future$cap = 9999999
# Set floor in forecsat data
two_yr_future$floor = 0
two_yr_future$cap = 9999999
sat_model = prophet(train,growth='logistic')

sat_two_yr_forecast = predict(sat_model,two_yr_future)
plot(sat_model,sat_two_yr_forecast)+ylim(0,999999)+
  theme_bw()+xlab("Date")+ylab("Daily Mean Cases")

We can see a more modest estimate of daily mean cases above when the ceiling was added to our model.

Additive/Multiplicative Model

Below is an additive model that makes very similar predictions regarding the near future of cases to our original prophet model.

additive = prophet(train)
add_fcst = predict(additive,future)

plot(additive,add_fcst)+
ylim(0,999999)

prophet_plot_components(additive,add_fcst)

The seasonal components of our additive model indicate that Thursday is still a major source of information for weekly case growth.

multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)

plot(multi,multi_fcst)+ylim(0,999999)

prophet_plot_components(multi, multi_fcst)

The multiplicative model above further verifies the findings of the Additive model.

Holidays

Many time series’ contain trends built into holiday data that influence the behavior of the time series. We can use prophet to examine these holiday trends and build an awareness of them into our model.

Components Plot

model = prophet(train,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train)
forecast = predict(model,future)
prophet_plot_components(model,forecast)

The above plot displays numerous peaks and valley’s related to holiday data, maintaining that Thursdays are the most important day of the week for trending changes in mean daily cases.

We can make this information more useful by examining the holiday’s directly by name:

library(tidyr)
library(ggplot2)
library(dplyr)
forecast %>%
  filter(holidays != 0) %>%
  dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
  mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
  summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
  pivot_longer(
    cols = `Christmas Day`:`Washington's Birthday`,
    names_to = 'holiday', 
    values_to = 'effect'
  ) %>%
ggplot() +
  geom_col(aes(effect,holiday))+
  ggtitle("Effect of Holidays on Mean Daily Cases") +
  theme_bw()

Many holidays have an observed effect on the number of mean daily COVID-19 Cases. Many of the summer holidays that we observe are associated with low-points, which may be due to lower case numbers in summertime, or lower transmission due to outdoor events. Holidays after November 31st and before February are associated with a major increase in case numbers, generally. The most surprising find here is Thanksgiving’s negative effect on overall global cases. This may be due to the fact that this is for global case data while I’m only accounting for US holidays. It may also be noted that the highest

In-Sample Performance

Assessing model performance is highly important to the relavence of an analysis, I’ll be using the RMSE (Root mean squared error), MAE (Mean absolute error) and MAPE (mean absolute percentage error) to assess model performance.

# IS Performance
forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds<ymd("2021-12-31"))

RMSE = sqrt(mean((train$y - forecast_metric_data$yhat)^2))

MAE = mean(abs(train$y - forecast_metric_data$yhat))

MAPE = mean(abs((train$y - forecast_metric_data$yhat)/train$y))

print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 136623.17"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 107534.14"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.2"

The model performance of the in-sample data isn’t terrible, it has a Mean Absolute Percentage Error of 20%, indicating that we account for much of the daily differences in mean covid cases globally, however, that 20% difference corresponds to a massive disparity in the total number of cases predicted, which means that this may be useful for general trending but not for operationally relevant implementation like bed-planning or material gathering.

# OOS Performance

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2021-12-31"))
test = test %>% 
  filter(ds<=max(forecast_metric_data$ds  ))

RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))

MAE = mean(abs(test$y - forecast_metric_data$yhat))

MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))

print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 3367194.02"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 3097973.31"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.77"

The model performs far worse on Out-Of-Sample data, a whopping 77% MAPE. This likely has to do with the major uptick in case numbers that immediately follows the end of the training data.

Cross-Validation

library(kableExtra)
df.cv <- cross_validation(model, initial = 30, period = 30, horizon = 30, units = 'days')
head(df.cv) %>% 
  kbl(caption = "Cross Validated Data Frame | Head") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Cross Validated Data Frame | Head
y ds yhat yhat_lower yhat_upper cutoff
454060.5 2021-02-04 454994.9 452966.2 456802.9 2021-02-03
443958.4 2021-02-05 446576.2 439632.6 453497.1 2021-02-03
433448.3 2021-02-06 437647.0 424185.8 451155.9 2021-02-03
424259.9 2021-02-07 425666.4 403892.5 447427.2 2021-02-03
414205.9 2021-02-08 412419.0 381054.0 444289.9 2021-02-03
405118.9 2021-02-09 398445.8 357398.0 441609.8 2021-02-03

Above I’m creating 30-day horizon rolling periods for cross validation on the Prophet model.

df.cv %>% 
  ggplot()+
  ggtitle("Cross Validated Error by Month")+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Daily Cases")+
  scale_color_discrete(name = 'Cutoff')

The disparities witnessed in the dataset in the month of January are staggering relative to the general performance of the model.

round_df <- function(df, digits) {
  nums <- vapply(df, is.numeric, FUN.VALUE = logical(1))

  df[,nums] <- round(df[,nums], digits = digits)

  (df)
}
library(kableExtra)
round_df(performance_metrics(df.cv),2) %>% 
  kbl(caption = "Cross-Validated Error Metrics") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Cross-Validated Error Metrics
horizon mse rmse mae mape mdape smape coverage
3 days 8315574810 91189.77 68732.69 0.13 0.09 0.13 0.70
4 days 9061562940 95192.24 70502.24 0.13 0.07 0.14 0.70
5 days 9854070470 99267.67 72902.47 0.13 0.06 0.14 0.64
6 days 10709813512 103488.23 76212.80 0.14 0.06 0.15 0.58
7 days 11941433604 109276.87 80932.20 0.15 0.08 0.16 0.55
8 days 13797381733 117462.26 87326.07 0.16 0.11 0.17 0.55
9 days 16070433060 126769.21 94707.31 0.17 0.13 0.18 0.55
10 days 18483278627 135953.22 102519.98 0.19 0.17 0.20 0.55
11 days 20492407738 143151.69 109395.83 0.20 0.20 0.21 0.55
12 days 22524750971 150082.48 116008.21 0.22 0.22 0.23 0.55
13 days 24573864537 156760.53 122239.92 0.23 0.22 0.24 0.52
14 days 26998870963 164313.33 129329.60 0.24 0.22 0.26 0.48
15 days 29833719847 172724.40 136990.15 0.26 0.24 0.27 0.45
16 days 32818027539 181157.47 145407.50 0.27 0.26 0.29 0.45
17 days 35877642221 189413.94 153729.94 0.29 0.27 0.31 0.48
18 days 38675921911 196661.95 161567.13 0.31 0.28 0.33 0.48
19 days 41875779430 204635.72 169553.87 0.32 0.30 0.35 0.48
20 days 45572836665 213477.95 177633.48 0.34 0.31 0.37 0.45
21 days 50574296782 224887.30 187085.73 0.36 0.31 0.39 0.48
22 days 56996612875 238739.63 197590.59 0.38 0.31 0.41 0.52
23 days 64236613113 253449.43 208828.08 0.40 0.33 0.44 0.55
24 days 71476242725 267350.41 219210.50 0.41 0.33 0.46 0.55
25 days 78638209321 280425.05 228980.73 0.43 0.33 0.48 0.55
26 days 87428366669 295682.88 240058.33 0.45 0.33 0.51 0.55
27 days 99062017172 314741.19 253038.31 0.46 0.32 0.53 0.55
28 days 115623293421 340034.25 269099.83 0.48 0.32 0.56 0.55
29 days 137047751911 370199.61 287390.09 0.50 0.32 0.59 0.55
30 days 161522190296 401898.23 306993.68 0.53 0.32 0.62 0.52

There appears to be a general trend that the further we get away from the current date the worse our predictions are for daily cases using cross-validation. We can test this out by plotting the error metrics against the windows it’s assessed for.

Plot of Cross-Validated Accuracy of Model as Time Window from Present Increases

plot_cross_validation_metric(df.cv, metric = 'rmse') 

There is a general trend towards inaccuracy as time goes on, with the 30 day window approaching a difference of 500,000 cases relative to the 1 day window.

Model Comparison

RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
RMSE
## [1] 3367194
error1
## [1] 44115394

The RMSE of the Prophet model is 3.36719410^{6}, while the RMSE of the best ARIMA model is 4.411539410^{7}. The reason that I’m operating using the RMSE vs. the MAPE or other criteria is that I want to consider all comparisons in terms of the original units - by case numbers. I think that the MAPE value could be misleading in the context due to the wide variation in case numbers during peaks and valleys, smaller variations will make for wider deviations during the low points of case numbers. The RMSE communicates an understandable metric in absolute units.