Data Introduction

I decided to use the COVID-19 Data published by IHME. The data is published to the web by the Institute of Health Metrics and Evaluation for the sake of assisting policymakers with planning for uncertainty surrounding COVID. The data is collected from local and national governments, as well as hospitals and associations, WHO, and others.

They “use data from government websites for Brazil, Canada, France, Germany, Italy, Japan, Mexico, Pakistan, South Africa, Spain, United Kingdom, and the US states of Indiana, Illinois, Maryland, and Washington”, while data for New York is collected from the New York Times Github Repository. I decided on this dataset because of the importance of understanding the COVID-19 pandemic and the relevant factors impacting its transmission. This is a topic that is very closely connected to my everyday life and analyzing this data will help me to understand the problem in order to help those around me be safer and engage with the community in a more informed way.

I will specifically be analyzing the time series of “Mean Daily Cases” for the variable of interest for this analysis, looking at the global population, rather than any country-specific data.

What drives variation in Daily Cases?

The data was generated based on disease transmission, so variation in the variable is tied pretty heavily to adherence to pandemic protocols, or lack thereof due to fatigue, in addition to variants of the virus itself mutating to become more or less infectious/harmful. Localized quarantine procedures have been observed to be effective (Royal Society for Public Health - Volume 189) at containing the virus and disinfecting procedures have been observed to limit exposure in non-sterile environments (Vardoulakis et al.), though high levels of stress (Park et al) due to the pandemic may damage willingness to adhere to public health guidelines. This will be a very difficult variable to forecast because of the interactions between the nature of the virus itself as well as the behaviors exhibited by the population. This, in addition to the potential for new variants to appear from other portions of the globe, heightens an already dangerous situation due to how closely interconnected all the peoples of the world are.

Disclaimer - the articles cited above may be paywalled to non-UC users.

Date Range

The maximum date value contained in this model is for May of 2022 - this data is from the reference scenario produced by IHME, which means that it will be displaying projections created by IHME regarding daily cases. For the purposes of this analysis, I will remove all values following today’s date (2/24/2022) and will treat the series as if it is the ground truth in order to test forecasting methods, though this will not become immediately necessary until later in the process, when I am no longer

Section 1 - EDA

Explanatory Graphics

data <- df[, c(2, 4, 23)]
data <- subset(data, location_name == "Global")
data$cases_mean[data$cases_mean == 0] <- 0.1
unique(data$location_name)

## [1] "Global"

summary(data)%>%
  kbl(caption = "Summary Table for Global Daily Mean COVID-19 Cases Data") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Summary Table for Global Daily Mean COVID-19 Cases Data
date	location_name	cases_mean
Length:486	Length:486	Min. : 53214
Class :character	Class :character	1st Qu.: 420254
Mode :character	Mode :character	Median : 560434
NA	NA	Mean : 882330
NA	NA	3rd Qu.: 703547
NA	NA	Max. :5410623

The above table displays summary statistics for the entire dataset in a concise and well formatted table. By viewing this table in the HTML submission file you’ll be able to scroll through the columns inside the table, I sized it down for the purpose of brevity in the report, given the high number of columns.

library(kableExtra)

head(data)  %>% 
  kbl(caption = "Mean Cases vs. Date") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Mean Cases vs. Date
date	location_name	cases_mean
2021-01-01	Global	635601.5
2021-01-02	Global	646073.9
2021-01-03	Global	657019.8
2021-01-04	Global	665052.6
2021-01-05	Global	680489.2
2021-01-06	Global	708657.8

Above we can see summary statistics of our data as well as the first 10 records of the data.

Missingness of Data

nulls <- rbind(data.frame(round(colSums(is.na(data))/nrow(data), digits = 2)))
names(nulls)[names(nulls) == 'round.colSums.is.na.data...nrow.data...digits...2.'] <- 'Proportion Null'
names(nulls)

## [1] "Proportion Null"

nulls %>% 
 kbl(caption = "Proportion of Missing Values by Variable") %>%
  kable_styling(full_width = F) %>% 
  kable_paper("hover", full_width = F) %>%
  column_spec(1, bold = TRUE, border_right = TRUE, color = "black", background = "lightgrey")

Proportion of Missing Values by Variable
	Proportion Null
date	0
location_name	0
cases_mean	0

Our data isn’t missing any values in the time series of interest.

Time Series Plot

library(magrittr)
library(ggplot2)
data$date <- as.Date(data$date)
data %>%
  ggplot()+
      geom_line(aes(date,cases_mean))+
      theme_bw()+
      ggtitle("Mean Cases Over Time")+
      ylab("Mean Cases")+
      xlab("Date")

Linear Regression Model

Fit a linear regression predicting your outcome (the time-series) as a function of time (measured in hours, days, weeks, etc.). Is there a statistically significant impact of time on your outcome? Present the results of the regression in a publication-quality table, and interpret the output. How much of the variation in the outcome is explained by a time trend? Does it look like a linear fit would do a good job of predicting the behavior of the time series?

library(sjPlot)
model <- lm(cases_mean ~ date, data = data)
summary_model <- summary(model)
tab_model(summary_model)

	cases mean
Predictors	Estimates	CI	p
(Intercept)	-44765600.99	-56900516.01 – -32630685.96	<0.001
date	2419.01	1775.97 – 3062.06	<0.001
Observations	486
R² / R² adjusted	0.101 / 0.100

The above linear model appears to have a very low R^2 and Adj. R^2, which likely means that there is not a linear relationship between the two, however, fitting a better model with more finely tuned parameters may change that.

ARIMA Modeling

library(magrittr)
library(ggplot2)
data$date <- as.Date(data$date)
data %>%
  ggplot()+
      geom_line(aes(date,cases_mean))+
      theme_bw()+
      ggtitle("Mean Cases Over Time")+
      ylab("Mean Cases")+
      xlab("Date")

There are some observable trends in the data in terms of rising average case numbers worldwide, with what appears to be 6-month humps in cases with an observed high recently due to the widespread infectiousness of Omicron. Due to the observed periodicity of the dataset we will likely not be able to observe stationarity - a regular pace seems to exist wherein cases begin to rise. Seasonality may not be the correct word for what I’m observing, as the data has only been relevant for a short period due to the novelty of the disease.

data = data %>% 
  mutate(lag1_cases = dplyr::lag(cases_mean), 
         lag2_cases = dplyr::lag(cases_mean,2), 
         lag3_cases = dplyr::lag(cases_mean,3))

head(data) %>%
  kbl(caption = "First 10 Rows in the New Dataset") %>%
  kable_styling(full_width = F) %>% 
  kable_paper("hover", full_width = F) %>%
  kable_classic(full_width = F, html_font = "Cambria")

First 10 Rows in the New Dataset
date	location_name	cases_mean	lag1_cases	lag2_cases	lag3_cases
2021-01-01	Global	635601.5	NA	NA	NA
2021-01-02	Global	646073.9	635601.5	NA	NA
2021-01-03	Global	657019.8	646073.9	635601.5	NA
2021-01-04	Global	665052.6	657019.8	646073.9	635601.5
2021-01-05	Global	680489.2	665052.6	657019.8	646073.9
2021-01-06	Global	708657.8	680489.2	665052.6	657019.8

Differencing and Transformations

cases_diff = data %>%
  mutate(case_diff = cases_mean - lag(cases_mean))

cases_diff %>% 
  select(date,cases_mean,case_diff) %>% 
  head()  %>% 
  kbl(caption = "Head of Differenced Case Data") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Head of Differenced Case Data
date	cases_mean	case_diff
2021-01-01	635601.5	NA
2021-01-02	646073.9	10472.43
2021-01-03	657019.8	10945.96
2021-01-04	665052.6	8032.74
2021-01-05	680489.2	15436.63
2021-01-06	708657.8	28168.53

cases_diff %>%
  ggplot()+
      geom_line(aes(date,case_diff))+
      theme_bw()+
      ggtitle("Mean Daily Cases (First Difference)")+
      ylab("Mean Cases (Difference))")+
      xlab("Date")

library(tidyr)
cases_diff = cases_diff %>%
  mutate(case_log = log1p(case_diff)) %>% 
  mutate(case_log_diff = case_log - lag(case_log))

nan_df <- subset(cases_diff, is.na(cases_diff))

cases_diff %>%
  ggplot()+
      geom_line(aes(date,case_log_diff))+
      theme_bw()+
      ggtitle("Mean Daily Cases (Log; First Difference)")+
      ylab("Mean Cases (Difference))")+
      xlab("Date")

cases_log_diff <- data %>%
 arrange(date) %>%
  mutate(
    cases_log = log1p(cases_mean), # Log for Variance Stationarity
    cases_diff = cases_mean - lag(cases_mean),
    cases_log_diff = cases_log - lag(cases_log) # Difference for Mean Stationarity
  ) %>%
  drop_na() 

cases_log_diff %>%
  ggplot() +
  geom_line(aes(date, cases_log_diff)) +
  theme_bw() +
  ggtitle("Difference in Log Values, Mean Daily Cases over Time") +
  ylab("Mean Daily Cases (Difference))") +
  xlab("Date")

In the above section I attempt to create a more stationary version of the time series, such that it will be easier to forecast. I’ve attempted to do this utilizing differenced calculations in the data (accounting for lag) and utilizing log transformations, though these don’t appear to have effectively made the series more stationary.

Statistical Testing

ADF test: (null: non-stationary) p-value <0.05 indicates stationary.
KPSS test: (Null: stationary) p-value <0.05 indicates non- stationary.

library(tseries) # Use tseries package for adf.test function
adf_mean <- adf.test(cases_log_diff$cases_mean) # Raw Close Value
adf_diff <- adf.test(cases_log_diff$cases_diff) # First Difference
adf_log <- adf.test(cases_log_diff$cases_log) # Log Close Value
adf_log_diff <- adf.test(cases_log_diff$cases_log_diff) # Differenced Log Close Value
kp_mean <- kpss.test(cases_log_diff$cases_mean) # Raw close value
kp_diff <- kpss.test(cases_log_diff$cases_diff) # First difference
kp_log <- kpss.test(cases_log_diff$cases_log) # Log close value
kp_log_diff <- kpss.test(cases_log_diff$cases_log_diff) # Differenced log close value

stats <- matrix(c("ADF","Log Transform",adf_log$p.value,"Stationary",
                "KPSS","Log Transform",kp_log$p.value,"Stationary",
                "ADF","First Difference",adf_diff$p.value,"Non-stationary",
                "KPSS","First Difference",kp_diff$p.value,"Stationary"),
              ncol=4,byrow=TRUE)

stats %>%
  kbl(caption = "Table 2: ADF/KPSS test on Log vs. Box-cox transform") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Table 2: ADF/KPSS test on Log vs. Box-cox transform
ADF	Log Transform	0.0190727806666848	Stationary
KPSS	Log Transform	0.0279839115781265	Stationary
ADF	First Difference	0.01	Non-stationary
KPSS	First Difference	0.1	Stationary

After running and ADF (Augmented Dickey-Fuller) Test on the mean of daily cases as well as the difference in daily cases, the log of cases, and the log difference of daily cases, we can see that the p values of the 1st difference in cases and the log difference in cases indicate there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis that the data are stationary, indicating that the 1st difference and log of cases are stationary series.

The KPSS test indicates that the Log transformation fails to reject the null hypothesis, meaning that the series is trend stationary, while all others reject the null hypothesis and indicate that the series is not trend stationary. Our results for these tests together indicate that the log transformation on undifferenced data will likely be best for our time-series model.

Autocorrelation Plots

Above I’ve created the ACF and PACF charts for 1st order differences and log transformations of the daily mean cases. It appears that the most effective transformation is the log transformation, however, it may be useful to explore some 2nd transformations to observe whether they have an impact on the autocorrelation effects observed in the ACF plots.

2nd Transformations

cases_log_diff <- data %>%
 arrange(date) %>%
  mutate(
    cases_log = log1p(cases_mean), # Log for Variance Stationarity
    cases_diff = cases_mean - lag(cases_mean),
    cases_log_diff = cases_log - lag(cases_log), # Difference for Mean Stationarity
    cases_diff2 = cases_mean - lag(cases_mean,2),
    cases_log_diff2 = cases_log - lag(cases_log, 2)
  ) %>%
  drop_na()

Above I’ve calculated second degree transformations for the log and difference values of mean daily cases, and will calculate the KPSS & ADF test values on the new fields as well as graph their ACF & PACF plots below.

The transformations calculated above indicate higher autocorrelative effects than the first degree transformations, so I will utilize the first degree transformations only.

cases_log_diff %>%
  ggplot()+
      geom_line(aes(date,cases_log))+
      theme_bw()+
      ggtitle("Mean Daily Cases (Log Transformed)")+
      ylab("Mean Cases (Log))")+
      xlab("Date")

data$cases_mean[data$cases_mean == 0] <- 0.1
cases_boxcox = data %>%
 # filter(date>=ymd('2009-01-01') & date<=ymd("2015-01-01")) %>%
  mutate(case_boxcox = forecast::BoxCox(cases_mean,lambda='auto'))

min(data$cases_mean)

## [1] 53214.16

cases_boxcox %>%
  ggplot()+
      geom_line(aes(date,case_boxcox))+
      theme_bw()+
      ggtitle("Mean Cases over Time (Box-Cox Transformation)")+
      ylab("Mean Cases (Box-Cox Tranformation)")+
      xlab("Date")

I have above calculated a Boxcox transformation of mean daily cases in an attempt to identify a more suitable model, and have built a variety of models below altering settings within ARIMA models to evaluate the models against one another.

Model Selection

BIC(
  arima(cases_boxcox$case_boxcox,order=c(0,1,1)),
  arima(cases_boxcox$case_boxcox,order=c(0,1,2)),
  arima(cases_boxcox$case_boxcox,order=c(0,1,3)),
  arima(cases_boxcox$case_boxcox,order=c(1,1,0)),
  arima(cases_boxcox$case_boxcox,order=c(1,1,1)),
  arima(cases_boxcox$case_boxcox,order=c(2,1,0)),
  arima(cases_boxcox$case_boxcox,order=c(3,1,0))
) %>% 
  kbl(caption = "Comparing Model BIC") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Comparing Model BIC
	df	BIC
arima(cases_boxcox$case_boxcox, order = c(0, 1, 1)) </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> -3680.105 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_boxcox$case_boxcox, order = c(0, 1, 2))	3	-4252.574
arima(cases_boxcox$case_boxcox, order = c(0, 1, 3)) </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> -4573.805 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_boxcox$case_boxcox, order = c(1, 1, 0))	2	-4930.274
arima(cases_boxcox$case_boxcox, order = c(1, 1, 1)) </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -5234.642 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_boxcox$case_boxcox, order = c(2, 1, 0))	3	-5046.246
arima(cases_boxcox$case_boxcox, order = c(3, 1, 0))	4	-5116.265

The best model of the above seems to be the ARIMA model of order (1,1,1), let’s compare those results with that of the log transformed arima models.

BIC(
  arima(cases_log_diff$cases_log,order=c(0,1,1)),
  arima(cases_log_diff$cases_log,order=c(0,1,2)),
  arima(cases_log_diff$cases_log,order=c(0,1,3)),
  arima(cases_log_diff$cases_log,order=c(1,1,0)),
  arima(cases_log_diff$cases_log,order=c(1,1,1)),
  arima(cases_log_diff$cases_log,order=c(2,1,0)),
  arima(cases_log_diff$cases_log,order=c(3,1,0))
) %>% 
  kbl(caption = "Comparing Model BIC") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Comparing Model BIC
	df	BIC
arima(cases_log_diff$cases_log, order = c(0, 1, 1)) </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> -2651.339 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_log_diff$cases_log, order = c(0, 1, 2))	3	-3224.673
arima(cases_log_diff$cases_log, order = c(0, 1, 3)) </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> -3568.429 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_log_diff$cases_log, order = c(1, 1, 0))	2	-3893.528
arima(cases_log_diff$cases_log, order = c(1, 1, 1)) </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -4202.144 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_log_diff$cases_log, order = c(2, 1, 0))	3	-4018.758
arima(cases_log_diff$cases_log, order = c(3, 1, 0))	4	-4089.790

These models don’t perform as well as the boxcox transformation models according to the BIC criterion, however, before accepting a single model I will also create some auto.arima models to see if there are any better options I haven’t considered.

library(forecast)
auto.arima(cases_boxcox$case_boxcox,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)

## Series: cases_boxcox$case_boxcox 
## ARIMA(0,2,5) 
## 
## Coefficients:
##          ma1      ma2      ma3    ma4      ma5
##       0.9648  -0.0360  -0.1682  0.053  -0.1165
## s.e.  0.0446   0.0646   0.0588  0.064   0.0489
## 
## sigma^2 estimated as 1.418e-06:  log likelihood=2628.56
## AIC=-5245.13   AICc=-5244.95   BIC=-5220.03

The auto-ARIMA model of a boxcox transformation selected a model of order (0, 2, 5), and has a comparable BIC to the model selected above.The boxcox transformation with a stationary = FALSE setting performed better than the stationary = TRUE setting.

auto.arima(cases_log_diff$cases_log,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)

## Series: cases_log_diff$cases_log 
## ARIMA(0,2,5) 
## 
## Coefficients:
##          ma1     ma2      ma3     ma4      ma5
##       1.0244  0.0251  -0.1846  0.0305  -0.1187
## s.e.  0.0461  0.0703   0.0625  0.0701   0.0526
## 
## sigma^2 estimated as 9.669e-06:  log likelihood=2115.2
## AIC=-4218.41   AICc=-4218.23   BIC=-4193.35

The auto arima models for both boxcox and log transormed mean cases appear to perform strongly, however, the boxcox transformed arima model appears stronger.

Forecasting & Diagnostics

best_mod = arima(cases_boxcox$case_boxcox,order=c(1,1,1))
forecast::checkresiduals(best_mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 31.839, df = 8, p-value = 9.951e-05
## 
## Model df: 2.   Total lags used: 10

The above residuals indicate that some autocorrelative effects are still lingering in the ARIMA model selected through the above model selection means.

par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)

This effect can be seen most obviously in the first period lag.

Box.test(resid,type='Ljung-Box',lag=1)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.18195, df = 1, p-value = 0.6697

Box.test(resid,type='Ljung-Box',lag=5)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 3.9355, df = 5, p-value = 0.5587

Box.test(resid,type='Ljung-Box',lag=10)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 31.839, df = 10, p-value = 0.0004259

Box.test(resid,type='Ljung-Box',lag=20)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 43.535, df = 20, p-value = 0.001736

The Box-Ljung tests above indicate that there are significant autocorrelative effects occurring at the 10 and 20 period lags in the model as it currently runs.

resid = best_mod$residuals

pred = resid+cases_boxcox$case_boxcox

ggplot()+
  geom_line(aes(cases_boxcox$date,cases_boxcox$case_boxcox))+
  geom_line(aes(cases_boxcox$date,pred),color='blue',alpha=0.4)+
  theme_bw()+
  ggtitle("Box-Cox Transformed Mean Daily Cases Predictions vs. Actuals (Training Model") +
  xlab("Date")+
  ylab("Box-Cox Daily Cases")

RMSE = sqrt(mean((expm1(pred) - expm1(cases_boxcox$case_boxcox))^2,na.rm=T))
RMSE

## [1] 4.860085

best_mod %>%
  forecast(h=60) %>% 
  autoplot()

Our current model indicates that cases will continue to be low in the future. These values seem reasonable as observed trends in the past indicate that the period after the holiday season sees a reduction in cases, and the Omicron variant has been so far reaching that it is losing steam relative to the remaining vulnerable population.

library(lubridate)
train = data %>% 
  filter(date<ymd("2022-1-31")) %>% 
  mutate(case_boxcox = forecast::BoxCox(cases_mean,lambda='auto'))
test = data %>% 
  filter(date>=ymd("2022-1-31") & date<ymd("2022-02-24")) %>% 
  mutate(case_boxcox = forecast::BoxCox(cases_mean,lambda='auto'))
best_mod = arima(train$case_boxcox,order=c(1,1,1))

test_pred1 = predict(best_mod,24)
error_df = data.frame(test_pred1$pred, test$case_boxcox)
error1 = (mean((error_df$test_pred1.pred - error_df$test.case_boxcox)^2)^1/2)
error1

## [1] 44115394

The out of sample error of the best ARIMA model selected from the data is pretty substantial, which makes sense because the number of cases has differed so heavily, with the minor hills observed earlier in the dataset representing swings in case numbers to the order of 100’s of 1000’s of cases.

Prophet Modeling

library(magrittr)
library(ggplot2)
data$date <- as.Date(data$date)
data %>%
  ggplot()+
      geom_line(aes(date,cases_mean))+
      theme_bw()+
      ggtitle("Mean Cases Over Time")+
      ylab("Mean Cases")+
      xlab("Date")

The above graphic displays the mean daily cases over time starting in 2021, we can see a major spike in total daily cases due to the rapid spread of the Omicron variant starting in early 2022.

The maximum date value contained in this model is for May of 2022 - this data is from the reference scenario produced by IHME, which means that it will be displaying projections created by IHME regarding daily cases. For the purposes of this analysis, I will remove all values following today’s date (2/17/2022) and will treat the series as if it is the ground truth in order to test the forecasting method.

# R
library(prophet)
library(dplyr)
prophet_data = data %>% 
    rename(ds = date, # Have to name our date variable "ds"
    y = cases_mean)  # Have to name our time series "y"
m <- prophet(prophet_data, )
train = prophet_data %>% 
  filter(ds<ymd("2021-12-31"))

test = prophet_data %>%
  filter(ds>=ymd("2021-12-31")) %>% 
  filter(ds<ymd("2022-02-17"))

model = prophet(train)

future = make_future_dataframe(model,periods =31)

forecast = predict(model,future)

tail(future) %>% 
  kbl(caption = "Tail of Future Caps on Data") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Tail of Future Caps on Data
	ds
390	2022-01-25
391	2022-01-26
392	2022-01-27
393	2022-01-28
394	2022-01-29
395	2022-01-30

tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')]) %>% 
  kbl(caption = "Tail of Future Forecasting Bounds") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Tail of Future Forecasting Bounds
	ds	yhat	yhat_lower	yhat_upper
390	2022-01-25	819777.6	647605.2	997128.8
391	2022-01-26	825841.0	637025.6	1011474.6
392	2022-01-27	831199.3	646897.9	1011912.5
393	2022-01-28	821353.3	640962.6	1001604.6
394	2022-01-29	825413.1	655076.5	993271.1
395	2022-01-30	830027.2	658915.2	1008848.7

Above we can see the first forecasts generated by dropping the data into the Prophet model. We’re given a range of upper and lower values for the dates leading up to 1/30/2022, 30 days after the end of our training dataset. We can visualize these results and the intervals included with a simple plot:

# R
plot(m, forecast) +
  ylab("Daily Covid Cases")+xlab("Date") +theme_bw()

prophet_plot_components(m, forecast)

We can see by plotting the ground truth data of our time series vs. the projections that this model was unable to predict the dramatic rise in cases due to Omicron, but it was responsive enought to detect the uptick in cases represented at the end of the training data. The decomposition of the seasonality elements present in the data indicates that Monday’s are low points for daily mean COVID cases, with Sunday being the high point.

dyplot.prophet(model,forecast)

We can also observe interactive plots which indicate specific dates, actual, and predicted values, which displays a helpful density graphic below the X axis. We can examine shorter time periods by zooming in on the time period in question using the filters attached to the bottom of the graphic.

Changepoints

plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Daily Covid Cases")

Prophet allows for the easy inclusion of changepoints (click here to learn more about changpoints), which can help us to identify when trends reverse course, this can be helpful in the case of the Omicron variant by helping us to recognize a significant departure from our modeled values beginning in October, putting us on high alert about the number of cases on the horizon.

We can modify the number of changepoints as shown below by adding a simple parameter to the prophet model.

# Number of Changepoints
model = prophet(train,n.changepoints=100)

forecast = predict(model,future)

plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("COVID Cases")

Decomposition

prophet_plot_components(model,forecast)

The decomposed seasonality elements yielded by the above model for changepoints indicates that Thursdays, rather than Monday’s, are to be watched out for in terms of weekly COVID case changepoints. When considering our original time series decomposition, we saw that Sundays were the high point of the week, which is supported above by the increase observed through the week, peaking at thursday, and then reversing.

The Two-Year-Future Forecast

By making forecasts 2 years into the future, we can observe the current trends that Prophet expects the data to follow.

two_yr_future = make_future_dataframe(model,periods = 730)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Mean Daily Cases")

Considering that our training data ends in the throes of the Omicron surge, we shouldn’t be surprised that the two year forecast for our data indicates high amounts of cases. We should, however, consider adding a ceiling to this prediction to manage any overzealous predictions that are too heavily influenced by recent trends.

# Set "floor" in training data
train$floor = 0
train$cap = 9999999
future$floor = 0
future$cap = 9999999
# Set floor in forecsat data
two_yr_future$floor = 0
two_yr_future$cap = 9999999
sat_model = prophet(train,growth='logistic')

sat_two_yr_forecast = predict(sat_model,two_yr_future)

plot(sat_model,sat_two_yr_forecast)+ylim(0,999999)+
  theme_bw()+xlab("Date")+ylab("Daily Mean Cases")

We can see a more modest estimate of daily mean cases above when the ceiling was added to our model.

Additive/Multiplicative Model

Below is an additive model that makes very similar predictions regarding the near future of cases to our original prophet model.

additive = prophet(train)
add_fcst = predict(additive,future)

plot(additive,add_fcst)+
ylim(0,999999)

prophet_plot_components(additive,add_fcst)

The seasonal components of our additive model indicate that Thursday is still a major source of information for weekly case growth.

multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)

plot(multi,multi_fcst)+ylim(0,999999)

prophet_plot_components(multi, multi_fcst)

The multiplicative model above further verifies the findings of the Additive model.

Holidays

Many time series’ contain trends built into holiday data that influence the behavior of the time series. We can use prophet to examine these holiday trends and build an awareness of them into our model.

Components Plot

model = prophet(train,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train)
forecast = predict(model,future)
prophet_plot_components(model,forecast)

The above plot displays numerous peaks and valley’s related to holiday data, maintaining that Thursdays are the most important day of the week for trending changes in mean daily cases.

We can make this information more useful by examining the holiday’s directly by name:

library(tidyr)
library(ggplot2)
library(dplyr)
forecast %>%
  filter(holidays != 0) %>%
  dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
  mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
  summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
  pivot_longer(
    cols = `Christmas Day`:`Washington's Birthday`,
    names_to = 'holiday', 
    values_to = 'effect'
  ) %>%
ggplot() +
  geom_col(aes(effect,holiday))+
  ggtitle("Effect of Holidays on Mean Daily Cases") +
  theme_bw()

Many holidays have an observed effect on the number of mean daily COVID-19 Cases. Many of the summer holidays that we observe are associated with low-points, which may be due to lower case numbers in summertime, or lower transmission due to outdoor events. Holidays after November 31st and before February are associated with a major increase in case numbers, generally. The most surprising find here is Thanksgiving’s negative effect on overall global cases. This may be due to the fact that this is for global case data while I’m only accounting for US holidays. It may also be noted that the highest

In-Sample Performance

Assessing model performance is highly important to the relavence of an analysis, I’ll be using the RMSE (Root mean squared error), MAE (Mean absolute error) and MAPE (mean absolute percentage error) to assess model performance.

# IS Performance
forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds<ymd("2021-12-31"))

RMSE = sqrt(mean((train$y - forecast_metric_data$yhat)^2))

MAE = mean(abs(train$y - forecast_metric_data$yhat))

MAPE = mean(abs((train$y - forecast_metric_data$yhat)/train$y))

print(paste("RMSE:",round(RMSE,2)))

## [1] "RMSE: 136623.17"

print(paste("MAE:",round(MAE,2)))

## [1] "MAE: 107534.14"

print(paste("MAPE:",round(MAPE,2)))

## [1] "MAPE: 0.2"

The model performance of the in-sample data isn’t terrible, it has a Mean Absolute Percentage Error of 20%, indicating that we account for much of the daily differences in mean covid cases globally, however, that 20% difference corresponds to a massive disparity in the total number of cases predicted, which means that this may be useful for general trending but not for operationally relevant implementation like bed-planning or material gathering.

# OOS Performance

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2021-12-31"))
test = test %>% 
  filter(ds<=max(forecast_metric_data$ds  ))

RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))

MAE = mean(abs(test$y - forecast_metric_data$yhat))

MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))

print(paste("RMSE:",round(RMSE,2)))

## [1] "RMSE: 3367194.02"

print(paste("MAE:",round(MAE,2)))

## [1] "MAE: 3097973.31"

print(paste("MAPE:",round(MAPE,2)))

## [1] "MAPE: 0.77"

The model performs far worse on Out-Of-Sample data, a whopping 77% MAPE. This likely has to do with the major uptick in case numbers that immediately follows the end of the training data.

Cross-Validation

library(kableExtra)
df.cv <- cross_validation(model, initial = 30, period = 30, horizon = 30, units = 'days')
head(df.cv) %>% 
  kbl(caption = "Cross Validated Data Frame | Head") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Cross Validated Data Frame | Head
y	ds	yhat	yhat_lower	yhat_upper	cutoff
454060.5	2021-02-04	454994.9	452966.2	456802.9	2021-02-03
443958.4	2021-02-05	446576.2	439632.6	453497.1	2021-02-03
433448.3	2021-02-06	437647.0	424185.8	451155.9	2021-02-03
424259.9	2021-02-07	425666.4	403892.5	447427.2	2021-02-03
414205.9	2021-02-08	412419.0	381054.0	444289.9	2021-02-03
405118.9	2021-02-09	398445.8	357398.0	441609.8	2021-02-03

Above I’m creating 30-day horizon rolling periods for cross validation on the Prophet model.

df.cv %>% 
  ggplot()+
  ggtitle("Cross Validated Error by Month")+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("Daily Cases")+
  scale_color_discrete(name = 'Cutoff')

The disparities witnessed in the dataset in the month of January are staggering relative to the general performance of the model.

round_df <- function(df, digits) {
  nums <- vapply(df, is.numeric, FUN.VALUE = logical(1))

  df[,nums] <- round(df[,nums], digits = digits)

  (df)
}
library(kableExtra)
round_df(performance_metrics(df.cv),2) %>% 
  kbl(caption = "Cross-Validated Error Metrics") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Cross-Validated Error Metrics
horizon	mse	rmse	mae	mape	mdape	smape	coverage
3 days	8315574810	91189.77	68732.69	0.13	0.09	0.13	0.70
4 days	9061562940	95192.24	70502.24	0.13	0.07	0.14	0.70
5 days	9854070470	99267.67	72902.47	0.13	0.06	0.14	0.64
6 days	10709813512	103488.23	76212.80	0.14	0.06	0.15	0.58
7 days	11941433604	109276.87	80932.20	0.15	0.08	0.16	0.55
8 days	13797381733	117462.26	87326.07	0.16	0.11	0.17	0.55
9 days	16070433060	126769.21	94707.31	0.17	0.13	0.18	0.55
10 days	18483278627	135953.22	102519.98	0.19	0.17	0.20	0.55
11 days	20492407738	143151.69	109395.83	0.20	0.20	0.21	0.55
12 days	22524750971	150082.48	116008.21	0.22	0.22	0.23	0.55
13 days	24573864537	156760.53	122239.92	0.23	0.22	0.24	0.52
14 days	26998870963	164313.33	129329.60	0.24	0.22	0.26	0.48
15 days	29833719847	172724.40	136990.15	0.26	0.24	0.27	0.45
16 days	32818027539	181157.47	145407.50	0.27	0.26	0.29	0.45
17 days	35877642221	189413.94	153729.94	0.29	0.27	0.31	0.48
18 days	38675921911	196661.95	161567.13	0.31	0.28	0.33	0.48
19 days	41875779430	204635.72	169553.87	0.32	0.30	0.35	0.48
20 days	45572836665	213477.95	177633.48	0.34	0.31	0.37	0.45
21 days	50574296782	224887.30	187085.73	0.36	0.31	0.39	0.48
22 days	56996612875	238739.63	197590.59	0.38	0.31	0.41	0.52
23 days	64236613113	253449.43	208828.08	0.40	0.33	0.44	0.55
24 days	71476242725	267350.41	219210.50	0.41	0.33	0.46	0.55
25 days	78638209321	280425.05	228980.73	0.43	0.33	0.48	0.55
26 days	87428366669	295682.88	240058.33	0.45	0.33	0.51	0.55
27 days	99062017172	314741.19	253038.31	0.46	0.32	0.53	0.55
28 days	115623293421	340034.25	269099.83	0.48	0.32	0.56	0.55
29 days	137047751911	370199.61	287390.09	0.50	0.32	0.59	0.55
30 days	161522190296	401898.23	306993.68	0.53	0.32	0.62	0.52

There appears to be a general trend that the further we get away from the current date the worse our predictions are for daily cases using cross-validation. We can test this out by plotting the error metrics against the windows it’s assessed for.

Plot of Cross-Validated Accuracy of Model as Time Window from Present Increases

plot_cross_validation_metric(df.cv, metric = 'rmse')

There is a general trend towards inaccuracy as time goes on, with the 30 day window approaching a difference of 500,000 cases relative to the 1 day window.

Model Comparison

RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
RMSE

## [1] 3367194

error1

## [1] 44115394

The RMSE of the Prophet model is 3.36719410^{6}, while the RMSE of the best ARIMA model is 4.411539410^{7}. The reason that I’m operating using the RMSE vs. the MAPE or other criteria is that I want to consider all comparisons in terms of the original units - by case numbers. I think that the MAPE value could be misleading in the context due to the wide variation in case numbers during peaks and valleys, smaller variations will make for wider deviations during the low points of case numbers. The RMSE communicates an understandable metric in absolute units.

COVID-19 Time Series Analysis

John Trygier

2/24/2022