I decided to use the COVID-19 Data published by IHME. The data is published to the web by the Institute of Health Metrics and Evaluation for the sake of assisting policymakers with planning for uncertainty surrounding COVID. The data is collected from local and national governments, as well as hospitals and associations, WHO, and others.
They “use data from government websites for Brazil, Canada, France, Germany, Italy, Japan, Mexico, Pakistan, South Africa, Spain, United Kingdom, and the US states of Indiana, Illinois, Maryland, and Washington”, while data for New York is collected from the New York Times Github Repository. I decided on this dataset because of the importance of understanding the COVID-19 pandemic and the relevant factors impacting its transmission. This is a topic that is very closely connected to my everyday life and analyzing this data will help me to understand the problem in order to help those around me be safer and engage with the community in a more informed way.
I will specifically be analyzing the time series of “Mean Daily Cases” for the variable of interest for this analysis, looking at the global population, rather than any country-specific data.
The data was generated based on disease transmission, so variation in the variable is tied pretty heavily to adherence to pandemic protocols, or lack thereof due to fatigue, in addition to variants of the virus itself mutating to become more or less infectious/harmful. Localized quarantine procedures have been observed to be effective (Royal Society for Public Health - Volume 189) at containing the virus and disinfecting procedures have been observed to limit exposure in non-sterile environments (Vardoulakis et al.), though high levels of stress (Park et al) due to the pandemic may damage willingness to adhere to public health guidelines. This will be a very difficult variable to forecast because of the interactions between the nature of the virus itself as well as the behaviors exhibited by the population. This, in addition to the potential for new variants to appear from other portions of the globe, heightens an already dangerous situation due to how closely interconnected all the peoples of the world are.
Disclaimer - the articles cited above may be paywalled to non-UC users.
The maximum date value contained in this model is for May of 2022 - this data is from the reference scenario produced by IHME, which means that it will be displaying projections created by IHME regarding daily cases. For the purposes of this analysis, I will remove all values following today’s date (2/24/2022) and will treat the series as if it is the ground truth in order to test forecasting methods, though this will not become immediately necessary until later in the process, when I am no longer
data <- df[, c(2, 4, 23)]
data <- subset(data, location_name == "Global")
data$cases_mean[data$cases_mean == 0] <- 0.1
unique(data$location_name)
## [1] "Global"
summary(data)%>%
kbl(caption = "Summary Table for Global Daily Mean COVID-19 Cases Data") %>%
kable_classic(full_width = F, html_font = "Cambria")
| date | location_name | cases_mean | |
|---|---|---|---|
| Length:486 | Length:486 | Min. : 53214 | |
| Class :character | Class :character | 1st Qu.: 420254 | |
| Mode :character | Mode :character | Median : 560434 | |
| NA | NA | Mean : 882330 | |
| NA | NA | 3rd Qu.: 703547 | |
| NA | NA | Max. :5410623 |
The above table displays summary statistics for the entire dataset in a concise and well formatted table. By viewing this table in the HTML submission file you’ll be able to scroll through the columns inside the table, I sized it down for the purpose of brevity in the report, given the high number of columns.
library(kableExtra)
head(data) %>%
kbl(caption = "Mean Cases vs. Date") %>%
kable_classic(full_width = F, html_font = "Cambria")
| date | location_name | cases_mean |
|---|---|---|
| 2021-01-01 | Global | 635601.5 |
| 2021-01-02 | Global | 646073.9 |
| 2021-01-03 | Global | 657019.8 |
| 2021-01-04 | Global | 665052.6 |
| 2021-01-05 | Global | 680489.2 |
| 2021-01-06 | Global | 708657.8 |
Above we can see summary statistics of our data as well as the first 10 records of the data.
nulls <- rbind(data.frame(round(colSums(is.na(data))/nrow(data), digits = 2)))
names(nulls)[names(nulls) == 'round.colSums.is.na.data...nrow.data...digits...2.'] <- 'Proportion Null'
names(nulls)
## [1] "Proportion Null"
nulls %>%
kbl(caption = "Proportion of Missing Values by Variable") %>%
kable_styling(full_width = F) %>%
kable_paper("hover", full_width = F) %>%
column_spec(1, bold = TRUE, border_right = TRUE, color = "black", background = "lightgrey")
| Proportion Null | |
|---|---|
| date | 0 |
| location_name | 0 |
| cases_mean | 0 |
Our data isn’t missing any values in the time series of interest.
library(magrittr)
library(ggplot2)
data$date <- as.Date(data$date)
data %>%
ggplot()+
geom_line(aes(date,cases_mean))+
theme_bw()+
ggtitle("Mean Cases Over Time")+
ylab("Mean Cases")+
xlab("Date")
Fit a linear regression predicting your outcome (the time-series) as a function of time (measured in hours, days, weeks, etc.). Is there a statistically significant impact of time on your outcome? Present the results of the regression in a publication-quality table, and interpret the output. How much of the variation in the outcome is explained by a time trend? Does it look like a linear fit would do a good job of predicting the behavior of the time series?
library(sjPlot)
model <- lm(cases_mean ~ date, data = data)
summary_model <- summary(model)
tab_model(summary_model)
| cases mean | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -44765600.99 | -56900516.01 – -32630685.96 | <0.001 |
| date | 2419.01 | 1775.97 – 3062.06 | <0.001 |
| Observations | 486 | ||
| R2 / R2 adjusted | 0.101 / 0.100 | ||
The above linear model appears to have a very low R^2 and Adj. R^2, which likely means that there is not a linear relationship between the two, however, fitting a better model with more finely tuned parameters may change that.
library(magrittr)
library(ggplot2)
data$date <- as.Date(data$date)
data %>%
ggplot()+
geom_line(aes(date,cases_mean))+
theme_bw()+
ggtitle("Mean Cases Over Time")+
ylab("Mean Cases")+
xlab("Date")
There are some observable trends in the data in terms of rising average case numbers worldwide, with what appears to be 6-month humps in cases with an observed high recently due to the widespread infectiousness of Omicron. Due to the observed periodicity of the dataset we will likely not be able to observe stationarity - a regular pace seems to exist wherein cases begin to rise. Seasonality may not be the correct word for what I’m observing, as the data has only been relevant for a short period due to the novelty of the disease.
data = data %>%
mutate(lag1_cases = dplyr::lag(cases_mean),
lag2_cases = dplyr::lag(cases_mean,2),
lag3_cases = dplyr::lag(cases_mean,3))
head(data) %>%
kbl(caption = "First 10 Rows in the New Dataset") %>%
kable_styling(full_width = F) %>%
kable_paper("hover", full_width = F) %>%
kable_classic(full_width = F, html_font = "Cambria")
| date | location_name | cases_mean | lag1_cases | lag2_cases | lag3_cases |
|---|---|---|---|---|---|
| 2021-01-01 | Global | 635601.5 | NA | NA | NA |
| 2021-01-02 | Global | 646073.9 | 635601.5 | NA | NA |
| 2021-01-03 | Global | 657019.8 | 646073.9 | 635601.5 | NA |
| 2021-01-04 | Global | 665052.6 | 657019.8 | 646073.9 | 635601.5 |
| 2021-01-05 | Global | 680489.2 | 665052.6 | 657019.8 | 646073.9 |
| 2021-01-06 | Global | 708657.8 | 680489.2 | 665052.6 | 657019.8 |
cases_diff = data %>%
mutate(case_diff = cases_mean - lag(cases_mean))
cases_diff %>%
select(date,cases_mean,case_diff) %>%
head() %>%
kbl(caption = "Head of Differenced Case Data") %>%
kable_classic(full_width = F, html_font = "Cambria")
| date | cases_mean | case_diff |
|---|---|---|
| 2021-01-01 | 635601.5 | NA |
| 2021-01-02 | 646073.9 | 10472.43 |
| 2021-01-03 | 657019.8 | 10945.96 |
| 2021-01-04 | 665052.6 | 8032.74 |
| 2021-01-05 | 680489.2 | 15436.63 |
| 2021-01-06 | 708657.8 | 28168.53 |
cases_diff %>%
ggplot()+
geom_line(aes(date,case_diff))+
theme_bw()+
ggtitle("Mean Daily Cases (First Difference)")+
ylab("Mean Cases (Difference))")+
xlab("Date")
library(tidyr)
cases_diff = cases_diff %>%
mutate(case_log = log1p(case_diff)) %>%
mutate(case_log_diff = case_log - lag(case_log))
nan_df <- subset(cases_diff, is.na(cases_diff))
cases_diff %>%
ggplot()+
geom_line(aes(date,case_log_diff))+
theme_bw()+
ggtitle("Mean Daily Cases (Log; First Difference)")+
ylab("Mean Cases (Difference))")+
xlab("Date")
cases_log_diff <- data %>%
arrange(date) %>%
mutate(
cases_log = log1p(cases_mean), # Log for Variance Stationarity
cases_diff = cases_mean - lag(cases_mean),
cases_log_diff = cases_log - lag(cases_log) # Difference for Mean Stationarity
) %>%
drop_na()
cases_log_diff %>%
ggplot() +
geom_line(aes(date, cases_log_diff)) +
theme_bw() +
ggtitle("Difference in Log Values, Mean Daily Cases over Time") +
ylab("Mean Daily Cases (Difference))") +
xlab("Date")
In the above section I attempt to create a more stationary version of the time series, such that it will be easier to forecast. I’ve attempted to do this utilizing differenced calculations in the data (accounting for lag) and utilizing log transformations, though these don’t appear to have effectively made the series more stationary.
library(tseries) # Use tseries package for adf.test function
adf_mean <- adf.test(cases_log_diff$cases_mean) # Raw Close Value
adf_diff <- adf.test(cases_log_diff$cases_diff) # First Difference
adf_log <- adf.test(cases_log_diff$cases_log) # Log Close Value
adf_log_diff <- adf.test(cases_log_diff$cases_log_diff) # Differenced Log Close Value
kp_mean <- kpss.test(cases_log_diff$cases_mean) # Raw close value
kp_diff <- kpss.test(cases_log_diff$cases_diff) # First difference
kp_log <- kpss.test(cases_log_diff$cases_log) # Log close value
kp_log_diff <- kpss.test(cases_log_diff$cases_log_diff) # Differenced log close value
stats <- matrix(c("ADF","Log Transform",adf_log$p.value,"Stationary",
"KPSS","Log Transform",kp_log$p.value,"Stationary",
"ADF","First Difference",adf_diff$p.value,"Non-stationary",
"KPSS","First Difference",kp_diff$p.value,"Stationary"),
ncol=4,byrow=TRUE)
stats %>%
kbl(caption = "Table 2: ADF/KPSS test on Log vs. Box-cox transform") %>%
kable_classic(full_width = F, html_font = "Cambria")
| ADF | Log Transform | 0.0190727806666848 | Stationary |
| KPSS | Log Transform | 0.0279839115781265 | Stationary |
| ADF | First Difference | 0.01 | Non-stationary |
| KPSS | First Difference | 0.1 | Stationary |
After running and ADF (Augmented Dickey-Fuller) Test on the mean of daily cases as well as the difference in daily cases, the log of cases, and the log difference of daily cases, we can see that the p values of the 1st difference in cases and the log difference in cases indicate there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis that the data are stationary, indicating that the 1st difference and log of cases are stationary series.
The KPSS test indicates that the Log transformation fails to reject the null hypothesis, meaning that the series is trend stationary, while all others reject the null hypothesis and indicate that the series is not trend stationary. Our results for these tests together indicate that the log transformation on undifferenced data will likely be best for our time-series model.
Above I’ve created the ACF and PACF charts for 1st order differences and log transformations of the daily mean cases. It appears that the most effective transformation is the log transformation, however, it may be useful to explore some 2nd transformations to observe whether they have an impact on the autocorrelation effects observed in the ACF plots.
cases_log_diff <- data %>%
arrange(date) %>%
mutate(
cases_log = log1p(cases_mean), # Log for Variance Stationarity
cases_diff = cases_mean - lag(cases_mean),
cases_log_diff = cases_log - lag(cases_log), # Difference for Mean Stationarity
cases_diff2 = cases_mean - lag(cases_mean,2),
cases_log_diff2 = cases_log - lag(cases_log, 2)
) %>%
drop_na()
Above I’ve calculated second degree transformations for the log and difference values of mean daily cases, and will calculate the KPSS & ADF test values on the new fields as well as graph their ACF & PACF plots below.
The transformations calculated above indicate higher autocorrelative effects than the first degree transformations, so I will utilize the first degree transformations only.
cases_log_diff %>%
ggplot()+
geom_line(aes(date,cases_log))+
theme_bw()+
ggtitle("Mean Daily Cases (Log Transformed)")+
ylab("Mean Cases (Log))")+
xlab("Date")
data$cases_mean[data$cases_mean == 0] <- 0.1
cases_boxcox = data %>%
# filter(date>=ymd('2009-01-01') & date<=ymd("2015-01-01")) %>%
mutate(case_boxcox = forecast::BoxCox(cases_mean,lambda='auto'))
min(data$cases_mean)
## [1] 53214.16
cases_boxcox %>%
ggplot()+
geom_line(aes(date,case_boxcox))+
theme_bw()+
ggtitle("Mean Cases over Time (Box-Cox Transformation)")+
ylab("Mean Cases (Box-Cox Tranformation)")+
xlab("Date")
I have above calculated a Boxcox transformation of mean daily cases in an attempt to identify a more suitable model, and have built a variety of models below altering settings within ARIMA models to evaluate the models against one another.
BIC(
arima(cases_boxcox$case_boxcox,order=c(0,1,1)),
arima(cases_boxcox$case_boxcox,order=c(0,1,2)),
arima(cases_boxcox$case_boxcox,order=c(0,1,3)),
arima(cases_boxcox$case_boxcox,order=c(1,1,0)),
arima(cases_boxcox$case_boxcox,order=c(1,1,1)),
arima(cases_boxcox$case_boxcox,order=c(2,1,0)),
arima(cases_boxcox$case_boxcox,order=c(3,1,0))
) %>%
kbl(caption = "Comparing Model BIC") %>%
kable_classic(full_width = F, html_font = "Cambria")
| df | BIC | |
|---|---|---|
| arima(cases_boxcox\(case_boxcox, order = c(0, 1, 1)) </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> -3680.105 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_boxcox\)case_boxcox, order = c(0, 1, 2)) | 3 | -4252.574 |
| arima(cases_boxcox\(case_boxcox, order = c(0, 1, 3)) </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> -4573.805 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_boxcox\)case_boxcox, order = c(1, 1, 0)) | 2 | -4930.274 |
| arima(cases_boxcox\(case_boxcox, order = c(1, 1, 1)) </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -5234.642 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_boxcox\)case_boxcox, order = c(2, 1, 0)) | 3 | -5046.246 |
| arima(cases_boxcox$case_boxcox, order = c(3, 1, 0)) | 4 | -5116.265 |
The best model of the above seems to be the ARIMA model of order (1,1,1), let’s compare those results with that of the log transformed arima models.
BIC(
arima(cases_log_diff$cases_log,order=c(0,1,1)),
arima(cases_log_diff$cases_log,order=c(0,1,2)),
arima(cases_log_diff$cases_log,order=c(0,1,3)),
arima(cases_log_diff$cases_log,order=c(1,1,0)),
arima(cases_log_diff$cases_log,order=c(1,1,1)),
arima(cases_log_diff$cases_log,order=c(2,1,0)),
arima(cases_log_diff$cases_log,order=c(3,1,0))
) %>%
kbl(caption = "Comparing Model BIC") %>%
kable_classic(full_width = F, html_font = "Cambria")
| df | BIC | |
|---|---|---|
| arima(cases_log_diff\(cases_log, order = c(0, 1, 1)) </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> -2651.339 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_log_diff\)cases_log, order = c(0, 1, 2)) | 3 | -3224.673 |
| arima(cases_log_diff\(cases_log, order = c(0, 1, 3)) </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> -3568.429 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_log_diff\)cases_log, order = c(1, 1, 0)) | 2 | -3893.528 |
| arima(cases_log_diff\(cases_log, order = c(1, 1, 1)) </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -4202.144 </td> </tr> <tr> <td style="text-align:left;"> arima(cases_log_diff\)cases_log, order = c(2, 1, 0)) | 3 | -4018.758 |
| arima(cases_log_diff$cases_log, order = c(3, 1, 0)) | 4 | -4089.790 |
These models don’t perform as well as the boxcox transformation models according to the BIC criterion, however, before accepting a single model I will also create some auto.arima models to see if there are any better options I haven’t considered.
library(forecast)
auto.arima(cases_boxcox$case_boxcox,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
## Series: cases_boxcox$case_boxcox
## ARIMA(0,2,5)
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5
## 0.9648 -0.0360 -0.1682 0.053 -0.1165
## s.e. 0.0446 0.0646 0.0588 0.064 0.0489
##
## sigma^2 estimated as 1.418e-06: log likelihood=2628.56
## AIC=-5245.13 AICc=-5244.95 BIC=-5220.03
The auto-ARIMA model of a boxcox transformation selected a model of order (0, 2, 5), and has a comparable BIC to the model selected above.The boxcox transformation with a stationary = FALSE setting performed better than the stationary = TRUE setting.
auto.arima(cases_log_diff$cases_log,stationary=FALSE,allowdrift=FALSE,
seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
## Series: cases_log_diff$cases_log
## ARIMA(0,2,5)
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5
## 1.0244 0.0251 -0.1846 0.0305 -0.1187
## s.e. 0.0461 0.0703 0.0625 0.0701 0.0526
##
## sigma^2 estimated as 9.669e-06: log likelihood=2115.2
## AIC=-4218.41 AICc=-4218.23 BIC=-4193.35
The auto arima models for both boxcox and log transormed mean cases appear to perform strongly, however, the boxcox transformed arima model appears stronger.
best_mod = arima(cases_boxcox$case_boxcox,order=c(1,1,1))
forecast::checkresiduals(best_mod)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 31.839, df = 8, p-value = 9.951e-05
##
## Model df: 2. Total lags used: 10
The above residuals indicate that some autocorrelative effects are still lingering in the ARIMA model selected through the above model selection means.
par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)
This effect can be seen most obviously in the first period lag.
Box.test(resid,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.18195, df = 1, p-value = 0.6697
Box.test(resid,type='Ljung-Box',lag=5)
##
## Box-Ljung test
##
## data: resid
## X-squared = 3.9355, df = 5, p-value = 0.5587
Box.test(resid,type='Ljung-Box',lag=10)
##
## Box-Ljung test
##
## data: resid
## X-squared = 31.839, df = 10, p-value = 0.0004259
Box.test(resid,type='Ljung-Box',lag=20)
##
## Box-Ljung test
##
## data: resid
## X-squared = 43.535, df = 20, p-value = 0.001736
The Box-Ljung tests above indicate that there are significant autocorrelative effects occurring at the 10 and 20 period lags in the model as it currently runs.
resid = best_mod$residuals
pred = resid+cases_boxcox$case_boxcox
ggplot()+
geom_line(aes(cases_boxcox$date,cases_boxcox$case_boxcox))+
geom_line(aes(cases_boxcox$date,pred),color='blue',alpha=0.4)+
theme_bw()+
ggtitle("Box-Cox Transformed Mean Daily Cases Predictions vs. Actuals (Training Model") +
xlab("Date")+
ylab("Box-Cox Daily Cases")
RMSE = sqrt(mean((expm1(pred) - expm1(cases_boxcox$case_boxcox))^2,na.rm=T))
RMSE
## [1] 4.860085
best_mod %>%
forecast(h=60) %>%
autoplot()
Our current model indicates that cases will continue to be low in the future. These values seem reasonable as observed trends in the past indicate that the period after the holiday season sees a reduction in cases, and the Omicron variant has been so far reaching that it is losing steam relative to the remaining vulnerable population.
library(lubridate)
train = data %>%
filter(date<ymd("2022-1-31")) %>%
mutate(case_boxcox = forecast::BoxCox(cases_mean,lambda='auto'))
test = data %>%
filter(date>=ymd("2022-1-31") & date<ymd("2022-02-24")) %>%
mutate(case_boxcox = forecast::BoxCox(cases_mean,lambda='auto'))
best_mod = arima(train$case_boxcox,order=c(1,1,1))
test_pred1 = predict(best_mod,24)
error_df = data.frame(test_pred1$pred, test$case_boxcox)
error1 = (mean((error_df$test_pred1.pred - error_df$test.case_boxcox)^2)^1/2)
error1
## [1] 44115394
The out of sample error of the best ARIMA model selected from the data is pretty substantial, which makes sense because the number of cases has differed so heavily, with the minor hills observed earlier in the dataset representing swings in case numbers to the order of 100’s of 1000’s of cases.
library(magrittr)
library(ggplot2)
data$date <- as.Date(data$date)
data %>%
ggplot()+
geom_line(aes(date,cases_mean))+
theme_bw()+
ggtitle("Mean Cases Over Time")+
ylab("Mean Cases")+
xlab("Date")
The above graphic displays the mean daily cases over time starting in 2021, we can see a major spike in total daily cases due to the rapid spread of the Omicron variant starting in early 2022.
The maximum date value contained in this model is for May of 2022 - this data is from the reference scenario produced by IHME, which means that it will be displaying projections created by IHME regarding daily cases. For the purposes of this analysis, I will remove all values following today’s date (2/17/2022) and will treat the series as if it is the ground truth in order to test the forecasting method.
# R
library(prophet)
library(dplyr)
prophet_data = data %>%
rename(ds = date, # Have to name our date variable "ds"
y = cases_mean) # Have to name our time series "y"
m <- prophet(prophet_data, )
train = prophet_data %>%
filter(ds<ymd("2021-12-31"))
test = prophet_data %>%
filter(ds>=ymd("2021-12-31")) %>%
filter(ds<ymd("2022-02-17"))
model = prophet(train)
future = make_future_dataframe(model,periods =31)
forecast = predict(model,future)
tail(future) %>%
kbl(caption = "Tail of Future Caps on Data") %>%
kable_classic(full_width = F, html_font = "Cambria")
| ds | |
|---|---|
| 390 | 2022-01-25 |
| 391 | 2022-01-26 |
| 392 | 2022-01-27 |
| 393 | 2022-01-28 |
| 394 | 2022-01-29 |
| 395 | 2022-01-30 |
tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')]) %>%
kbl(caption = "Tail of Future Forecasting Bounds") %>%
kable_classic(full_width = F, html_font = "Cambria")
| ds | yhat | yhat_lower | yhat_upper | |
|---|---|---|---|---|
| 390 | 2022-01-25 | 819777.6 | 647605.2 | 997128.8 |
| 391 | 2022-01-26 | 825841.0 | 637025.6 | 1011474.6 |
| 392 | 2022-01-27 | 831199.3 | 646897.9 | 1011912.5 |
| 393 | 2022-01-28 | 821353.3 | 640962.6 | 1001604.6 |
| 394 | 2022-01-29 | 825413.1 | 655076.5 | 993271.1 |
| 395 | 2022-01-30 | 830027.2 | 658915.2 | 1008848.7 |
Above we can see the first forecasts generated by dropping the data into the Prophet model. We’re given a range of upper and lower values for the dates leading up to 1/30/2022, 30 days after the end of our training dataset. We can visualize these results and the intervals included with a simple plot:
# R
plot(m, forecast) +
ylab("Daily Covid Cases")+xlab("Date") +theme_bw()
prophet_plot_components(m, forecast)
We can see by plotting the ground truth data of our time series vs. the projections that this model was unable to predict the dramatic rise in cases due to Omicron, but it was responsive enought to detect the uptick in cases represented at the end of the training data. The decomposition of the seasonality elements present in the data indicates that Monday’s are low points for daily mean COVID cases, with Sunday being the high point.
dyplot.prophet(model,forecast)
We can also observe interactive plots which indicate specific dates, actual, and predicted values, which displays a helpful density graphic below the X axis. We can examine shorter time periods by zooming in on the time period in question using the filters attached to the bottom of the graphic.
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Daily Covid Cases")
Prophet allows for the easy inclusion of changepoints (click here to learn more about changpoints), which can help us to identify when trends reverse course, this can be helpful in the case of the Omicron variant by helping us to recognize a significant departure from our modeled values beginning in October, putting us on high alert about the number of cases on the horizon.
We can modify the number of changepoints as shown below by adding a simple parameter to the prophet model.
# Number of Changepoints
model = prophet(train,n.changepoints=100)
forecast = predict(model,future)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("COVID Cases")
prophet_plot_components(model,forecast)
The decomposed seasonality elements yielded by the above model for changepoints indicates that Thursdays, rather than Monday’s, are to be watched out for in terms of weekly COVID case changepoints. When considering our original time series decomposition, we saw that Sundays were the high point of the week, which is supported above by the increase observed through the week, peaking at thursday, and then reversing.
By making forecasts 2 years into the future, we can observe the current trends that Prophet expects the data to follow.
two_yr_future = make_future_dataframe(model,periods = 730)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Mean Daily Cases")
Considering that our training data ends in the throes of the Omicron surge, we shouldn’t be surprised that the two year forecast for our data indicates high amounts of cases. We should, however, consider adding a ceiling to this prediction to manage any overzealous predictions that are too heavily influenced by recent trends.
# Set "floor" in training data
train$floor = 0
train$cap = 9999999
future$floor = 0
future$cap = 9999999
# Set floor in forecsat data
two_yr_future$floor = 0
two_yr_future$cap = 9999999
sat_model = prophet(train,growth='logistic')
sat_two_yr_forecast = predict(sat_model,two_yr_future)
plot(sat_model,sat_two_yr_forecast)+ylim(0,999999)+
theme_bw()+xlab("Date")+ylab("Daily Mean Cases")
We can see a more modest estimate of daily mean cases above when the ceiling was added to our model.
Below is an additive model that makes very similar predictions regarding the near future of cases to our original prophet model.
additive = prophet(train)
add_fcst = predict(additive,future)
plot(additive,add_fcst)+
ylim(0,999999)
prophet_plot_components(additive,add_fcst)
The seasonal components of our additive model indicate that Thursday is still a major source of information for weekly case growth.
multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)+ylim(0,999999)
prophet_plot_components(multi, multi_fcst)
The multiplicative model above further verifies the findings of the Additive model.
Many time series’ contain trends built into holiday data that influence the behavior of the time series. We can use prophet to examine these holiday trends and build an awareness of them into our model.
model = prophet(train,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train)
forecast = predict(model,future)
prophet_plot_components(model,forecast)
The above plot displays numerous peaks and valley’s related to holiday data, maintaining that Thursdays are the most important day of the week for trending changes in mean daily cases.
We can make this information more useful by examining the holiday’s directly by name:
library(tidyr)
library(ggplot2)
library(dplyr)
forecast %>%
filter(holidays != 0) %>%
dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
pivot_longer(
cols = `Christmas Day`:`Washington's Birthday`,
names_to = 'holiday',
values_to = 'effect'
) %>%
ggplot() +
geom_col(aes(effect,holiday))+
ggtitle("Effect of Holidays on Mean Daily Cases") +
theme_bw()
Many holidays have an observed effect on the number of mean daily COVID-19 Cases. Many of the summer holidays that we observe are associated with low-points, which may be due to lower case numbers in summertime, or lower transmission due to outdoor events. Holidays after November 31st and before February are associated with a major increase in case numbers, generally. The most surprising find here is Thanksgiving’s negative effect on overall global cases. This may be due to the fact that this is for global case data while I’m only accounting for US holidays. It may also be noted that the highest
Assessing model performance is highly important to the relavence of an analysis, I’ll be using the RMSE (Root mean squared error), MAE (Mean absolute error) and MAPE (mean absolute percentage error) to assess model performance.
# IS Performance
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds<ymd("2021-12-31"))
RMSE = sqrt(mean((train$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(train$y - forecast_metric_data$yhat))
MAPE = mean(abs((train$y - forecast_metric_data$yhat)/train$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 136623.17"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 107534.14"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.2"
The model performance of the in-sample data isn’t terrible, it has a Mean Absolute Percentage Error of 20%, indicating that we account for much of the daily differences in mean covid cases globally, however, that 20% difference corresponds to a massive disparity in the total number of cases predicted, which means that this may be useful for general trending but not for operationally relevant implementation like bed-planning or material gathering.
# OOS Performance
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2021-12-31"))
test = test %>%
filter(ds<=max(forecast_metric_data$ds ))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 3367194.02"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 3097973.31"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.77"
The model performs far worse on Out-Of-Sample data, a whopping 77% MAPE. This likely has to do with the major uptick in case numbers that immediately follows the end of the training data.
library(kableExtra)
df.cv <- cross_validation(model, initial = 30, period = 30, horizon = 30, units = 'days')
head(df.cv) %>%
kbl(caption = "Cross Validated Data Frame | Head") %>%
kable_classic(full_width = F, html_font = "Cambria")
| y | ds | yhat | yhat_lower | yhat_upper | cutoff |
|---|---|---|---|---|---|
| 454060.5 | 2021-02-04 | 454994.9 | 452966.2 | 456802.9 | 2021-02-03 |
| 443958.4 | 2021-02-05 | 446576.2 | 439632.6 | 453497.1 | 2021-02-03 |
| 433448.3 | 2021-02-06 | 437647.0 | 424185.8 | 451155.9 | 2021-02-03 |
| 424259.9 | 2021-02-07 | 425666.4 | 403892.5 | 447427.2 | 2021-02-03 |
| 414205.9 | 2021-02-08 | 412419.0 | 381054.0 | 444289.9 | 2021-02-03 |
| 405118.9 | 2021-02-09 | 398445.8 | 357398.0 | 441609.8 | 2021-02-03 |
Above I’m creating 30-day horizon rolling periods for cross validation on the Prophet model.
df.cv %>%
ggplot()+
ggtitle("Cross Validated Error by Month")+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color=factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("Daily Cases")+
scale_color_discrete(name = 'Cutoff')
The disparities witnessed in the dataset in the month of January are staggering relative to the general performance of the model.
round_df <- function(df, digits) {
nums <- vapply(df, is.numeric, FUN.VALUE = logical(1))
df[,nums] <- round(df[,nums], digits = digits)
(df)
}
library(kableExtra)
round_df(performance_metrics(df.cv),2) %>%
kbl(caption = "Cross-Validated Error Metrics") %>%
kable_classic(full_width = F, html_font = "Cambria")
| horizon | mse | rmse | mae | mape | mdape | smape | coverage |
|---|---|---|---|---|---|---|---|
| 3 days | 8315574810 | 91189.77 | 68732.69 | 0.13 | 0.09 | 0.13 | 0.70 |
| 4 days | 9061562940 | 95192.24 | 70502.24 | 0.13 | 0.07 | 0.14 | 0.70 |
| 5 days | 9854070470 | 99267.67 | 72902.47 | 0.13 | 0.06 | 0.14 | 0.64 |
| 6 days | 10709813512 | 103488.23 | 76212.80 | 0.14 | 0.06 | 0.15 | 0.58 |
| 7 days | 11941433604 | 109276.87 | 80932.20 | 0.15 | 0.08 | 0.16 | 0.55 |
| 8 days | 13797381733 | 117462.26 | 87326.07 | 0.16 | 0.11 | 0.17 | 0.55 |
| 9 days | 16070433060 | 126769.21 | 94707.31 | 0.17 | 0.13 | 0.18 | 0.55 |
| 10 days | 18483278627 | 135953.22 | 102519.98 | 0.19 | 0.17 | 0.20 | 0.55 |
| 11 days | 20492407738 | 143151.69 | 109395.83 | 0.20 | 0.20 | 0.21 | 0.55 |
| 12 days | 22524750971 | 150082.48 | 116008.21 | 0.22 | 0.22 | 0.23 | 0.55 |
| 13 days | 24573864537 | 156760.53 | 122239.92 | 0.23 | 0.22 | 0.24 | 0.52 |
| 14 days | 26998870963 | 164313.33 | 129329.60 | 0.24 | 0.22 | 0.26 | 0.48 |
| 15 days | 29833719847 | 172724.40 | 136990.15 | 0.26 | 0.24 | 0.27 | 0.45 |
| 16 days | 32818027539 | 181157.47 | 145407.50 | 0.27 | 0.26 | 0.29 | 0.45 |
| 17 days | 35877642221 | 189413.94 | 153729.94 | 0.29 | 0.27 | 0.31 | 0.48 |
| 18 days | 38675921911 | 196661.95 | 161567.13 | 0.31 | 0.28 | 0.33 | 0.48 |
| 19 days | 41875779430 | 204635.72 | 169553.87 | 0.32 | 0.30 | 0.35 | 0.48 |
| 20 days | 45572836665 | 213477.95 | 177633.48 | 0.34 | 0.31 | 0.37 | 0.45 |
| 21 days | 50574296782 | 224887.30 | 187085.73 | 0.36 | 0.31 | 0.39 | 0.48 |
| 22 days | 56996612875 | 238739.63 | 197590.59 | 0.38 | 0.31 | 0.41 | 0.52 |
| 23 days | 64236613113 | 253449.43 | 208828.08 | 0.40 | 0.33 | 0.44 | 0.55 |
| 24 days | 71476242725 | 267350.41 | 219210.50 | 0.41 | 0.33 | 0.46 | 0.55 |
| 25 days | 78638209321 | 280425.05 | 228980.73 | 0.43 | 0.33 | 0.48 | 0.55 |
| 26 days | 87428366669 | 295682.88 | 240058.33 | 0.45 | 0.33 | 0.51 | 0.55 |
| 27 days | 99062017172 | 314741.19 | 253038.31 | 0.46 | 0.32 | 0.53 | 0.55 |
| 28 days | 115623293421 | 340034.25 | 269099.83 | 0.48 | 0.32 | 0.56 | 0.55 |
| 29 days | 137047751911 | 370199.61 | 287390.09 | 0.50 | 0.32 | 0.59 | 0.55 |
| 30 days | 161522190296 | 401898.23 | 306993.68 | 0.53 | 0.32 | 0.62 | 0.52 |
There appears to be a general trend that the further we get away from the current date the worse our predictions are for daily cases using cross-validation. We can test this out by plotting the error metrics against the windows it’s assessed for.
plot_cross_validation_metric(df.cv, metric = 'rmse')
There is a general trend towards inaccuracy as time goes on, with the 30 day window approaching a difference of 500,000 cases relative to the 1 day window.
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
RMSE
## [1] 3367194
error1
## [1] 44115394
The RMSE of the Prophet model is 3.36719410^{6}, while the RMSE of the best ARIMA model is 4.411539410^{7}. The reason that I’m operating using the RMSE vs. the MAPE or other criteria is that I want to consider all comparisons in terms of the original units - by case numbers. I think that the MAPE value could be misleading in the context due to the wide variation in case numbers during peaks and valleys, smaller variations will make for wider deviations during the low points of case numbers. The RMSE communicates an understandable metric in absolute units.