Midterm Paper

Introduction

Over the past two years, the people of the world have been undergoing the SARS-CoV-2 virus, causing COVID-19. In this paper, we will use data from Kaggle to forecast daily world-wide cases and fatalities. The data we have is from January 23, 2020 to June 10, 2020. With the continued fear of additional variants arising, it is important for us to understand what the future may bring. With this data, we will be able to compare our forecasts to the test data, also provided by Kaggle.

Literature Review

In previous work, we see how COVID-19 data has reacted to time series forecasting methods. Naumov et. al. uses ARIMA and Holt-Winters’ exponential smoothing models. An important note to consider is they used the additive method for Holt-Winters’ model (Naumov, A.V et al). In our analysis we will compare both additive and multiplicative using the MAPE metric - the method with the MAPE closest to zero is most ideal.

Doornik, Castle, and Hendry looked at COVID-19 data from the United States only. In their data, they can see seasonality in the plots, which is something we will need to be looking for with the world-wide data ([2])Doornik, Jurgen A, Jennifer L Castle, and David F Hendry). In contrast from Naumov, Doornik uses a multiplicative model, also why we will compare it with the additive approach.

Data Exploration

We first convert our training data into two time series, one for world-wide confirmed cases and the other for world-wide fatalities.

# require libraries
library(fpp)

## Loading required package: forecast

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Loading required package: fma

## Loading required package: expsmooth

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: tseries

library(fpp2)

## -- Attaching packages ---------------------------------------------- fpp2 2.4 --

## v ggplot2 3.3.5

##

## 
## Attaching package: 'fpp2'

## The following objects are masked from 'package:fpp':
## 
##     ausair, ausbeer, austa, austourists, debitcards, departures,
##     elecequip, euretail, guinearice, oil, sunspotarea, usmelec

library(fpp3)

## -- Attaching packages -------------------------------------------- fpp3 0.4.0 --

## v tibble      3.1.5      v tsibble     1.1.0 
## v dplyr       1.0.7      v tsibbledata 0.3.0 
## v tidyr       1.1.4      v feasts      0.2.2 
## v lubridate   1.7.10     v fable       0.3.1

## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()      masks base::date()
## x dplyr::filter()        masks stats::filter()
## x fabletools::forecast() masks forecast::forecast()
## x tsibble::index()       masks zoo::index()
## x tsibble::intersect()   masks base::intersect()
## x tsibble::interval()    masks lubridate::interval()
## x dplyr::lag()           masks stats::lag()
## x tsibble::setdiff()     masks base::setdiff()
## x tsibble::union()       masks base::union()

## 
## Attaching package: 'fpp3'

## The following object is masked from 'package:fpp2':
## 
##     insurance

## The following object is masked from 'package:fpp':
## 
##     insurance

library(forecast)
library(grid)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(seasonal)

## 
## Attaching package: 'seasonal'

## The following object is masked from 'package:tibble':
## 
##     view

# read data
covid = read.csv("C:/Users/Jake/Documents/PredAnalytics/covid.csv")
# separate data from "confirmed" and "Fatalities"
confirm = covid[which(covid$Target == "ConfirmedCases"),]
fatal = covid[which(covid$Target == "Fatalities"),]
# Create time series for each target value
confirm2 = aggregate(TargetValue ~ Date, data = confirm, FUN = sum)
confirm_ts = ts(confirm2)
fatal2 = aggregate(TargetValue ~ Date, data = fatal, FUN = sum)
fatal_ts = ts(fatal2)
# plot time series
a = autoplot(confirm_ts[,"TargetValue"], main = "Time Series of Confirmed Cases") + 
  xlab("Days") + ylab("Confirmed Cases")
b = autoplot(fatal_ts[,"TargetValue"], main = "Time Series of Deaths") +
  xlab("Days") + ylab("Fatalities")
grid.arrange(a, b, nrow = 2)

# summaries of time series
summary(confirm_ts[,"TargetValue"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     194    4926  116928   82349  142319  197934

summary(fatal_ts[,"TargetValue"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   206.5  4451.0  4666.2  8238.0 17355.0

As discussed in Doornik, Castle, and Hendry’s work and in the plots above, we can see a slight seasonality in both the fatalities and confirmed cases data. In the confirmed cases data, we can see a slight increase over the 140 days our data was collected. Whereas in the the fatalities data, we see an increase over the first half and a slow decrease on the latter half. Next, we look at the ACF and PACF graphs for each time series.

# Create Univariate Time Series
confirm_ts2 = ts(subset(confirm2, select = "TargetValue"), frequency = 2)
fatal_ts2 = ts(subset(fatal2, select = "TargetValue"), frequency = 2)

# Plots
ggtsdisplay(confirm_ts2)

ggtsdisplay(fatal_ts2)

For both time series, we can see both have slight auto-correlation based on the ACF plots. When looking at the PACF plots, we can see the data is non-stationary, this is consistent with the observed characteristics of each plot. This information will be important when we look at the ARIMA models.

Types of Models

In this analysis, we will look at a few models highlighted by our literature and discussed through our discussions. The first is a linear regression which will produce a line of best fit using the least squares estimate. Next, we will look at several types of exponential smoothing models, Holt’s linear trend method as example. Finally, we will walk through some ARIMA models to assess their performance.

When we look at the linear regression, we are assuming a linear relationship between the forecast variable (the number of confirmed cases or fatalities) and the predictor variable. Since this is a time series, time serves as our predictor variable.

As we walk through the exponential smoothing models, we are using weighted averages of observations, the weights decrease as the observation gets older. There are many different types of exponential smoothing models, each has its own algorithm to assign the weights of each observation.

Lastly, the ARIMA models show autocorrelation between observations. ARIMA models make it important to understand the data and how the data reacts to other parts of the data. This is why we discussed the ACF and PACF plots of each time series.

Model Building

As we listed above, we will first assess the performance of a linear regression on each time series.

# Creating Linear Regression Models
lm_confirm = tslm(TargetValue ~ Date, data = confirm_ts)
lm_fatal = tslm(TargetValue ~ Date, data = fatal_ts)

# Plot Linear Regression Models
a = autoplot(confirm_ts[,"TargetValue"], 
             main = "Confirmed Cases Linear Regression") + 
  autolayer(fitted(lm_confirm), series = "Fitted Linear Regression") +
  ylab("Confirmed Cases") + xlab("Days")
b = autoplot(fatal_ts[,"TargetValue"], 
             main = "Fatalities Linear Regression") + 
  autolayer(fitted(lm_fatal), series = "Fitted Linear Regression") +
  ylab("Fatalities") + xlab("Days")
grid.arrange(a, b, nrow = 2)

# Check Residuals of Linear Regressions
checkresiduals(lm_confirm)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals from Linear regression model
## LM test = 80.519, df = 10, p-value = 3.972e-13

checkresiduals(lm_fatal)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals from Linear regression model
## LM test = 104.43, df = 10, p-value < 2.2e-16

In the first two plots, we can see how the linear regression on the time series shows the increasing trend on both subsets of our data. When we look at the residual plots, we can see our residuals are skewed for both models which could affect the prediction intervals. Also, we see a lot more autocorrelation in the fatality data than we do in the confirmed cases. Next, we will look at the forecast results for the next 10 days.

newdata = data.frame(Date = c(141:150))
forecast(lm_confirm, newdata = newdata)

##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 141       185747.5 144575.7 226919.2 122527.8 248967.2
## 142       187214.1 146030.0 228398.3 123975.4 250452.9
## 143       188680.8 147484.0 229877.5 125422.7 251938.9
## 144       190147.4 148937.9 231356.9 126869.7 253425.1
## 145       191614.1 150391.6 232836.5 128316.5 254911.6
## 146       193080.7 151845.2 234316.3 129763.0 256398.4
## 147       194547.4 153298.5 235796.2 131209.3 257885.4
## 148       196014.0 154751.7 237276.3 132655.3 259372.7
## 149       197480.7 156204.8 238756.6 134101.0 260860.3
## 150       198947.3 157657.6 240237.0 135546.5 262348.1

forecast(lm_fatal, newdata = newdata)

##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 141       10017.59 5875.367 14159.81 3657.158 16378.02
## 142       10093.50 5950.023 14236.97 3731.144 16455.85
## 143       10169.40 6024.662 14314.14 3805.105 16533.70
## 144       10245.31 6099.284 14391.33 3879.038 16611.58
## 145       10321.21 6173.888 14468.54 3952.946 16689.48
## 146       10397.12 6248.475 14545.77 4026.827 16767.41
## 147       10473.03 6323.045 14623.01 4100.682 16845.37
## 148       10548.93 6397.598 14700.27 4174.510 16923.35
## 149       10624.84 6472.134 14777.54 4248.312 17001.36
## 150       10700.74 6546.653 14854.84 4322.088 17079.40

We are going to follow suit with our literature and perform Holt-Winters’ seasonal method. As stated before, our literature uses the additive method, but with the characteristics of our data we will also perform the multiplicative method.

# Build Models
hw_confirm1 = hw(confirm_ts2, seasonal = "additive")
hw_confirm2 = hw(confirm_ts2, seasonal = "multiplicative")
hw_fatal1 = hw(fatal_ts2, seasonal = "additive")
hw_fatal2 = hw(fatal_ts2, seasonal = "multiplicative")

# Plot 
a = autoplot(confirm_ts2, 
             main = "Confirmed Cases Holt-Winters'") + 
  autolayer(fitted(hw_confirm1), series = "Fitted Additive") +
  autolayer(fitted(hw_confirm2), series = "Fitted Multiplicative") +
  ylab("Confirmed Cases") + xlab("Days")
b = autoplot(fatal_ts2, 
             main = "Fatalities Holt-Winters'") + 
  autolayer(fitted(hw_fatal1), series = "Fitted Additive") +
  autolayer(fitted(hw_fatal2), series = "Fitted Multiplicative") +
  ylab("Fatalities") + xlab("Days")
grid.arrange(a, b, nrow = 2)

# Accuracy Parameters of the 4 Models
accuracy(hw_confirm1)

##                    ME     RMSE      MAE      MPE     MAPE    MASE       ACF1
## Training set 927.4969 22925.19 12488.46 -102.806 121.5237 0.72955 0.06542865

accuracy(hw_confirm2)

##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 612.0846 22596.26 12470.21 -104.7138 123.8985 0.7284835 0.01907746

accuracy(hw_fatal1)

##                    ME     RMSE      MAE       MPE     MAPE     MASE       ACF1
## Training set 31.52125 1824.612 1083.984 -211.0357 249.9751 0.657035 0.03837122

accuracy(hw_fatal2)

##                     ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set -18.88527 2125.411 1341.387 -195.8879 223.0567 0.8130548 0.3926904

From the looks of our accuracy parameters, none of these models are really that great with high values for basically every statistic. We will attempt to dampen these models in the next step.

#Build Damped Models
d_confirm1 = hw(confirm_ts2, seasonal = "additive", damped = T)
d_confirm2 = hw(confirm_ts2, seasonal = "multiplicative", damped = T)
d_fatal1 = hw(fatal_ts2, seasonal = "additive", damped = T)
d_fatal2 = hw(fatal_ts2, seasonal = "multiplicative", damped = T)

# Plot
a = autoplot(confirm_ts2, 
             main = "Confirmed Cases Damped Holt-Winters'") + 
  autolayer(fitted(d_confirm1), series = "Fitted Additive") +
  autolayer(fitted(d_confirm2), series = "Fitted Multiplicative") +
  ylab("Confirmed Cases") + xlab("Days")
b = autoplot(fatal_ts2, 
             main = "Fatalities Damped Holt-Winters'") + 
  autolayer(fitted(d_fatal1), series = "Fitted Additive") +
  autolayer(fitted(d_fatal2), series = "Fitted Multiplicative") +
  ylab("Fatalities") + xlab("Days")
grid.arrange(a, b, nrow = 2)

# Accuracy Parameters of the 4 Models
accuracy(d_confirm1)

##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 1420.602 22953.32 12534.18 -97.59156 118.3253 0.7322207 0.06512467

accuracy(d_confirm2)

##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 1175.533 22603.35 12529.59 -97.74458 119.6468 0.7319527 0.01115771

accuracy(d_fatal1)

##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 31.80999 1831.864 1085.039 -194.2802 233.4022 0.6576744 0.0372698

accuracy(d_fatal2)

##                    ME   RMSE      MAE       MPE     MAPE      MASE     ACF1
## Training set 15.78152 2124.1 1339.434 -180.8904 209.8006 0.8118706 0.391933

We can see some improvement when we perform a damped Holt-Winters’ seasonal method.

The next model we will test is the ARIMA model. If we use the auto.arima() function in R, we will produce the optimal ARIMA model for the time series.

# Build Models
a_confirm = auto.arima(confirm_ts2)
a_fatal = auto.arima(fatal_ts2)

# Check Residuals
checkresiduals(a_confirm)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(1,0,1)[2] with drift
## Q* = 1.3285, df = 3, p-value = 0.7224
## 
## Model df: 4.   Total lags used: 7

checkresiduals(a_fatal)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)(0,0,1)[2]
## Q* = 8.3476, df = 3, p-value = 0.03935
## 
## Model df: 1.   Total lags used: 4

# Forecasts
fc_confirm = forecast(a_confirm, h = 10)
fc_fatal = forecast(a_fatal, h = 10)

#Plots
a = autoplot(fc_confirm, main = "ARIMA(0,1,1)(1,0,1) Confirmed Cases") +
  xlab("Days") + ylab("Confirmed Cases")
b = autoplot(fc_fatal, main = "ARIMA(0,1,0)(0,0,1) Fatalities") +
  xlab("Days") + ylab("Fatalities")
grid.arrange(a, b, nrow = 2)

# Accuracy Statistics
accuracy(a_confirm)

##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -430.9472 20848.36 12974.37 -151.0193 163.9024 0.7579355
##                     ACF1
## Training set 0.001282694

accuracy(a_fatal)

##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 63.32939 1759.477 1073.451 -158.0723 191.4443 0.6506506
##                     ACF1
## Training set -0.02764222

These ARIMA models are performing much better, the residuals of both models are normally distributed. Consistantly with what we have seen in earlier models, the fatalities model has more autocorrelation than the confirmed cases model. We will compare these results to an ETS() model for each time series.

# Build Models
ets_confirm = ets(confirm_ts2)
ets_fatal = ets(fatal_ts2)

# Check Residuals
checkresiduals(ets_confirm)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 10.217, df = 3, p-value = 0.01681
## 
## Model df: 2.   Total lags used: 5

checkresiduals(ets_fatal)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,A,N)
## Q* = 48.151, df = 3, p-value = 1.977e-10
## 
## Model df: 4.   Total lags used: 7

# Forecasts
fc_confirm = forecast(ets_confirm, h = 10)
fc_fatal = forecast(ets_fatal, h = 10)

# Plots
a = autoplot(fc_confirm, main = "ETS Confirmed Cases") +
  xlab("Days") + ylab("Confirmed Cases")
b = autoplot(fc_fatal, main = "ETS Fatalities") +
  xlab("Days") + ylab("Fatalities")
grid.arrange(a, b, nrow = 2)

# Accuracy Statistics
accuracy(ets_confirm)

##                    ME     RMSE      MAE       MPE     MAPE     MASE       ACF1
## Training set 1534.797 22958.47 12551.06 -100.1059 124.2105 0.733207 0.06001178

accuracy(ets_fatal)

##                     ME     RMSE     MAE       MPE     MAPE      MASE      ACF1
## Training set -31.95403 2429.768 1573.61 -211.2078 243.6681 0.9538121 0.6699932

We can see R created an ETS(A,N,N) model for the confirmed cases and ETS(M,A,N) for the fatalities model. For the confirmed cases ETS model, the residuals seem to keep their normality whereas the residuals for the fatalities model does not. The ETS model of the confirmed cases data shows more autocorrelation than the ARIMA. The ACF plot of the fatalities ETS model shows more stationary data, which may cause the skewness of the residual histogram. To test the error of both the ARIMA and ETS models, we will produce test errors for each and see how they compare.

# Create Functions to Run tsCV() function
fets = function(x,h){
  forecast(ets(x), h = h)
}
farima = function(x, h){
  forecast(auto.arima(x), h = h)
}

# Compute CV errors for each 
e_confirm_ets = tsCV(confirm_ts2, fets, h = 10)
e_confirm_arima = tsCV(confirm_ts2, farima, h = 10)
e_fatal_ets = tsCV(fatal_ts2, fets, h = 10)
e_fatal_arima = tsCV(fatal_ts2, farima, h = 10)

# Find MSE of each model
mean(e_confirm_ets^2, na.rm = T)

## [1] 1435921713

mean(e_confirm_arima^2, na.rm = T)

## [1] 1655047634

mean(e_fatal_ets^2, na.rm = T)

## [1] 16246586

mean(e_fatal_arima^2, na.rm = T)

## [1] 10965513

From computing the MSE of the CV errors, we can see the ETS model is preferred for the Confirmed Cases time series, and the ARIMA model is better for the fatalities time series.

Limitations

There are a few limitations to make note of that was noticed through building the models. First, we only have 140 observations once we combine the data into daily world-wide. This can cause some issues because it is harder to find seasonality in the data since it does not spread throughout a full year. Also, every country had (or has) a different strategy on how to regulate the virus. For example, the United States went through lock down regulations, whereas other countries like Sweden had little to no regulations. Inconsistent regulations could cause for more irregularities in the data.

Future Work

A project that would be interesting from this data is how each individual country was affected from the virus. In the previous section, we discussed how countries like the United States and Sweden had different strategies. Also, there are countries with different socio-economic statuses, how did mitigation strategies of first-world countries compare to that of third-world countries? Further researchers can also look into comparing first-world countries with out first-world countries. Similarly, with third world countries. This will highlight what strategies worked early in the pandemic and when we get more recent data, we will really see what mitigation techniques were effective.

What Was Learned

We can see the two time series we examine throughout the analysis react completely different in each forecasting technique. For example, we see the exponential smoothing model works better than the ARIMA for the Confirmed Cases and the ARIMA performed better than the ETS model for the Fatalities time series. This may have something to do with all the dependent variables not seen in our data (individual patient’s medical history, hospital’s ability, etc.). We can also see there is no linear relationship between time and number of daily world-wide cases. We also see that Doornik, Jurgen, and Castle’s use of the additive Holt-Winters’ method could have been improved by using the damped and multiplicative method. All this being said, none of the models shown above performs at such a high level that it would successfully forecast the daily world-wide confirmed cases nor fatalities. This is just another topic we need to improve our understanding of the virus to get back to normal life.

Work Cited

Naumov, A.V et al. “Baseline Accuracies of Forecasting COVID-19 Cases in Russian Regions on a Year in Retrospect Using Basic Statistical and Machine Learning Methods.” Procedia computer science 193 (2021): 276–284. Web.

Doornik, Jurgen A, Jennifer L Castle, and David F Hendry. “Modeling and Forecasting the COVID‐19 Pandemic Time‐series Data.” Social science quarterly 102.5 (2021): 2070–2087. Web.