## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✓ tibble      3.1.4     ✓ tsibble     1.1.1
## ✓ dplyr       1.0.7     ✓ tsibbledata 0.4.0
## ✓ tidyr       1.1.3     ✓ feasts      0.2.2
## ✓ lubridate   1.8.0     ✓ fable       0.3.1
## ✓ ggplot2     3.3.5
## Warning: package 'tsibbledata' was built under R version 4.1.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✓ fma       2.4     ✓ expsmooth 2.3
## 
## 
## Attaching package: 'fpp2'
## The following object is masked from 'package:fpp3':
## 
##     insurance

Predictive Analytics: Prediciting Covid Cases and Fatalities

Problem

The challenge is to predict the number of COVID-19 cases and fatalities around the world. The Kaggle contest uses data from the beginning of the pandemic (March-May) with the goal of predicting cases for for May and June. The key difficulty will be using a limited amount of data with a changing trend to predict into the future. This model could be used (or could have been used, should it have been made earlier) to predict near-term cases which could guide public policy.

The problem will continue to change. The goal here is to provide another potential model or avenue for predicting the spread of Covid, which could later be applied to a broader data for more current predictions.

The contest can be found here: https://www.kaggle.com/c/covid19-global-forecasting-week-5/data?select=submission.csv

Significance

This investigation is based on the following key questions:

  1. Based on past data, what are the expected cases and deaths from Covid-19?
  2. What predictive models are most suited to modeling covid-19 cases and fatalities.

Since this research is being done retroactively, it primarily serves to show how well time-series models can predict Covid cases, which could be useful for the prediction of future pandemics or future models of Covid-19. It is important to note that much has changed in terms of variants, public policy, and human behavior since the start of the pandemic. Thus, this research does not seek to assert the accuracy of particular values or predictions, but instead use the best performing models as a good reference for future research.

Data Cleaning and Time Series Exploration

Below I converted the data into a time series

df<-read.csv("/Users/Luke/Documents/BC/Predictive Analytics/Midterm/covid19-global-forecasting-week-5/train.csv")
confirm <- df[which(df$Target == "ConfirmedCases"),]
fatal <- df[which(df$Target == "Fatalities"),]
confirm_2 <- aggregate(TargetValue ~ Date, data = confirm, FUN = sum)
confirm_2$Date<-as.Date(confirm_2$Date)
fatal_2 <- aggregate(TargetValue ~ Date, data = fatal, FUN = sum)
fatal_2$Date<-as.Date(fatal_2$Date)
c_ts<- tsibble(confirm_2, index = Date)
f_ts<- tsibble(fatal_2, index = Date)

Description of Data

Cases and deaths show similar patterns, with upticks in deaths lagging slightly behind cases. Logically, this occurs because the virus takes some time to cause deaths. Both deaths and fatalities do not spike until march, then both follow and up-and-down pattern with a relatively constant pattern through the spring. This presents challenges for trend-based or linear models which will assume an upward trend, but a general shift may have occurred with the onset of summer as some places experience outbreaks and other places see reduced cases.

autoplot(c_ts)+
  labs( y= "Cases", title= "Worldwide Covid Cases Feb-June 2020")
## Plot variable not specified, automatically selected `.vars = TargetValue`

autoplot(f_ts)+
    labs(y = "Deaths", title= "Worldwide Covid Deaths Feb-June2020")
## Plot variable not specified, automatically selected `.vars = TargetValue`

Additonally, when we examine the ACF and pACF plots for cases and fatalities. In the ‘cases’ plot, we can conclude that the data is nonstationary since we see a tailing pattern on on the acf plot and a spike at lag=1 on the pACF plot, which suggests that there are seasonal and nonseasonal components to this model. The fatalities shows general decay, but an up-and-down trend suggesting low levels of seasonality.

autoplot(acf(c_ts))

autoplot(pacf(c_ts))

autoplot(acf(f_ts))

autoplot(pacf(f_ts))      

Below, I split the data into a test and training set, with slightly more than 80% of the series in the training set. This slightly larger training set was used to to accomodate for the change in trend accross the span, with the hopes that the models may better capture some of the oscillation that ocurrs after April 1.

c_train <- filter(c_ts, between(c_ts$Date, as.Date("2020-01-23"), as.Date("2020-05-01")))
c_test <- filter (c_ts,  between(c_ts$Date, as.Date("2020-05-01"), as.Date("2020-06-10")))
f_train<- filter(f_ts, between(f_ts$Date, as.Date("2020-01-23"), as.Date("2020-05-01")))
f_test <- filter (f_ts,  between(f_ts$Date, as.Date("2020-05-01"), as.Date("2020-06-10")))

Types of Models

The models to be used below are a linear model, a linear model using a fourier component, an Exponetional Smooth (ETS) model, and an AutoRegressive Integrated Moving Average (ARIMA) model.

The linear model was selected primarily as a benchmark. Since we are not provided with breadth of data about policies, demographics, and other regional data, a linear model will likely perform poorly but is worth examining to show if there is utility form other models. The Linear Model can be adjusted to use a fourier sequence, which could be useful given the up-and-down trend of the data after its uptick in April.

Additionally, I will use ETS models, which use averages between data points to smooth overall trends and often perform better than linear models. Finally, I will use the ARIMA model which uses similar features as ETS but – in short – performs autoregression between terms to come up with a model that has important differences from ETS.

Literature Review

Much of the literature on this topic has come since June 2020 and is therefore based on more complete data than this investigation will use. Thus, many of the models examined here may more accurately predict results, but will also serve to drive the research I will engage in here.

It is important to note here that many of the existing models acknowledge that changes in public policy can have a dramatic impact on results. Zhao et al proposed a method for short term predictions using incidence cases with a Poisson distributions, a gamma distribution for the time series interval, and future cases from posterior distributions (Zhao et. al, 2021). Their model – given a fixed set of policy and variant frameworks – predicted well over short internals which is useful on the regional level for policymaking. This study used similar data as will be applied here. Relatedly Wattanachit et al. found that ensemble forecasts could icnreased the robustness of existing forecasts.

Another strand of the research uses a broader set of predictors Barcellos, Fernandes and de Souza use a broader set of predictors to predict morbidity and death using linear-type models(Barcellos, 2021). Other research has used machine learning algorithms – XGBoost, Random Forest, etc – and found that the RF model predicts covid deaths very well.

Research that is more directly in line with the investigation here compared time series models – ARIMA, ETS, NNAR, TBATS – to the second wave of covid hospitalizations in Italy . The results find that in the case of hospitalizations, hybrid models functioned as the strongest predictors, but relevant to my investigation here, ARIMA models performed well and were present in all the best hybrid models. (Perone, 2021)

Model Building and Validation

For each of the models below, I will use AIC, AICc, and BIC to evaluate model accuracy, and RMSE, MSAE, and MAE to evaluate fit. The ideal model will have a combination of low metrics for model fit and acuracy.

I will begin by going through each model for ‘cases’ then do the models in bulk for ‘fatalities.’

Linear Model

Below is the linear model. We can see on inspection that a simple linear model with just time and trend way overpredicts future outcomes. When we include the fourier series, it predicts more accurately, but the model still relies too heavily on trend and misses the mark in terms of prediction, with a high RMSE, MSE, and MSAE implying poor model fit when tested on the test set.

When we include data from cases (I’m not sure if this was allowed or intended), we can achieve a very low RMSE of approx 4000, the lowest of all the models. MASE and MAP are also quite lowe. In this case, allowing for the exogenous variable of cases allows us to predict a high level of accuracy compared to the models that exclude outside variables. The challenge with this approach is that cases and deaths are only lagged by 2-3 weeks, so a practical model would not have this wide six week window for prediction.

lin_cases<- c_train %>%
  model(TSLM(TargetValue ~ trend()))%>%
  report()
## Series: TargetValue 
## Model: TSLM 
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49986 -26173   5171  25326  52752 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -39642.4     5945.1  -6.668        0.00000000154 ***
## trend()       1911.6      102.2  18.704 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29500 on 98 degrees of freedom
## Multiple R-squared: 0.7812,  Adjusted R-squared: 0.7789
## F-statistic: 349.8 on 1 and 98 DF, p-value: < 0.000000000000000222
c_train$Deaths<-f_train$TargetValue

De<-c_train %>%
  model(TSLM(Deaths ~ TargetValue + trend()))%>%
  report()
## Series: Deaths 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3244.9  -431.6   165.0   353.9  6553.1 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -506.28392  373.98852  -1.354               0.179    
## TargetValue    0.07032    0.00527  13.342 <0.0000000000000002 ***
## trend()        5.27297   11.39939   0.463               0.645    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1539 on 97 degrees of freedom
## Multiple R-squared: 0.8991,  Adjusted R-squared: 0.897
## F-statistic: 432.1 on 2 and 97 DF, p-value: < 0.000000000000000222
de_predict <- forecast(De, new_data = c_test,  h=40)
## Warning: Input forecast horizon `h` will be ignored as `new_data` has been
## provided.
de_predict
## # A fable: 41 x 5 [1D]
## # Key:     .model [1]
##    .model                        Date                  Deaths  .mean TargetValue
##    <chr>                         <date>                <dist>  <dbl>       <int>
##  1 TSLM(Deaths ~ TargetValue + … 2020-05-01 N(11113, 2463898) 11113.      157734
##  2 TSLM(Deaths ~ TargetValue + … 2020-05-02  N(9952, 2469857)  9952.      141156
##  3 TSLM(Deaths ~ TargetValue + … 2020-05-03  N(9311, 2483757)  9311.      131963
##  4 TSLM(Deaths ~ TargetValue + … 2020-05-04  N(8676, 2504388)  8676.      122862
##  5 TSLM(Deaths ~ TargetValue + … 2020-05-05  N(9166, 2498585)  9166.      129748
##  6 TSLM(Deaths ~ TargetValue + … 2020-05-06 N(10041, 2487607) 10041.      142112
##  7 TSLM(Deaths ~ TargetValue + … 2020-05-07 N(10509, 2486416) 10509.      148698
##  8 TSLM(Deaths ~ TargetValue + … 2020-05-08 N(10453, 2492022) 10453.      147831
##  9 TSLM(Deaths ~ TargetValue + … 2020-05-09  N(9743, 2510780)  9743.      137659
## 10 TSLM(Deaths ~ TargetValue + … 2020-05-10  N(8344, 2562847)  8344.      117680
## # … with 31 more rows
augment(De) %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = Deaths, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Cases",
       title = "Linear Model with Cases as Predictor") +
  guides(colour = guide_legend(title = "Series"))

new_f_test<-f_test
new_f_test$Deaths<-new_f_test$TargetValue
c_train%>%
  autoplot(Deaths) +
  autolayer(de_predict)

library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
acc_de=accuracy(de_predict, new_f_test)
acc_de%>%kbl(caption="Cases predicting Death")%>%kable_classic(html_font="Cambria")
Cases predicting Death
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
TSLM(Deaths ~ TargetValue + trend()) Test -3421.232 4039.478 3542.627 -62.34604 63.47889 NaN NaN 0.7743951
augment(De) %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = Deaths, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Cases",
       title = "Linear Model with Cases as Predictor") +
  guides(colour = guide_legend(title = "Series"))

augment(lin_cases) %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = TargetValue, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Cases",
       title = "Linear Model with Trend") +
  guides(colour = guide_legend(title = "Series"))

lin_cases2<- c_train %>%
  model(TSLM(TargetValue ~ trend()+fourier(K = 1)))%>%
  report()
## Series: TargetValue 
## Model: TSLM 
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54857 -23759   3230  22864  45961 
## 
## Coefficients:
##                    Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)        -39852.1     5924.2  -6.727        0.00000000125 ***
## trend()              1913.4      101.9  18.786 < 0.0000000000000002 ***
## fourier(K = 1)C1_7   4990.2     4141.1   1.205                0.231    
## fourier(K = 1)S1_7   4792.0     4174.2   1.148                0.254    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29380 on 96 degrees of freedom
## Multiple R-squared: 0.7874,  Adjusted R-squared: 0.7807
## F-statistic: 118.5 on 3 and 96 DF, p-value: < 0.000000000000000222
augment(lin_cases2) %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = TargetValue, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Cases",
       title = "Linear Model with Fourier") +
  guides(colour = guide_legend(title = "Series"))

harmfc_predict <- forecast(lin_cases2, h=40)
c_train%>%
  autoplot(TargetValue) +
  autolayer(harmfc_predict)

library(kableExtra)
acc_harmfc=accuracy(harmfc_predict, c_test)
acc_harmfc%>%kbl(caption="ETS")%>%kable_classic(html_font="Cambria")
ETS
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
TSLM(TargetValue ~ trend() + fourier(K = 1)) Test -44439.62 47040.24 44439.62 -30.95849 30.95849 NaN NaN 0.6472627

Ultimately, the linear model including cases as a predictor outperformed the other lienear models on measures of fit. In terms of accuracy, though, it seems to introduce a high level of bias (lower accuracy) and would therefore be stronger if it included other predictors.

ETS Model

For the ETS model, I used the ETS() function to identify the optimal components for the ets model. In this case, it chose an AAA model, implying an additive model for error, trend, seasonality. The resulting data for measures of fit and accuracy for the model shows that it dramatically outperforms the linear and fourier model.

#chose AAA model
c_ets <- c_train %>%
  model(ETS(TargetValue))
report(c_ets)
## Series: TargetValue 
## Model: ETS(A,A,A) 
##   Smoothing parameters:
##     alpha = 0.1498886 
##     beta  = 0.127196 
##     gamma = 0.3731849 
## 
##   Initial states:
##       l[0]     b[0]      s[0]     s[-1]     s[-2]     s[-3]    s[-4]    s[-5]
##  -4377.998 1098.091 -1924.491 -3910.677 -6475.073 -1537.856 1833.689 6317.191
##     s[-6]
##  5697.216
## 
##   sigma^2:  78724309
## 
##      AIC     AICc      BIC 
## 2291.010 2294.596 2322.272
c_ets_predict <- forecast(c_ets, h=40)

c_train %>%
  autoplot(TargetValue) +
  autolayer(c_ets_predict)

library(kableExtra)
acc_ets=accuracy(c_ets_predict, c_test)
acc_ets%>%kbl(caption="ETS")%>%kable_classic(html_font="Cambria")
ETS
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ETS(TargetValue) Test 40478.36 48771.27 40478.36 26.6126 26.6126 NaN NaN 0.869517
de_predict <- forecast(De, new_data = c_test,  h=40)
## Warning: Input forecast horizon `h` will be ignored as `new_data` has been
## provided.
de_predict
## # A fable: 41 x 5 [1D]
## # Key:     .model [1]
##    .model                        Date                  Deaths  .mean TargetValue
##    <chr>                         <date>                <dist>  <dbl>       <int>
##  1 TSLM(Deaths ~ TargetValue + … 2020-05-01 N(11113, 2463898) 11113.      157734
##  2 TSLM(Deaths ~ TargetValue + … 2020-05-02  N(9952, 2469857)  9952.      141156
##  3 TSLM(Deaths ~ TargetValue + … 2020-05-03  N(9311, 2483757)  9311.      131963
##  4 TSLM(Deaths ~ TargetValue + … 2020-05-04  N(8676, 2504388)  8676.      122862
##  5 TSLM(Deaths ~ TargetValue + … 2020-05-05  N(9166, 2498585)  9166.      129748
##  6 TSLM(Deaths ~ TargetValue + … 2020-05-06 N(10041, 2487607) 10041.      142112
##  7 TSLM(Deaths ~ TargetValue + … 2020-05-07 N(10509, 2486416) 10509.      148698
##  8 TSLM(Deaths ~ TargetValue + … 2020-05-08 N(10453, 2492022) 10453.      147831
##  9 TSLM(Deaths ~ TargetValue + … 2020-05-09  N(9743, 2510780)  9743.      137659
## 10 TSLM(Deaths ~ TargetValue + … 2020-05-10  N(8344, 2562847)  8344.      117680
## # … with 31 more rows

Arima Model

Similar to above model, I allowed the ARIMA() function to automatically select specifications. The ARIMA model peforms better than the ETS, with a much lowers RMSE, MSE, and MAPE. Additionally, the accuracy values(AIC, AICc and BIC) are all lower, suggesting better fit and accuracy.

c_arima <- c_train %>%
  model(ARIMA(TargetValue))
report(c_arima)
## Series: TargetValue 
## Model: ARIMA(2,0,2)(0,1,2)[7] 
## 
## Coefficients:
##          ar1     ar2     ma1      ma2     sma1     sma2
##       0.0208  0.9496  0.5747  -0.2688  -0.3161  -0.3313
## s.e.  0.0469  0.0473  0.1191   0.1091   0.1329   0.1276
## 
## sigma^2 estimated as 77334443:  log likelihood=-975.94
## AIC=1965.89   AICc=1967.21   BIC=1983.62
c_arima_predict <- forecast(c_arima, h=40)

c_train %>%
  autoplot(TargetValue) +
  autolayer(c_arima_predict)

acc_arima=accuracy(c_arima_predict, c_test)
acc_arima%>%kbl(caption="ARIMA")%>%kable_classic(html_font="Cambria")
ARIMA
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ARIMA(TargetValue) Test -3281.024 10773.56 8334.135 -2.911194 5.845196 NaN NaN 0.4839195

Below are the equivalent with outputs for ETS and ARIM models for deaths. We once again see that the predictive qualities of the ARIMA model are superior, with lower accuracy and fit values, hence it is a stronger model for fatalities.

f_ets <- f_train %>%
  model(ETS(TargetValue))
report(f_ets)
## Series: TargetValue 
## Model: ETS(M,A,N) 
##   Smoothing parameters:
##     alpha = 0.3327147 
##     beta  = 0.002484359 
## 
##   Initial states:
##       l[0]     b[0]
##  -12.89152 14.59331
## 
##   sigma^2:  0.1899
## 
##      AIC     AICc      BIC 
## 1643.177 1643.815 1656.203
f_ets_predict <- forecast(f_ets, h=40)

f_train %>%
  autoplot(TargetValue) +
  autolayer(f_ets_predict)

acc_ets_f=accuracy(f_ets_predict, f_test)
acc_ets_f%>%kbl(caption="ETS Fatalities")%>%kable_classic(html_font="Cambria")
ETS Fatalities
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ETS(TargetValue) Test -4467.807 5033.8 4516.64 -82.32491 82.7544 NaN NaN 0.7271761
f_arima <- f_train %>%
  model(ARIMA(TargetValue))
report(f_arima)
## Series: TargetValue 
## Model: ARIMA(1,1,1)(1,0,0)[7] 
## 
## Coefficients:
##          ar1      ma1    sar1
##       0.3749  -0.7218  0.4519
## s.e.  0.1658   0.1056  0.1068
## 
## sigma^2 estimated as 1461947:  log likelihood=-842.5
## AIC=1693   AICc=1693.42   BIC=1703.38
f_arima_predict <- forecast(f_arima, h=40)

f_train %>%
  autoplot(TargetValue) +
  autolayer(f_arima_predict)

acc_arima_f=accuracy(f_arima_predict, f_test)
acc_arima_f%>%kbl(caption="ARIMA Fatalities")%>%kable_classic(html_font="Cambria")
ARIMA Fatalities
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ARIMA(TargetValue) Test -1960.634 2554.761 2091.078 -40.05277 41.27876 NaN NaN 0.6184166

Limitations

The primary limitations of these models relate to incomplete information. If we conclude, as I did, that the ARIMA model predicts future cases most accurately and the linear model using cases as in input is most accurate. The problem with the model that makes cases endogenous to deaths is that the time horizon used for the predictive model is not the same as the time horizon in real life. Thus, for any utility beyond 2-3 weeks, we would need predictions for cases. By combining the ARIMA predictions for cases and the cases-endogenous model for deaths, we could have a model that may be reliable for cases and thereby deaths. This could be used to anticipate needs in testing, deaths, and hospital beds/needs.

Additionally, this model does not account for public policy measures, behavior changes, or virus mutations. A model that has wider information would do a better job predicting, and the cases-endogenous trend model could easily be expanded to use more information. Finally, this does not use hybrid models (such as combined ETS and ARIMA), and therefore suffers in prediction accuracy compared to other models available.

Future work

In the future, the following research could be included to improve models and predictive accuracy: 1) Use data that spans a wider period, especially once public health measures are in place 2) include a wider set of predictor variables such as public policy, medications available, and vaccination. 3) Use the most advanced models for case predicts, such as hybrid models.

Ultimately, such a model could be used for predicting trends, especially if it was easy for local scientists to adapt the specific measure’s in their community and use as a predictive modeling tool.

Learnings

The key learning here is that cases are a great predictor of deaths, and that ARIMA models can predict cases well. Additionally, the data presents a unique challenge given the flat and jump nature. Using either autoregressive models, or endogenizing cases functions to make a better predictive model.

Ultimately, the data are fully available, and we could measure these models over a long period of time. They would likely perform poorly because of the degree of omitted variable bias, such as public policy, variant, vaccine, and behavioral changes accross a longer timescale. These are essential to a strong model, but were not available in this time period.

Works Cited

Barcellos, D.d.S., Fernandes, G.M.K. & de Souza, F.T. Data based model for predicting COVID-19 morbidity and mortality in metropolis. Sci Rep 11, 24491 (2021). https://doi.org/10.1038/s41598-021-04029-6

Moulaei, K., Shanbehzadeh, M., Mohammadi-Taghiabad, Z. et al. Comparing machine learning algorithms for predicting COVID-19 mortality. BMC Med Inform Decis Mak 22, 2 (2022). https://doi.org/10.1186/s12911-021-01742-0

Evan L Ray, Nutcha Wattanachit, Jarad Niemi, Abdul Hannan Kanji,et al, Ensemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the U.S., CDC(2021)https://doi.org/10.1101/2020.08.19.2017749 medRxiv 2020.08.19.20177493; doi: https://doi.org/10.1101/2020.08.19.20177493

Zhao, Hongwei, Mercent, N, McNulty, Alyssa, Radcliff, Tiffany, Cole, Murray, Fischer, Rebecca, Sang, Huiyan, Ory, Marci COVID-19: Short term prediction model using daily incidence data (2021)https://doi.org/10.1371/journal.pone.0250110