train <- read.csv("//Users//kevinclifford//Downloads//train.csv", header=TRUE)
test <- read.csv("//Users//kevinclifford//Downloads//test.csv", header=TRUE)
##Problem
The issue we are examining today is Global COVID-19 cases and finding a way to forecast confirmed cases and fatalities. The issue is very pressing, as the global pandemic is ongoing, and is unprecedented in scale in terms of number of cases and the spread globally. Each country handled COVID-19 in varying manners and differing levels of effectiveness. The problem at the time of this kaggle competition is how can we effectively forecast an ongoing and unpredictable pandemic? Now it could be useful to examine our forecasting techniques to see if we can improve for future forecasting.
##Significance
In addition, as we have seen several variants are becoming common. As we look to the future to forecasting these variants, it could be useful to examine how effective our modeling of COVID-19 is for daily data during 2020. Forecasting is a vital aspect of determining health policy. Underestimating COVID-19, both in the public eye and in forecasting models, could lead to serious implications. Therefore, it is important to ensure that forecasting models are thoroughly built, examined and evaluated, to ensure that those that make health policy decisions are the most informed they could possibly be.
##Data
The training data consists of 140 days of observations for global COVID-19. As shown below, the dates span from January 23rd to June 10th. Additionally, the data spans globally with data from 187 countries, all of which you can see below. The COVID-19 data set’s main variable TargetValue is split into two categories: Confirmed cases and Fatalities, each given a weight applied by proportion of population. Looking at the residuals of this data and that of the differenced data, it is clear that this is not white noise. Therefore, effective model specification and evaluation may be difficult.
summary(train)
## Id County Province_State Country_Region
## Min. : 1 Length:969640 Length:969640 Length:969640
## 1st Qu.:242411 Class :character Class :character Class :character
## Median :484820 Mode :character Mode :character Mode :character
## Mean :484820
## 3rd Qu.:727230
## Max. :969640
## Population Weight Date Target
## Min. :8.600e+01 Min. :0.04749 Length:969640 Length:969640
## 1st Qu.:1.213e+04 1st Qu.:0.09684 Class :character Class :character
## Median :3.053e+04 Median :0.34941 Mode :character Mode :character
## Mean :2.720e+06 Mean :0.53087
## 3rd Qu.:1.056e+05 3rd Qu.:0.96838
## Max. :1.396e+09 Max. :2.23919
## TargetValue
## Min. :-10034.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 12.56
## 3rd Qu.: 0.00
## Max. : 36163.00
unique(train$Date)
## [1] "2020-01-23" "2020-01-24" "2020-01-25" "2020-01-26" "2020-01-27"
## [6] "2020-01-28" "2020-01-29" "2020-01-30" "2020-01-31" "2020-02-01"
## [11] "2020-02-02" "2020-02-03" "2020-02-04" "2020-02-05" "2020-02-06"
## [16] "2020-02-07" "2020-02-08" "2020-02-09" "2020-02-10" "2020-02-11"
## [21] "2020-02-12" "2020-02-13" "2020-02-14" "2020-02-15" "2020-02-16"
## [26] "2020-02-17" "2020-02-18" "2020-02-19" "2020-02-20" "2020-02-21"
## [31] "2020-02-22" "2020-02-23" "2020-02-24" "2020-02-25" "2020-02-26"
## [36] "2020-02-27" "2020-02-28" "2020-02-29" "2020-03-01" "2020-03-02"
## [41] "2020-03-03" "2020-03-04" "2020-03-05" "2020-03-06" "2020-03-07"
## [46] "2020-03-08" "2020-03-09" "2020-03-10" "2020-03-11" "2020-03-12"
## [51] "2020-03-13" "2020-03-14" "2020-03-15" "2020-03-16" "2020-03-17"
## [56] "2020-03-18" "2020-03-19" "2020-03-20" "2020-03-21" "2020-03-22"
## [61] "2020-03-23" "2020-03-24" "2020-03-25" "2020-03-26" "2020-03-27"
## [66] "2020-03-28" "2020-03-29" "2020-03-30" "2020-03-31" "2020-04-01"
## [71] "2020-04-02" "2020-04-03" "2020-04-04" "2020-04-05" "2020-04-06"
## [76] "2020-04-07" "2020-04-08" "2020-04-09" "2020-04-10" "2020-04-11"
## [81] "2020-04-12" "2020-04-13" "2020-04-14" "2020-04-15" "2020-04-16"
## [86] "2020-04-17" "2020-04-18" "2020-04-19" "2020-04-20" "2020-04-21"
## [91] "2020-04-22" "2020-04-23" "2020-04-24" "2020-04-25" "2020-04-26"
## [96] "2020-04-27" "2020-04-28" "2020-04-29" "2020-04-30" "2020-05-01"
## [101] "2020-05-02" "2020-05-03" "2020-05-04" "2020-05-05" "2020-05-06"
## [106] "2020-05-07" "2020-05-08" "2020-05-09" "2020-05-10" "2020-05-11"
## [111] "2020-05-12" "2020-05-13" "2020-05-14" "2020-05-15" "2020-05-16"
## [116] "2020-05-17" "2020-05-18" "2020-05-19" "2020-05-20" "2020-05-21"
## [121] "2020-05-22" "2020-05-23" "2020-05-24" "2020-05-25" "2020-05-26"
## [126] "2020-05-27" "2020-05-28" "2020-05-29" "2020-05-30" "2020-05-31"
## [131] "2020-06-01" "2020-06-02" "2020-06-03" "2020-06-04" "2020-06-05"
## [136] "2020-06-06" "2020-06-07" "2020-06-08" "2020-06-09" "2020-06-10"
unique(train$Country_Region)
## [1] "Afghanistan" "Albania"
## [3] "Algeria" "Andorra"
## [5] "Angola" "Antigua and Barbuda"
## [7] "Argentina" "Armenia"
## [9] "Australia" "Austria"
## [11] "Azerbaijan" "Bahamas"
## [13] "Bahrain" "Bangladesh"
## [15] "Barbados" "Belarus"
## [17] "Belgium" "Belize"
## [19] "Benin" "Bhutan"
## [21] "Bolivia" "Bosnia and Herzegovina"
## [23] "Botswana" "Brazil"
## [25] "Brunei" "Bulgaria"
## [27] "Burkina Faso" "Burma"
## [29] "Burundi" "Cabo Verde"
## [31] "Cambodia" "Cameroon"
## [33] "Canada" "Central African Republic"
## [35] "Chad" "Chile"
## [37] "China" "Colombia"
## [39] "Comoros" "Congo (Brazzaville)"
## [41] "Congo (Kinshasa)" "Costa Rica"
## [43] "Cote d'Ivoire" "Croatia"
## [45] "Cuba" "Cyprus"
## [47] "Czechia" "Denmark"
## [49] "Diamond Princess" "Djibouti"
## [51] "Dominica" "Dominican Republic"
## [53] "Ecuador" "Egypt"
## [55] "El Salvador" "Equatorial Guinea"
## [57] "Eritrea" "Estonia"
## [59] "Eswatini" "Ethiopia"
## [61] "Fiji" "Finland"
## [63] "France" "Gabon"
## [65] "Gambia" "Georgia"
## [67] "Germany" "Ghana"
## [69] "Greece" "Grenada"
## [71] "Guatemala" "Guinea"
## [73] "Guinea-Bissau" "Guyana"
## [75] "Haiti" "Holy See"
## [77] "Honduras" "Hungary"
## [79] "Iceland" "India"
## [81] "Indonesia" "Iran"
## [83] "Iraq" "Ireland"
## [85] "Israel" "Italy"
## [87] "Jamaica" "Japan"
## [89] "Jordan" "Kazakhstan"
## [91] "Kenya" "Korea, South"
## [93] "Kosovo" "Kuwait"
## [95] "Kyrgyzstan" "Laos"
## [97] "Latvia" "Lebanon"
## [99] "Liberia" "Libya"
## [101] "Liechtenstein" "Lithuania"
## [103] "Luxembourg" "MS Zaandam"
## [105] "Madagascar" "Malawi"
## [107] "Malaysia" "Maldives"
## [109] "Mali" "Malta"
## [111] "Mauritania" "Mauritius"
## [113] "Mexico" "Moldova"
## [115] "Monaco" "Mongolia"
## [117] "Montenegro" "Morocco"
## [119] "Mozambique" "Namibia"
## [121] "Nepal" "Netherlands"
## [123] "New Zealand" "Nicaragua"
## [125] "Niger" "Nigeria"
## [127] "North Macedonia" "Norway"
## [129] "Oman" "Pakistan"
## [131] "Panama" "Papua New Guinea"
## [133] "Paraguay" "Peru"
## [135] "Philippines" "Poland"
## [137] "Portugal" "Qatar"
## [139] "Romania" "Russia"
## [141] "Rwanda" "Saint Kitts and Nevis"
## [143] "Saint Lucia" "Saint Vincent and the Grenadines"
## [145] "San Marino" "Sao Tome and Principe"
## [147] "Saudi Arabia" "Senegal"
## [149] "Serbia" "Seychelles"
## [151] "Sierra Leone" "Singapore"
## [153] "Slovakia" "Slovenia"
## [155] "Somalia" "South Africa"
## [157] "South Sudan" "Spain"
## [159] "Sri Lanka" "Sudan"
## [161] "Suriname" "Sweden"
## [163] "Switzerland" "Syria"
## [165] "Taiwan*" "Tajikistan"
## [167] "Tanzania" "Thailand"
## [169] "Timor-Leste" "Togo"
## [171] "Trinidad and Tobago" "Tunisia"
## [173] "Turkey" "US"
## [175] "Uganda" "Ukraine"
## [177] "United Arab Emirates" "United Kingdom"
## [179] "Uruguay" "Uzbekistan"
## [181] "Venezuela" "Vietnam"
## [183] "West Bank and Gaza" "Western Sahara"
## [185] "Yemen" "Zambia"
## [187] "Zimbabwe"
plot(train$TargetValue)
## Making data a time series for ETS and ARIMA models
test_ts <- as_tsibble(test, index = ForecastId)
train_ts <- as_tsibble(train, index = Id)
train_ts %>%
gg_tsdisplay(TargetValue, plot_type = 'partial')
train_ts %>% gg_tsdisplay(difference(TargetValue), plot_type = 'partial')
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
##Literature Review
In general, there were a variety of articles published in which an ETS or ARIMA model was applied to forecasting COVID-19 cases. They were both utilized to varying effectiveness levels, thus I feel I should explore both to see which one fits this data the best. The three articles I have selected articulate that these models can be applied to this forecasting task:
Kirbas et al.
This journal article explores forecasting COVID-19 cases in several European countries. The data utilized was from the European Center for Disease Prevention and Control. The data itself varies with each country, but generally speaking is daily data very similar to the data set used in this competition. The authors utilized ARIMA, neural network (NARNN), and Long-Short Term Memory(LSTM) models to eight European countries. The authors found that the LSTM model was the most effective, however the ARIMA model was particularly effective in forecasting COVID-19 cases in the United Kingdom.
Perone
This journal article is of particular interest because it applies and compares ARIMA and ETS models, alongside two other models. Perone examines 236 observations which were real-time COVID-19 cases in Italy spanning from February to October of 2020. The ARIMA model outperforms the ETS model, but the author also utilizes hybrid models. That is something to think about in future exploration of this kind of data.
Ribeiro et al.
This journal article is very similar to the other two, with ARIMA and various other more sophisticated models utilized to forecast COVID-19 cases in Brazil. In this article ARIMA is the second-best performing model, adding to the notion that it can be effectively utilized in this competition.
##Types of Models
Based on the literature I reviewed, and the techniques under my belt, I will mainly focus on appyling ETS and ARIMA models to this problem. I am also interested in how a linear model might perform in this case, and will utilize a linear model as sort of a baseline model. Finally, I will attempt to run a STL decomposition of the data and determine if I can utilize the seasonally adjusted data to create better versions of ETS or ARIMA models. The ETS and ARIMA models will be built using R functions, and evaluated by their respective AIC, AICc, BIC, and RMSE values.
##Building Models
Linear Model
First, I will build a linear model as a baseline regression. I will place the variable target value of cases and fatalities as a function of the variables Population and Weight.
lm <- lm(TargetValue ~ Population + Weight, train)
summary(lm)
##
## Call:
## lm(formula = TargetValue ~ Population + Weight, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10118 -19 -7 2 35692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.098e+01 4.690e-01 44.73 <2e-16 ***
## Population 1.391e-06 8.720e-09 159.47 <2e-16 ***
## Weight -2.298e+01 6.711e-01 -34.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 298.4 on 969637 degrees of freedom
## Multiple R-squared: 0.02717, Adjusted R-squared: 0.02717
## F-statistic: 1.354e+04 on 2 and 969637 DF, p-value: < 2.2e-16
plot(lm)
accuracy(lm)
## ME RMSE MAE MPE MAPE MASE
## Training set -3.807993e-13 298.3863 23.68562 NaN Inf 1.012281
prediction <- predict(lm, test)
plot(prediction)
plot(lm$residuals)
ETS Model For the ETS model, I will use the function ETS() on the target value to the determine the model that best fits the data set. The model that the function selects is ETS(A,Ad,N). There is an additive trend, as well as the Holt’s damped additive seasonal trend.
ets <- train_ts %>% model(ETS(TargetValue))
ets
## # A mable: 1 x 1
## `ETS(TargetValue)`
## <model>
## 1 <ETS(A,Ad,N)>
accuracy(ets)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(TargetValue) Training 0.0000290 218. 12.9 NaN Inf 0.567 0.540 -0.785
gg_tsresiduals(ets)
ets_fc <- ets %>% forecast(h=30)
ets_fc %>% autoplot(train_ts)
prediction2 <- predict(ets, test_ts)
plot(prediction2$.mean)
ARIMA Model
The ARIMA model is selected using the function ARIMA() on the target value of cases and fatalities. The final model is an ARIMA(5,1,1). This is intuitive for when we examined the residuals. The lags were very strong, and thus this type of ARIMA model will hopefully provide stationarity.
arima <- train_ts %>% model(ARIMA(TargetValue))
arima
## # A mable: 1 x 1
## `ARIMA(TargetValue)`
## <model>
## 1 <ARIMA(5,1,1)>
accuracy(arima)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(TargetValue) Traini… 1.95e-5 89.9 4.33 NaN Inf 0.191 0.223 0.00738
gg_tsresiduals(arima)
arima_fc <- arima %>% forecast(h=24)
arima_fc %>% autoplot(train_ts)
prediction3 <- predict(arima, test_ts)
plot(prediction3$.mean)
Separating Cases and Fatalities
As I was building these models, I realized that by grouping together cases and fatalities, the variable TargetValue might become a bit dilluted. Both the cases and fatalities are correlated but obviously having them grouped together in forecast models might cause the predictions for both of them to be ineffective. Therefore, I filtered out and separated confirmed cases and fatalities and applied the two main models, ETS and ARIMA to them separately as the target variable.
For the ETS model of cases, the model remains the same, but for fatalities interestingly only a simple exponential smoothing model is applied. The damped trend being applied is therefore a result of the confirmed cases, and that could have had a negative effect on fatalities in the previous models.
For the ARIMA model of cases, the model changes to an ARIMA(3,1,3). For fatalities, the model changes to an ARIMA(2,1,4).Looking at the residuals it appears to still not be white noise for either cases or fatalities.
##ETS
# Confirmed Cases
train_cases <- train_ts %>% filter(Target == "ConfirmedCases")
test_cases <- test_ts %>% filter(Target == "ConfirmedCases")
ets_cases <- train_cases %>% model(ETS(TargetValue))
ets_cases
## # A mable: 1 x 1
## `ETS(TargetValue)`
## <model>
## 1 <ETS(A,Ad,N)>
accuracy(ets_cases)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(TargetValue) Training 0.0000394 127. 7.51 NaN Inf 0.950 0.937 0.0631
gg_tsresiduals(ets_cases)
cases_fc <- ets_cases %>% forecast(h=30)
cases_fc %>% autoplot(train_cases)
prediction_cases <- predict(ets_cases, test_cases)
plot(prediction_cases$.mean)
# Fatalities
train_deaths <- train_ts %>% filter(Target == "Fatalities")
test_deaths <- test_ts %>% filter(Target == "Fatalities")
ets_deaths <- train_deaths %>% model(ETS(TargetValue))
accuracy(ets_deaths)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(TargetValue) Training -9.40e-9 12.1 0.617 NaN Inf 0.950 0.923 0.146
gg_tsresiduals(ets_deaths)
deaths_fc <- ets_deaths %>% forecast(h=30)
deaths_fc %>% autoplot(train_deaths)
prediction_deaths <- predict(ets_deaths, test_deaths)
plot(prediction_deaths$.mean)
##ARIMA
# Confirmed Cases
arima_cases <- train_cases %>% model(ARIMA(TargetValue))
accuracy(arima_cases)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(TargetValue) Train… 3.16e-5 124. 7.70 NaN Inf 0.974 0.916 -0.00652
gg_tsresiduals(arima_cases)
cases_fc2 <- arima_cases %>% forecast(h=30)
cases_fc2 %>% autoplot(train_cases)
prediction_cases2 <- predict(arima_cases, test_cases)
plot(prediction_cases2$.mean)
# Fatalities
arima_deaths <- train_deaths %>% model(ARIMA(TargetValue))
accuracy(arima_deaths)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(TargetValue) Train… 6.59e-10 11.0 0.603 NaN Inf 0.929 0.839 0.00382
gg_tsresiduals(arima_deaths)
deaths_fc2 <- arima_deaths %>% forecast(h=30)
deaths_fc2 %>% autoplot(train_deaths)
prediction_deaths2 <- predict(arima_deaths, test_deaths)
plot(prediction_deaths2$.mean)
STL Decomposition
I also ran a STL decomposition and ran a seasonally adjusted model but it had no effect. Here is the code.
train_ts %>%
model(
STL(TargetValue ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
decomp <- train_ts %>%
model(
STL(TargetValue ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components()
##ETS
train_ts$season_adjust <- decomp$season_adjust
ets_seas <- train_ts %>% model(ETS(season_adjust))
glance(ets_seas)
## # A tibble: 1 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(season_adjust) 47630. -11905185. 2.38e7 2.38e7 2.38e7 47629. 47212. 12.9
accuracy(ets_seas)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(season_adjust) Training 2.90e-5 218. 12.9 NaN Inf 0.567 0.540 -0.785
ets_fc2 <- ets_seas %>% forecast(h=24)
ets_fc2 %>% autoplot(train_ts)
prediction4 <- predict(ets_seas, test_ts)
plot(prediction4$.mean)
##ARIMA
arima_seas <- train_ts %>% model(ARIMA(season_adjust))
glance(arima_seas)
## # A tibble: 1 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(season_adjust) 8079. -5737771. 1.15e7 1.15e7 1.15e7 <cpl> <cpl>
accuracy(arima_seas)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(season_adjust) Trai… 1.95e-5 89.9 4.33 NaN Inf 0.191 0.223 0.00738
arima_fc2 <- arima_seas %>% forecast(h=24)
arima_fc2 %>% autoplot(train_ts)
prediction5 <- predict(arima_seas, test_ts)
plot(prediction5$.mean)
Submissions (if we were submitting)
s <- data.frame(Id=test$ForecastId,TargetValue=prediction)
write.csv(s,file="Forecasting_Kaggle COVID_LM.csv", row.names=F)
s2 <- data.frame(Id=test_ts$ForecastId,TargetValue=prediction2$.mean)
write.csv(s2,file="Forecasting_Kaggle COVID_ETS.csv", row.names=F)
s3 <- data.frame(Id=test_ts$ForecastId,TargetValue=prediction3$.mean)
write.csv(s3,file="Forecasting_Kaggle COVID_ETS.csv", row.names=F)
#Model Evaluation
Confirmed Cases and Fatalities Together
Looking at the RMSE, the ARIMA model greatly outperforms the ETS model. Additionally, the ARIMA model has a lower AIC, AICc, and BIC. This indicates that the ARIMA model is the better fit to this data set. The RMSE of the linear model is a lot higher than both the ARIMA and ETS models.
accuracy(lm)
## ME RMSE MAE MPE MAPE MASE
## Training set -3.807993e-13 298.3863 23.68562 NaN Inf 1.012281
#Total
a <- bind_rows(ets = accuracy(ets), arima = accuracy(arima))
a
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(TargetValue) Train… 2.90e-5 218. 12.9 NaN Inf 0.567 0.540 -0.785
## 2 ARIMA(TargetValue) Train… 1.95e-5 89.9 4.33 NaN Inf 0.191 0.223 0.00738
glance(ets)
## # A tibble: 1 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(TargetValue) 47630. -11905185. 23810383. 2.38e7 2.38e7 47629. 47212. 12.9
glance(arima)
## # A tibble: 1 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(TargetValue) 8079. -5737771. 11475556. 1.15e7 1.15e7 <cpl> <cpl>
Confirmed Cases and Fatalities Separated
When confirmed cases anad fatalities are separated the results of the better performing models is the same. The ARIMA models have the lowest AIC, AICc, BIC and RMSE values for both the models describing confirmed cases and fatalities. Altough for both sets of data the results of the RMSE are a lot closer than from the data with cases and fatalities together.
# Cases
bind_rows(ets_cases = accuracy(ets_cases), arima_cases = accuracy(arima_cases))
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(TargetValue) Train… 3.94e-5 127. 7.51 NaN Inf 0.950 0.937 0.0631
## 2 ARIMA(TargetValue) Train… 3.16e-5 124. 7.70 NaN Inf 0.974 0.916 -0.00652
glance(ets_cases)
## # A tibble: 1 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(TargetValue) 16094. -5521550. 11043112. 1.10e7 1.10e7 16094. 21433. 7.51
glance(arima_cases)
## # A tibble: 1 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(TargetValue) 15370. -3024789. 6049592. 6049592. 6.05e6 <cpl> <cpl>
#Fatalities
bind_rows(ets_deaths = accuracy(ets_deaths), arima_deaths = accuracy(arima_deaths))
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(TargetValue) Trai… -9.40e- 9 12.1 0.617 NaN Inf 0.950 0.923 0.146
## 2 ARIMA(TargetValue) Trai… 6.59e-10 11.0 0.603 NaN Inf 0.929 0.839 0.00382
glance(ets_deaths)
## # A tibble: 1 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(TargetValue) 147. -4383170. 8766346. 8766346. 8766380. 147. 183. 0.617
glance(arima_deaths)
## # A tibble: 1 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(TargetValue) 122. -1851628. 3703269. 3703269. 3.70e6 <cpl> <cpl>
#Limitations
I definitely felt as I was working on this that I could have done a lot of things differently. For starters, I feel as though I should have evaluated the data at the country level. That is build models for each country. As I stated previously, different countries handled the virus with varying levels of success. The outbreaks for each country were differing in levels and occurred during different periods of time depending on country population level and density.
In addition, I had difficulty making a model that would look like white noise. The non-stationarity of the data, with very strong lags, makes it difficult to trust that my models were great at forecasting this data.
Finally, I think the problem we are examining in this competition is volatile by nature. The COVID-19 pandemic is fraught with outbreaks and longevity that baffle experts in all fields. Working with this kind of data means a lot of trial and error with forecasting methods. And even when you think you cracked the case situations can easily change (i.e. vaccinations, new variants behaving differently, easing of COVID protocols etc.). Thus, it is a complex and everchanging problem that requires complex and adaptable solution. #Future Work Building off the previous section, the limitations of the models indicates that I have a lot of avenues for future work. I would like to examine COVID-19 cases at the country level. The literature section has me intrigued to examine the United Kingdom’s COVID-19 cases with both the ARIMA and ETS models to see how effective they would be. Also, I would like to examine why exactly it might be the case that ARIMA is a good fit for the UK, and not other countries. In addition, I think that as I become comfortable with other forecasting methods like neural networks and GARCH, I would like to employ them to help solve this problem. I think as we examine the future of forecasting COVID-19, the increasing data sources alongside health policies will allow people, policymakers, and forecasters to better understand the nature of this pandemic and how we can hopefully avoid the rapid growth in cases and fatalities during 2020.
#What was Learned
I think the number one thing I learned is that I still am not as comfortable with these forecasting methods as I would have liked. I knew the target value was not white noise but the models did not provide stationarity.
I think the main thing I can take away is that the ARIMA models were better suited to address the volality of COVID-19 cases and fatalities. I think the ETS and ARIMA of the first models residuals might have been a better approach. With more time after this class is over I think I will come back to this competition and try that alongside the strategies outlined in future work. Strategies such as ETS, ARIMA, neural networks, and other forecasting methods, that were effective evaluating the volatility of COVID-19 during the beginning stages may be key in evaluating the different variants that have been seen throughout the world.
#Works Cited
Kırbaş, İsmail, et al. “Comparative Analysis and Forecasting of Covid-19 Cases in Various European Countries with Arima, Narnn and LSTM Approaches.” Chaos, Solitons & Fractals, vol. 138, 2020, p. 110015., https://doi.org/10.1016/j.chaos.2020.110015.
Perone, Gaetano. “Comparison of Arima, ETS, NNAR, TBATS and Hybrid Models to Forecast the Second Wave of Covid-19 Hospitalizations in Italy.” The European Journal of Health Economics, 2021, https://doi.org/10.1007/s10198-021-01347-4.
Ribeiro, Matheus Henrique, et al. “Short-Term Forecasting COVID-19 Cumulative Confirmed Cases: Perspectives for Brazil.” Chaos, Solitons & Fractals, vol. 135, 2020, p. 109853., https://doi.org/10.1016/j.chaos.2020.109853.