# Libraries ---------------------------------------------------------------

library('tidyverse')
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library('lubridate')
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library('forecast')
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library('feasts')
## Loading required package: fabletools
## 
## Attaching package: 'fabletools'
## The following objects are masked from 'package:forecast':
## 
##     accuracy, forecast
library('fpp3') 
## -- Attaching packages -------------------------------------------- fpp3 0.4.0 --
## v tsibble     1.1.1     v fable       0.3.1
## v tsibbledata 0.4.0
## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()      masks base::date()
## x dplyr::filter()        masks stats::filter()
## x fabletools::forecast() masks forecast::forecast()
## x tsibble::intersect()   masks base::intersect()
## x tsibble::interval()    masks lubridate::interval()
## x dplyr::lag()           masks stats::lag()
## x tsibble::setdiff()     masks base::setdiff()
## x tsibble::union()       masks base::union()
library('gridExtra') 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

The world was transformed into an unprecedented when covid-19 emergered. A threat to public safety came out in force and shut down governments as well as their economies across the globe. The data analyzed in this assignment is the first year of cases and fatalities of the covid-19 pandemic. The importance of forecasting and analyzing this data helps provide useful information to medical centers so they can be prepared for increased amounts of patients. It also helps governments navigate the pandemic so they can help the public and return life to a more normal state.

train <- read.csv("train.csv")
head(train)
##   Id County Province_State Country_Region Population     Weight       Date
## 1  1                          Afghanistan   27657145 0.05835874 2020-01-23
## 2  2                          Afghanistan   27657145 0.58358737 2020-01-23
## 3  3                          Afghanistan   27657145 0.05835874 2020-01-24
## 4  4                          Afghanistan   27657145 0.58358737 2020-01-24
## 5  5                          Afghanistan   27657145 0.05835874 2020-01-25
## 6  6                          Afghanistan   27657145 0.58358737 2020-01-25
##           Target TargetValue
## 1 ConfirmedCases           0
## 2     Fatalities           0
## 3 ConfirmedCases           0
## 4     Fatalities           0
## 5 ConfirmedCases           0
## 6     Fatalities           0
table(train$Target)
## 
## ConfirmedCases     Fatalities 
##         484820         484820
table(train$Country_Region)
## 
##                      Afghanistan                          Albania 
##                              280                              280 
##                          Algeria                          Andorra 
##                              280                              280 
##                           Angola              Antigua and Barbuda 
##                              280                              280 
##                        Argentina                          Armenia 
##                              280                              280 
##                        Australia                          Austria 
##                             2520                              280 
##                       Azerbaijan                          Bahamas 
##                              280                              280 
##                          Bahrain                       Bangladesh 
##                              280                              280 
##                         Barbados                          Belarus 
##                              280                              280 
##                          Belgium                           Belize 
##                              280                              280 
##                            Benin                           Bhutan 
##                              280                              280 
##                          Bolivia           Bosnia and Herzegovina 
##                              280                              280 
##                         Botswana                           Brazil 
##                              280                              280 
##                           Brunei                         Bulgaria 
##                              280                              280 
##                     Burkina Faso                            Burma 
##                              280                              280 
##                          Burundi                       Cabo Verde 
##                              280                              280 
##                         Cambodia                         Cameroon 
##                              280                              280 
##                           Canada         Central African Republic 
##                             3640                              280 
##                             Chad                            Chile 
##                              280                              280 
##                            China                         Colombia 
##                             9520                              280 
##                          Comoros              Congo (Brazzaville) 
##                              280                              280 
##                 Congo (Kinshasa)                       Costa Rica 
##                              280                              280 
##                    Cote d'Ivoire                          Croatia 
##                              280                              280 
##                             Cuba                           Cyprus 
##                              280                              280 
##                          Czechia                          Denmark 
##                              280                              840 
##                 Diamond Princess                         Djibouti 
##                              280                              280 
##                         Dominica               Dominican Republic 
##                              280                              280 
##                          Ecuador                            Egypt 
##                              280                              280 
##                      El Salvador                Equatorial Guinea 
##                              280                              280 
##                          Eritrea                          Estonia 
##                              280                              280 
##                         Eswatini                         Ethiopia 
##                              280                              280 
##                             Fiji                          Finland 
##                              280                              280 
##                           France                            Gabon 
##                             3080                              280 
##                           Gambia                          Georgia 
##                              280                              280 
##                          Germany                            Ghana 
##                              280                              280 
##                           Greece                          Grenada 
##                              280                              280 
##                        Guatemala                           Guinea 
##                              280                              280 
##                    Guinea-Bissau                           Guyana 
##                              280                              280 
##                            Haiti                         Holy See 
##                              280                              280 
##                         Honduras                          Hungary 
##                              280                              280 
##                          Iceland                            India 
##                              280                              280 
##                        Indonesia                             Iran 
##                              280                              280 
##                             Iraq                          Ireland 
##                              280                              280 
##                           Israel                            Italy 
##                              280                              280 
##                          Jamaica                            Japan 
##                              280                              280 
##                           Jordan                       Kazakhstan 
##                              280                              280 
##                            Kenya                     Korea, South 
##                              280                              280 
##                           Kosovo                           Kuwait 
##                              280                              280 
##                       Kyrgyzstan                             Laos 
##                              280                              280 
##                           Latvia                          Lebanon 
##                              280                              280 
##                          Liberia                            Libya 
##                              280                              280 
##                    Liechtenstein                        Lithuania 
##                              280                              280 
##                       Luxembourg                       Madagascar 
##                              280                              280 
##                           Malawi                         Malaysia 
##                              280                              280 
##                         Maldives                             Mali 
##                              280                              280 
##                            Malta                       Mauritania 
##                              280                              280 
##                        Mauritius                           Mexico 
##                              280                              280 
##                          Moldova                           Monaco 
##                              280                              280 
##                         Mongolia                       Montenegro 
##                              280                              280 
##                          Morocco                       Mozambique 
##                              280                              280 
##                       MS Zaandam                          Namibia 
##                              280                              280 
##                            Nepal                      Netherlands 
##                              280                             1400 
##                      New Zealand                        Nicaragua 
##                              280                              280 
##                            Niger                          Nigeria 
##                              280                              280 
##                  North Macedonia                           Norway 
##                              280                              280 
##                             Oman                         Pakistan 
##                              280                              280 
##                           Panama                 Papua New Guinea 
##                              280                              280 
##                         Paraguay                             Peru 
##                              280                              280 
##                      Philippines                           Poland 
##                              280                              280 
##                         Portugal                            Qatar 
##                              280                              280 
##                          Romania                           Russia 
##                              280                              280 
##                           Rwanda            Saint Kitts and Nevis 
##                              280                              280 
##                      Saint Lucia Saint Vincent and the Grenadines 
##                              280                              280 
##                       San Marino            Sao Tome and Principe 
##                              280                              280 
##                     Saudi Arabia                          Senegal 
##                              280                              280 
##                           Serbia                       Seychelles 
##                              280                              280 
##                     Sierra Leone                        Singapore 
##                              280                              280 
##                         Slovakia                         Slovenia 
##                              280                              280 
##                          Somalia                     South Africa 
##                              280                              280 
##                      South Sudan                            Spain 
##                              280                              280 
##                        Sri Lanka                            Sudan 
##                              280                              280 
##                         Suriname                           Sweden 
##                              280                              280 
##                      Switzerland                            Syria 
##                              280                              280 
##                          Taiwan*                       Tajikistan 
##                              280                              280 
##                         Tanzania                         Thailand 
##                              280                              280 
##                      Timor-Leste                             Togo 
##                              280                              280 
##              Trinidad and Tobago                          Tunisia 
##                              280                              280 
##                           Turkey                           Uganda 
##                              280                              280 
##                          Ukraine             United Arab Emirates 
##                              280                              280 
##                   United Kingdom                          Uruguay 
##                             3080                              280 
##                               US                       Uzbekistan 
##                           895440                              280 
##                        Venezuela                          Vietnam 
##                              280                              280 
##               West Bank and Gaza                   Western Sahara 
##                              280                              280 
##                            Yemen                           Zambia 
##                              280                              280 
##                         Zimbabwe 
##                              280
colSums(is.na(train))
##             Id         County Province_State Country_Region     Population 
##              0              0              0              0              0 
##         Weight           Date         Target    TargetValue 
##              0              0              0              0
colSums(train == "")
##             Id         County Province_State Country_Region     Population 
##              0          89600          52360              0              0 
##         Weight           Date         Target    TargetValue 
##              0              0              0              0
train$Date <- as_date(train$Date)
duplicates(train)
## Using `Date` as index variable.
## # A tibble: 969,640 x 9
##       Id County Province_State Country_Region Population Weight Date      
##    <int> <chr>  <chr>          <chr>               <int>  <dbl> <date>    
##  1     1 ""     ""             Afghanistan      27657145 0.0584 2020-01-23
##  2     2 ""     ""             Afghanistan      27657145 0.584  2020-01-23
##  3     3 ""     ""             Afghanistan      27657145 0.0584 2020-01-24
##  4     4 ""     ""             Afghanistan      27657145 0.584  2020-01-24
##  5     5 ""     ""             Afghanistan      27657145 0.0584 2020-01-25
##  6     6 ""     ""             Afghanistan      27657145 0.584  2020-01-25
##  7     7 ""     ""             Afghanistan      27657145 0.0584 2020-01-26
##  8     8 ""     ""             Afghanistan      27657145 0.584  2020-01-26
##  9     9 ""     ""             Afghanistan      27657145 0.0584 2020-01-27
## 10    10 ""     ""             Afghanistan      27657145 0.584  2020-01-27
## # ... with 969,630 more rows, and 2 more variables: Target <chr>,
## #   TargetValue <int>

Below is the transformation of the data. It was turned into a world data set rather than split out by country. It is aggregated by date. This will help narrow down specify trends.

trainCC <- train %>% 
  filter(Target == "ConfirmedCases") 

trainfate <- train %>% 
  filter(Target == "Fatalities") 
#Creating a new data that displays the data horizontally rather than duplicating it
traindata <- trainCC %>% 
  mutate(Fatalities = trainfate$Target) 
#dates are duplicating because there is a new set per country 
traindata <- trainCC %>% 
  mutate(Fatalitiesvalue = trainfate$TargetValue)
#aggregating the total cases per day for the whole datasheet
worldcases <- as.data.frame(aggregate(TargetValue~Date, trainCC,sum))
worldfatalities <- as.data.frame(aggregate(TargetValue~Date, trainfate,sum))
worldcovid <- worldcases %>% 
  mutate(Fatalities = worldfatalities$TargetValue) 


worldcovid <- as_tsibble(worldcovid, index = Date)

The training data is a multi-time series data set. It has data from countries all around the world. It has state and county data for the United states. The important variables of the data is the amount of covid cases and fatalities per day. These are broken out by the target variable in the original data. In the aggregated world data, Target value is all confirmed cases and Fatalities are confirmed fatalities due to the disease.

worldcases %>% 
  ggplot(aes(x=Date,y=TargetValue)) +
  geom_line() + 
  labs(title = "World Covid Cases Per Day", 
       y="Number of Cases",
       x = "2020 Dates")

worldfatalities  %>% 
  ggplot(aes(x=Date,y=TargetValue)) +
  geom_line() + 
  labs(title = "World Covid Case Fatalities Per Day", 
       y="Number of Fatalities",
       x = "2020 Dates")

worldcovid %>% 
  ggplot(aes(x=Date)) +
  geom_line(aes(y = TargetValue, color = "Positive Cases")) + 
  geom_line(aes(y= Fatalities, color = "Fatalities")) + 
  labs(title = "World Covid Cases and Fatalities",
       y = "Cases/fatalities", 
       color = "Legend",
       x = "2020 Dates")  

#Decomposing the data 
worldcovid %>% 
  model(STL(log(TargetValue))) %>% 
  components() %>%
  autoplot()

worldcovid %>% 
  model(STL(log(Fatalities))) %>% 
  components() %>%
  autoplot()

The data was decomposed to further the understanding. A multiplicative STL approach was used. STL was used due to it robustness and ability to handle all forms of seasonality when compared to other approaches (3.6 Hyndman). The multiplicative approach was used by taking the log of the data. It was done to account for the variation of the data when graphing the data originally.

Comparing the decomposition for both covid cases and fatalities abck to their respective graphs, the log emphasizes the variation in the data more significantly in the first 3 months of the data. Both trends show an increase at the beginning, a dip in the middle, and then a significant increase. The seasonality has a lot of variation throughout the data. The remainder of the data isn’t random. It shows the variation in the beginning and smooths out.

wcmodel <- worldcovid %>% 
  slice(-n()) %>%
  stretch_tsibble(.init = 10) %>%
  model(naive = NAIVE(TargetValue),
         ETS = ETS(TargetValue), 
         arima = ARIMA(TargetValue)
        )  

wcmodel %>% 
  forecast(h = 1) %>% 
  accuracy(worldcovid) 
## # A tibble: 3 x 10
##   .model .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE      ACF1
##   <chr>  <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>     <dbl>
## 1 arima  Test     51.2  9959.  7307. -33.0   47.2 0.567 0.496  0.0259  
## 2 ETS    Test  -5269.  38922. 16383. -55.0   84.6 1.27  1.94  -0.100   
## 3 naive  Test   1201.  11632.  8236.  -9.31  26.7 0.639 0.580  0.000501

The was test on 4 separate models; Naive, ETS, and ARIMA. The base functions were used so the models could automatically chosen and tested for to find the best model. ETS and ARIMA models were used because they are able to handle seasonality and variation of data better than more traditional methods like linear forecasting models. A naive model was used as a base for comparison. The test data provide had overlapping indexed data with the training data. Due to this it was decided to use cross validation to asses the robustness of each model. In this case, the best model was the ARIMA model with an RMSE of 9959.

models <- worldcovid %>% 
  model( 
        arima = ARIMA(TargetValue)
  )  
report(models)
## Series: TargetValue 
## Model: ARIMA(1,0,4)(0,1,1)[7] 
## 
## Coefficients:
##          ar1      ma1     ma2      ma3     ma4     sma1
##       0.9732  -0.4704  0.0439  -0.0223  0.2271  -0.5582
## s.e.  0.0190   0.0942  0.1110   0.0795  0.0961   0.1000
## 
## sigma^2 estimated as 80446334:  log likelihood=-1398.21
## AIC=2810.43   AICc=2811.32   BIC=2830.66

The model used was a an (1,0,4)(0,1,1)[7].

models  %>% gg_tsresiduals()

When looking at the diagnostic plots of the data we can see the innovation residuals isn’t completely random as it has a similar pattern to the data. The residual plot is fairly normally distributed. The ACF shows that there is some slight autocorrelation.

augment(models) %>%
  filter(.model == "arima") %>%
  features(.innov, ljung_box, lag=14, dof=3)
## # A tibble: 1 x 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 arima     21.5    0.0287

using the Ljung box test a p-value smaller than 0.05 indicates there is autocorrelation. The residuals are not similar to white noise. This proves what was seen in the ACF plot.

models %>% 
  forecast(h=14) %>% 
  autoplot(worldcovid)

Though the model has some serial correlation it was the best fitting model to the data when compared to the other models. Using the model, the next 14 days were forecasted.

models %>% 
  forecast(h=14) %>% 
  hilo()
## # A tsibble: 14 x 6 [1D]
## # Key:       .model [1]
##    .model Date              TargetValue   .mean                  `80%`
##    <chr>  <date>                 <dist>   <dbl>                 <hilo>
##  1 arima  2020-06-11   N(169161, 8e+07) 169161. [157666.7, 180655.6]80
##  2 arima  2020-06-12   N(183422, 1e+08) 183422. [170556.1, 196288.1]80
##  3 arima  2020-06-13 N(171726, 1.2e+08) 171726. [157474.2, 185977.6]80
##  4 arima  2020-06-14 N(148478, 1.4e+08) 148478. [133125.4, 163831.3]80
##  5 arima  2020-06-15 N(141243, 1.8e+08) 141243. [123852.6, 158632.5]80
##  6 arima  2020-06-16 N(154897, 2.2e+08) 154897. [135776.8, 174017.4]80
##  7 arima  2020-06-17 N(161620, 2.6e+08) 161620. [140994.2, 182245.9]80
##  8 arima  2020-06-18 N(169861, 3.6e+08) 169861. [145688.2, 194034.7]80
##  9 arima  2020-06-19 N(184730, 4.2e+08) 184730. [158615.2, 210844.3]80
## 10 arima  2020-06-20 N(172465, 4.7e+08) 172465. [144558.7, 200371.7]80
## 11 arima  2020-06-21 N(149981, 5.3e+08) 149981. [120514.0, 179448.2]80
## 12 arima  2020-06-22 N(142705, 5.9e+08) 142705. [111469.1, 173941.1]80
## 13 arima  2020-06-23 N(156320, 6.6e+08) 156320. [123496.9, 189144.1]80
## 14 arima  2020-06-24 N(163005, 7.1e+08) 163005. [128745.7, 197265.0]80
## # ... with 1 more variable: `95%` <hilo>

Next we can see the prediction interval at 80% and 95% which make up the forecasted portion of the model.

Now that the cases have been forecasted, next will be the fatalities. This will involve a similar process. Cross validation will be used to assess the accuracy of models to select the most accurate one.

wfmodel <- worldcovid %>% 
  slice(-n()) %>%
  stretch_tsibble(.init = 10) %>%
  model(naive = NAIVE(Fatalities),
        ETS = ETS(Fatalities), 
        arima = ARIMA(Fatalities)
  )  

wfmodel %>% 
  forecast(h = 1) %>% 
  accuracy(worldcovid) 
## # A tibble: 3 x 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 arima  Test  -91.9 1338.  752. -69.9  92.0 0.660 0.729 -0.124 
## 2 ETS    Test  -15.7 1677. 1116. -91.3 120.  0.979 0.914  0.474 
## 3 naive  Test   50.6 1479.  902. -83.9 107.  0.791 0.806  0.0947
#like the covid cases, fatalities will use an arima model because the RMSE is the lowest.  

fatemodel <- worldcovid %>% 
  model(arima = ARIMA(Fatalities)) 

fatemodel %>% report()
## Series: Fatalities 
## Model: ARIMA(2,0,3)(0,1,1)[7] 
## 
## Coefficients:
##          ar1     ar2     ma1      ma2      ma3     sma1
##       0.0543  0.8842  0.6301  -0.3609  -0.2218  -0.5396
## s.e.  0.0672  0.0631  0.1080   0.0969   0.0830   0.1040
## 
## sigma^2 estimated as 1201759:  log likelihood=-1118.5
## AIC=2251.01   AICc=2251.9   BIC=2271.24
#AICc of 2251.01  

fatemodel %>% gg_tsresiduals()

#similar to the the cases model, there seems to be some autocorrelation and heteroscadictity. 

  
augment(fatemodel) %>%
  filter(.model == "arima") %>%
  features(.innov, ljung_box, lag=14, dof=3)
## # A tibble: 1 x 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 arima     21.4    0.0299
#Box test confirms autocorrelation 
fatemodel %>% 
  forecast(h=14) %>% 
  autoplot(worldcovid)

fatemodel %>% 
  forecast(h=14) %>% 
  hilo()
## # A tsibble: 14 x 6 [1D]
## # Key:       .model [1]
##    .model Date             Fatalities .mean                    `80%`
##    <chr>  <date>               <dist> <dbl>                   <hilo>
##  1 arima  2020-06-11 N(6540, 1201759) 6540. [5134.90634, 7944.702]80
##  2 arima  2020-06-12 N(6661, 1764738) 6661. [4958.45789, 8363.373]80
##  3 arima  2020-06-13 N(5143, 2142352) 5143. [3267.09941, 7018.656]80
##  4 arima  2020-06-14 N(3239, 2348231) 3239. [1274.80200, 5202.486]80
##  5 arima  2020-06-15 N(3847, 2670864) 3847. [1752.15221, 5940.975]80
##  6 arima  2020-06-16 N(6155, 2857539) 6155. [3988.26454, 8321.000]80
##  7 arima  2020-06-17 N(6770, 3133918) 6770. [4501.50164, 9038.931]80
##  8 arima  2020-06-18   N(6374, 4e+06) 6374. [3820.19820, 8928.289]80
##  9 arima  2020-06-19 N(6481, 4665056) 6481. [3713.13520, 9249.115]80
## 10 arima  2020-06-20 N(4845, 5117256) 4845. [1945.50017, 7743.586]80
## 11 arima  2020-06-21 N(3063, 5553958) 3063. [  43.25326, 6083.677]80
## 12 arima  2020-06-22   N(3573, 6e+06) 3573. [ 446.81996, 6699.683]80
## 13 arima  2020-06-23 N(5985, 6334125) 5985. [2759.52105, 9210.259]80
## 14 arima  2020-06-24 N(6519, 6683530) 6519. [3206.19139, 9832.460]80
## # ... with 1 more variable: `95%` <hilo>

Overall, the fatalities model seems to be very similar to the covid cases model. The ARIMA model was the most accurate. The model selected was ARIMA(2,0,3)(0,1,1). It has the same seasonal component but different trend parameters. This isn’t so surprising when comparing to the decomposition conducted earlier.

The residuals of the model also provide similar results to that of the cases model. There seems to be patterns within the residual plots indicating heteroscedasticity. The ACF plot indicates autocorrelation and it is confirmed with the Ljung-box test. The Ljung-box test had a p-value of less then 0.05 indicating the residuals are not identical to white noise.

In review of both models, ARIMA served best. Both had some caveats proving to predict well but had autocorrelation and some heteroscedasticity. This gives limitation to interpret-ability which might be an issue due to the importance of the data to public health. The models didn’t have as much data to work with than originally seemed which provided another limitation. The training data started with 969640 observations but ended up with only 140 when appropriately aggregated for modelling. The testing data also didn’t prove to be helpful either being redundant. A limitation of both models is long term forecasting. Since they both have a d = 0 (Cases = ARIMA(1,0,4) fatalities =ARIM(2,0,3)), they won’t be as accurate forecasting in the longer term because it will eventually revert to the mean of the data. This will also make any long term forecast prediction intervals become the same (Hyndman 9.5).The cases model also won’t take into account any cyclical behaviors since it’s p is less than 2. The fatalities model would be better at prediction because it accounts for cyclical behavior.

A lot of future work could be beneficial when working with this data. A priority in the future would be to dive into more granular portions of the data such as US specific data and forecast by state. More transformations could have been looked into help with the variation of the data which was seen when doing the multiplicative STL decomposition. A Guerrero box-cox transformation could have helped with the variance in the data. Testing out a wider variety of models, like neural networks, could also be a good learning experience but also provide more insight.

The biggest take aways from this project were really understanding the data, testing a wide variety of models, understanding a forecasting project from start to finish. Upon review, the original data doesn’t seem too out of place. The multiplicative decomposition sheds light on how volatile the original data was, how different the trend can be, and standout points in the data. Having multiple models to test would have been better to see which one works better. The personal bias was that ARIMA models would have worked better due to its robustness with data and ability to handle various types of seasonal data. Auto ARIMA models were used, fourier terms would’ve helped with predictive accuracy within the model process (Hyndman 13.1). Forecasting with STL decomposition would’ve been a viable option for forecasting but was only used to under stand the data (Hyndman 13.1). The midterm also helped in understanding a forecast project from start to finish. Time series data is similar to normal modelling procedures but has differences that require a unique understanding. Going through a whole data start to finish helps with learning the needs a forecasting project.