# Libraries ---------------------------------------------------------------
library('tidyverse')
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library('lubridate')
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library('forecast')
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library('feasts')
## Loading required package: fabletools
##
## Attaching package: 'fabletools'
## The following objects are masked from 'package:forecast':
##
## accuracy, forecast
library('fpp3')
## -- Attaching packages -------------------------------------------- fpp3 0.4.0 --
## v tsibble 1.1.1 v fable 0.3.1
## v tsibbledata 0.4.0
## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x fabletools::forecast() masks forecast::forecast()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::setdiff() masks base::setdiff()
## x tsibble::union() masks base::union()
library('gridExtra')
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
The world was transformed into an unprecedented when covid-19 emergered. A threat to public safety came out in force and shut down governments as well as their economies across the globe. The data analyzed in this assignment is the first year of cases and fatalities of the covid-19 pandemic. The importance of forecasting and analyzing this data helps provide useful information to medical centers so they can be prepared for increased amounts of patients. It also helps governments navigate the pandemic so they can help the public and return life to a more normal state.
train <- read.csv("train.csv")
head(train)
## Id County Province_State Country_Region Population Weight Date
## 1 1 Afghanistan 27657145 0.05835874 2020-01-23
## 2 2 Afghanistan 27657145 0.58358737 2020-01-23
## 3 3 Afghanistan 27657145 0.05835874 2020-01-24
## 4 4 Afghanistan 27657145 0.58358737 2020-01-24
## 5 5 Afghanistan 27657145 0.05835874 2020-01-25
## 6 6 Afghanistan 27657145 0.58358737 2020-01-25
## Target TargetValue
## 1 ConfirmedCases 0
## 2 Fatalities 0
## 3 ConfirmedCases 0
## 4 Fatalities 0
## 5 ConfirmedCases 0
## 6 Fatalities 0
table(train$Target)
##
## ConfirmedCases Fatalities
## 484820 484820
table(train$Country_Region)
##
## Afghanistan Albania
## 280 280
## Algeria Andorra
## 280 280
## Angola Antigua and Barbuda
## 280 280
## Argentina Armenia
## 280 280
## Australia Austria
## 2520 280
## Azerbaijan Bahamas
## 280 280
## Bahrain Bangladesh
## 280 280
## Barbados Belarus
## 280 280
## Belgium Belize
## 280 280
## Benin Bhutan
## 280 280
## Bolivia Bosnia and Herzegovina
## 280 280
## Botswana Brazil
## 280 280
## Brunei Bulgaria
## 280 280
## Burkina Faso Burma
## 280 280
## Burundi Cabo Verde
## 280 280
## Cambodia Cameroon
## 280 280
## Canada Central African Republic
## 3640 280
## Chad Chile
## 280 280
## China Colombia
## 9520 280
## Comoros Congo (Brazzaville)
## 280 280
## Congo (Kinshasa) Costa Rica
## 280 280
## Cote d'Ivoire Croatia
## 280 280
## Cuba Cyprus
## 280 280
## Czechia Denmark
## 280 840
## Diamond Princess Djibouti
## 280 280
## Dominica Dominican Republic
## 280 280
## Ecuador Egypt
## 280 280
## El Salvador Equatorial Guinea
## 280 280
## Eritrea Estonia
## 280 280
## Eswatini Ethiopia
## 280 280
## Fiji Finland
## 280 280
## France Gabon
## 3080 280
## Gambia Georgia
## 280 280
## Germany Ghana
## 280 280
## Greece Grenada
## 280 280
## Guatemala Guinea
## 280 280
## Guinea-Bissau Guyana
## 280 280
## Haiti Holy See
## 280 280
## Honduras Hungary
## 280 280
## Iceland India
## 280 280
## Indonesia Iran
## 280 280
## Iraq Ireland
## 280 280
## Israel Italy
## 280 280
## Jamaica Japan
## 280 280
## Jordan Kazakhstan
## 280 280
## Kenya Korea, South
## 280 280
## Kosovo Kuwait
## 280 280
## Kyrgyzstan Laos
## 280 280
## Latvia Lebanon
## 280 280
## Liberia Libya
## 280 280
## Liechtenstein Lithuania
## 280 280
## Luxembourg Madagascar
## 280 280
## Malawi Malaysia
## 280 280
## Maldives Mali
## 280 280
## Malta Mauritania
## 280 280
## Mauritius Mexico
## 280 280
## Moldova Monaco
## 280 280
## Mongolia Montenegro
## 280 280
## Morocco Mozambique
## 280 280
## MS Zaandam Namibia
## 280 280
## Nepal Netherlands
## 280 1400
## New Zealand Nicaragua
## 280 280
## Niger Nigeria
## 280 280
## North Macedonia Norway
## 280 280
## Oman Pakistan
## 280 280
## Panama Papua New Guinea
## 280 280
## Paraguay Peru
## 280 280
## Philippines Poland
## 280 280
## Portugal Qatar
## 280 280
## Romania Russia
## 280 280
## Rwanda Saint Kitts and Nevis
## 280 280
## Saint Lucia Saint Vincent and the Grenadines
## 280 280
## San Marino Sao Tome and Principe
## 280 280
## Saudi Arabia Senegal
## 280 280
## Serbia Seychelles
## 280 280
## Sierra Leone Singapore
## 280 280
## Slovakia Slovenia
## 280 280
## Somalia South Africa
## 280 280
## South Sudan Spain
## 280 280
## Sri Lanka Sudan
## 280 280
## Suriname Sweden
## 280 280
## Switzerland Syria
## 280 280
## Taiwan* Tajikistan
## 280 280
## Tanzania Thailand
## 280 280
## Timor-Leste Togo
## 280 280
## Trinidad and Tobago Tunisia
## 280 280
## Turkey Uganda
## 280 280
## Ukraine United Arab Emirates
## 280 280
## United Kingdom Uruguay
## 3080 280
## US Uzbekistan
## 895440 280
## Venezuela Vietnam
## 280 280
## West Bank and Gaza Western Sahara
## 280 280
## Yemen Zambia
## 280 280
## Zimbabwe
## 280
colSums(is.na(train))
## Id County Province_State Country_Region Population
## 0 0 0 0 0
## Weight Date Target TargetValue
## 0 0 0 0
colSums(train == "")
## Id County Province_State Country_Region Population
## 0 89600 52360 0 0
## Weight Date Target TargetValue
## 0 0 0 0
train$Date <- as_date(train$Date)
duplicates(train)
## Using `Date` as index variable.
## # A tibble: 969,640 x 9
## Id County Province_State Country_Region Population Weight Date
## <int> <chr> <chr> <chr> <int> <dbl> <date>
## 1 1 "" "" Afghanistan 27657145 0.0584 2020-01-23
## 2 2 "" "" Afghanistan 27657145 0.584 2020-01-23
## 3 3 "" "" Afghanistan 27657145 0.0584 2020-01-24
## 4 4 "" "" Afghanistan 27657145 0.584 2020-01-24
## 5 5 "" "" Afghanistan 27657145 0.0584 2020-01-25
## 6 6 "" "" Afghanistan 27657145 0.584 2020-01-25
## 7 7 "" "" Afghanistan 27657145 0.0584 2020-01-26
## 8 8 "" "" Afghanistan 27657145 0.584 2020-01-26
## 9 9 "" "" Afghanistan 27657145 0.0584 2020-01-27
## 10 10 "" "" Afghanistan 27657145 0.584 2020-01-27
## # ... with 969,630 more rows, and 2 more variables: Target <chr>,
## # TargetValue <int>
Below is the transformation of the data. It was turned into a world data set rather than split out by country. It is aggregated by date. This will help narrow down specify trends.
trainCC <- train %>%
filter(Target == "ConfirmedCases")
trainfate <- train %>%
filter(Target == "Fatalities")
#Creating a new data that displays the data horizontally rather than duplicating it
traindata <- trainCC %>%
mutate(Fatalities = trainfate$Target)
#dates are duplicating because there is a new set per country
traindata <- trainCC %>%
mutate(Fatalitiesvalue = trainfate$TargetValue)
#aggregating the total cases per day for the whole datasheet
worldcases <- as.data.frame(aggregate(TargetValue~Date, trainCC,sum))
worldfatalities <- as.data.frame(aggregate(TargetValue~Date, trainfate,sum))
worldcovid <- worldcases %>%
mutate(Fatalities = worldfatalities$TargetValue)
worldcovid <- as_tsibble(worldcovid, index = Date)
The training data is a multi-time series data set. It has data from countries all around the world. It has state and county data for the United states. The important variables of the data is the amount of covid cases and fatalities per day. These are broken out by the target variable in the original data. In the aggregated world data, Target value is all confirmed cases and Fatalities are confirmed fatalities due to the disease.
worldcases %>%
ggplot(aes(x=Date,y=TargetValue)) +
geom_line() +
labs(title = "World Covid Cases Per Day",
y="Number of Cases",
x = "2020 Dates")
worldfatalities %>%
ggplot(aes(x=Date,y=TargetValue)) +
geom_line() +
labs(title = "World Covid Case Fatalities Per Day",
y="Number of Fatalities",
x = "2020 Dates")
worldcovid %>%
ggplot(aes(x=Date)) +
geom_line(aes(y = TargetValue, color = "Positive Cases")) +
geom_line(aes(y= Fatalities, color = "Fatalities")) +
labs(title = "World Covid Cases and Fatalities",
y = "Cases/fatalities",
color = "Legend",
x = "2020 Dates")
#Decomposing the data
worldcovid %>%
model(STL(log(TargetValue))) %>%
components() %>%
autoplot()
worldcovid %>%
model(STL(log(Fatalities))) %>%
components() %>%
autoplot()
The data was decomposed to further the understanding. A multiplicative STL approach was used. STL was used due to it robustness and ability to handle all forms of seasonality when compared to other approaches (3.6 Hyndman). The multiplicative approach was used by taking the log of the data. It was done to account for the variation of the data when graphing the data originally.
Comparing the decomposition for both covid cases and fatalities abck to their respective graphs, the log emphasizes the variation in the data more significantly in the first 3 months of the data. Both trends show an increase at the beginning, a dip in the middle, and then a significant increase. The seasonality has a lot of variation throughout the data. The remainder of the data isn’t random. It shows the variation in the beginning and smooths out.
wcmodel <- worldcovid %>%
slice(-n()) %>%
stretch_tsibble(.init = 10) %>%
model(naive = NAIVE(TargetValue),
ETS = ETS(TargetValue),
arima = ARIMA(TargetValue)
)
wcmodel %>%
forecast(h = 1) %>%
accuracy(worldcovid)
## # A tibble: 3 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test 51.2 9959. 7307. -33.0 47.2 0.567 0.496 0.0259
## 2 ETS Test -5269. 38922. 16383. -55.0 84.6 1.27 1.94 -0.100
## 3 naive Test 1201. 11632. 8236. -9.31 26.7 0.639 0.580 0.000501
The was test on 4 separate models; Naive, ETS, and ARIMA. The base functions were used so the models could automatically chosen and tested for to find the best model. ETS and ARIMA models were used because they are able to handle seasonality and variation of data better than more traditional methods like linear forecasting models. A naive model was used as a base for comparison. The test data provide had overlapping indexed data with the training data. Due to this it was decided to use cross validation to asses the robustness of each model. In this case, the best model was the ARIMA model with an RMSE of 9959.
models <- worldcovid %>%
model(
arima = ARIMA(TargetValue)
)
report(models)
## Series: TargetValue
## Model: ARIMA(1,0,4)(0,1,1)[7]
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 sma1
## 0.9732 -0.4704 0.0439 -0.0223 0.2271 -0.5582
## s.e. 0.0190 0.0942 0.1110 0.0795 0.0961 0.1000
##
## sigma^2 estimated as 80446334: log likelihood=-1398.21
## AIC=2810.43 AICc=2811.32 BIC=2830.66
The model used was a an (1,0,4)(0,1,1)[7].
models %>% gg_tsresiduals()
When looking at the diagnostic plots of the data we can see the innovation residuals isn’t completely random as it has a similar pattern to the data. The residual plot is fairly normally distributed. The ACF shows that there is some slight autocorrelation.
augment(models) %>%
filter(.model == "arima") %>%
features(.innov, ljung_box, lag=14, dof=3)
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima 21.5 0.0287
using the Ljung box test a p-value smaller than 0.05 indicates there is autocorrelation. The residuals are not similar to white noise. This proves what was seen in the ACF plot.
models %>%
forecast(h=14) %>%
autoplot(worldcovid)
Though the model has some serial correlation it was the best fitting
model to the data when compared to the other models. Using the model,
the next 14 days were forecasted.
models %>%
forecast(h=14) %>%
hilo()
## # A tsibble: 14 x 6 [1D]
## # Key: .model [1]
## .model Date TargetValue .mean `80%`
## <chr> <date> <dist> <dbl> <hilo>
## 1 arima 2020-06-11 N(169161, 8e+07) 169161. [157666.7, 180655.6]80
## 2 arima 2020-06-12 N(183422, 1e+08) 183422. [170556.1, 196288.1]80
## 3 arima 2020-06-13 N(171726, 1.2e+08) 171726. [157474.2, 185977.6]80
## 4 arima 2020-06-14 N(148478, 1.4e+08) 148478. [133125.4, 163831.3]80
## 5 arima 2020-06-15 N(141243, 1.8e+08) 141243. [123852.6, 158632.5]80
## 6 arima 2020-06-16 N(154897, 2.2e+08) 154897. [135776.8, 174017.4]80
## 7 arima 2020-06-17 N(161620, 2.6e+08) 161620. [140994.2, 182245.9]80
## 8 arima 2020-06-18 N(169861, 3.6e+08) 169861. [145688.2, 194034.7]80
## 9 arima 2020-06-19 N(184730, 4.2e+08) 184730. [158615.2, 210844.3]80
## 10 arima 2020-06-20 N(172465, 4.7e+08) 172465. [144558.7, 200371.7]80
## 11 arima 2020-06-21 N(149981, 5.3e+08) 149981. [120514.0, 179448.2]80
## 12 arima 2020-06-22 N(142705, 5.9e+08) 142705. [111469.1, 173941.1]80
## 13 arima 2020-06-23 N(156320, 6.6e+08) 156320. [123496.9, 189144.1]80
## 14 arima 2020-06-24 N(163005, 7.1e+08) 163005. [128745.7, 197265.0]80
## # ... with 1 more variable: `95%` <hilo>
Next we can see the prediction interval at 80% and 95% which make up the forecasted portion of the model.
Now that the cases have been forecasted, next will be the fatalities. This will involve a similar process. Cross validation will be used to assess the accuracy of models to select the most accurate one.
wfmodel <- worldcovid %>%
slice(-n()) %>%
stretch_tsibble(.init = 10) %>%
model(naive = NAIVE(Fatalities),
ETS = ETS(Fatalities),
arima = ARIMA(Fatalities)
)
wfmodel %>%
forecast(h = 1) %>%
accuracy(worldcovid)
## # A tibble: 3 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -91.9 1338. 752. -69.9 92.0 0.660 0.729 -0.124
## 2 ETS Test -15.7 1677. 1116. -91.3 120. 0.979 0.914 0.474
## 3 naive Test 50.6 1479. 902. -83.9 107. 0.791 0.806 0.0947
#like the covid cases, fatalities will use an arima model because the RMSE is the lowest.
fatemodel <- worldcovid %>%
model(arima = ARIMA(Fatalities))
fatemodel %>% report()
## Series: Fatalities
## Model: ARIMA(2,0,3)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 sma1
## 0.0543 0.8842 0.6301 -0.3609 -0.2218 -0.5396
## s.e. 0.0672 0.0631 0.1080 0.0969 0.0830 0.1040
##
## sigma^2 estimated as 1201759: log likelihood=-1118.5
## AIC=2251.01 AICc=2251.9 BIC=2271.24
#AICc of 2251.01
fatemodel %>% gg_tsresiduals()
#similar to the the cases model, there seems to be some autocorrelation and heteroscadictity.
augment(fatemodel) %>%
filter(.model == "arima") %>%
features(.innov, ljung_box, lag=14, dof=3)
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima 21.4 0.0299
#Box test confirms autocorrelation
fatemodel %>%
forecast(h=14) %>%
autoplot(worldcovid)
fatemodel %>%
forecast(h=14) %>%
hilo()
## # A tsibble: 14 x 6 [1D]
## # Key: .model [1]
## .model Date Fatalities .mean `80%`
## <chr> <date> <dist> <dbl> <hilo>
## 1 arima 2020-06-11 N(6540, 1201759) 6540. [5134.90634, 7944.702]80
## 2 arima 2020-06-12 N(6661, 1764738) 6661. [4958.45789, 8363.373]80
## 3 arima 2020-06-13 N(5143, 2142352) 5143. [3267.09941, 7018.656]80
## 4 arima 2020-06-14 N(3239, 2348231) 3239. [1274.80200, 5202.486]80
## 5 arima 2020-06-15 N(3847, 2670864) 3847. [1752.15221, 5940.975]80
## 6 arima 2020-06-16 N(6155, 2857539) 6155. [3988.26454, 8321.000]80
## 7 arima 2020-06-17 N(6770, 3133918) 6770. [4501.50164, 9038.931]80
## 8 arima 2020-06-18 N(6374, 4e+06) 6374. [3820.19820, 8928.289]80
## 9 arima 2020-06-19 N(6481, 4665056) 6481. [3713.13520, 9249.115]80
## 10 arima 2020-06-20 N(4845, 5117256) 4845. [1945.50017, 7743.586]80
## 11 arima 2020-06-21 N(3063, 5553958) 3063. [ 43.25326, 6083.677]80
## 12 arima 2020-06-22 N(3573, 6e+06) 3573. [ 446.81996, 6699.683]80
## 13 arima 2020-06-23 N(5985, 6334125) 5985. [2759.52105, 9210.259]80
## 14 arima 2020-06-24 N(6519, 6683530) 6519. [3206.19139, 9832.460]80
## # ... with 1 more variable: `95%` <hilo>
Overall, the fatalities model seems to be very similar to the covid cases model. The ARIMA model was the most accurate. The model selected was ARIMA(2,0,3)(0,1,1). It has the same seasonal component but different trend parameters. This isn’t so surprising when comparing to the decomposition conducted earlier.
The residuals of the model also provide similar results to that of the cases model. There seems to be patterns within the residual plots indicating heteroscedasticity. The ACF plot indicates autocorrelation and it is confirmed with the Ljung-box test. The Ljung-box test had a p-value of less then 0.05 indicating the residuals are not identical to white noise.
In review of both models, ARIMA served best. Both had some caveats proving to predict well but had autocorrelation and some heteroscedasticity. This gives limitation to interpret-ability which might be an issue due to the importance of the data to public health. The models didn’t have as much data to work with than originally seemed which provided another limitation. The training data started with 969640 observations but ended up with only 140 when appropriately aggregated for modelling. The testing data also didn’t prove to be helpful either being redundant. A limitation of both models is long term forecasting. Since they both have a d = 0 (Cases = ARIMA(1,0,4) fatalities =ARIM(2,0,3)), they won’t be as accurate forecasting in the longer term because it will eventually revert to the mean of the data. This will also make any long term forecast prediction intervals become the same (Hyndman 9.5).The cases model also won’t take into account any cyclical behaviors since it’s p is less than 2. The fatalities model would be better at prediction because it accounts for cyclical behavior.
A lot of future work could be beneficial when working with this data. A priority in the future would be to dive into more granular portions of the data such as US specific data and forecast by state. More transformations could have been looked into help with the variation of the data which was seen when doing the multiplicative STL decomposition. A Guerrero box-cox transformation could have helped with the variance in the data. Testing out a wider variety of models, like neural networks, could also be a good learning experience but also provide more insight.
The biggest take aways from this project were really understanding the data, testing a wide variety of models, understanding a forecasting project from start to finish. Upon review, the original data doesn’t seem too out of place. The multiplicative decomposition sheds light on how volatile the original data was, how different the trend can be, and standout points in the data. Having multiple models to test would have been better to see which one works better. The personal bias was that ARIMA models would have worked better due to its robustness with data and ability to handle various types of seasonal data. Auto ARIMA models were used, fourier terms would’ve helped with predictive accuracy within the model process (Hyndman 13.1). Forecasting with STL decomposition would’ve been a viable option for forecasting but was only used to under stand the data (Hyndman 13.1). The midterm also helped in understanding a forecast project from start to finish. Time series data is similar to normal modelling procedures but has differences that require a unique understanding. Going through a whole data start to finish helps with learning the needs a forecasting project.