#Project setup

setwd("~/Downloads/BANA4090 FINAL PROJECT")
cod<- read_csv("cod.csv.zip")
## Multiple files in zip: reading 'cause_of_deaths.csv'
## Rows: 6120 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country/Territory, Code
## dbl (32): Year, Meningitis, Alzheimer's Disease and Other Dementias, Parkins...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
show_col_types = FALSE
cod
## # A tibble: 6,120 × 34
##    Country…¹ Code   Year Menin…² Alzhe…³ Parki…⁴ Nutri…⁵ Malaria Drown…⁶ Inter…⁷
##    <chr>     <chr> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Afghanis… AFG    1990    2159    1116     371    2087      93    1370    1538
##  2 Afghanis… AFG    1991    2218    1136     374    2153     189    1391    2001
##  3 Afghanis… AFG    1992    2475    1162     378    2441     239    1514    2299
##  4 Afghanis… AFG    1993    2812    1187     384    2837     108    1687    2589
##  5 Afghanis… AFG    1994    3027    1211     391    3081     211    1809    2849
##  6 Afghanis… AFG    1995    3102    1225     394    3131     175    1881    2969
##  7 Afghanis… AFG    1996    3193    1239     398    3175     175    1969    3331
##  8 Afghanis… AFG    1997    3304    1253     402    3250     240    2078    3028
##  9 Afghanis… AFG    1998    3281    1267     405    3193     563    2098    3098
## 10 Afghanis… AFG    1999    3200    1281     409    3115     468    2084    2917
## # … with 6,110 more rows, 24 more variables: `Maternal Disorders` <dbl>,
## #   `HIV/AIDS` <dbl>, `Drug Use Disorders` <dbl>, Tuberculosis <dbl>,
## #   `Cardiovascular Diseases` <dbl>, `Lower Respiratory Infections` <dbl>,
## #   `Neonatal Disorders` <dbl>, `Alcohol Use Disorders` <dbl>,
## #   `Self-harm` <dbl>, `Exposure to Forces of Nature` <dbl>,
## #   `Diarrheal Diseases` <dbl>, `Environmental Heat and Cold Exposure` <dbl>,
## #   Neoplasms <dbl>, `Conflict and Terrorism` <dbl>, …

#Introduction This dataset contains historical data on causes of death around the world from different countries

The dataset is from the Kaggle site and was aggregated by different authors for research purposes

The index for this data set would be the year column. There are many different time series in the dataset but since we are looking to forecast for a specific country, we will let autoplot map out what Malaria deaths look like in different places and choose the country that has interesting information.

#Data wrangling

cod %>% as_tsibble(index = Year, key = Code) %>% filter(Code == "NGA" | Code == "BGD" | Code == "BTN" | Code == "MYS" | Code == "MUS" | Code == "PAK") %>%  autoplot(Malaria)

After primary analysis of different countries, we decided that the forecast variable will be the number of deaths caused by malaria in Nigeria because they have one of the highest deaths per 100,000 population according to the health cases data. The predictor variables will be: nutritional deficiencies, drug use disorders, exposure to forces of nature, environmental heat and cold exposure, conflict and terrorism (after some graphing against the forecast variable, this could change)

We chose this forecast variable because Malaria is a very dangerous and prevalent disease throughout the world taking many lives yearly so we wanted to investigate what other factors can contribute to those deaths in certain area of the world other than being bitten by a female anopheles mosquito

This forecast can raise awareness amongst people and bring their attention to variables that contribute to malaria deaths other than the obvious female anopheles mosquito

cod_work<- cod %>% 
  filter(`Country/Territory` == "Nigeria") %>% 
  select(Year, `Country/Territory` , Malaria, `Nutritional Deficiencies` , `Drug Use Disorders`, `Exposure to Forces of Nature`, `Environmental Heat and Cold Exposure` , `Lower Respiratory Infections` , `Conflict and Terrorism`)

cod_work
## # A tibble: 30 × 9
##     Year Country/Terri…¹ Malaria Nutri…² Drug …³ Expos…⁴ Envir…⁵ Lower…⁶ Confl…⁷
##    <dbl> <chr>             <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  2007 Nigeria          267368    9382     161      91     909  208401     628
##  2  2008 Nigeria          280604    8850     165       0     892  202257     913
##  3  2009 Nigeria          276715    8663     172      31     895  199250    2668
##  4  2013 Nigeria          227310    7542     220      19     910  191077    6260
##  5  2016 Nigeria          186482    6884     246      46     962  191915    5064
##  6  1990 Nigeria          148931   10236      95       0     728  169472      81
##  7  1991 Nigeria          157502   10517     100       0     744  174686      13
##  8  1992 Nigeria          165722   10624     105       0     782  180284     147
##  9  1993 Nigeria          173695   10693     110       0     781  185946       3
## 10  1994 Nigeria          180588   11053     113      30     800  192168      90
## # … with 20 more rows, and abbreviated variable names ¹​`Country/Territory`,
## #   ²​`Nutritional Deficiencies`, ³​`Drug Use Disorders`,
## #   ⁴​`Exposure to Forces of Nature`, ⁵​`Environmental Heat and Cold Exposure`,
## #   ⁶​`Lower Respiratory Infections`, ⁷​`Conflict and Terrorism`
cod_nigeria <- cod_work %>% as_tsibble(index = Year)

There is no missing data in our data set and as of now we are going to use the variables within the times series for forecasting. Given that they are numerical values we will refrain from creating new variables like dummy variables or fourier variables. Once we start fitting models, we will include lags of some of the predictor variables in the table to aid in forecasting but as of now we refrainfrom doing so.

#Exploratory anaylsis and visualization for the data set

cod_nigeria %>% autoplot(Malaria)

There seems to be a constant rise in the deaths up until approximately 2008 where it drops steadily until approximately 2017 and then rises again until 2020.

Now we are going to check for stationarity in the data set that we chose using an ACF plot and an unit root kpss test.

ACF plot analysis

cod_nigeria %>% ACF(Malaria) %>% autoplot()

There seems to be a decaying correlation between the lags of the values related to deaths caused by Malaria but there is some strong correlation in the first lags which means that the data set is not stationary; let us confirm this with a unit root test

Unit root test

cod_nigeria %>% features(Malaria, unitroot_kpss)
## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.411      0.0723

The low kpss value tells us that the data is not stationary. We can handle this by differencing out the data, taking a log transform or taking a box-cox transformation but because there is not a lot of seasonality in the dataset, we do not want to extract the meaningful information that might be held in the slight variance in mean.

Now we are going to plot some of the predictor variables against the forecast variable to see if there is a relationship between the two and to pick what variables to include in our models

cod_nigeria %>% ggplot(aes(x = `Nutritional Deficiencies`, y = Malaria))+ geom_point() + labs(title = "Malaria by nutrition")

cod_nigeria %>% ggplot(aes(x = `Exposure to Forces of Nature`, y = Malaria))+ geom_point() + labs(title = "Malaria by exposure to nature")

cod_nigeria %>% ggplot(aes(x = `Drug Use Disorders`, y = Malaria))+ geom_point() + labs(title = "Malaria by drug use")

cod_nigeria %>% ggplot(aes(x = `Lower Respiratory Infections`, y = Malaria))+ geom_point() + labs(title = "Malaria by infection")

cod_nigeria %>% ggplot(aes(x = `Environmental Heat and Cold Exposure`, y = Malaria))+ geom_point() + labs(title = "Malaria by Environmental pressure")

cod_nigeria %>% ggplot(aes(x = `Conflict and Terrorism`, y = Malaria))+ geom_point() + labs(title = "Malaria by national state")

From plotting the predictor variables against the forecast variable, it seems that there is a relationship between Malaria deaths and deaths caused by Drug use disorders, Environmental cold and heat exposure and potentially lower respiratory infections.

The deaths from Malaria and drug use disorders having a correlation could result from treating Malaria with drugs but when the drug use goes over a certain threshold, it could become detrimental to the health. We see that as deaths in Malaria increase with certain drug use disorders and then start to decrease and then rise again

Environmental conditions typically have a strong relationship with Malaria cases and infections with Malaria worsen in hot temperatures. So as the deaths from exposure to cold and heat pressure increase, it is natural that deaths caused by Malaria would increase as well.

The deaths caused by lower respiratory infections have a strange but positive relationship with the deaths caused by Malaria. This could be because as cases involving lower respiratory infections get worse, your ability to fight off Malaria reduces as well reduces which results in the deaths trending together.

Before fitting models, we are going to add lags of some variables to add to some of our model fittings like regression

cod_nigeria_lag<- cod_nigeria %>% mutate(
  lag_Malaria = lag(Malaria, 1 ), 
   lag_environment = lag(`Environmental Heat and Cold Exposure`, 1),
  lag_drug_use = lag(`Drug Use Disorders`, 1) ) %>% filter(!is.na(lag_Malaria)) %>% filter(!is.na(lag_environment)) %>% filter(!is.na(lag_drug_use))

#Model Fitting

train<- cod_nigeria_lag %>% filter(Year < "2012")
test<- cod_nigeria_lag %>% filter(Year >= "2012")

We chose to split the training set before 2012 because that is approximately 80% of the data and the rest is the remaining 20%

#Fit TSLM, ETS and ARIMA models

For the first model fitting, we will have r pick all the parameters and then using our analysis of the data and the plot above, we will try to beat the models r comes up with

fit_1 <- train %>% 
          model(TSLM_r = TSLM(Malaria),
                ARIMA_r = ARIMA(Malaria),
                ETS_r = ETS(Malaria))

We are going to check the models fitted above for which one fits best with the data using accuracy since they are all different model families and then we will check the residuals of the best model. So we will first create a forecast and test our fits against the trainging data set

fc_1 <- fit_1 %>% forecast(new_data = test, level = NULL)
fc_1 %>% autoplot(test, level = NULL)

fc_1 %>% accuracy(test) %>% arrange(RMSE)
## # A tibble: 3 × 10
##   .model  .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>   <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS_r   Test    8048. 16713.  9423.   4.23  4.90   NaN   NaN 0.399
## 2 ARIMA_r Test  -20775. 23946. 20775. -10.9  10.9    NaN   NaN 0.626
## 3 TSLM_r  Test  -17122. 27223. 23402.  -9.54 12.2    NaN   NaN 0.657

From the models fitted above, the ETS model that r fitted did the best. It is not a very good model as we can tell from the graph and deviates completely from the testing data after 2017 so in the long run it might not be a good model but let us see what the residual components say about the model as well

fit_1 %>% select(ETS_r) %>% gg_tsresiduals()

The makeup of the residuals does not do to bad because the model matches with the test data set up until a major deviaton. The variance looks fairly constant, there is no autocorrelation in the residuals and the mean has a normal distribution. We are still going to fit other models to improve on our accuracy.

#Fitting models with predictor variables, ARIMA and ARIMA errors

We will do some testing with the data set to figure out what ARIMA model will best fit the data before fitting it with models with predictors and regression

cod_nigeria_lag %>% features(Malaria, unitroot_nsdiffs)
## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

The unit root test comes up with 0 for seasonality differencing because as mentioned above, there is little to no seasonality in the Malaria plot.

Now let us check to see if regular differencing is needed

cod_nigeria_lag %>% features(Malaria, unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      2

The unit root test returns that in order for the data set to be stationary, we need to take two differences of the Malaria column so we will incorporate that into our ARIMA model to improve our forecasting

We will create differences columns and check for stationarity then analyze the ACF and PACF plots that are produced

cod_nigeria_lag <- cod_nigeria_lag %>% mutate(Difference_Malaria = difference(Malaria),
                           Difference2_Malaria = difference(Difference_Malaria))

Let us check for stationarity in the double differenced column to fact chech the unit root results

cod_nigeria_lag %>% features(Difference2_Malaria, unitroot_kpss)
## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.101         0.1

The high kpss values tells us that the data is stationary. We will incorporate that into our ARIMA modeling. Now let us look at the PACF and ACF plots

cod_nigeria_lag %>% gg_tsdisplay(Difference2_Malaria, plot_type = "partial")
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).

From the result above, we are going to fit MA model and a mixed AR and MA model

#Fitting models with different variables

fit_2<- train %>% 
  model(model_1 = TSLM(Malaria ~ `Drug Use Disorders`),
        model_2 = TSLM(Malaria ~ `Drug Use Disorders` + `Lower Respiratory Infections` + `Environmental Heat and Cold Exposure`),
        model_3 = TSLM(Malaria ~ lag_Malaria + lag_environment + lag_drug_use),
        ETS_SES = ETS(Malaria ~ error("A") + trend("A") +season("N")),
        ETS_Damped = ETS(Malaria ~ error("A") + trend("Ad") +season("N")),
        ARIMA_MA = ARIMA(Malaria ~ pdq(0,2,0) + PDQ(0,0,3)),
        ARIMA_mixed = ARIMA(Malaria ~ pdq(2,2,2)),
        Arima_errors = ARIMA(Malaria ~ lag_Malaria + I(lag_Malaria^2) + lag_drug_use + I(lag_drug_use ^2 )),
        model_nnetar = NNETAR(Malaria))

#Creating a forecast with the models fitted above

fc_2 <- fit_2 %>% forecast(new_data = test, level = NULL)
fc_2 %>% autoplot(test, level = NULL)

#Checking the accuracy for the models fitted above and then picking the best one

fc_2 %>% accuracy(test) %>% arrange(RMSE)
## # A tibble: 9 × 10
##   .model       .type       ME    RMSE     MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>        <chr>    <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_mixed  Test     5135.  13754.   8322.   2.66  4.38   NaN   NaN 0.364
## 2 ARIMA_MA     Test     6391   15360.   9017.   3.37  4.71   NaN   NaN 0.387
## 3 ETS_SES      Test     7575.  16318.   9308.   3.98  4.85   NaN   NaN 0.396
## 4 ETS_Damped   Test   -19287.  22336.  19287. -10.1  10.1    NaN   NaN 0.610
## 5 model_3      Test    31548.  38005.  31548.  16.4  16.4    NaN   NaN 0.586
## 6 model_nnetar Test   -61565.  65991.  61565. -31.7  31.7    NaN   NaN 0.645
## 7 Arima_errors Test    71487.  82068.  71487.  37.0  37.0    NaN   NaN 0.602
## 8 model_2      Test  -133667. 140332. 133667. -68.3  68.3    NaN   NaN 0.637
## 9 model_1      Test  -154304. 161381. 154304. -78.7  78.7    NaN   NaN 0.644

Model 3 seems to have the best fit with the testing data set and it also beats the models that r picked as well about by about 7122.354 on the RMSE scale.

Now we are going to assess how benchmark models do on the Malaria variable

#Fitting benchmark models

fit_3 <- train %>% 
  model(
   mean =  MEAN(Malaria),
   naive = NAIVE(Malaria),
   drift = RW(Malaria ~ drift()))

Note: We left out the seasonal naive model above because our data set is non seasonal so it would produce an error message if fitted

fc_3 <- fit_3 %>% forecast(test)
fc_3 %>% autoplot(test, level = NULL)

fc_3 %>% accuracy(test) %>% arrange(RMSE)
## # A tibble: 3 × 10
##   .model .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mean   Test  -17122. 27223. 23402.  -9.54  12.2   NaN   NaN 0.657
## 2 naive  Test  -50156  54438. 50156  -25.9   25.9   NaN   NaN 0.657
## 3 drift  Test  -71884. 78443. 71884. -37.2   37.2   NaN   NaN 0.663

The mean model has the lowest RMSA but it is still very far off from the testing data set. We chose to use RMSE because it is the most sensitive to bigger and smaller values in the models. Compared to the benchmark models and the models that r picked, model 3 in the ones we fitted above did best against the data set so we will select that and do some analysis on it.

final_model <- fit_2 %>% select(model_3)
final_model %>% gg_tsresiduals()

The residual plots seem fairly decent. The variance in the residuals are constant up until 2015, there is no autocorrelation in the residuals as well and the mean has a fairly normal distribution. Overall, the model seems like a decent one.

#Create a forecast for the future

We are going to forecast 10 years into the future because anything out of that time frame might be risky since conditions can change as time goes on

forecast<- final_model %>% forecast(h = “10 years”)

Some things to consider when putting this forecast into production is making sure the variables used to forecast deaths caused by malaria are still relevant in the future; there might be other variables that need to be considered as time goes on and this will change the course of the forecasting model