#Project setup
setwd("~/Downloads/BANA4090 FINAL PROJECT")
cod<- read_csv("cod.csv.zip")
## Multiple files in zip: reading 'cause_of_deaths.csv'
## Rows: 6120 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country/Territory, Code
## dbl (32): Year, Meningitis, Alzheimer's Disease and Other Dementias, Parkins...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
show_col_types = FALSE
cod
## # A tibble: 6,120 × 34
## Country…¹ Code Year Menin…² Alzhe…³ Parki…⁴ Nutri…⁵ Malaria Drown…⁶ Inter…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… AFG 1990 2159 1116 371 2087 93 1370 1538
## 2 Afghanis… AFG 1991 2218 1136 374 2153 189 1391 2001
## 3 Afghanis… AFG 1992 2475 1162 378 2441 239 1514 2299
## 4 Afghanis… AFG 1993 2812 1187 384 2837 108 1687 2589
## 5 Afghanis… AFG 1994 3027 1211 391 3081 211 1809 2849
## 6 Afghanis… AFG 1995 3102 1225 394 3131 175 1881 2969
## 7 Afghanis… AFG 1996 3193 1239 398 3175 175 1969 3331
## 8 Afghanis… AFG 1997 3304 1253 402 3250 240 2078 3028
## 9 Afghanis… AFG 1998 3281 1267 405 3193 563 2098 3098
## 10 Afghanis… AFG 1999 3200 1281 409 3115 468 2084 2917
## # … with 6,110 more rows, 24 more variables: `Maternal Disorders` <dbl>,
## # `HIV/AIDS` <dbl>, `Drug Use Disorders` <dbl>, Tuberculosis <dbl>,
## # `Cardiovascular Diseases` <dbl>, `Lower Respiratory Infections` <dbl>,
## # `Neonatal Disorders` <dbl>, `Alcohol Use Disorders` <dbl>,
## # `Self-harm` <dbl>, `Exposure to Forces of Nature` <dbl>,
## # `Diarrheal Diseases` <dbl>, `Environmental Heat and Cold Exposure` <dbl>,
## # Neoplasms <dbl>, `Conflict and Terrorism` <dbl>, …
#Introduction This dataset contains historical data on causes of death around the world from different countries
The dataset is from the Kaggle site and was aggregated by different authors for research purposes
The index for this data set would be the year column. There are many different time series in the dataset but since we are looking to forecast for a specific country, we will let autoplot map out what Malaria deaths look like in different places and choose the country that has interesting information.
#Data wrangling
cod %>% as_tsibble(index = Year, key = Code) %>% filter(Code == "NGA" | Code == "BGD" | Code == "BTN" | Code == "MYS" | Code == "MUS" | Code == "PAK") %>% autoplot(Malaria)
After primary analysis of different countries, we decided that the
forecast variable will be the number of deaths caused by malaria in
Nigeria because they have one of the highest deaths per 100,000
population according to the health cases data. The predictor variables
will be: nutritional deficiencies, drug use disorders, exposure to
forces of nature, environmental heat and cold exposure, conflict and
terrorism (after some graphing against the forecast variable, this could
change)
We chose this forecast variable because Malaria is a very dangerous and prevalent disease throughout the world taking many lives yearly so we wanted to investigate what other factors can contribute to those deaths in certain area of the world other than being bitten by a female anopheles mosquito
This forecast can raise awareness amongst people and bring their attention to variables that contribute to malaria deaths other than the obvious female anopheles mosquito
cod_work<- cod %>%
filter(`Country/Territory` == "Nigeria") %>%
select(Year, `Country/Territory` , Malaria, `Nutritional Deficiencies` , `Drug Use Disorders`, `Exposure to Forces of Nature`, `Environmental Heat and Cold Exposure` , `Lower Respiratory Infections` , `Conflict and Terrorism`)
cod_work
## # A tibble: 30 × 9
## Year Country/Terri…¹ Malaria Nutri…² Drug …³ Expos…⁴ Envir…⁵ Lower…⁶ Confl…⁷
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2007 Nigeria 267368 9382 161 91 909 208401 628
## 2 2008 Nigeria 280604 8850 165 0 892 202257 913
## 3 2009 Nigeria 276715 8663 172 31 895 199250 2668
## 4 2013 Nigeria 227310 7542 220 19 910 191077 6260
## 5 2016 Nigeria 186482 6884 246 46 962 191915 5064
## 6 1990 Nigeria 148931 10236 95 0 728 169472 81
## 7 1991 Nigeria 157502 10517 100 0 744 174686 13
## 8 1992 Nigeria 165722 10624 105 0 782 180284 147
## 9 1993 Nigeria 173695 10693 110 0 781 185946 3
## 10 1994 Nigeria 180588 11053 113 30 800 192168 90
## # … with 20 more rows, and abbreviated variable names ¹`Country/Territory`,
## # ²`Nutritional Deficiencies`, ³`Drug Use Disorders`,
## # ⁴`Exposure to Forces of Nature`, ⁵`Environmental Heat and Cold Exposure`,
## # ⁶`Lower Respiratory Infections`, ⁷`Conflict and Terrorism`
cod_nigeria <- cod_work %>% as_tsibble(index = Year)
There is no missing data in our data set and as of now we are going to use the variables within the times series for forecasting. Given that they are numerical values we will refrain from creating new variables like dummy variables or fourier variables. Once we start fitting models, we will include lags of some of the predictor variables in the table to aid in forecasting but as of now we refrainfrom doing so.
#Exploratory anaylsis and visualization for the data set
cod_nigeria %>% autoplot(Malaria)
There seems to be a constant rise in the deaths up until approximately
2008 where it drops steadily until approximately 2017 and then rises
again until 2020.
Now we are going to check for stationarity in the data set that we chose using an ACF plot and an unit root kpss test.
ACF plot analysis
cod_nigeria %>% ACF(Malaria) %>% autoplot()
There seems to be a decaying correlation between the lags of the values
related to deaths caused by Malaria but there is some strong correlation
in the first lags which means that the data set is not stationary; let
us confirm this with a unit root test
Unit root test
cod_nigeria %>% features(Malaria, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.411 0.0723
The low kpss value tells us that the data is not stationary. We can handle this by differencing out the data, taking a log transform or taking a box-cox transformation but because there is not a lot of seasonality in the dataset, we do not want to extract the meaningful information that might be held in the slight variance in mean.
Now we are going to plot some of the predictor variables against the forecast variable to see if there is a relationship between the two and to pick what variables to include in our models
cod_nigeria %>% ggplot(aes(x = `Nutritional Deficiencies`, y = Malaria))+ geom_point() + labs(title = "Malaria by nutrition")
cod_nigeria %>% ggplot(aes(x = `Exposure to Forces of Nature`, y = Malaria))+ geom_point() + labs(title = "Malaria by exposure to nature")
cod_nigeria %>% ggplot(aes(x = `Drug Use Disorders`, y = Malaria))+ geom_point() + labs(title = "Malaria by drug use")
cod_nigeria %>% ggplot(aes(x = `Lower Respiratory Infections`, y = Malaria))+ geom_point() + labs(title = "Malaria by infection")
cod_nigeria %>% ggplot(aes(x = `Environmental Heat and Cold Exposure`, y = Malaria))+ geom_point() + labs(title = "Malaria by Environmental pressure")
cod_nigeria %>% ggplot(aes(x = `Conflict and Terrorism`, y = Malaria))+ geom_point() + labs(title = "Malaria by national state")
From plotting the predictor variables against the forecast variable, it seems that there is a relationship between Malaria deaths and deaths caused by Drug use disorders, Environmental cold and heat exposure and potentially lower respiratory infections.
The deaths from Malaria and drug use disorders having a correlation could result from treating Malaria with drugs but when the drug use goes over a certain threshold, it could become detrimental to the health. We see that as deaths in Malaria increase with certain drug use disorders and then start to decrease and then rise again
Environmental conditions typically have a strong relationship with Malaria cases and infections with Malaria worsen in hot temperatures. So as the deaths from exposure to cold and heat pressure increase, it is natural that deaths caused by Malaria would increase as well.
The deaths caused by lower respiratory infections have a strange but positive relationship with the deaths caused by Malaria. This could be because as cases involving lower respiratory infections get worse, your ability to fight off Malaria reduces as well reduces which results in the deaths trending together.
Before fitting models, we are going to add lags of some variables to add to some of our model fittings like regression
cod_nigeria_lag<- cod_nigeria %>% mutate(
lag_Malaria = lag(Malaria, 1 ),
lag_environment = lag(`Environmental Heat and Cold Exposure`, 1),
lag_drug_use = lag(`Drug Use Disorders`, 1) ) %>% filter(!is.na(lag_Malaria)) %>% filter(!is.na(lag_environment)) %>% filter(!is.na(lag_drug_use))
#Model Fitting
train<- cod_nigeria_lag %>% filter(Year < "2012")
test<- cod_nigeria_lag %>% filter(Year >= "2012")
We chose to split the training set before 2012 because that is approximately 80% of the data and the rest is the remaining 20%
#Fit TSLM, ETS and ARIMA models
For the first model fitting, we will have r pick all the parameters and then using our analysis of the data and the plot above, we will try to beat the models r comes up with
fit_1 <- train %>%
model(TSLM_r = TSLM(Malaria),
ARIMA_r = ARIMA(Malaria),
ETS_r = ETS(Malaria))
We are going to check the models fitted above for which one fits best with the data using accuracy since they are all different model families and then we will check the residuals of the best model. So we will first create a forecast and test our fits against the trainging data set
fc_1 <- fit_1 %>% forecast(new_data = test, level = NULL)
fc_1 %>% autoplot(test, level = NULL)
fc_1 %>% accuracy(test) %>% arrange(RMSE)
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS_r Test 8048. 16713. 9423. 4.23 4.90 NaN NaN 0.399
## 2 ARIMA_r Test -20775. 23946. 20775. -10.9 10.9 NaN NaN 0.626
## 3 TSLM_r Test -17122. 27223. 23402. -9.54 12.2 NaN NaN 0.657
From the models fitted above, the ETS model that r fitted did the best. It is not a very good model as we can tell from the graph and deviates completely from the testing data after 2017 so in the long run it might not be a good model but let us see what the residual components say about the model as well
fit_1 %>% select(ETS_r) %>% gg_tsresiduals()
The makeup of the residuals does not do to bad because the model matches with the test data set up until a major deviaton. The variance looks fairly constant, there is no autocorrelation in the residuals and the mean has a normal distribution. We are still going to fit other models to improve on our accuracy.
#Fitting models with predictor variables, ARIMA and ARIMA errors
We will do some testing with the data set to figure out what ARIMA model will best fit the data before fitting it with models with predictors and regression
cod_nigeria_lag %>% features(Malaria, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
The unit root test comes up with 0 for seasonality differencing because as mentioned above, there is little to no seasonality in the Malaria plot.
Now let us check to see if regular differencing is needed
cod_nigeria_lag %>% features(Malaria, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 2
The unit root test returns that in order for the data set to be stationary, we need to take two differences of the Malaria column so we will incorporate that into our ARIMA model to improve our forecasting
We will create differences columns and check for stationarity then analyze the ACF and PACF plots that are produced
cod_nigeria_lag <- cod_nigeria_lag %>% mutate(Difference_Malaria = difference(Malaria),
Difference2_Malaria = difference(Difference_Malaria))
Let us check for stationarity in the double differenced column to fact chech the unit root results
cod_nigeria_lag %>% features(Difference2_Malaria, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.101 0.1
The high kpss values tells us that the data is stationary. We will incorporate that into our ARIMA modeling. Now let us look at the PACF and ACF plots
cod_nigeria_lag %>% gg_tsdisplay(Difference2_Malaria, plot_type = "partial")
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).
From the result above, we are going to fit MA model and a mixed AR and MA model
#Fitting models with different variables
fit_2<- train %>%
model(model_1 = TSLM(Malaria ~ `Drug Use Disorders`),
model_2 = TSLM(Malaria ~ `Drug Use Disorders` + `Lower Respiratory Infections` + `Environmental Heat and Cold Exposure`),
model_3 = TSLM(Malaria ~ lag_Malaria + lag_environment + lag_drug_use),
ETS_SES = ETS(Malaria ~ error("A") + trend("A") +season("N")),
ETS_Damped = ETS(Malaria ~ error("A") + trend("Ad") +season("N")),
ARIMA_MA = ARIMA(Malaria ~ pdq(0,2,0) + PDQ(0,0,3)),
ARIMA_mixed = ARIMA(Malaria ~ pdq(2,2,2)),
Arima_errors = ARIMA(Malaria ~ lag_Malaria + I(lag_Malaria^2) + lag_drug_use + I(lag_drug_use ^2 )),
model_nnetar = NNETAR(Malaria))
#Creating a forecast with the models fitted above
fc_2 <- fit_2 %>% forecast(new_data = test, level = NULL)
fc_2 %>% autoplot(test, level = NULL)
#Checking the accuracy for the models fitted above and then picking the
best one
fc_2 %>% accuracy(test) %>% arrange(RMSE)
## # A tibble: 9 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_mixed Test 5135. 13754. 8322. 2.66 4.38 NaN NaN 0.364
## 2 ARIMA_MA Test 6391 15360. 9017. 3.37 4.71 NaN NaN 0.387
## 3 ETS_SES Test 7575. 16318. 9308. 3.98 4.85 NaN NaN 0.396
## 4 ETS_Damped Test -19287. 22336. 19287. -10.1 10.1 NaN NaN 0.610
## 5 model_3 Test 31548. 38005. 31548. 16.4 16.4 NaN NaN 0.586
## 6 model_nnetar Test -61565. 65991. 61565. -31.7 31.7 NaN NaN 0.645
## 7 Arima_errors Test 71487. 82068. 71487. 37.0 37.0 NaN NaN 0.602
## 8 model_2 Test -133667. 140332. 133667. -68.3 68.3 NaN NaN 0.637
## 9 model_1 Test -154304. 161381. 154304. -78.7 78.7 NaN NaN 0.644
Model 3 seems to have the best fit with the testing data set and it also beats the models that r picked as well about by about 7122.354 on the RMSE scale.
Now we are going to assess how benchmark models do on the Malaria variable
#Fitting benchmark models
fit_3 <- train %>%
model(
mean = MEAN(Malaria),
naive = NAIVE(Malaria),
drift = RW(Malaria ~ drift()))
Note: We left out the seasonal naive model above because our data set is non seasonal so it would produce an error message if fitted
fc_3 <- fit_3 %>% forecast(test)
fc_3 %>% autoplot(test, level = NULL)
fc_3 %>% accuracy(test) %>% arrange(RMSE)
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mean Test -17122. 27223. 23402. -9.54 12.2 NaN NaN 0.657
## 2 naive Test -50156 54438. 50156 -25.9 25.9 NaN NaN 0.657
## 3 drift Test -71884. 78443. 71884. -37.2 37.2 NaN NaN 0.663
The mean model has the lowest RMSA but it is still very far off from the testing data set. We chose to use RMSE because it is the most sensitive to bigger and smaller values in the models. Compared to the benchmark models and the models that r picked, model 3 in the ones we fitted above did best against the data set so we will select that and do some analysis on it.
final_model <- fit_2 %>% select(model_3)
final_model %>% gg_tsresiduals()
The residual plots seem fairly decent. The variance in the residuals are
constant up until 2015, there is no autocorrelation in the residuals as
well and the mean has a fairly normal distribution. Overall, the model
seems like a decent one.
#Create a forecast for the future
We are going to forecast 10 years into the future because anything out of that time frame might be risky since conditions can change as time goes on
forecast<- final_model %>% forecast(h = “10 years”)
Some things to consider when putting this forecast into production is making sure the variables used to forecast deaths caused by malaria are still relevant in the future; there might be other variables that need to be considered as time goes on and this will change the course of the forecasting model