Introduction

About COVID-19

Objective

The objective of this tutorial is to provide a guide for a robust analysis of the death burden of COVID-19 in low and middle income countries.

Data on mortality is usually divided in different agencies such as the Civil Registry, Health Ministries or Statistical offices.

Analysys framework

Data integration

Figure 1 schematically represents the complexity of dealing with mortality data. There are at least 5 pieces of information, which will be addressed from the external to the inner circles.

The first circle is the estimation of the population, which is done based on different factors such as fertility, migration, deaths and considers the demographic structure (age and sex differences) of the population. Data is collected through census and used in estimating trends.

The second circle refers to the actual deaths occurred. As there is no perfect countdown for them, there are usually expected based on demographic projections and are materialised in crude mortality ratios.

The third circle refers to those deaths that are officially registered by different such as in hospitals or by doctors, police, judge, among other civil or communal authorities.

The four circle refers to deaths occurred by COVID-19, being or not registered or identified as it.

Finally, the last circle refers to deaths registered as COVID-19.

As the table shows, there are a few number of mismatches between both registration and identification of deaths.

Table 1: Possibles sources of bias in estimation of COVID-19 deaths

Death by COVID-19 Death by other causes
Registered as COVID-19 OK False positive
Not registered as COVID-19 False negative OK
Expected but not registered False negative -

Steps

For each circle, 2 steps:

  • Step1: Finding and knowing the data

    • Visualising to understand trends, patterns, errors.
  • Step 2: Modelling and forecastinG

    • Data quality checks (looking at data itself)

    • Data is not necessarily reconciled or matched so it is important to perform basic data checks.

    • Identify magnitudes of measurement bias (false negatives and false positive)

    • To completeness

    • Considering uncertainty

    • For instance, In last circle, for instance, modelling estimations of excess of deaths, considering uncertainty from previous steps

Relevant concepts

https://www.prb.org/glossary/

  • Mortality ratio \(Mortality= Deaths/Population\)

  • Crude mortality ratio

  • Estimated mortality

  • Standardised mortality ratio (SMR) - Standardized Mortality Ratio (SMR) is a ratio between the observed number of deaths in an study population and the number of deaths would be expected, based on the age- and sex-specific rates in a standard population and the population size of the study population by the same age/sex groups. If the ratio of observed:expected deaths is greater than 1.0, there is said to be “excess deaths” in the study population.

Application

Example: Peru

Peru….

Peru has 25 second-tier government levels

Map

Missing pieces

Forecast missing….

First circle: Population

Step1: Finding and knowing the data

Data on population is usually compiled as a combination of periodic census data with adjusted projections based on fertility, mortality, migration, life expectancy rates, among others.

In case of Peru, we provide the estimation of the total population from 2005 to 2020, produced by the National Statistics Agency (INEI). The estimated population is around 32.8 millions with an average of growth rate around 1% year.

As the brunt of deaths of COVID-19 relies on older people, understanding the evolution of population by range of age is very relevant for the analysis. Peru has about people 3,6 million people over 60, near 11% of the total country population. The figure below shows how his group has been significantly increasing over the last two decades.

In order to estimate COVID-19 mortality, we will analyse it by sub-categories. In this case, we will focus on the regional level, with a further disaggregation into range of age and sex. The estimated population pyramid in 2020 is presented below, where it is possible to observe a significant larger number of women among older people.

Step 2: Analysis and forecasts

In case of Peru, there is not official updated data on population estimations disaggregated by regions, sex and range of age. Therefore, I estimate two different set of models in order to forecast population up to 2020 by those categories.

Projections by region and sex

The National Statistics Agency publishes estimation per region disagregated by sex from 2000 to 2017. The following plot shows the evolution. Lima, which concentrates around 30% of the country population and Lambayeque are the only two regions with a clear female majority.Other states such as Callao, La Libertad, Ica and Piura show figure closer to parity, with the remaing states where men population outgrowths female.

The basic concept is that we forecast the time series of interest \(y\) assuming that it has a linear relationship with other time series \(x\).

Following Hyndman and Athanasopoulos (2019), we forecast time series with using different univarite models and compare predicted values with ex-ante results to establish the accuracy of the predictions. The models used are the Autoregressive integrated moving average (ARIMA), Exponential time smoothing (ETS), Random walk model, and time series linear models using different specifications such as piecewise, linear and exponential linear regression with trends.

In this case,the ARIMA models produce similar predicted values to values projected by the National Statistical Office for 2020. In the following graph we observe that differences are lower than .2%, which represents a maximum difference of 750 people between both.

An ARIMA model can be notated as \(y_{t} = \beta_{0} + \beta_{1}x_{t}+\eta_{t}\), where \(\eta_{t}\) is the error term. We use the R package fable, which allows for automated searches through the model space to identify the best ARIMA model which lowest information criteria, in this case, the Mean Absolute Error value.

Projections by region and range of age

The same process was applied to model range of ages by regions, and to be consistent, we chose ARIMA models as the option to be used in the forecasting, which also produce less differences from the official figures, as can be seen in the following plot. The outliers are values from young age in the regions of the Amazon forest (Amazonas, Loreto, Ucayali).

As we are dealing with uncertainty, we want to estimate a range of values that is likely to include the population value with a certain degree of confidence. In this case, we use a 95% confidence level.

The population 95% CI is given by \(\hat y± 1.96\hat\sigma\) where \(\hat y\) are the predicted values and \(\hat\sigma\) is the estimated variance of the residuals of the ARIMA model. Having the \(\hat y_{pop}\) and \(\hat y_{CRM}\), the computation of \(\hat y_{CRM}\) is simply \(\hat{Deaths} = \hat y_{pop}\hat y_{CRM}\).

Below we plot the same previous graph and add the predicted values including their confidence interval.

Second circle: Expected deaths

Step1: Finding and knowing the data

The first focus is looking at expected deaths, which are estimated by the crude mortality rate (CMR). CMR represents the number of deaths for the population over a year per 1000 people.

CRM depends on many factors, such as the population pyramid, life expectancy and external shocks such as natural hazards or, in this case, diseases.

As a first glance, it is useful to compare the country official rate to similar countries. Information can be found in here.

In the following graph I present the evolution of CMR over the last years in South America. As in 2020, the expected mortality rate of Peru is 5.83 deaths per 1,000 people during a year.

Step 2: Analysis and forecasts

While the pattern for Peru appears to be relatively linear, when we turn to region-level estimations of mortality rates over the same period of time, the picture changes significantly, as can be seen below. Many regions in the Amazon basin and Andes have had a significant decrease in their expected mortality rates, such as Amazonas, with a drop from 14.87 in 1990 to 8.37 expected deaths in 2020.

Data on expected mortality, in case of Peru, is more limited than data on population. For instance,at regional level, there are only values every each 5 years without further disaggregation into sub-categories. Due to limitations, we will estimate yearly crude ratio mortality by regions using the same tools used before. In this case, we will compare the predicted and expected results in 2020.

The most accurate predictions were achieved by the linear piecewise models, with knots in 2000, 2005 and 2010, as seen in the right bottom table.

Projections by region and sex

Considering that \(CRM = Deaths/Population\), having estimations on \(Population\) and \(CRM\) allows us to estimate the number of deaths per year at region level. Additionaly, to deal with uncertainty, we estimate a lower and and upper bound of death.

First, we estimate a 95% confidence of interval for \(CRM\) using Byar’s approximation (Breslow, Day, and Heseltine 1980), which is computed using the following notation:

\(\hat{CRM}_{lower} = O (1 − 1/9O - .95/3\sqrt(O))^3/n\)

\(\hat{CRM}_{higher} = (O +1)(1 − 1/9(O+1) - .95/3\sqrt(O+1))^3/n\)

where \(O\) is the number of observed counts, in this case, deaths, and \(n\) refers to the population.

Then, we estimate our deaths’ confidence intervals as the product of the lower and upper values obtained from \(\hat{CRM}\) and \(\hat y_{pop}\), as follows:

\(\hat{Deaths_{lower}} = \hat{CRM}_{lower} * (\hat y_{pop} - 1.96\hat\sigma_{pop})\)

\(\hat{Deaths_{higher}} = \hat{CRM}_{higher} * (\hat y_{pop} + 1.96\hat\sigma_{pop})\)

The maximum range is 908 and 862 deaths in case of female and male in Lima (\(\mu=29493.9\)).

Projections by region and range of age

The same procedure was applied to the estimation of expected mortality for regions and range of age.

The largest range of values are 1523.7 and 824.3 deaths for Lima in the range 20-30 and 10-20 years.

Third circle: Registered deaths

Step1: Finding and knowing the data

Conclusion

Resources

References

Breslow, Norman E, Nicholas E Day, and Elisabeth Heseltine. 1980. Statistical methods in cancer research. Vol. 1. International Agency for Research on Cancer Lyon.

Hyndman, R, and G Athanasopoulos. 2019. Forecasting: principles and practice. 3rd ed. Melborne, Australia.


  1. University of East Anglia - ↩︎

  2. University of East Anglia↩︎