Introduction

This page describes how routine data on the number of people living in an area, broken down by age-group (e.g. 0-1 years old, 1-4 years old etc) and data on who has died in each age-group, can tell us about the health of a population - generating a measure known as life expectancy.

Life expectancy lets us generate a measure that tells us how long people can be expected to live. Health Knowledge has some more information about this and other measures (https://www.healthknowledge.org.uk/public-health-textbook/health-information/3a-populations/life-tables-demographic-applications), while the Office for National Statistics has a lot more information on how life expectancy is calculated and the various approaches that can be used to calculating the measure (https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/healthandlifeexpectancies/articles/lifeexpectancyreleasesandtheirdifferentuses/2018-12-17).

One useful application of this measure is to observe over time, as policy, service and/or changes in the environment of a population, how changes in mortality affect life expectancy. The chart below, from an analysis we did in Wales in 2021 (https://pubmed.ncbi.nlm.nih.gov/33735693/), shows how life expectancy changed from 2002 to 2018, with a major slowdown in improvement after around 2012:

Chart showing trends in life expectancy for females and males, 2002-2018

First we need to load our data.

for this we will access the Application Programming Interface/API that allows R to scrape data from NOMIS - a large data repository featuring a catalogue of multiple UK government datasets.

see https://www.nomisweb.co.uk. For a summary of nomisR, the package we are going to use, see https://cran.r-project.org/web/packages/nomisr.

NOMIS provides data for across the UK, but let’s focus for now on Wales, a relatively small country with around 3 million people, large enough to produce estimates with some accuracy (more on that later).

Population data

x <- nomis_search(name = '*population*')

The dataset we want is NM_31_1, population estimates by 5-year age band and sex at local authority level.

We now need to find the code for the right geographic level:

c <- nomis_get_metadata(id = "NM_31_1", concept = "geography", type = "type")

The id for the geographic level (country) is TYPE499. We are now ready to pull the population data from NOMIS using the function nomis_get_data.

df <- nomis_get_data(id = "NM_31_1", 
                     time = "latest", 
                     geography = "TYPE499")

This data covers several UK geographies. We need to filter this data focussing just on Wales. We need to see the different levels available in each geography and their description before we know what to filter for.

unique(df$GEOGRAPHY_NAME)

df_wales <- df %>% 
  filter(GEOGRAPHY_NAME == "Wales")

Now, we don’t need every variable in this dataset, as it covers lots of descriptors, metadata (https://en.wikipedia.org/wiki/Metadata) so select the variables we do need, and pick just the observed value of the population estimate and remove the all ages value as we don’t need

df_wales %>% names()

popn_wales_22 <- df_wales %>%
  filter(MEASURES_NAME == "Value" & AGE_NAME != "All ages") %>%
  select(DATE, GEOGRAPHY_NAME, SEX_NAME, AGE_NAME, 
         OBS_VALUE)

Let’s also rename the variables to make them more simple and easier to refer to. Shorter variable names without spaces are best.

popn_wales_22 <- popn_wales_22 %>% 
  rename(year=1, country=2, sex=3, age_group=4, popn=5)

We now have a dataset for Welsh population estimates for 2022, broken down by age-group, in 5-year age bands, and sex. That wasn’t all that hard was it?

Mortality data

Now we’ve got what’s called the denominator (https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1a-epidemiology/numerators-denominators-populations) or population health data for our analysis, let’s focus on extracting data on mortality for the same period (2022) also from NOMIS.

We need to know again which dataset we’re pulling data from, so we’ll use again the nomis_search function from nomisR.

y <- nomis_search(name = '*mortality*')

The dataset ID we are looking for is NM_161_1, “Mortality statistics - underlying cause, sex and age”. This dataset can let you explore data on causes of death from specific conditions (more on that in a future tutorial) - for now we’re going to just use the all-cause mortality which adds deaths from different causes or conditions together.

Like with the population data, we need to see what geographic levels we can get data at for this dataset.

d <- nomis_get_metadata(id = "NM_161_1", concept = "geography", type = "type")

Again we want data at a national (Wales) level, so we’ll need to use the same TYPE499 ID we used above.

As mentioned above, we want to focus on deaths from all causes, for both genders, and also just for Wales. We’ll need to find the ID for all-cause mortality.

e <- nomis_get_metadata(id = "NM_161_1", concept = "CAUSE_OF_DEATH")

The ID we want for all cause mortality is 888888. Let’s see what the ID is for specific genders:

f <- nomis_get_metadata(id = "NM_161_1", concept = "GENDER")

The ID codes we want then for gender are 1 and 2.

We are ready now to scrape the mortality data with the specific geography, cause and gender levels.

deaths <- nomis_get_data(id = "NM_161_1", 
                               time = "latest", 
                         geography = "TYPE499", 
                         CAUSE_OF_DEATH = "888888",
                         GENDER = c(1,2)
                         )

This data does though include mortality in England and other UK nations, so let’s filter just for Welsh data:

deaths_wales <- deaths %>% 
  filter(GEOGRAPHY_NAME == "Wales")

Like we did with the population data, we can lose several of the variables in this dataframe, and can filter for certain levels of the data that we need, as well as renaming the variables:

deaths_wales <- deaths_wales %>%
  filter(AGE_NAME != "total (all ages)" & MEASURE_NAME == "Deaths") %>%
  select(DATE, GEOGRAPHY_NAME, GENDER_NAME, AGE_NAME, OBS_VALUE)

deaths_wales <- deaths_wales %>%
  rename(year=1, country=2, sex=3, age_group=4, deaths=5)

We now have a dataset of deaths registered in Wales in 2022, broken down by sex and age-group.

Before we combine the mortality and population data, we need to make sure the age-groups in the deaths dataset are identical to our population dataset (this is important for the next step):

unique(deaths_wales$age_group)
unique(popn_wales_22$age_group)

#the oldest age-group in the population data is 85+, whereas it is 90+ in the mortality data. As we will see later, the life expectancy function expects the open-ended age-group to be 90+, so while we could work with 85+, it’s better to have more detailed data in this age-group, so we’ll need to go back to NOMIS to get more population data

x <- nomis_search(name = '*population*')

The dataset in NOMIS for population estimates by single year age-band is “NM_2002_1”). Let’s extract this data for 2022 and inspect:

popn_sing_yr <- nomis_get_data(id = "NM_2002_1", 
                     time = "latest", 
                     geography = "TYPE499")

We only need the 90+ data for each sex so can filter out the rest of the data, which covers every single age year (0,1,2,3…etc).

unique(popn_sing_yr$C_AGE_NAME)

popn_sing_yr %>% names()

unique(popn_sing_yr$GEOGRAPHY_NAME)

popn_sing_yr_wales_90_plus <- popn_sing_yr %>%
  filter(GEOGRAPHY_NAME == "Wales") %>%
  filter(MEASURES_NAME == "Value" & C_AGE_NAME == "Aged 90+" & GENDER_NAME %in% c("Male", "Female")) %>%
  select(DATE, GEOGRAPHY_NAME, GENDER_NAME, C_AGE_NAME, 
         OBS_VALUE)

We can now rename the variables in this dataframe and join it with the original population dataset, using bind_rows:

popn_sing_yr_wales_90_plus <- popn_sing_yr_wales_90_plus %>% 
  rename(year=1, country=2, sex=3, age_group=4, popn=5)

popn_wales_22 %>% head()
popn_sing_yr_wales_90_plus %>% head()

#we can merge these datasets quite easily just by combining the rows

popn_wales_22 <- bind_rows(popn_wales_22, popn_sing_yr_wales_90_plus)

We need again to make sure the age groups between the population and mortality datasets are identical:

unique(popn_wales_22$age_group)
unique(deaths_wales$age_group)

The age_group labels are both different and differently described, so we will need to filter out those not needed and relabel. First let’s filter out from the population dataset which has broad categories we don’t need to keep:

A good trick for this is create a function to filter out similar to %in% but the opposite (see https://stackoverflow.com/questions/5831794/opposite-of-in-exclude-rows-with-values-specified-in-a-vector):

`%nin%` <- Negate(`%in%`)

popn_wales_22 <- popn_wales_22 %>%
  filter(age_group %nin% c("Aged 16 - 59/64", "Aged 18 - 24", 
                           "Aged 16 - 64", "Aged 0 - 15", 
                           "Aged 65 and over"))

Compare the age bands again:

unique(popn_wales_22$age_group)
unique(deaths_wales$age_group)

Recoding data can be done in several ways, and is often more efficient using some sort of function or programming loop. For now let’s do it manually using the case_when function from the dplyr package:

deaths_wales <- deaths_wales %>%
  mutate(age_group = case_when(
    age_group == "Aged under 1" ~ 0,
    age_group == "Aged 1 to 4" ~ 1,
    age_group == "Aged 5 to 9" ~ 5,
    age_group == "Aged 10-14" ~ 10,
    age_group == "Aged 15-19" ~ 15,
    age_group == "Aged 20-24" ~ 20,
    age_group == "Aged 25-29" ~ 25,
    age_group == "Aged 30-34" ~ 30,
    age_group == "Aged 35-39" ~ 35,
    age_group == "Aged 40-44" ~ 40,
    age_group == "Aged 45-49" ~ 45,
    age_group == "Aged 50-54" ~ 50,
    age_group == "Aged 55-59" ~ 55,
    age_group == "Aged 60-64" ~ 60,
    age_group == "Aged 65-69" ~ 65,
    age_group == "Aged 70-74" ~ 70,
    age_group == "Aged 75-79" ~ 75,
    age_group == "Aged 80-84" ~ 80,
    age_group == "Aged 85-89" ~ 85,
    age_group == "Aged 90 and over" ~ 90,
    TRUE ~ as.numeric(age_group)))

Repeat this for the population dataframe:

unique(popn_wales_22$age_group)

popn_wales_22 <- popn_wales_22 %>%
  mutate(age_group = case_when(
    age_group == "Aged under 1 year" ~ 0,
    age_group == "Aged 1 - 4 years" ~ 1,
    age_group == "Aged 5 - 9 years" ~ 5,
    age_group == "Aged 10 - 14 years" ~ 10,
    age_group == "Aged 15 - 19 years" ~ 15,
    age_group == "Aged 20 - 24 years" ~ 20,
    age_group == "Aged 25 - 29 years" ~ 25,
    age_group == "Aged 30 - 34 years" ~ 30,
    age_group == "Aged 35 - 39 years"~ 35,
    age_group == "Aged 40 - 44 years" ~ 40,
    age_group == "Aged 45 - 49 years" ~ 45,
    age_group == "Aged 50 - 54 years" ~ 50,
    age_group == "Aged 55 - 59 years" ~ 55,
    age_group == "Aged 60 - 64 years" ~ 60,
    age_group == "Aged 65 - 69 years" ~ 65,
    age_group == "Aged 70 - 74 years"~ 70,
    age_group == "Aged 75 - 79 years" ~ 75,
    age_group == "Aged 80 - 84 years" ~ 80,
    age_group == "Aged 85 and over" ~ 85,
    age_group == "Aged 90+" ~ 90,
    TRUE ~ as.numeric(age_group)))

Now the dataframes both have identical age group data, let’s merge them using the left_join function from the dplyr package. For a good overview of this and other join functions, see https://lindsaydbrin.github.io/CREATE_R_Workshop/Lesson_-_dplyr_join.html.

le_data <- left_join(deaths_wales, popn_wales_22, 
          by = c("year", "country", "sex", "age_group")) %>%
  rename(ageband = age_group)

We now have a single dataset of population and mortality data, broken down by sex and age-group, that will allow us to estimate life expectancy for each sex in 2022 in Wales.

To calculate life expectancy, we need to choose what type of measure we’re calculating. Because we’re using data and the age-specific mortality rates in a specific period, we’ll be estimating a measure known as period life expectancy. Period life expectancy can be calculated using different approaches too, but to keep it simple we’ll follow the approach used by the ONS and public health agencies across the UK (https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/healthandlifeexpectancies/methodologies/guidetocalculatingnationallifetables).

Life tables involve the calculation of several different variables, from a dataset of deaths and population counts in a certain period and population. First we calculate the mortality rate in each age group, before then applying this to a hypothetical population, for example of 100,000 people, before applying the probability of dying in each period to the number of people surviving to the beginning of a period.

The link above gives a good descrption for each step - for the code here, I’ve copied the approach taken in an earlier study we did with some public health colleagues to explore inequalities in life expectancy between the most and least deprived areas in Wales (more on how we did this later) - see https://linkinghub.elsevier.com/retrieve/pii/S0033-3506(21)00047-0 for more.

life_table <- le_data %>%
  group_by(year, sex) %>%
  mutate(ax #average fraction of interval lived
         = if_else(ageband %in% c("0"), 0.1, 0.5),
         Mx #age-specific death rate
         = deaths/popn, 
         n #number of years between intervals
         = if_else(ageband <90, dplyr::lead(ageband)-ageband, 2/Mx), 
         qx #proportion dying in interval
         = if_else(ageband <90, (n*Mx)/(1+n*(1-ax)*Mx), 1),
         px #proportion surviving 
         = 1-qx,
         lx #number alive at age x
         = if_else(ageband %in% c("0"), 100000, 1),
         px_lag = dplyr::lag(px, default = 1),
         lx = if_else(ageband >0, cumprod(lx*px_lag), lx),
         dx #number dying in interval
         = lx - dplyr::lead(lx, default = 0),
         Lx #total years lived in interval
         = if_else(ageband <90, n*(dplyr::lead(lx)+(ax*dx)), lx/Mx),
         Tx #total years lived beyond age x
         = rev(cumsum(rev(Lx))),
         ex #estimated life expectancy
         = Tx/lx)

life_table %>% select(sex, ageband, ex) %>% filter(ageband == 0)

There is a further step we can take to add in an uncertainty range around our life expectancy estimates. We often use for uncertainty ranges what are called confidence intervals - specifically 95% confidence intervals. These can broadly allow us to say, with 95% confidence, the true value of life expectancy in this population and period lies between the lower and upper estimates generated by this calculation (see https://en.wikipedia.org/wiki/Confidence_interval). The narrower the confidence intervals, the more precise our estimates are, which tends to happen both with more accurate estimates, but also larger samples of data which tend to increase precision.

life_table <- life_table %>%
  mutate(step1 = ((qx^2)*(1-qx))/deaths) %>%
  mutate(step2 = (lx^2)*(((1-ax)*n+lead(ex)^2)*step1)) %>%
  replace_na(list(step2 = 0)) %>%
  mutate(step3 = rev(cumsum(rev(step2)))) %>% 
  mutate(step4 = step3/(lx^2)) %>% 
  mutate(se = sqrt(step4)) %>% 
  mutate(lci = ex-(1.96*se), 
         uci = ex+(1.96*se))

From this data, male life expectancy at birth in 2022 in Wales was 78.8 years (95% CI 78.6-79.0) and females 82.5 (95% CI 82.3 - 82.6 years).

How open source population and mortality data can generate life expectancy estimates for a population

Jonny Currie

2024-03-10

Introduction

Population data

Mortality data