Life Expectancy in England is faltering

While average life expectancy has increased in England in recent decades, since 2010 the rate of improvement has slowed. Presently, there is a lot of debate among experts as to why, with some of the causes mentioned being flu epidemics and austerity1 2. We probably won’t come up with any definite answers in this study, however that doesn’t prevent us from performing some basic, up-to-date analysis on the subject.

In this study we will examine how life expectancy at birth (LE) in England has evolved since 2001-03. We will also focus on the inequalities of LE among local authorities, which don’t seem to be decreasing despite the UK goverment’s efforts3. In particular, we will look into inequalities that are associated with income or sex. Besides some descriptive analysis, we will also try to build a very simple regression model describing the relationship between life expectancy and gross disposable household income at the Local Authority (LA) level.


Outline of the study

This practical session demonstrates a step-by-step analysis on life expectancies and inequalities in England. It covers some key aspects of such studies, from getting the necessary data to building (simple) regression models exploring the relationships between life expectancies and a few social determinants at the Local Authority level.

In general, this practical should run very smoothly; running it should be quite fast and everything should be contained within the script, including downloading data. However, there is always room for more, and thus we have included several “optional” extra tasks about the analysis.

This practical is trying to keep to Base R as much as possible (i.e. without using third-party libraries). We also assume you run this within RStudio (tested for Version 1.1.453 and R version 3.6.1), which is free and you can find here.

By the end of the practical, you should be able to:
– Read data into R.
– Identify data structures.
– Subset, clean and re-code data.
– Find missing values and outliers.
– Join data tables based on common variables.
– Summarise and plot data.
– Calculate variable relationships.
– Build a linear regression model with a confounding variable.


Data

Most of the relevant data are accessible through the Public Health England (PHE) “Fingertips” platform or through the Office for National Statistics (ONS). The code should download the relevant data automatically in your working directory folder.

We can start by loading the necessary libraries - remember to install these from the “Tools” menu first if you haven’t done so already:

# Load necessary libraries
library(fingertipsR)
library(readxl)
library(ggplot2)

We will download LE at birth data directly from PHE using the fingertips library. If you are wondering how to find the necessary fingertips codes for what you want these two commands will help you (optional):

# Available Indicators
ind_codes <- indicators()

# Available Area Types
area_types()

An easy way of “eyeballing” the data within RStudio is by viewing the file using either View(ind_codes) or clicking the file from your environment tab. When the table opens, click filter button at the top left, and within the textbox that appears above each column, type “life expectancy” for column IndicatorName in order to search for it. Indicator code 90366 indicates the LE at birth for all persons and an AreaTypeID of 101 returns data at the Lower Tier LA level (Districts and UAs).

Now we can load the data into R:

# Get data
le0 <- fingertips_data(IndicatorID = 90366, AreaTypeID = 101)
# View top of data
head(le0)
##   IndicatorID            IndicatorName ParentCode ParentName  AreaCode          AreaName AreaType    Sex      Age CategoryType Category Timeperiod    Value LowerCI95.0limit UpperCI95.0limit LowerCI99.8limit UpperCI99.8limit Count Denominator Valuenote RecentTrend ComparedtoEnglandvalueorpercentiles ComparedtoRegionvalueorpercentiles TimeperiodSortable Newdata Comparedtogoal
## 1       90366 Life expectancy at birth       <NA>       <NA> E92000001           England  England   Male All ages         <NA>     <NA>  2001 - 03 76.19791         76.16661         76.22922               NA               NA    NA          NA      <NA>        <NA>                        Not compared                       Not compared           20010000    <NA>           <NA>
## 2       90366 Life expectancy at birth       <NA>       <NA> E92000001           England  England Female All ages         <NA>     <NA>  2001 - 03 80.69957         80.67004         80.72910               NA               NA    NA          NA      <NA>        <NA>                        Not compared                       Not compared           20010000    <NA>           <NA>
## 3       90366 Life expectancy at birth  E92000001    England E12000001 North East region   Region   Male All ages         <NA>     <NA>  2001 - 03 74.70091         74.56450         74.83732               NA               NA    NA          NA      <NA>        <NA>                               Worse                       Not compared           20010000    <NA>           <NA>
##  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
# Structure
str(le0)
## 'data.frame':    10728 obs. of  26 variables:
##  $ IndicatorID                        : int  90366 90366 90366 90366 90366 90366 90366 90366 90366 90366 ...
##  $ IndicatorName                      : chr  "Life expectancy at birth" "Life expectancy at birth" "Life expectancy at birth" "Life expectancy at birth" ...
##  $ ParentCode                         : chr  NA NA "E92000001" "E92000001" ...
##  $ ParentName                         : chr  NA NA "England" "England" ...
##  $ AreaCode                           : chr  "E92000001" "E92000001" "E12000001" "E12000002" ...
##  $ AreaName                           : chr  "England" "England" "North East region" "North West region" ...
##  $ AreaType                           : chr  "England" "England" "Region" "Region" ...
##  $ Sex                                : chr  "Male" "Female" "Male" "Male" ...
##  $ Age                                : chr  "All ages" "All ages" "All ages" "All ages" ...
##  $ CategoryType                       : chr  NA NA NA NA ...
##  $ Category                           : chr  NA NA NA NA ...
##  $ Timeperiod                         : chr  "2001 - 03" "2001 - 03" "2001 - 03" "2001 - 03" ...
##  $ Value                              : num  76.2 80.7 74.7 74.8 75.5 ...
##  $ LowerCI95.0limit                   : num  76.2 80.7 74.6 74.7 75.4 ...
##  $ UpperCI95.0limit                   : num  76.2 80.7 74.8 74.9 75.6 ...
##  $ LowerCI99.8limit                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ UpperCI99.8limit                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Count                              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Denominator                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Valuenote                          : chr  NA NA NA NA ...
##  $ RecentTrend                        : chr  NA NA NA NA ...
##  $ ComparedtoEnglandvalueorpercentiles: chr  "Not compared" "Not compared" "Worse" "Worse" ...
##  $ ComparedtoRegionvalueorpercentiles : chr  "Not compared" "Not compared" "Not compared" "Not compared" ...
##  $ TimeperiodSortable                 : int  20010000 20010000 20010000 20010000 20010000 20010000 20010000 20010000 20010000 20010000 ...
##  $ Newdata                            : chr  NA NA NA NA ...
##  $ Comparedtogoal                     : chr  NA NA NA NA ...

Without going into much detail, the str() function is very helpful when you are performing any calculations and load data into R through text files, excels etc., as many times R will read some values wrong. This is particularly true when handing numerical data; for instance, missing values and decimal types might need your attention, and it is best if you caught these “errors” early on.

Notice the structure of the data; these are generally called panel data, i.e. data with observations accross a number of time periods. There are generally two ways with which these are supplied, either in long or wide format. Wide format is when each column represents a time period, while long is when each line represents a time period.

Q: With regards to the format, what kind of panel data is the life expectancies table?

Answer: It’s in long format.

Depending on what you want to accomplish, you might want to tranform between the two, which is not particularly easy. Fortunately we don’t need to do this now, but in case you are wondering take a look at the reshape() function.


Inequalities at Local Authority level

For the next part we will try to explore the LE of local authorities based on deprivation quintiles. Since the current table contains more than the LA-level data that we need, we will try to remove these and generally try to clean it from unnecessary data, so we have a nice table that we can use for our analysis.

# Subset and Clean
table(le0$AreaType)
## 
## District & UA (pre 4/19)                  England                   Region 
##                    10408                       32                      288
le0 <- le0[le0$AreaType == "District & UA (pre 4/19)", ]
le0 <- le0[, c("Year", "ParentCode", "ParentName", "AreaCode", "AreaName", # geography and time
               "Sex", "Age",                                               # type
               "LE_Birth", "LowerCI95.0limit", "UpperCI95.0limit")]           # values

# No useful information about age
table(le0$Age)
## 
## All ages 
##    10408
# Delete
le0$Age <- NULL

The table() function is a good way to get all the unique values in a column. We will only keep the District and Unitary Authorities (Lower Tier LAs). Notice the LE is for all persons at birth, so the Age column doesn’t really give any useful information.

It is particularly important to look for any irregularities in the data, like missing values. Thankfully we didn’t have to deal with any issues up till now. The summary() function can be very useful for this.

# Look for irregularities
summary(le0)
##       Year       ParentCode         ParentName          AreaCode           AreaName             Sex               LE_Birth     LowerCI95.0limit UpperCI95.0limit
##  Min.   :2002   Length:10408       Length:10408       Length:10408       Length:10408       Length:10408       Min.   :71.72   Min.   :71.36    Min.   :72.09   
##  1st Qu.:2005   Class :character   Class :character   Class :character   Class :character   Class :character   1st Qu.:78.66   1st Qu.:78.04    1st Qu.:79.29   
##  Median :2009   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :80.76   Median :80.16    Median :81.35   
##  Mean   :2009                                                                                                  Mean   :80.56   Mean   :79.96    Mean   :81.15   
##  3rd Qu.:2013                                                                                                  3rd Qu.:82.61   3rd Qu.:82.02    3rd Qu.:83.19   
##  Max.   :2017                                                                                                  Max.   :87.04   Max.   :86.43    Max.   :87.65   
##                                                                                                                NA's   :68      NA's   :68       NA's   :68

The most important information here is the missing values in the LE LE_Birth column. This could potentially be problematic.

# Which are these?
le0[is.na(le0$LE_Birth) == TRUE, ]
##      Year ParentCode        ParentName  AreaCode        AreaName    Sex LE_Birth LowerCI95.0limit UpperCI95.0limit
## 70   2002  E12000009 South West region E06000053 Isles of Scilly   Male       NA               NA               NA
## 312  2002  E12000007     London region E09000001  City of London   Male       NA               NA               NA
## 370  2002  E12000009 South West region E06000053 Isles of Scilly Female       NA               NA               NA
## 612  2002  E12000007     London region E09000001  City of London Female       NA               NA               NA
## 742  2003  E12000009 South West region E06000053 Isles of Scilly   Male       NA               NA               NA
## 984  2003  E12000007     London region E09000001  City of London   Male       NA               NA               NA
## 1042 2003  E12000009 South West region E06000053 Isles of Scilly Female       NA               NA               NA
## 1284 2003  E12000007     London region E09000001  City of London Female       NA               NA               NA
## 1414 2004  E12000009 South West region E06000053 Isles of Scilly   Male       NA               NA               NA
## 1656 2004  E12000007     London region E09000001  City of London   Male       NA               NA               NA
## 1714 2004  E12000009 South West region E06000053 Isles of Scilly Female       NA               NA               NA
##  [ reached 'max' / getOption("max.print") -- omitted 57 rows ]
# Remove them
le0 <- le0[is.na(le0$LE_Birth) == F, ]
# So how many LAs are left?
length(unique(le0$AreaCode))
## [1] 324

It is very common to exclude those two LAs. The Isles of Scilly and the City of London are so small in terms of population, that usually ONS doesn’t publish any data for them. Since there are 326 LAs as of 2018, we are left with 324, which is correct.

Now lets try to plot these LE data by deprivation. We will use the 2015 Index of Multiple Deprivation (IMD) and construct quintiles based on the average LSOA deprivation rank. We will have to get these data and join them to our LE table.

# Download data
imd_link <- "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/467764/File_1_ID_2015_Index_of_Multiple_Deprivation.xlsx"
download.file(imd_link, "imd.xlsx", method = "curl") # this is downloaded to your working directory
imd <- read_xlsx("imd.xlsx", sheet = 2, col_names = T)
head(imd)
## # A tibble: 6 x 6
##   `LSOA code (2011)` `LSOA name (2011)` `Local Authority District code (2013)` `Local Authority District name (2013)` `Index of Multiple Deprivation (IMD) Rank (where 1 is most deprived)` `Index of Multiple Deprivation (IMD) Decile (where 1 is most deprived 10% of LSOAs)`
##   <chr>              <chr>              <chr>                                  <chr>                                                                                                  <dbl>                                                                                <dbl>
## 1 E01031349          Adur 001A          E07000223                              Adur                                                                                                   21352                                                                                    7
## 2 E01031350          Adur 001B          E07000223                              Adur                                                                                                    8864                                                                                    3
## 3 E01031351          Adur 001C          E07000223                              Adur                                                                                                   22143                                                                                    7
## 4 E01031352          Adur 001D          E07000223                              Adur                                                                                                   17252                                                                                    6
## 5 E01031370          Adur 001E          E07000223                              Adur                                                                                                   15643                                                                                    5
## 6 E01031374          Adur 001F          E07000223                              Adur                                                                                                   21176                                                                                    7

In order to calculate the average rank per LA, we use aggregate() with FUN = "mean".

# Rename columns
names(imd) <- c("LSOA_code", "LSOA_name", "LA_code", "LA_name", "IMD_rank", "IMD_decile")
imd <- aggregate(IMD_rank ~ LA_code + LA_name, data = imd, FUN = "mean")

We can now join the IMD data to the LE data using merge(). For simplicity, we assume here that the 2015 IMD ranks remain stable over the period of analysis. Another potential issue with the IMD is that, being a composite measure of deprivation, it already contains measures of mortality; and as such it is not particularly insightful to use in conjunction with life expectancies. A better measure would be to use just one domain of deprivation, such as income, to perform the analysis. Again, for practical reasons, we will still carry out this analysis using the composite IMD rank to organise LAs.

# Join data
le0 <- merge(x = le0, y = imd, by.x = "AreaCode", by.y = "LA_code", all.x = T)

Note the all.x argument is the very important; an all.x = T represents a left join, i.e. we keep all data from the x object (le0), and discard those from y (imd) that don’t match - in this case, the IMD data for Wales, Scotland etc. An all.y = T would keep all the imd data (right join). An all = T represents a full join, and an all = F an inner join. check the merge() help page for more information.

Next, we will make the 5 quintiles based on the average rank. We use a combination of quantile() and findInterval() to find where the cuts should be and which LA belongs to which interval respectively.

# Helper variable
imd_Q5 <- quantile(le0$IMD_rank, probs=c(0.2, 0.4, 0.6, 0.8))
# Cut the variable into 5 equal pieces
le0$IMD_Quintile <- findInterval(le0$IMD_rank, c(0, imd_Q5, Inf))
# Rename using factors; it is a type of variable, like numeric, string, etc.
le0$IMD_Quintile <- factor(le0$IMD_Quintile, labels=c("Q1","Q2","Q3", "Q4","Q5")) # where Q1 is the most deprived quintile
# Preview
head(le0)
##    AreaCode Year ParentCode        ParentName   AreaName    Sex LE_Birth LowerCI95.0limit UpperCI95.0limit    LA_name IMD_rank IMD_Quintile
## 1 E06000001 2002  E12000001 North East region Hartlepool   Male 73.41968         72.68208         74.15728 Hartlepool 11101.05           Q1
## 2 E06000001 2017  E12000001 North East region Hartlepool Female 81.33117         80.67800         81.98434 Hartlepool 11101.05           Q1
## 3 E06000001 2011  E12000001 North East region Hartlepool   Male 77.24366         76.52737         77.95995 Hartlepool 11101.05           Q1
## 4 E06000001 2005  E12000001 North East region Hartlepool   Male 74.48324         73.76928         75.19719 Hartlepool 11101.05           Q1
## 5 E06000001 2017  E12000001 North East region Hartlepool   Male 76.81525         76.05943         77.57108 Hartlepool 11101.05           Q1
## 6 E06000001 2012  E12000001 North East region Hartlepool Female 81.50619         80.85787         82.15451 Hartlepool 11101.05           Q1
# Delete variables that we don't need any more
remove(imd_Q5)

Now we can make a new data frame with the aggregate average LE values per IMD quintile. Essentially, we are telling to R to find the averages of LE_Birth by Year, IMD_Quintile and Sex. Translate by with ~ and you should get an idea of how to write formulas in R now.

# Calcalute averages by IMD Quintile
le0_imd <- aggregate(LE_Birth ~ Year + IMD_Quintile + Sex, data = le0, FUN = "mean")
head(le0_imd)
##   Year IMD_Quintile    Sex LE_Birth
## 1 2002           Q1 Female 79.54735
## 2 2003           Q1 Female 79.73821
## 3 2004           Q1 Female 79.97114
## 4 2005           Q1 Female 80.30640
## 5 2006           Q1 Female 80.52099
## 6 2007           Q1 Female 80.64881

Q: Try to plot the results using ggplot again. Can you make a plot for male life expectancy by IMD Quintile, as we did above for LE by gender?

Answer:

le0_imd_male <- le0_imd[le0_imd$Sex == "Male", ]

ggplot(le0_imd_male, aes(x = Year, y = LE_Birth, group = IMD_Quintile)) + 
  geom_point(aes(color = IMD_Quintile)) + 
  geom_line(aes(color = IMD_Quintile)) +
  labs(y = "Male Life Expectancy at Birth", 
       x = "Mid-Year of 3-Year Period", 
       caption = "where Q1 is the most deprived",
       title = "Trends in Male Life Expectancy at birth between 2001-03 and 2015-17 in England")

For this last part of the analysis, we will try to plot LEs by deprivation and gender. Optimally, we would want to somehow compare LEs for both males and females easily in a graph.

We will try something a little more advanced at this point. Since we have mean values for both males and females stored in the variable Sex, we can add facet_grid(cols = vars(Sex)) to create one plot for each gender. We specify how many plot by adding cols i.e. separate plots as columns (it could also be rows), and use the variable Sex to do the separation by (in this case, male and female). Remember, you can always search online to find out code examples!

# Plot by IMD Quintile
ggplot(le0_imd, aes(x = Year, y = LE_Birth, group = IMD_Quintile)) + 
  geom_point(aes(color = IMD_Quintile)) + 
  geom_line(aes(color = IMD_Quintile)) +
  facet_grid(cols = vars(Sex)) + 
  labs(y = "Life Expectancy at Birth", 
       x= "Mid-Year of 3-Year Period",
       caption = "where Q1 is the most deprived",
       title = "Trends in Life Expectancy at birth between 2001-03 and 2015-17 in England")


Modelling the relationships between LE and Income

The second part of this analysis is to make a simple regression model looking at the realtionship between LE and average income. We won’t be using panel data in this intance; we will try to build this model just for the latest (2016) data. Let’s begin by subsetting the LE table:

# LA - level data for latest year
# Subset latest year
le0_16 <- le0[le0$Year == "2016", ]

We now need to attach the income data to the le0_16 table. For this example we will use the Gross Disposable Household Income (GDHI) estimates per Local Authority. We use the {readxl} library to read excel files directly into R. It is very fast, but note that it reads data into a tibble format, which is a format you might not be familiar with (but does not really matter in this case).

# Income
gdhi_link <- "https://www.ons.gov.uk/file?uri=%2feconomy%2fregionalaccounts%2fgrossdisposablehouseholdincome%2fdatasets%2fregionalgrossdisposablehouseholdincomegdhibylocalauthorityintheuk%2f1997to2016/vcregionalgdhibylareordered.xlsx"
download.file(gdhi_link, "gdhi.xlsx", method = "curl") 
# Read Excel
gdhi <- read_xlsx("gdhi.xlsx", sheet = 2, skip = 2, n_max = 392, col_names = T)

Income figures are not per person, but rather as totals in £ millions. Thankfully, the next sheet (no. 3) in the excel file has the total population estimates per LA.

# Read Excel
pop <- read_xlsx("gdhi.xlsx", sheet = 3, skip = 2, n_max = 392, col_names = T)

We can now calculate values per person for 2016:

# Subset year 2016
gdhi <- gdhi[, c("LAU1 code", "LA name", "2016")]
pop <- pop[, c("LAU1 code", "LA name", "2016")]

# Rename
gdhi$GDHI_2016 <- gdhi$`2016` 
gdhi$`2016` <- NULL
pop$Pop_2016 <- pop$`2016` 
pop$`2016` <- NULL

# Merge 
gdhi <- merge(gdhi, pop, by = c("LAU1 code", "LA name"), all = T)
gdhi$GDHI_2016_PP <- gdhi$GDHI_2016 / gdhi$Pop_2016
gdhi$GDHI_2016_PP <- round(gdhi$GDHI_2016_PP*1000, 3) # in £ thousands

Let’s take a look at the distribution of the income data. We will use the histogram geometry of ggplot, which is essentially adding geom_histogram() to the core plot:

# Descriptives
summary(gdhi$GDHI_2016_PP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.23   16.65   18.81   19.78   21.13   62.60
# Distribution
# The geom_histogram makes a histogram from the data
ggplot(data = gdhi, aes(x = GDHI_2016_PP)) + 
  geom_histogram(bins = 50, fill="steelblue", color = "white")

Notice are some outliers in the data values, possibly in London, which might become a problem later in the analysis. For now though we can proceed and join our GDHI data to our LE data:

# Merge with LE
le0_16 <- merge(le0_16, gdhi, by.x = "AreaCode", by.y = "LAU1 code", all.x = T)
# Check for irregularities again
summary(le0_16)
##    AreaCode              Year       ParentCode         ParentName          AreaName             Sex               LE_Birth     LowerCI95.0limit UpperCI95.0limit   LA_name             IMD_rank     IMD_Quintile   LA name            GDHI_2016        Pop_2016        GDHI_2016_PP  
##  Length:648         Min.   :2016   Length:648         Length:648         Length:648         Length:648         Min.   :74.16   Min.   :73.54    Min.   :74.77    Length:648         Min.   : 6683   Q1:128       Length:648         Min.   :  608   Min.   :  34475   Min.   :12.23  
##  Class :character   1st Qu.:2016   Class :character   Class :character   Class :character   Class :character   1st Qu.:79.90   1st Qu.:79.35    1st Qu.:80.50    Class :character   1st Qu.:13407   Q2:130       Class :character   1st Qu.: 1940   1st Qu.:  98780   1st Qu.:17.39  
##  Mode  :character   Median :2016   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :81.60   Median :81.00    Median :82.23    Mode  :character   Median :17785   Q3:130       Mode  :character   Median : 2590   Median : 132810   Median :19.30  
##                     Mean   :2016                                                                               Mean   :81.56   Mean   :80.99    Mean   :82.14                       Mean   :17710   Q4:130                          Mean   : 3389   Mean   : 170551   Mean   :20.22  
##                     3rd Qu.:2016                                                                               3rd Qu.:83.44   3rd Qu.:82.86    3rd Qu.:83.98                       3rd Qu.:21872   Q5:130                          3rd Qu.: 3978   3rd Qu.: 208997   3rd Qu.:21.84  
##                     Max.   :2016                                                                               Max.   :86.55   Max.   :85.98    Max.   :87.11                       Max.   :29719                                   Max.   :15898   Max.   :1128077   Max.   :62.60

We will perform the analysis separately for males and females, and so we will create two separate files:

# Split
le0_m <- le0_16[le0_16$Sex == "Male", ]
le0_f <- le0_16[le0_16$Sex == "Female", ]

A first step in the analysis is to look for any correlations between the two variables. Correlation gives us an idea on the level of association between two variables. This is easily carried out with the cor() function. By default, this gives us the Pearson’s correlation. The cor.test() is essentially the same, but it gives a little bit more information than just the correlation value.

# Correlations
cor(le0_m$LE_Birth, le0_m$GDHI_2016_PP)
## [1] 0.5576461
cor.test(le0_f$LE_Birth, le0_f$GDHI_2016_PP)
## 
##  Pearson's product-moment correlation
## 
## data:  le0_f$LE_Birth and le0_f$GDHI_2016_PP
## t = 13.877, df = 322, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5386832 0.6756615
## sample estimates:
##       cor 
## 0.6117375

It seems there is a high correlation between LE and GDHI. Plotting the data might also be useful in visualising this relationship, particularly spotting any possible outliers:

# Plot
le_f_plot <- ggplot(data = le0_f, aes(LE_Birth, GDHI_2016_PP)) + geom_point(col = "orchid") 
le_f_plot

Notice this time we saved the plot in an object, le_f_plot, which we can call whenever we want. The best part of this is that now we can add more details to it, like labs() using the +.

Q: Can you make the same plot for males and add axis labels and a title? Try to make it with a different colour for points.

Answer:

le_m_plot <- ggplot(le0_m, aes(LE_Birth, GDHI_2016_PP)) + geom_point(col = "blue") 
le_m_plot + labs(x = "Life Expectancy at Birth", 
                 y = "Gross Disposable Household Income per person",
                 title = "Relationship between male life expectancy and income by LA, 2016 data")


#### Regression Model

We know there is a significant positive correlation between LE and average GDHI, however a regression model will quantify that relationship into something more meaningful and interpretable. We will make two simple linear regression models, one for males and one for females. In short, we will calculate the statistical importance of the relationship (the p-value), how much is the effect (the coefficient), and how much LE variability can be explained by income figures (the R2). We will make two models, one for male and one for female life expectancy.

# Model for males
le_model_m <- lm(LE_Birth ~ GDHI_2016_PP, data = le0_m)
summary(le_model_m)
## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP, data = le0_m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1626 -0.8914  0.1276  0.9950  2.9635 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  76.41332    0.28964  263.82   <2e-16 ***
## GDHI_2016_PP  0.16692    0.01385   12.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.338 on 322 degrees of freedom
## Multiple R-squared:  0.311,  Adjusted R-squared:  0.3088 
## F-statistic: 145.3 on 1 and 322 DF,  p-value: < 2.2e-16
# Model for females
le_model_f <- lm(LE_Birth ~ GDHI_2016_PP, data = le0_f)
summary(le_model_f)
## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP, data = le0_f)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3581 -0.6481 -0.0095  0.7769  2.7562 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.20438    0.23378  343.08   <2e-16 ***
## GDHI_2016_PP  0.15509    0.01118   13.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.08 on 322 degrees of freedom
## Multiple R-squared:  0.3742, Adjusted R-squared:  0.3723 
## F-statistic: 192.6 on 1 and 322 DF,  p-value: < 2.2e-16

Q: According to these two models, which gender is expected to have higher gains of life expectancy at birth when their GDHI increases?

Answer: Males. An increase of £1,000 is expected to increase their life expectancy by 0.167 years. Females have a slightly smaller coefficient with 0.155 years.

We can also plot the model for some useful information about heteroskedasticity of residuals, outliers etc. We can press enter to scroll through the plots.

# Plots
plot(le_model_m)
plot(le_model_f)

Based on one of the plots, there seems to be an issue with GDHI outlier values. We have also spotted those earlier in the density plot - around 5-6 of them. In this case we will try to remove those LAs with very high GDHI (but note there are other ways to deal with outlier values as well, including some specific techniques). For this study we will do so manually; we order the dataset and remove the top 6 LAs according to GDHI.

From now on we will only focus on the male life expectancy model.

# Remove top 6 outliers
le0_m <- le0_m[order(le0_m$GDHI_2016_PP, decreasing = T), ]
le0_m <- le0_m[-c(1:6), ]

Then we re-run our model using the new dataset without the outliers:

# Models
le_model_m <- lm(LE_Birth ~ GDHI_2016_PP, data = le0_m)
summary(le_model_m)
## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP, data = le0_m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1576 -0.6726 -0.0002  0.8608  3.0677 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   73.9833     0.3668  201.73   <2e-16 ***
## GDHI_2016_PP   0.2928     0.0183   15.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.185 on 316 degrees of freedom
## Multiple R-squared:  0.4474, Adjusted R-squared:  0.4456 
## F-statistic: 255.8 on 1 and 316 DF,  p-value: < 2.2e-16

Notice now how the expected relationship is much higher, to 0.29 years for every £1,000, and how the overall variability explained is also higher (R2 = 0.447 / 0.446).

We can visualise the linear model again using ggplot, very easily by adding the geom_smooth() and a method = "lm" parameter (i.e. a linear model), as follows:

ggplot(le0_m, aes(LE_Birth, GDHI_2016_PP)) + 
  geom_point(col = "skyblue") + 
  geom_smooth(method = "lm")

Q: In the above plot, can you guess what the grey shaded area along the regression line represents?

Answer: It represents the confidence interval of the regression line; in this case that is 95%. Or, in other words, it means that we’re 95% confident that the true regression line lies somewhere in that gray zone. By default, the confidence graphic in geom_smooth() is always true and set to 0.95. These parameters are controlled by the se and level arguments, like so: geom_smooth(method = "lm", se = TRUE, level = 0.95).


Adding confounders

One issue with how life expectancy is calculated is that it assumes the mortality for people born now, using mortality patterns of all ages based on how they are now. However, mortality ratios of e.g. people at 75 years of age now will probably not be same for people born now in 75 years. For example, there are reasons to believe that mortality ratios of older people are much higher as an effect of the smoking epidemics in the 1970’s and 1980’s.

In this study, we will try to control for the age distribution of LAs and their association to life expectancies. For simplicity, in this example we will just include the proportion of the population over 65 years of age for each LA.

Firstly, let’s get the available population estimates by age group. The excel files has three sheets, the first includes notes, the second estimates for males and the third estimates for females. As aforementioned, we will only work with the male life expectancy model from now on.

# Data
pop16_link <- "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fpopulationandmigration%2fpopulationprojections%2fdatasets%2flocalauthoritiesinenglandtable2%2f2016based/table2.xls"
download.file(pop16_link, "pop_16.xls", method = "curl")

pop16_m <- read_xls("pop_16.xls", sheet = 2, skip = 6, n_max = 7387, col_names = T)

# Preview
head(pop16_m)
## # A tibble: 6 x 29
##   CODE      AREA    `AGE GROUP` `2016` `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024` `2025` `2026` `2027` `2028` `2029` `2030` `2031` `2032` `2033` `2034` `2035` `2036` `2037` `2038` `2039` `2040` `2041`
##   <chr>     <chr>   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 E92000001 England 0-4          1758.  1734.  1723.  1719.  1714.  1708.  1710.  1707.  1704.  1700.  1697.  1693.  1689.  1684.  1679.  1674.  1670.  1668.  1668.  1670.  1675   1682.  1691.  1702.  1716.  1730.
## 2 E92000001 England 5-9          1756.  1791.  1805.  1813   1814.  1807.  1781.  1768.  1762.  1756.  1749.  1751.  1748.  1745.  1742.  1738.  1734.  1730   1725   1720.  1715.  1712.  1710.  1710.  1712.  1716.
## 3 E92000001 England 10-14        1572.  1622.  1676.  1718.  1760.  1794.  1827.  1840.  1847.  1846.  1839.  1813.  1800.  1794.  1788.  1781.  1782.  1780.  1777.  1774.  1770.  1766.  1762   1757.  1752.  1747.
## 4 E92000001 England 15-19        1632.  1600.  1582.  1578.  1588.  1621.  1668.  1720.  1761.  1802.  1835.  1868.  1881.  1888.  1888.  1880.  1854.  1842.  1836.  1830.  1822.  1824   1821.  1818.  1815   1812.
## 5 E92000001 England 20-24        1824.  1814.  1799.  1780.  1760.  1728.  1691.  1670.  1662.  1669.  1701.  1747.  1800.  1840.  1881.  1914   1947.  1960.  1967.  1967.  1960.  1934.  1921.  1915.  1910.  1902.
## 6 E92000001 England 25-29        1924.  1946.  1948.  1952.  1934.  1910.  1895.  1876.  1851.  1828.  1794.  1756.  1734.  1726.  1733.  1765.  1812.  1864.  1905.  1946.  1979   2013.  2025.  2032.  2032.  2025.

We will now have to construct a variable adding all age groups of 65 years and over.

# All age groups of 65 years and over
target_groups <- c("65-69", "70-74", "75-79", "80-84", "85-89", "90+")

# Use the %in% to subset those that match our target age groups
pop16_m_65 <- pop16_m[pop16_m$`AGE GROUP` %in% target_groups, ]

# Subset columns
pop16_m_65 <- pop16_m_65[, c("CODE",  "AREA",  "AGE GROUP", "2016")]

# Aggregate using FUN = "sum"
pop16_m_65 <- aggregate(`2016` ~ CODE + AREA, data = pop16_m_65, FUN = "sum")

We also need the total amount the obtain the ratio:

# subset the total only
pop16_m_total <- pop16_m[pop16_m$`AGE GROUP` == "All ages", ]

# Subset columns
pop16_m_total <- pop16_m_total[, c("CODE",  "AREA",  "AGE GROUP", "2016")]

# Rename and clean table
pop16_m_total$Pop_Male <- pop16_m_total$`2016`
pop16_m_total$`AGE GROUP` <- NULL
pop16_m_total$`2016` <- NULL

Merging the file will gives us two columns, the population over 65 years and the total population. Dividing the two columns will give us the proportion of people of 65 years of age and over to the total LA population, according to 2016 estimates.

# Join
pop16_m_65 <- merge(pop16_m_65, pop16_m_total, by = c("CODE", "AREA"), all.x = T)

# Calculate proportion
pop16_m_65$Pop_M65_ratio <- pop16_m_65$`2016` / pop16_m_65$Pop_Male

# Subset columns
pop16_m_65 <- pop16_m_65[, c("CODE", "Pop_M65_ratio")]

# Preview
head(pop16_m_65)
##        CODE Pop_M65_ratio
## 1 E06000001     0.1721854
## 2 E06000002     0.1443001
## 3 E06000003     0.2066869
## 4 E06000004     0.1616580
## 5 E06000005     0.1833977
## 6 E06000006     0.1672026
# Check values
summary(pop16_m_65$Pop_M65_ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05287 0.15461 0.17840 0.17978 0.20922 0.31928

The pop16_m_65 table contains the information we need to add to the LE table. we can again use the LA ONS code to join the two tables:

# Join to male LE table
le0_m <- merge(le0_m, pop16_m_65, by.x = "AreaCode", by.y = "CODE", all.x = T)

We now have a multiple linear regression model, using two explanatory variables, GDHI_2016_PP and Pop_M65_ratio. In multiple regression, it is important to look out for Multicollinearity. Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly correlated, which can be problematic, as the model assumes no association to work correctly. We can check the level of correlation here:

# Correlation
cor(le0_m$GDHI_2016_PP, le0_m$Pop_M65_ratio)
## [1] -0.1601987

Which shows that our two variables are not very highly linearly correlated. In this case the correlation value can be considered sufficient evidence.

We can now add the population over 65 ratio as a confounder to our model specification. In R, the way to do that is to add variable using the + character on the right side of the formula.

# New model
le_ext_model_m <- lm(LE_Birth ~ GDHI_2016_PP + Pop_M65_ratio, data = le0_m)
summary(le_ext_model_m)
## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP + Pop_M65_ratio, data = le0_m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1206 -0.5313  0.0099  0.6532  2.5376 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    70.7915     0.4048  174.87   <2e-16 ***
## GDHI_2016_PP    0.3223     0.0154   20.93   <2e-16 ***
## Pop_M65_ratio  14.4678     1.2089   11.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.984 on 315 degrees of freedom
## Multiple R-squared:  0.6201, Adjusted R-squared:  0.6177 
## F-statistic: 257.1 on 2 and 315 DF,  p-value: < 2.2e-16

The final model shows that when controlling for the proportion of population of 65 years age and over, the expected effect of gross disposable income is 0.32 years increase for an increase of £1,000. Both our explanatory variables are highly significant.

Note that when there is an increase in the percentage of population of 65 years age and over, our model expects also an increase in life expectancy. This which might not be easily explained. Perhaps that are other underlying reasons regarding specific latent characteristics of LAs that contribute to these relationships, which might also be related to the Pop_M65_ratio variable.

For this purpose, maybe a fixed-effects regression model, controlling for characteristics within each Local Authority, would be more appropriate. We will look at these types of models in later sessions.


Exercise:


– As we did above for Male LE, try to carry out the same analysis and construct a similar model using female life expectancies.
– What differences can you see?

Answer:

# Read data
pop16_f <- read_xls("pop_16.xls", sheet = 3, skip = 6, n_max = 7387, col_names = T)

# Preview
head(pop16_f)
## # A tibble: 6 x 29
##   CODE      AREA    `AGE GROUP` `2016` `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024` `2025` `2026` `2027` `2028` `2029` `2030` `2031` `2032` `2033` `2034` `2035` `2036` `2037` `2038` `2039` `2040` `2041`
##   <chr>     <chr>   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 E92000001 England 0-4          1671.  1649.  1640.  1636.  1633.  1628.  1630.  1627.  1624.  1621.  1618.  1614.  1610.  1605.  1600   1596.  1592.  1590.  1590.  1592.  1596.  1603.  1612.  1622.  1635.  1649.
## 2 E92000001 England 5-9          1673.  1707.  1719.  1726.  1726.  1718.  1693.  1682.  1677.  1673.  1666.  1668.  1665.  1662.  1659.  1656.  1653.  1648.  1643.  1639.  1634.  1631.  1629.  1629.  1631.  1635.
## 3 E92000001 England 10-14        1498.  1544   1596.  1636.  1676.  1708.  1741.  1752.  1758.  1755.  1747.  1722.  1712.  1706.  1702   1696.  1697.  1695.  1692.  1689.  1686.  1682.  1678.  1673.  1668   1664.
## 4 E92000001 England 15-19        1548.  1516.  1499.  1496.  1507.  1538.  1582.  1632.  1671.  1710.  1741.  1774.  1785.  1791.  1789   1781   1756.  1745.  1740.  1736.  1730.  1731.  1728.  1726.  1722.  1719.
## 5 E92000001 England 20-24        1736.  1716.  1700.  1680.  1662.  1637   1602.  1581.  1574.  1583.  1612   1656.  1706   1745.  1784.  1815.  1848.  1859.  1865.  1863   1855.  1830.  1820.  1814.  1810.  1804.
## 6 E92000001 England 25-29        1887.  1895.  1882.  1873.  1851.  1821.  1797.  1775   1752.  1730   1703.  1666.  1645.  1638.  1647.  1676   1720.  1770.  1810.  1848.  1880.  1913.  1924.  1930.  1928   1920
# All age groups of 65 years and over
target_groups <- c("65-69", "70-74", "75-79", "80-84", "85-89", "90+")

# Use the %in% to subset those that match our target age groups
pop16_f_65 <- pop16_f[pop16_f$`AGE GROUP` %in% target_groups, ]

# Subset columns
pop16_f_65 <- pop16_f_65[, c("CODE",  "AREA",  "AGE GROUP", "2016")]

# Aggregate using FUN = "sum"
pop16_f_65 <- aggregate(`2016` ~ CODE + AREA, data = pop16_f_65, FUN = "sum")

# subset the total only
pop16_f_total <- pop16_f[pop16_f$`AGE GROUP` == "All ages", ]

# Subset columns
pop16_f_total <- pop16_f_total[, c("CODE",  "AREA",  "AGE GROUP", "2016")]

# Rename and clean table
pop16_f_total$Pop_Female <- pop16_f_total$`2016`
pop16_f_total$`AGE GROUP` <- NULL
pop16_f_total$`2016` <- NULL

# Join
pop16_f_65 <- merge(pop16_f_65, pop16_f_total, by = c("CODE", "AREA"), all.x = T)

# Calculate proportion
pop16_f_65$Pop_F65_ratio <- pop16_f_65$`2016` / pop16_f_65$Pop_Female

# Subset columns
pop16_f_65 <- pop16_f_65[, c("CODE", "Pop_F65_ratio")]

# Preview
head(pop16_f_65)
##        CODE Pop_F65_ratio
## 1 E06000001     0.2016807
## 2 E06000002     0.1718310
## 3 E06000003     0.2309900
## 4 E06000004     0.1879397
## 5 E06000005     0.2110092
## 6 E06000006     0.1843318
# Check values
summary(pop16_f_65$Pop_F65_ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0688  0.1814  0.2065  0.2080  0.2383  0.3408
# Remove outliers and join ratios to female LE table
le0_f <- le0_f[order(le0_f$GDHI_2016_PP, decreasing = T), ]
le0_f <- le0_f[-c(1:6), ]
le0_f <- merge(le0_f, pop16_f_65, by.x = "AreaCode", by.y = "CODE", all.x = T)

# Correlation
cor(le0_f$GDHI_2016_PP, le0_f$Pop_F65_ratio)
## [1] -0.1419296
# New model
le_ext_model_f <- lm(LE_Birth ~ GDHI_2016_PP + Pop_F65_ratio, data = le0_f)
summary(le_ext_model_f)
## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP + Pop_F65_ratio, data = le0_f)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50036 -0.58993  0.02241  0.59656  2.28810 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   75.73595    0.34866  217.22   <2e-16 ***
## GDHI_2016_PP   0.27737    0.01285   21.58   <2e-16 ***
## Pop_F65_ratio 10.07454    0.96924   10.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8235 on 315 degrees of freedom
## Multiple R-squared:  0.623,  Adjusted R-squared:  0.6207 
## F-statistic: 260.3 on 2 and 315 DF,  p-value: < 2.2e-16



References

  1. Hiam L, Harrison D, McKee M, et al. Why is life expectancy in England and Wales ‘stalling’? J Epidemiol Community Health 2018;72:404-408.

  2. Leon DA, Jdanov DA, Shkolnikov VM. Trends in life expectancy and age-specific mortality in England and Wales, 1970–2016, in comparison with a set of 22 high-income countries: an analysis of vital statistics data. Lancet Public Health, 2019;4:575-582.

  3. Barr B, Higgerson J, Whitehead M. Investigating the impact of the English health inequalities strategy: time trend analysis. BMJ 2017; 358:j3310.