Public Health Data Lab II

Recent trends in Life Expectancy in England

While the average life expectancy in England has been increasing during the past decades, since 2010 the rate of improvement has slowed down considerably. Presently, there is a lot of debate among experts as to why, with one of the causes mentioned being austerity¹ ² ³. Another key issue that many studies have also focused on during this period, is whether inequalities or the gap between the least and most disadvantaged groups has increased³ ⁴. We probably won’t come up with any definite answers in this study, however that doesn’t prevent us from performing some basic, up-to-date analysis on the subject.

In this study we will examine how life expectancy (LE) at birth in England has evolved in recent years. We will also focus on the geographical inequalities of LE across local authorities. In particular, we will focus on LE inequalities that are associated with deprivation and income level. Besides some descriptive analysis, we will also try to build a very simple regression model describing the relationship between life expectancy and income at the Local Authority (LA) level.

Outline of the study

This practical session demonstrates a step-by-step analysis on life expectancies and inequalities in England. It covers some key aspects of such studies, from getting the necessary data to building (simple) regression models exploring the relationships between life expectancy and the social determinants of health. However, instead of individual-level data we will use aggregated data at Local Authority level to perform the analysis.

In general, this practical should run smoothly; everything should be contained within the script, including downloading data. Furthermore, we have included several “optional” tasks about the analysis that you can do.

This practical is trying to keep to Base R as much as possible (i.e. without using third-party libraries), with some exceptions. It also assume you run this within RStudio (tested for Version 1.1.453 and R version 3.6.1), which is free and you can find here.

By the end of the practical, you should be able to:
– Read data into R.
– Identify data structures.
– Subset, clean and re-code data.
– Find missing values and outliers.
– Join data tables based on common variables.
– Summarise and plot data.
– Calculate variable relationships.
– Calculate linear regression models.

There are also some sections included in a text-box; these are generally optional, but they do contain additional information, exercises, or remind you some things from previous practicals.

Getting life expectancy data

Most of the relevant data are accessible through the Public Health England (PHE) “Fingertips” repository or through the Office for National Statistics (ONS). The code should download the relevant data automatically in your working directory folder.

We can start by loading the necessary libraries - you can to install these from the “Tools” menu first if you haven’t done so already. Alternatively try using these commands:

# Install libraries
install.packages("readxl")
install.packages("ggplot2")

For fingertipsR, if the library is not available on CRAN, try:

# Install libraries
install.packages("fingertipsR", repos = "https://dev.ropensci.org")

# Load necessary libraries
library(fingertipsR)
library(readxl)
library(ggplot2)

We will download LE at birth data directly from PHE using the fingertipsR library. If you are wondering how to find the necessary fingertips codes for what you want these two commands will help you (optional):

# Available Indicators
ind_codes <- indicators()

# Available Area Types
area_types()

An easy way of “eyeballing” the data within RStudio is by viewing the file using either View(ind_codes) or clicking the file from your environment tab. When the table opens, click filter button at the top left, and within the textbox that appears above each column, type “life expectancy” for column IndicatorName in order to search for it. Indicator code 90366 indicates the LE at birth for all persons and an AreaTypeID of 101 returns data at the Lower Tier LA level (Districts and UAs).

Now we can load the data into R:

# Get data
le0 <- fingertips_data(IndicatorID = 90366, AreaTypeID = 101)
# View top of data
head(le0)

##   IndicatorID            IndicatorName ParentCode ParentName  AreaCode          AreaName AreaType    Sex      Age CategoryType Category Timeperiod    Value LowerCI95.0limit UpperCI95.0limit LowerCI99.8limit UpperCI99.8limit Count Denominator Valuenote RecentTrend ComparedtoEnglandvalueorpercentiles ComparedtoRegionvalueorpercentiles TimeperiodSortable Newdata Comparedtogoal
## 1       90366 Life expectancy at birth       <NA>       <NA> E92000001           England  England   Male All ages         <NA>     <NA>  2001 - 03 76.19791         76.16661         76.22922               NA               NA    NA          NA      <NA>        <NA>                        Not compared                       Not compared           20010000    <NA>           <NA>
## 2       90366 Life expectancy at birth       <NA>       <NA> E92000001           England  England Female All ages         <NA>     <NA>  2001 - 03 80.69957         80.67004         80.72910               NA               NA    NA          NA      <NA>        <NA>                        Not compared                       Not compared           20010000    <NA>           <NA>
## 3       90366 Life expectancy at birth  E92000001    England E12000001 North East region   Region   Male All ages         <NA>     <NA>  2001 - 03 74.70091         74.56450         74.83732               NA               NA    NA          NA      <NA>        <NA>                               Worse                       Not compared           20010000    <NA>           <NA>
##  [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

# Structure
str(le0)

## 'data.frame':    24422 obs. of  26 variables:
##  $ IndicatorID                        : int  90366 90366 90366 90366 90366 90366 90366 90366 90366 90366 ...
##  $ IndicatorName                      : chr  "Life expectancy at birth" "Life expectancy at birth" "Life expectancy at birth" "Life expectancy at birth" ...
##  $ ParentCode                         : chr  NA NA "E92000001" "E92000001" ...
##  $ ParentName                         : chr  NA NA "England" "England" ...
##  $ AreaCode                           : chr  "E92000001" "E92000001" "E12000001" "E12000002" ...
##  $ AreaName                           : chr  "England" "England" "North East region" "North West region" ...
##  $ AreaType                           : chr  "England" "England" "Region" "Region" ...
##  $ Sex                                : chr  "Male" "Female" "Male" "Male" ...
##  $ Age                                : chr  "All ages" "All ages" "All ages" "All ages" ...
##  $ CategoryType                       : chr  NA NA NA NA ...
##  $ Category                           : chr  NA NA NA NA ...
##  $ Timeperiod                         : chr  "2001 - 03" "2001 - 03" "2001 - 03" "2001 - 03" ...
##  $ Value                              : num  76.2 80.7 74.7 74.8 75.5 ...
##  $ LowerCI95.0limit                   : num  76.2 80.7 74.6 74.7 75.4 ...
##  $ UpperCI95.0limit                   : num  76.2 80.7 74.8 74.9 75.6 ...
##  $ LowerCI99.8limit                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ UpperCI99.8limit                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Count                              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Denominator                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Valuenote                          : chr  NA NA NA NA ...
##  $ RecentTrend                        : chr  NA NA NA NA ...
##  $ ComparedtoEnglandvalueorpercentiles: chr  "Not compared" "Not compared" "Worse" "Worse" ...
##  $ ComparedtoRegionvalueorpercentiles : chr  "Not compared" "Not compared" "Not compared" "Not compared" ...
##  $ TimeperiodSortable                 : int  20010000 20010000 20010000 20010000 20010000 20010000 20010000 20010000 20010000 20010000 ...
##  $ Newdata                            : chr  NA NA NA NA ...
##  $ Comparedtogoal                     : chr  NA NA NA NA ...

Without going into much detail, the str() function is very helpful when you are performing any calculations and load data into R through text files, excels etc., as many times R will read some values wrong. This is particularly true when handing numerical data; for instance, missing values and decimal types might need your attention, and it is best if you caught these “errors” early on.

Notice the structure of the data; these are generally called panel data, i.e. data with observations across a number of time periods. There are generally two ways with which these are supplied, either in long or wide format. Wide format is when each column represents a time period, while long is when each line represents a time period.

Q: With regards to the format, what kind of panel data is the life expectancy table?

Answer:

It’s in long format.

Depending on what you want to accomplish, you might want to transform between the two, which is not particularly easy. Fortunately we don’t need to do this now, but in case you are wondering, take a look at the reshape() function.

The data are also supplied for multiple levels, so besides LA level, national and regional data are also supplied. Furthermore, LE data are supplied as single year (added very recently) and 3-year rolling averages (the traditional way). For this study we will use the single-year estimates. We can easily subset just those using the nchar() function; For example, “2001” has 4 characters but “2001 - 03” has 9.

# Change column names by making a copy to a new one and deleting the original
# We can also use names(le0)[7] <- "LE_Birth" (i.e. change the 7th column name)
le0$LE_Birth <- le0$Value
le0$Value <- NULL 

# Keep only single year estimates
# For this, the Time period should be just 4 characters long
le0 <-le0[nchar(le0$Timeperiod) == 4,]

# Rename
le0$Year <- le0$Timeperiod
le0$Timeperiod <- NULL

Now let’s try to make a simple plot regarding national trends in England. First, we must subset the data.

# England data only
le0_eng <- le0[le0$AreaName == "England", ]
# Males only
le0_eng_male <- le0_eng[le0_eng$Sex == "Male", ]

# Simple plot
plot(le0_eng_male$Year, le0_eng_male$LE_Birth)

Q: Can you make a similar plot for female life expectancy?

Answer:

le0_eng_female <- le0_eng[le0_eng$Sex == "Female", ]
plot(le0_eng_female$Year, le0_eng_female$LE_Birth)

Plotting with `ggplot2`: Basic Plots

Click to open section:

It would be nice if we could overlay trends in both male and female LE at birth have to create another data frame with females plot it, This is rather ugly in base R - as you probably have noticed - so from now on we will use the ggplot2 library to visualize data. ggplot2 has its own grammar, so it might be confusing at first. However, it is easy once you get used to; the best way to get started is to look online for examples, including this cheetsheet.

Let’s start with a simple plot:

ggplot(data = le0_eng_male, aes(x = Year, y = LE_Birth)) + 
  geom_point()

The premise is very simple; you tell R to create a plot using the data = le0_eng_male. Then you add the aesthetics, essentially telling it how to plot the data you just supplied. We specify the Year on the x-axis and LE_Birth on the y-axis. However, we are not telling it how to represent them; if we want them as points, we add a geometry with geom_point(). If we wanted lines, we would add geom_line(). If we wanted both, we could add both.

We can add another argument in the aesthetics, group = Sex with which we specify to group observations by sex. We can also add some labels with labs. The rest are done automatically:

ggplot(le0_eng, aes(x = Year, y = LE_Birth, group = Sex)) + 
  geom_point() + 
  geom_line(aes(color = Sex)) +
  labs(y = "Life Expectancy at Birth", 
       x = "Year", 
       title = "Trends in Life Expectancy at birth between 2001 and 2020 in England")

Since we could like to colour every line differently we can specify geom_line(aes(color = Sex)) to do that. R will pick two colours automatically.

Q: The points above are all the default black colour. What do we need to change to make them the same colour as the lines?

Answer:

Replace with geom_point(aes(color = Sex)).

Inequalities at Local Authority level

For the next part we will try to explore the LE of local authorities by deprivation. Since the current table contains more than the LA-level data that we need, we will try to remove these. Generally we try to clean datasets from unnecessary data, so we have a nice table that we can use easily in our analysis.

# Subset and Clean
table(le0$AreaType)

## 
## Districts & UAs (pre Apr 2019)                        England                         Region 
##                          12040                             40                            360

le0 <- le0[le0$AreaType == "Districts & UAs (pre Apr 2019)", ]
le0 <- le0[, c("Year", "ParentCode", "ParentName", "AreaCode", "AreaName", # geography and time
               "Sex", "Age",                                               # type
               "LE_Birth", "LowerCI95.0limit", "UpperCI95.0limit")]        # values

The table() function is a good way to get all the unique values in a column. We will only keep the District and Unitary Authorities (Lower Tier LAs). Notice the measure we are using is LE for all persons at birth, so the Age column doesn’t really give any useful information.

# No useful information about age
table(le0$Age)

## 
## All ages 
##    12040

# Delete
le0$Age <- NULL

It is particularly important to look for any irregularities in the data, like missing values. Thankfully we didn’t have to deal with any issues up till now. The summary() function can be very useful for this.

# Look for irregularities
summary(le0)

##      Year            ParentCode         ParentName          AreaCode           AreaName             Sex               LE_Birth     LowerCI95.0limit UpperCI95.0limit
##  Length:12040       Length:12040       Length:12040       Length:12040       Length:12040       Length:12040       Min.   :71.18   Min.   :70.32    Min.   :71.81   
##  Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   1st Qu.:78.62   1st Qu.:77.58    1st Qu.:79.67   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :80.76   Median :79.77    Median :81.77   
##                                                                                                                    Mean   :80.56   Mean   :79.56    Mean   :81.57   
##                                                                                                                    3rd Qu.:82.62   3rd Qu.:81.64    3rd Qu.:83.64   
##                                                                                                                    Max.   :88.39   Max.   :87.31    Max.   :89.46   
##                                                                                                                    NA's   :80      NA's   :80       NA's   :80

The most important information here is the missing values in the LE LE_Birth column. This could potentially be problematic. We have to investigate.

# Which are these?
le0[is.na(le0$LE_Birth) == TRUE, ]

##       Year ParentCode        ParentName  AreaCode        AreaName    Sex LE_Birth LowerCI95.0limit UpperCI95.0limit
## 12052 2001  E12000009 South West region E06000053 Isles of Scilly   Male       NA               NA               NA
## 12271 2001  E12000007     London region E09000001  City of London   Male       NA               NA               NA
## 12329 2001  E12000009 South West region E06000053 Isles of Scilly Female       NA               NA               NA
## 12548 2001  E12000007     London region E09000001  City of London Female       NA               NA               NA
## 12674 2002  E12000009 South West region E06000053 Isles of Scilly   Male       NA               NA               NA
## 12893 2002  E12000007     London region E09000001  City of London   Male       NA               NA               NA
## 12951 2002  E12000009 South West region E06000053 Isles of Scilly Female       NA               NA               NA
## 13170 2002  E12000007     London region E09000001  City of London Female       NA               NA               NA
## 13296 2003  E12000009 South West region E06000053 Isles of Scilly   Male       NA               NA               NA
## 13515 2003  E12000007     London region E09000001  City of London   Male       NA               NA               NA
## 13573 2003  E12000009 South West region E06000053 Isles of Scilly Female       NA               NA               NA
##  [ reached 'max' / getOption("max.print") -- omitted 69 rows ]

We do not have data for the two LAs, The Isles of Scilly and the City of London. We can remove them now.

# Remove them
le0 <- le0[is.na(le0$LE_Birth) == F, ]
# So how many LAs are left?
length(unique(le0$AreaCode))

## [1] 299

It is very common to exclude those two LAs. The Isles of Scilly and the City of London are so small in terms of population, that usually ONS doesn’t publish any data for them. While there were 326 LAs as of 2018, we are missing data on LAs that have been some re-organisations since 2019 (e.g. Bournemouth), which brings the total to 299.

Now lets try to plot these LE data by deprivation. We will use the 2015 Index of Multiple Deprivation (IMD) and construct quintiles based on the average LSOA deprivation rank. We will have to get these data and join them to our LE table. We use the {readxl} library to read excel files directly into R. It is very fast, but note that it reads data into a tibble format, which is a format you might not be familiar with (but does not really matter in this case).

# Download data
imd_link <- "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/467764/File_1_ID_2015_Index_of_Multiple_Deprivation.xlsx"
download.file(imd_link, "imd.xlsx", method = "curl") # this is downloaded to your working directory
imd <- read_xlsx("imd.xlsx", sheet = 2, col_names = T)
head(imd)

## # A tibble: 6 x 6
##   `LSOA code (2011)` `LSOA name (2011)` `Local Authority District code (2013)` `Local Authority District name (2013)` `Index of Multiple Deprivation (IMD) Rank (where 1 is most deprived)` `Index of Multiple Deprivation (IMD) Decile (where 1 is most deprived 10% of LSOAs)`
##   <chr>              <chr>              <chr>                                  <chr>                                                                                                  <dbl>                                                                                <dbl>
## 1 E01031349          Adur 001A          E07000223                              Adur                                                                                                   21352                                                                                    7
## 2 E01031350          Adur 001B          E07000223                              Adur                                                                                                    8864                                                                                    3
## 3 E01031351          Adur 001C          E07000223                              Adur                                                                                                   22143                                                                                    7
## 4 E01031352          Adur 001D          E07000223                              Adur                                                                                                   17252                                                                                    6
## 5 E01031370          Adur 001E          E07000223                              Adur                                                                                                   15643                                                                                    5
## 6 E01031374          Adur 001F          E07000223                              Adur                                                                                                   21176                                                                                    7

In order to calculate the average rank per LA, we use aggregate() with FUN = "mean".

# Rename columns
names(imd) <- c("LSOA_code", "LSOA_name", "LA_code", "LA_name", "IMD_rank", "IMD_decile")
imd <- aggregate(IMD_rank ~ LA_code + LA_name, data = imd, FUN = "mean")

We can now join the IMD data to the LE data using merge(). For simplicity, we assume here that the 2015 IMD ranks remain stable over the period of analysis. Another potential issue with the IMD is that, being a composite measure of deprivation, it already contains measures of mortality; and as such it wouldn’t be best to use in conjunction with life expectancies. A better measure would be to use just one domain of deprivation, such as income, to perform the analysis. For practical reasons however, we will just use the general deprivation rank for the classification.

# Join data
le0 <- merge(x = le0, y = imd, by.x = "AreaCode", by.y = "LA_code", all.x = T)

Note the all.x argument is the very important; an all.x = T represents a left join, i.e. we keep all data from the x object (le0), and discard those from y (imd) that don’t match - in this case, the IMD data for Wales, Scotland etc. An all.y = T would keep all the imd data (right join). An all = T represents a full join, and an all = F an inner join. check the merge() help page for more information.

Next, we will make the 5 quintiles based on the average rank. We use a combination of quantile() and findInterval() to find where the cuts should be and which LA belongs to which interval respectively.

# Helper variable
imd_Q5 <- quantile(le0$IMD_rank, probs=c(0.2, 0.4, 0.6, 0.8))
# Cut the variable into 5 equal pieces
le0$IMD_Quintile <- findInterval(le0$IMD_rank, c(0, imd_Q5, Inf))
# Rename using factors; it is a type of variable, like numeric, string, etc.
le0$IMD_Quintile <- factor(le0$IMD_Quintile, labels=c("Q1 most deprived","Q2","Q3", "Q4","Q5 least deprived"))
# Preview
head(le0)

##    AreaCode Year ParentCode        ParentName   AreaName    Sex LE_Birth LowerCI95.0limit UpperCI95.0limit    LA_name IMD_rank     IMD_Quintile
## 1 E06000001 2001  E12000001 North East region Hartlepool   Male 73.53947         72.31141         74.76753 Hartlepool 11101.05 Q1 most deprived
## 2 E06000001 2014  E12000001 North East region Hartlepool   Male 77.04187         75.77484         78.30890 Hartlepool 11101.05 Q1 most deprived
## 3 E06000001 2013  E12000001 North East region Hartlepool Female 81.85011         80.75253         82.94770 Hartlepool 11101.05 Q1 most deprived
## 4 E06000001 2017  E12000001 North East region Hartlepool Female 81.75981         80.70720         82.81243 Hartlepool 11101.05 Q1 most deprived
## 5 E06000001 2018  E12000001 North East region Hartlepool   Male 77.77052         76.45557         79.08547 Hartlepool 11101.05 Q1 most deprived
## 6 E06000001 2005  E12000001 North East region Hartlepool   Male 75.08698         74.00260         76.17137 Hartlepool 11101.05 Q1 most deprived

# Delete variable that we don't need any more
remove(imd_Q5)

Now we can make a new data frame with the aggregate average LE values per IMD quintile. Essentially, we are telling to R to find the averages of LE_Birth by Year, IMD_Quintile and Sex. Translate by with ~ and you should get an idea of how to write formulas in R now.

# Calcalute averages by IMD Quintile
le0_imd <- aggregate(LE_Birth ~ Year + IMD_Quintile + Sex, data = le0, FUN = "mean")
head(le0_imd)

##   Year     IMD_Quintile    Sex LE_Birth
## 1 2001 Q1 most deprived Female 79.45209
## 2 2002 Q1 most deprived Female 79.57067
## 3 2003 Q1 most deprived Female 79.44721
## 4 2004 Q1 most deprived Female 79.98570
## 5 2005 Q1 most deprived Female 80.29279
## 6 2006 Q1 most deprived Female 80.39948

Plotting with `ggplot2`: Advanced plots

Click to open section:

Let’s try to plot some results using ggplot() again. We can make a plot for male life expectancy by IMD Quintile, as we did above for LE by gender:

le0_imd_male <- le0_imd[le0_imd$Sex == "Male", ]

ggplot(le0_imd_male, aes(x = Year, y = LE_Birth, group = IMD_Quintile)) + 
  geom_point(aes(color = IMD_Quintile)) + 
  geom_line(aes(color = IMD_Quintile)) +
  labs(y = "Male Life Expectancy at Birth", 
       x = "Year", 
       color = "Deprivation Quintile",
       caption = "Based on the 2015 IMD average rank",
       title = "Trends in Male Life Expectancy at birth between 2001 and 2020 in England")

Now, we will try to plot LEs by deprivation and gender. Optimally, we would want to somehow compare LEs for both males and females easily in a graph.

We will try something a little more advanced at this point. Since we have mean values for both males and females stored in the variable Sex, we can add facet_grid(cols = vars(Sex)) to create one plot for each gender. We specify how many plot by adding cols i.e. separate plots as columns (it could also be rows), and use the variable Sex to do the separation by (in this case, male and female). Remember, you can always search online to find out code examples!

# Plot by IMD Quintile
ggplot(le0_imd, aes(x = Year, y = LE_Birth, group = IMD_Quintile)) + 
  geom_point(aes(color = IMD_Quintile)) + 
  geom_line(aes(color = IMD_Quintile)) +
  facet_grid(cols = vars(Sex)) + 
  labs(y = "Life Expectancy at Birth", 
       x= "Year",
       color = "Deprivation Quintile",
       caption = "Based on the 2015 IMD average rank",
       title = "Trends in Life Expectancy at birth between 2001 and 2020 in England")

Modelling the relationships between LE and Income

The second part of this analysis is to make a simple regression model looking at the relationship between LE and average income. We won’t be using panel data in this instance; we will try to build this model just for the 2016 data. Let’s begin by subsetting the LE table:

# LA - level data for latest year
# Subset latest year
le0_16 <- le0[le0$Year == "2016", ]

We now need to attach the income data to the le0_16 table. For this example we will use the Gross Disposable Household Income (GDHI) estimates per Local Authority. We also need to specify the sheet number (2), how many lines to skip at the beginning (2), and how many lines total to read (392).

# Income
gdhi_link <- "https://www.ons.gov.uk/file?uri=%2feconomy%2fregionalaccounts%2fgrossdisposablehouseholdincome%2fdatasets%2fregionalgrossdisposablehouseholdincomegdhibylocalauthorityintheuk%2f1997to2016/vcregionalgdhibylareordered.xlsx"
download.file(gdhi_link, "gdhi.xlsx", method = "curl") 
# Read Excel
gdhi <- read_xlsx("gdhi.xlsx", sheet = 2, skip = 2, n_max = 392, col_names = T)

Income figures are not per person, but rather as totals in £ millions. Thankfully, the next sheet (no. 3) in the excel file has the total population estimates per LA.

# Read Excel
pop <- read_xlsx("gdhi.xlsx", sheet = 3, skip = 2, n_max = 392, col_names = T)

We can now calculate values per person for 2016:

# Subset year 2016
gdhi <- gdhi[, c("LAU1 code", "LA name", "2016")]
pop <- pop[, c("LAU1 code", "LA name", "2016")]

# Rename
gdhi$GDHI_2016 <- gdhi$`2016` 
gdhi$`2016` <- NULL
pop$Pop_2016 <- pop$`2016` 
pop$`2016` <- NULL

# Merge 
gdhi <- merge(gdhi, pop, by = c("LAU1 code", "LA name"), all = T)
gdhi$GDHI_2016_PP <- gdhi$GDHI_2016 / gdhi$Pop_2016
gdhi$GDHI_2016_PP <- round(gdhi$GDHI_2016_PP*1000, 3) # in £ thousands

# Descriptives
summary(gdhi$GDHI_2016_PP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.23   16.65   18.81   19.78   21.13   62.60

Plotting with `ggplot2`: Histograms

Click to open section:

For more information we can look at the distribution of the income data. We will use the histogram geometry of ggplot(), which is essentially adding geom_histogram() to the core plot:

# Distribution
# The geom_histogram makes a histogram from the data
ggplot(data = gdhi, aes(x = GDHI_2016_PP)) + 
  geom_histogram(bins = 50, fill="steelblue", color = "white")

Notice are some outliers in the data values, possibly in London, which might become a problem later in the analysis. For now though we can proceed and join our GDHI data to our LE data:

# Merge with LE
le0_16 <- merge(le0_16, gdhi, by.x = "AreaCode", by.y = "LAU1 code", all.x = T)
# Check for irregularities again
summary(le0_16)

##    AreaCode             Year            ParentCode         ParentName          AreaName             Sex               LE_Birth     LowerCI95.0limit UpperCI95.0limit   LA_name             IMD_rank                IMD_Quintile   LA name            GDHI_2016        Pop_2016        GDHI_2016_PP  
##  Length:598         Length:598         Length:598         Length:598         Length:598         Length:598         Min.   :74.48   Min.   :73.43    Min.   :75.53    Length:598         Min.   : 6683   Q1 most deprived :118   Length:598         Min.   :  711   Min.   :  38949   Min.   :12.23  
##  Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   1st Qu.:79.92   1st Qu.:78.89    1st Qu.:80.83    Class :character   1st Qu.:13205   Q2               :120   Class :character   1st Qu.: 1970   1st Qu.: 101474   1st Qu.:17.07  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :81.65   Median :80.68    Median :82.64    Mode  :character   Median :17525   Q3               :120   Mode  :character   Median : 2680   Median : 138523   Median :19.29  
##                                                                                                                    Mean   :81.55   Mean   :80.57    Mean   :82.54                       Mean   :17539   Q4               :120                      Mean   : 3492   Mean   : 176099   Mean   :20.19  
##                                                                                                                    3rd Qu.:83.38   3rd Qu.:82.39    3rd Qu.:84.38                       3rd Qu.:21856   Q5 least deprived:120                      3rd Qu.: 4038   3rd Qu.: 216241   3rd Qu.:22.01  
##                                                                                                                    Max.   :88.09   Max.   :86.91    Max.   :89.28                       Max.   :29719                                              Max.   :15898   Max.   :1128077   Max.   :62.60

We will perform the analysis separately for males and females, and so we will create two separate files:

# Split
le0_m <- le0_16[le0_16$Sex == "Male", ]
le0_f <- le0_16[le0_16$Sex == "Female", ]

Correlations

Click to open section:

A first step in any analysis is to look for any correlations between the two variables. Correlation gives us an idea on the level of association between two variables. This is easily carried out with the cor() function. By default, this gives us the Pearson’s correlation. The cor.test() is essentially the same, but it gives a little bit more information than just the correlation value.

# Correlations
cor(le0_m$LE_Birth, le0_m$GDHI_2016_PP)

## [1] 0.563493

cor.test(le0_f$LE_Birth, le0_f$GDHI_2016_PP)

## 
##  Pearson's product-moment correlation
## 
## data:  le0_f$LE_Birth and le0_f$GDHI_2016_PP
## t = 13.686, df = 297, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5470445 0.6868644
## sample estimates:
##       cor 
## 0.6218859

It seems there is a high correlation between LE and GDHI. Plotting the data might also be useful in visualising this relationship, particularly spotting any possible outliers:

# Plot
le_f_plot <- ggplot(data = le0_f, aes(LE_Birth, GDHI_2016_PP)) + geom_point(col = "orchid") 
le_f_plot

Notice this time we saved the plot in an object, le_f_plot, which we can call whenever we want. The best part of this is that now we can add more details to it, like labs() using the +.

Q: Can you make the same plot for males and add axis labels and a title? Try to make it with a different colour for points.

Answer:

le_m_plot <- ggplot(le0_m, aes(LE_Birth, GDHI_2016_PP)) + geom_point(col = "blue") 
le_m_plot + labs(x = "Life Expectancy at Birth (years)", 
                 y = "Gross Disposable Household Income per person (£1000s)",
                 title = "Relationship between male life expectancy and income by LA, 2016 data")

Regression Models

We know there is a significant positive correlation between LE and average GDHI, however a regression model will quantify that relationship into something more meaningful and easy to interpret. We will now calculate a simple linear regression model. In brief, we will calculate the statistical importance of the relationship (the p-value), how much is the effect (the coefficient), and how much LE variability can be explained by income figures (the R²). We will make two models, one for male and one for female life expectancy.

# Model for males
le_model_m <- lm(LE_Birth ~ GDHI_2016_PP, data = le0_m)
summary(le_model_m)

## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP, data = le0_m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4317 -0.8770  0.0387  0.9787  3.8973 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  76.30076    0.30604  249.31   <2e-16 ***
## GDHI_2016_PP  0.17186    0.01462   11.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.396 on 297 degrees of freedom
## Multiple R-squared:  0.3175, Adjusted R-squared:  0.3152 
## F-statistic: 138.2 on 1 and 297 DF,  p-value: < 2.2e-16

# Model for females
le_model_f <- lm(LE_Birth ~ GDHI_2016_PP, data = le0_f)
summary(le_model_f)

## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP, data = le0_f)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5069 -0.7042  0.0050  0.8142  4.5380 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   80.0188     0.2511  318.60   <2e-16 ***
## GDHI_2016_PP   0.1642     0.0120   13.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.146 on 297 degrees of freedom
## Multiple R-squared:  0.3867, Adjusted R-squared:  0.3847 
## F-statistic: 187.3 on 1 and 297 DF,  p-value: < 2.2e-16

Q: According to these two models, which gender is expected to have higher gains of life expectancy at birth when GDHI increases?

Answer:

Males. An increase of 1 in GDHI (i.e., £1,000) is associated with an increase in life expectancy by 0.17 years. Females have a slightly smaller coefficient with 0.16 years.

Basic Regression Plots

By plotting a regression model in R we can get some useful information about how “good” our model is, such as heteroskedasticity of residuals, outliers etc. In this instance we will only focus on outliers, and we will only focus on the male life expectancy model.

We can plot a model really easily by using the plot() command. We can press enter to scroll through the plots.

# Plots
plot(le_model_m)

Let’s focus on plot no. 5:

Based on the above plot, there seems to be an issue with GDHI outlier values. We have also spotted those earlier in the density plot - around 5-6 of them. In this case we will try to remove those LAs with very high GDHI (but note there are better ways to deal with outlier values as well, including specific techniques). For this study we will do so manually; we order the dataset by GDHI value and remove the top 6 LAs.

# Remove top 6 outliers
le0_m <- le0_m[order(le0_m$GDHI_2016_PP, decreasing = T), ]
head(le0_m)

##      AreaCode Year ParentCode    ParentName               AreaName  Sex LE_Birth LowerCI95.0limit UpperCI95.0limit                LA_name IMD_rank      IMD_Quintile                LA name GDHI_2016 Pop_2016 GDHI_2016_PP
## 571 E09000020 2016  E12000007 London region Kensington and Chelsea Male 83.82800         82.48897         85.16702 Kensington and Chelsea 14172.27                Q2 Kensington and Chelsea      9814   156773       62.600
## 557 E09000013 2016  E12000007 London region Hammersmith and Fulham Male 80.41607         79.27609         81.55605 Hammersmith and Fulham 13045.04                Q2 Hammersmith and Fulham     10098   181783       55.550
## 597 E09000033 2016  E12000007 London region            Westminster Male 82.91912         81.89379         83.94446            Westminster 11616.65  Q1 most deprived            Westminster     12319   241974       50.910
## 545 E09000007 2016  E12000007 London region                 Camden Male 82.25523         81.21920         83.29127                 Camden 13155.79                Q2                 Camden     11185   249162       44.890
## 595 E09000032 2016  E12000007 London region             Wandsworth Male 80.08397         79.27273         80.89521             Wandsworth 17288.01                Q3             Wandsworth     11988   321497       37.288
## 586 E09000027 2016  E12000007 London region   Richmond upon Thames Male 82.43894         81.56494         83.31295   Richmond upon Thames 24744.86 Q5 least deprived   Richmond upon Thames      6492   195187       33.260

le0_m <- le0_m[-c(1:6), ]

Then we re-run our model using the new dataset without the outliers:

# Models
le_model_m <- lm(LE_Birth ~ GDHI_2016_PP, data = le0_m)
summary(le_model_m)

## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP, data = le0_m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2543 -0.8327 -0.0882  0.8866  4.0160 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  73.94272    0.40075  184.51   <2e-16 ***
## GDHI_2016_PP  0.29453    0.02006   14.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.262 on 291 degrees of freedom
## Multiple R-squared:  0.4255, Adjusted R-squared:  0.4235 
## F-statistic: 215.5 on 1 and 291 DF,  p-value: < 2.2e-16

Notice now how the expected relationship is much higher, i.e. 0.29 years for every £1,000, and how the overall variability explained is also higher (R² = 0.43).

Visualising the regression line

Click to open section:

We can visualise the linear model again using ggplot2, very easily by adding the geom_smooth() and a method = "lm" parameter (as it’s a linear model), as follows:

ggplot(le0_m, aes(LE_Birth, GDHI_2016_PP)) + 
  geom_point(col = "skyblue") + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Q: In the above plot, can you guess what the gray shaded area along the regression line represents?

Answer:

It represents the confidence interval of the regression line; in this case that is 95%. Or, in other words, it means that we’re 95% confident that the true regression line lies somewhere in that gray zone. By default, the confidence graphic in geom_smooth() is always true and set to 0.95. These parameters are controlled by the se and level arguments, like so: geom_smooth(method = "lm", se = TRUE, level = 0.95).

Adding confounders

One issue with how life expectancy is calculated is that it assumes the mortality for people born now, using mortality patterns of all ages based on individuals that were born in the past (i.e. current mortality patterns). However, mortality rates of e.g. people at 65 years of age now will probably not be same for people born now in 65 years. For example, there are reasons to believe that mortality rates of older people are much higher as an effect of the smoking epidemics in the 1970’s and 1980’s.

In this study, we will try to control for the age distribution of LAs and their association to life expectancies. For simplicity, in this example we will just include the proportion of the population over 65 years of age for each LA. We will do that as an example, although whether that parameter should be included in the model is debatable.

Firstly, let’s get the available population estimates by age group. The excel files has three sheets, the first includes notes, the second estimates for males and the third estimates for females. As aforementioned, we will only work with the male life expectancy model from now on.

# Data
pop16_link <- "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fpopulationandmigration%2fpopulationprojections%2fdatasets%2flocalauthoritiesinenglandtable2%2f2016based/table2.xls"
download.file(pop16_link, "pop_16.xls", method = "curl")

pop16_m <- read_xls("pop_16.xls", sheet = 2, skip = 6, n_max = 7387, col_names = T)

# Preview
head(pop16_m)

## # A tibble: 6 x 29
##   CODE      AREA    `AGE GROUP` `2016` `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024` `2025` `2026` `2027` `2028` `2029` `2030` `2031` `2032` `2033` `2034` `2035` `2036` `2037` `2038` `2039` `2040` `2041`
##   <chr>     <chr>   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 E92000001 England 0-4          1758.  1734.  1723.  1719.  1714.  1708.  1710.  1707.  1704.  1700.  1697.  1693.  1689.  1684.  1679.  1674.  1670.  1668.  1668.  1670.  1675   1682.  1691.  1702.  1716.  1730.
## 2 E92000001 England 5-9          1756.  1791.  1805.  1813   1814.  1807.  1781.  1768.  1762.  1756.  1749.  1751.  1748.  1745.  1742.  1738.  1734.  1730   1725   1720.  1715.  1712.  1710.  1710.  1712.  1716.
## 3 E92000001 England 10-14        1572.  1622.  1676.  1718.  1760.  1794.  1827.  1840.  1847.  1846.  1839.  1813.  1800.  1794.  1788.  1781.  1782.  1780.  1777.  1774.  1770.  1766.  1762   1757.  1752.  1747.
## 4 E92000001 England 15-19        1632.  1600.  1582.  1578.  1588.  1621.  1668.  1720.  1761.  1802.  1835.  1868.  1881.  1888.  1888.  1880.  1854.  1842.  1836.  1830.  1822.  1824   1821.  1818.  1815   1812.
## 5 E92000001 England 20-24        1824.  1814.  1799.  1780.  1760.  1728.  1691.  1670.  1662.  1669.  1701.  1747.  1800.  1840.  1881.  1914   1947.  1960.  1967.  1967.  1960.  1934.  1921.  1915.  1910.  1902.
## 6 E92000001 England 25-29        1924.  1946.  1948.  1952.  1934.  1910.  1895.  1876.  1851.  1828.  1794.  1756.  1734.  1726.  1733.  1765.  1812.  1864.  1905.  1946.  1979   2013.  2025.  2032.  2032.  2025.

We will now have to construct a variable adding all age groups of 65 years and over.

# All age groups of 65 years and over
target_groups <- c("65-69", "70-74", "75-79", "80-84", "85-89", "90+")

# Use the %in% to subset those that match our target age groups
pop16_m_65 <- pop16_m[pop16_m$`AGE GROUP` %in% target_groups, ]

# Subset columns
pop16_m_65 <- pop16_m_65[, c("CODE",  "AREA",  "AGE GROUP", "2016")]

# Aggregate using FUN = "sum"
pop16_m_65 <- aggregate(`2016` ~ CODE + AREA, data = pop16_m_65, FUN = "sum")

We also need the total amount the obtain the ratio:

# subset the total only
pop16_m_total <- pop16_m[pop16_m$`AGE GROUP` == "All ages", ]

# Subset columns
pop16_m_total <- pop16_m_total[, c("CODE",  "AREA",  "AGE GROUP", "2016")]

# Rename and clean table
pop16_m_total$Pop_Male <- pop16_m_total$`2016`
pop16_m_total$`AGE GROUP` <- NULL
pop16_m_total$`2016` <- NULL

Merging the file will gives us two columns, the population over 65 years and the total population. Dividing the two columns will give us the proportion of people of 65 years of age and over to the total LA population, according to 2016 estimates.

# Join
pop16_m_65 <- merge(pop16_m_65, pop16_m_total, by = c("CODE", "AREA"), all.x = T)

# Calculate proportion
pop16_m_65$Pop_M65_ratio <- pop16_m_65$`2016` / pop16_m_65$Pop_Male

# Subset columns
pop16_m_65 <- pop16_m_65[, c("CODE", "Pop_M65_ratio")]

# Preview
head(pop16_m_65)

##        CODE Pop_M65_ratio
## 1 E06000001     0.1721854
## 2 E06000002     0.1443001
## 3 E06000003     0.2066869
## 4 E06000004     0.1616580
## 5 E06000005     0.1833977
## 6 E06000006     0.1672026

# Check values
summary(pop16_m_65$Pop_M65_ratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05287 0.15461 0.17840 0.17978 0.20922 0.31928

The pop16_m_65 table contains the information we need to add to the LE table. we can again use the LA ONS code to join the two tables:

# Join to male LE table
le0_m <- merge(le0_m, pop16_m_65, by.x = "AreaCode", by.y = "CODE", all.x = T)

We now have a multiple linear regression model, using two explanatory variables, GDHI_2016_PP and Pop_M65_ratio. In multiple regression, it is important to look out for Multicollinearity. Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly correlated, which can be problematic. We can check the level of correlation here:

# Correlation
cor(le0_m$GDHI_2016_PP, le0_m$Pop_M65_ratio)

## [1] -0.1735995

Which shows that our two variables are not very highly (linearly) correlated. In this case the correlation value can be considered sufficient evidence.

We can now add the population over 65 ratio as a confounder to our model specification. In R, the way to do that is to add variable using the + character on the right side of the formula.

# New model
le_ext_model_m <- lm(LE_Birth ~ GDHI_2016_PP + Pop_M65_ratio, data = le0_m)
summary(le_ext_model_m)

## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP + Pop_M65_ratio, data = le0_m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3203 -0.7468  0.0707  0.7191  3.5161 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   71.13330    0.48455 146.803  < 2e-16 ***
## GDHI_2016_PP   0.32175    0.01821  17.668  < 2e-16 ***
## Pop_M65_ratio 12.78323    1.48453   8.611 4.71e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.128 on 290 degrees of freedom
## Multiple R-squared:  0.5425, Adjusted R-squared:  0.5393 
## F-statistic: 171.9 on 2 and 290 DF,  p-value: < 2.2e-16

While relatively basic, the above model shows that when controlling for the proportion of population of 65 years age and over, the expected effect of gross disposable income is 0.32 years increase for an increase in income of £1,000. Both our explanatory variables are highly significant.

Take a moment to reflect on the model outcomes:
– How can these results inform policy?
– Does this model tell us anything about trends in LE?
– What does it mean for inequalities?
– What other confounding variables could be included in the model? For instance, other demographics? What about access to healthcare? Also, is there any way we can check whether these associations are varying by area deprivation?

R coding Exercise:

– As we did above for Male LE, try to carry out the same analysis and construct a similar model using female life expectancies.

Answer:

# Read data
pop16_f <- read_xls("pop_16.xls", sheet = 3, skip = 6, n_max = 7387, col_names = T)

# Preview
head(pop16_f)

## # A tibble: 6 x 29
##   CODE      AREA    `AGE GROUP` `2016` `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024` `2025` `2026` `2027` `2028` `2029` `2030` `2031` `2032` `2033` `2034` `2035` `2036` `2037` `2038` `2039` `2040` `2041`
##   <chr>     <chr>   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 E92000001 England 0-4          1671.  1649.  1640.  1636.  1633.  1628.  1630.  1627.  1624.  1621.  1618.  1614.  1610.  1605.  1600   1596.  1592.  1590.  1590.  1592.  1596.  1603.  1612.  1622.  1635.  1649.
## 2 E92000001 England 5-9          1673.  1707.  1719.  1726.  1726.  1718.  1693.  1682.  1677.  1673.  1666.  1668.  1665.  1662.  1659.  1656.  1653.  1648.  1643.  1639.  1634.  1631.  1629.  1629.  1631.  1635.
## 3 E92000001 England 10-14        1498.  1544   1596.  1636.  1676.  1708.  1741.  1752.  1758.  1755.  1747.  1722.  1712.  1706.  1702   1696.  1697.  1695.  1692.  1689.  1686.  1682.  1678.  1673.  1668   1664.
## 4 E92000001 England 15-19        1548.  1516.  1499.  1496.  1507.  1538.  1582.  1632.  1671.  1710.  1741.  1774.  1785.  1791.  1789   1781   1756.  1745.  1740.  1736.  1730.  1731.  1728.  1726.  1722.  1719.
## 5 E92000001 England 20-24        1736.  1716.  1700.  1680.  1662.  1637   1602.  1581.  1574.  1583.  1612   1656.  1706   1745.  1784.  1815.  1848.  1859.  1865.  1863   1855.  1830.  1820.  1814.  1810.  1804.
## 6 E92000001 England 25-29        1887.  1895.  1882.  1873.  1851.  1821.  1797.  1775   1752.  1730   1703.  1666.  1645.  1638.  1647.  1676   1720.  1770.  1810.  1848.  1880.  1913.  1924.  1930.  1928   1920

# All age groups of 65 years and over
target_groups <- c("65-69", "70-74", "75-79", "80-84", "85-89", "90+")

# Use the %in% to subset those that match our target age groups
pop16_f_65 <- pop16_f[pop16_f$`AGE GROUP` %in% target_groups, ]

# Subset columns
pop16_f_65 <- pop16_f_65[, c("CODE",  "AREA",  "AGE GROUP", "2016")]

# Aggregate using FUN = "sum"
pop16_f_65 <- aggregate(`2016` ~ CODE + AREA, data = pop16_f_65, FUN = "sum")

# subset the total only
pop16_f_total <- pop16_f[pop16_f$`AGE GROUP` == "All ages", ]

# Subset columns
pop16_f_total <- pop16_f_total[, c("CODE",  "AREA",  "AGE GROUP", "2016")]

# Rename and clean table
pop16_f_total$Pop_Female <- pop16_f_total$`2016`
pop16_f_total$`AGE GROUP` <- NULL
pop16_f_total$`2016` <- NULL

# Join
pop16_f_65 <- merge(pop16_f_65, pop16_f_total, by = c("CODE", "AREA"), all.x = T)

# Calculate proportion
pop16_f_65$Pop_F65_ratio <- pop16_f_65$`2016` / pop16_f_65$Pop_Female

# Subset columns
pop16_f_65 <- pop16_f_65[, c("CODE", "Pop_F65_ratio")]

# Preview
head(pop16_f_65)

##        CODE Pop_F65_ratio
## 1 E06000001     0.2016807
## 2 E06000002     0.1718310
## 3 E06000003     0.2309900
## 4 E06000004     0.1879397
## 5 E06000005     0.2110092
## 6 E06000006     0.1843318

# Check values
summary(pop16_f_65$Pop_F65_ratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0688  0.1814  0.2065  0.2080  0.2383  0.3408

# Remove outliers and join ratios to female LE table
le0_f <- le0_f[order(le0_f$GDHI_2016_PP, decreasing = T), ]
le0_f <- le0_f[-c(1:6), ]
le0_f <- merge(le0_f, pop16_f_65, by.x = "AreaCode", by.y = "CODE", all.x = T)

# Correlation
cor(le0_f$GDHI_2016_PP, le0_f$Pop_F65_ratio)

## [1] -0.1567037

# New model
le_ext_model_f <- lm(LE_Birth ~ GDHI_2016_PP + Pop_F65_ratio, data = le0_f)
summary(le_ext_model_f)

## 
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP + Pop_F65_ratio, data = le0_f)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9895 -0.6483  0.0046  0.6881  4.1029 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   76.06449    0.42285  179.89  < 2e-16 ***
## GDHI_2016_PP   0.27472    0.01536   17.89  < 2e-16 ***
## Pop_F65_ratio  8.83670    1.20391    7.34 2.16e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9543 on 290 degrees of freedom
## Multiple R-squared:  0.5405, Adjusted R-squared:  0.5374 
## F-statistic: 170.6 on 2 and 290 DF,  p-value: < 2.2e-16

Public Health Data Lab II

Trends in life expectancy and inequalities in England

Alexandros Alexiou

9 March 2022

Recent trends in Life Expectancy in England

Outline of the study

Getting life expectancy data

Plotting with `ggplot2`: Basic Plots

Inequalities at Local Authority level

Plotting with `ggplot2`: Advanced plots

Modelling the relationships between LE and Income

Plotting with `ggplot2`: Histograms

Correlations

Regression Models

Basic Regression Plots

Visualising the regression line

Adding confounders

R coding Exercise:

References

Public Health Data Lab II

Trends in life expectancy and inequalities in England

Alexandros Alexiou

9 March 2022

Recent trends in Life Expectancy in England

Outline of the study

Getting life expectancy data

Plotting with ggplot2: Basic Plots

Inequalities at Local Authority level

Plotting with ggplot2: Advanced plots

Modelling the relationships between LE and Income

Plotting with ggplot2: Histograms

Correlations

Regression Models

Basic Regression Plots

Visualising the regression line

Adding confounders

R coding Exercise:

References

Plotting with `ggplot2`: Basic Plots

Plotting with `ggplot2`: Advanced plots

Plotting with `ggplot2`: Histograms