While average life expectancy has increased in England in recent decades, since 2010 the rate of improvement has slowed. Presently, there is a lot of debate among experts as to why, with some of the causes mentioned being flu epidemics and austerity1 2. We probably won’t come up with any definite answers in this study, however that doesn’t prevent us from performing some basic, up-to-date analysis on the subject.
In this study we will examine how life expectancy at birth (LE) in England has evolved since 2001-03. We will also focus on the inequalities of LE among local authorities, which don’t seem to be decreasing despite the UK goverment’s efforts3. In particular, we will look into inequalities that are associated with income or sex. Besides some descriptive analysis, we will also try to build a very simple regression model describing the relationship between life expectancy and gross disposable household income at the Local Authority (LA) level.
This practical session demonstrates a step-by-step analysis on life expectancies and inequalities in England. It covers some key aspects of such studies, from getting the necessary data to building (simple) regression models exploring the relationships between life expectancies and a few social determinants at the Local Authority level.
In general, this practical should run very smoothly; running it should be quite fast and everything should be contained within the script, including downloading data. However, there is always room for more, and thus we have included several “optional” extra tasks about the analysis.
This practical is trying to keep to Base R as much as possible (i.e. without using third-party libraries). We also assume you run this within RStudio (tested for Version 1.1.453 and R version 3.6.1), which is free and you can find here.
By the end of the practical, you should be able to:
– Read data into R.
– Identify data structures.
– Subset, clean and re-code data.
– Find missing values and outliers.
– Join data tables based on common variables.
– Summarise and plot data.
– Calculate variable relationships.
– Build a linear regression model with a confounding variable.
Most of the relevant data are accessible through the Public Health England (PHE) “Fingertips” platform or through the Office for National Statistics (ONS). The code should download the relevant data automatically in your working directory folder.
We can start by loading the necessary libraries - remember to install these from the “Tools” menu first if you haven’t done so already:
# Load necessary libraries
library(fingertipsR)
library(readxl)
library(ggplot2)
We will download LE at birth data directly from PHE using the fingertips library. If you are wondering how to find the necessary fingertips codes for what you want these two commands will help you (optional):
# Available Indicators
ind_codes <- indicators()
# Available Area Types
area_types()
An easy way of “eyeballing” the data within RStudio is by viewing the file using either View(ind_codes) or clicking the file from your environment tab. When the table opens, click filter button at the top left, and within the textbox that appears above each column, type “life expectancy” for column IndicatorName in order to search for it. Indicator code 90366 indicates the LE at birth for all persons and an AreaTypeID of 101 returns data at the Lower Tier LA level (Districts and UAs).
Now we can load the data into R:
# Get data
le0 <- fingertips_data(IndicatorID = 90366, AreaTypeID = 101)
# View top of data
head(le0)
## IndicatorID IndicatorName ParentCode ParentName AreaCode AreaName AreaType Sex Age CategoryType Category Timeperiod Value LowerCI95.0limit UpperCI95.0limit LowerCI99.8limit UpperCI99.8limit Count Denominator Valuenote RecentTrend ComparedtoEnglandvalueorpercentiles ComparedtoRegionvalueorpercentiles TimeperiodSortable Newdata Comparedtogoal
## 1 90366 Life expectancy at birth <NA> <NA> E92000001 England England Male All ages <NA> <NA> 2001 - 03 76.19791 76.16661 76.22922 NA NA NA NA <NA> <NA> Not compared Not compared 20010000 <NA> <NA>
## 2 90366 Life expectancy at birth <NA> <NA> E92000001 England England Female All ages <NA> <NA> 2001 - 03 80.69957 80.67004 80.72910 NA NA NA NA <NA> <NA> Not compared Not compared 20010000 <NA> <NA>
## 3 90366 Life expectancy at birth E92000001 England E12000001 North East region Region Male All ages <NA> <NA> 2001 - 03 74.70091 74.56450 74.83732 NA NA NA NA <NA> <NA> Worse Not compared 20010000 <NA> <NA>
## [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
# Structure
str(le0)
## 'data.frame': 10728 obs. of 26 variables:
## $ IndicatorID : int 90366 90366 90366 90366 90366 90366 90366 90366 90366 90366 ...
## $ IndicatorName : chr "Life expectancy at birth" "Life expectancy at birth" "Life expectancy at birth" "Life expectancy at birth" ...
## $ ParentCode : chr NA NA "E92000001" "E92000001" ...
## $ ParentName : chr NA NA "England" "England" ...
## $ AreaCode : chr "E92000001" "E92000001" "E12000001" "E12000002" ...
## $ AreaName : chr "England" "England" "North East region" "North West region" ...
## $ AreaType : chr "England" "England" "Region" "Region" ...
## $ Sex : chr "Male" "Female" "Male" "Male" ...
## $ Age : chr "All ages" "All ages" "All ages" "All ages" ...
## $ CategoryType : chr NA NA NA NA ...
## $ Category : chr NA NA NA NA ...
## $ Timeperiod : chr "2001 - 03" "2001 - 03" "2001 - 03" "2001 - 03" ...
## $ Value : num 76.2 80.7 74.7 74.8 75.5 ...
## $ LowerCI95.0limit : num 76.2 80.7 74.6 74.7 75.4 ...
## $ UpperCI95.0limit : num 76.2 80.7 74.8 74.9 75.6 ...
## $ LowerCI99.8limit : num NA NA NA NA NA NA NA NA NA NA ...
## $ UpperCI99.8limit : num NA NA NA NA NA NA NA NA NA NA ...
## $ Count : num NA NA NA NA NA NA NA NA NA NA ...
## $ Denominator : num NA NA NA NA NA NA NA NA NA NA ...
## $ Valuenote : chr NA NA NA NA ...
## $ RecentTrend : chr NA NA NA NA ...
## $ ComparedtoEnglandvalueorpercentiles: chr "Not compared" "Not compared" "Worse" "Worse" ...
## $ ComparedtoRegionvalueorpercentiles : chr "Not compared" "Not compared" "Not compared" "Not compared" ...
## $ TimeperiodSortable : int 20010000 20010000 20010000 20010000 20010000 20010000 20010000 20010000 20010000 20010000 ...
## $ Newdata : chr NA NA NA NA ...
## $ Comparedtogoal : chr NA NA NA NA ...
Without going into much detail, the str() function is very helpful when you are performing any calculations and load data into R through text files, excels etc., as many times R will read some values wrong. This is particularly true when handing numerical data; for instance, missing values and decimal types might need your attention, and it is best if you caught these “errors” early on.
Notice the structure of the data; these are generally called panel data, i.e. data with observations accross a number of time periods. There are generally two ways with which these are supplied, either in long or wide format. Wide format is when each column represents a time period, while long is when each line represents a time period.
Q: With regards to the format, what kind of panel data is the life expectancies table? Answer:
It’s in long format.
Depending on what you want to accomplish, you might want to tranform between the two, which is not particularly easy. Fortunately we don’t need to do this now, but in case you are wondering take a look at the reshape() function.
The data are also supplied for multiple levels, so besides LA level, national and regional data are also supplied. Furthermore, LE data are supplied as 3-year rolling averages. It would probably be helpful if we had a numeric year to represent every observation, such as the mid-year.
# Change column names by making a copy to a new one and deleting the original
# We can also use names(le0)[7] <- "LE_Birth" (i.e. change the 7th column name)
le0$LE_Birth <- le0$Value
le0$Value <- NULL
# Mid-year
le0$Year <- substr(x = le0$Timeperiod, start = 1, stop = 4)
le0$Year <- as.numeric(le0$Year) + 1
Now let’s try to make a simple plot regarding national trends in England. First, we must subset the data.
# England trend
le0_eng <- le0[le0$AreaName == "England", ]
le0_eng_male <- le0_eng[le0_eng$Sex == "Male", ]
# Simple plot
plot(le0_eng_male$Year, le0_eng_male$LE_Birth)
Q: Can you make a similar plot for female life expectancy? Answer:
le0_eng_female <- le0_eng[le0_eng$Sex == "Female", ]
plot(le0_eng_female$Year, le0_eng_female$LE_Birth)
It would be nice if we could overlay trends in both male and female LE at birth have to create another data frame with females plot it, This is rather ugly in base R - as you probably have noticed - so from now on we will use the ggplot2 library to visualize data. ggplot2 has its own grammar, so it might be confusing at first. However, it is easy once you get used to; the best way to get started is to look online, including this cheetsheet.
Let’s start with a simple plot:
ggplot(data = le0_eng_male, aes(x = Year, y = LE_Birth)) +
geom_point()
The premise is very simple; you tell R to create a plot using the data = le0_eng_male. Then you add the aesthetics, essentially telling it how to plot the data you just supplied. We specify the Year on the x-axis and LE_Birth on the y-axis. However, we are not telling it how to represent them; if we want them as points, we add a geometry with geom_point(). If we wanted lines, we would add geom_line(). If we wanted both, we could add both.
We can add another argument in the aesthetics, group = Sex with which we specify to group observations by sex. We can also add some labels with labs. The rest are done automatically:
ggplot(le0_eng, aes(x = Year, y = LE_Birth, group = Sex)) +
geom_point() +
geom_line(aes(color = Sex)) +
labs(y = "Life Expectancy at Birth",
x = "Mid-Year of 3-Year Period",
title = "Trends in Life Expectancy at birth between 2001-03 and 2015-17 in England")
Since we could like to colour every line differently we can specify geom_line(aes(color = Sex)) to do that. R will pick two colours automatically to make the distinction
Q: The points above are all the default black colour. What do we need to change to make them the same colour as the lines? Answer:
Replace with geom_point(aes(color = Sex)).
The second part of this analysis is to make a simple regression model looking at the realtionship between LE and average income. We won’t be using panel data in this intance; we will try to build this model just for the latest (2016) data. Let’s begin by subsetting the LE table:
# LA - level data for latest year
# Subset latest year
le0_16 <- le0[le0$Year == "2016", ]
We now need to attach the income data to the le0_16 table. For this example we will use the Gross Disposable Household Income (GDHI) estimates per Local Authority. We use the {readxl} library to read excel files directly into R. It is very fast, but note that it reads data into a tibble format, which is a format you might not be familiar with (but does not really matter in this case).
# Income
gdhi_link <- "https://www.ons.gov.uk/file?uri=%2feconomy%2fregionalaccounts%2fgrossdisposablehouseholdincome%2fdatasets%2fregionalgrossdisposablehouseholdincomegdhibylocalauthorityintheuk%2f1997to2016/vcregionalgdhibylareordered.xlsx"
download.file(gdhi_link, "gdhi.xlsx", method = "curl")
# Read Excel
gdhi <- read_xlsx("gdhi.xlsx", sheet = 2, skip = 2, n_max = 392, col_names = T)
Income figures are not per person, but rather as totals in £ millions. Thankfully, the next sheet (no. 3) in the excel file has the total population estimates per LA.
# Read Excel
pop <- read_xlsx("gdhi.xlsx", sheet = 3, skip = 2, n_max = 392, col_names = T)
We can now calculate values per person for 2016:
# Subset year 2016
gdhi <- gdhi[, c("LAU1 code", "LA name", "2016")]
pop <- pop[, c("LAU1 code", "LA name", "2016")]
# Rename
gdhi$GDHI_2016 <- gdhi$`2016`
gdhi$`2016` <- NULL
pop$Pop_2016 <- pop$`2016`
pop$`2016` <- NULL
# Merge
gdhi <- merge(gdhi, pop, by = c("LAU1 code", "LA name"), all = T)
gdhi$GDHI_2016_PP <- gdhi$GDHI_2016 / gdhi$Pop_2016
gdhi$GDHI_2016_PP <- round(gdhi$GDHI_2016_PP*1000, 3) # in £ thousands
Let’s take a look at the distribution of the income data. We will use the histogram geometry of ggplot, which is essentially adding geom_histogram() to the core plot:
# Descriptives
summary(gdhi$GDHI_2016_PP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.23 16.65 18.81 19.78 21.13 62.60
# Distribution
# The geom_histogram makes a histogram from the data
ggplot(data = gdhi, aes(x = GDHI_2016_PP)) +
geom_histogram(bins = 50, fill="steelblue", color = "white")
Notice are some outliers in the data values, possibly in London, which might become a problem later in the analysis. For now though we can proceed and join our GDHI data to our LE data:
# Merge with LE
le0_16 <- merge(le0_16, gdhi, by.x = "AreaCode", by.y = "LAU1 code", all.x = T)
# Check for irregularities again
summary(le0_16)
## AreaCode Year ParentCode ParentName AreaName Sex LE_Birth LowerCI95.0limit UpperCI95.0limit LA_name IMD_rank IMD_Quintile LA name GDHI_2016 Pop_2016 GDHI_2016_PP
## Length:648 Min. :2016 Length:648 Length:648 Length:648 Length:648 Min. :74.16 Min. :73.54 Min. :74.77 Length:648 Min. : 6683 Q1:128 Length:648 Min. : 608 Min. : 34475 Min. :12.23
## Class :character 1st Qu.:2016 Class :character Class :character Class :character Class :character 1st Qu.:79.90 1st Qu.:79.35 1st Qu.:80.50 Class :character 1st Qu.:13407 Q2:130 Class :character 1st Qu.: 1940 1st Qu.: 98780 1st Qu.:17.39
## Mode :character Median :2016 Mode :character Mode :character Mode :character Mode :character Median :81.60 Median :81.00 Median :82.23 Mode :character Median :17785 Q3:130 Mode :character Median : 2590 Median : 132810 Median :19.30
## Mean :2016 Mean :81.56 Mean :80.99 Mean :82.14 Mean :17710 Q4:130 Mean : 3389 Mean : 170551 Mean :20.22
## 3rd Qu.:2016 3rd Qu.:83.44 3rd Qu.:82.86 3rd Qu.:83.98 3rd Qu.:21872 Q5:130 3rd Qu.: 3978 3rd Qu.: 208997 3rd Qu.:21.84
## Max. :2016 Max. :86.55 Max. :85.98 Max. :87.11 Max. :29719 Max. :15898 Max. :1128077 Max. :62.60
We will perform the analysis separately for males and females, and so we will create two separate files:
# Split
le0_m <- le0_16[le0_16$Sex == "Male", ]
le0_f <- le0_16[le0_16$Sex == "Female", ]
A first step in the analysis is to look for any correlations between the two variables. Correlation gives us an idea on the level of association between two variables. This is easily carried out with the cor() function. By default, this gives us the Pearson’s correlation. The cor.test() is essentially the same, but it gives a little bit more information than just the correlation value.
# Correlations
cor(le0_m$LE_Birth, le0_m$GDHI_2016_PP)
## [1] 0.5576461
cor.test(le0_f$LE_Birth, le0_f$GDHI_2016_PP)
##
## Pearson's product-moment correlation
##
## data: le0_f$LE_Birth and le0_f$GDHI_2016_PP
## t = 13.877, df = 322, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5386832 0.6756615
## sample estimates:
## cor
## 0.6117375
It seems there is a high correlation between LE and GDHI. Plotting the data might also be useful in visualising this relationship, particularly spotting any possible outliers:
# Plot
le_f_plot <- ggplot(data = le0_f, aes(LE_Birth, GDHI_2016_PP)) + geom_point(col = "orchid")
le_f_plot
Notice this time we saved the plot in an object, le_f_plot, which we can call whenever we want. The best part of this is that now we can add more details to it, like labs() using the +.
Q: Can you make the same plot for males and add axis labels and a title? Try to make it with a different colour for points. Answer:
le_m_plot <- ggplot(le0_m, aes(LE_Birth, GDHI_2016_PP)) + geom_point(col = "blue")
le_m_plot + labs(x = "Life Expectancy at Birth",
y = "Gross Disposable Household Income per person",
title = "Relationship between male life expectancy and income by LA, 2016 data")
#### Regression Model
We know there is a significant positive correlation between LE and average GDHI, however a regression model will quantify that relationship into something more meaningful and interpretable. We will make two simple linear regression models, one for males and one for females. In short, we will calculate the statistical importance of the relationship (the p-value), how much is the effect (the coefficient), and how much LE variability can be explained by income figures (the R2). We will make two models, one for male and one for female life expectancy.
# Model for males
le_model_m <- lm(LE_Birth ~ GDHI_2016_PP, data = le0_m)
summary(le_model_m)
##
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP, data = le0_m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1626 -0.8914 0.1276 0.9950 2.9635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.41332 0.28964 263.82 <2e-16 ***
## GDHI_2016_PP 0.16692 0.01385 12.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.338 on 322 degrees of freedom
## Multiple R-squared: 0.311, Adjusted R-squared: 0.3088
## F-statistic: 145.3 on 1 and 322 DF, p-value: < 2.2e-16
# Model for females
le_model_f <- lm(LE_Birth ~ GDHI_2016_PP, data = le0_f)
summary(le_model_f)
##
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP, data = le0_f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3581 -0.6481 -0.0095 0.7769 2.7562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.20438 0.23378 343.08 <2e-16 ***
## GDHI_2016_PP 0.15509 0.01118 13.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.08 on 322 degrees of freedom
## Multiple R-squared: 0.3742, Adjusted R-squared: 0.3723
## F-statistic: 192.6 on 1 and 322 DF, p-value: < 2.2e-16
Q: According to these two models, which gender is expected to have higher gains of life expectancy at birth when their GDHI increases? Answer:
Males. An increase of £1,000 is expected to increase their life expectancy by 0.167 years. Females have a slightly smaller coefficient with 0.155 years.
We can also plot the model for some useful information about heteroskedasticity of residuals, outliers etc. We can press enter to scroll through the plots.
# Plots
plot(le_model_m)
plot(le_model_f)
Based on one of the plots, there seems to be an issue with GDHI outlier values. We have also spotted those earlier in the density plot - around 5-6 of them. In this case we will try to remove those LAs with very high GDHI (but note there are other ways to deal with outlier values as well, including some specific techniques). For this study we will do so manually; we order the dataset and remove the top 6 LAs according to GDHI.
From now on we will only focus on the male life expectancy model.
# Remove top 6 outliers
le0_m <- le0_m[order(le0_m$GDHI_2016_PP, decreasing = T), ]
le0_m <- le0_m[-c(1:6), ]
Then we re-run our model using the new dataset without the outliers:
# Models
le_model_m <- lm(LE_Birth ~ GDHI_2016_PP, data = le0_m)
summary(le_model_m)
##
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP, data = le0_m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1576 -0.6726 -0.0002 0.8608 3.0677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.9833 0.3668 201.73 <2e-16 ***
## GDHI_2016_PP 0.2928 0.0183 15.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.185 on 316 degrees of freedom
## Multiple R-squared: 0.4474, Adjusted R-squared: 0.4456
## F-statistic: 255.8 on 1 and 316 DF, p-value: < 2.2e-16
Notice now how the expected relationship is much higher, to 0.29 years for every £1,000, and how the overall variability explained is also higher (R2 = 0.447 / 0.446).
We can visualise the linear model again using ggplot, very easily by adding the geom_smooth() and a method = "lm" parameter (i.e. a linear model), as follows:
ggplot(le0_m, aes(LE_Birth, GDHI_2016_PP)) +
geom_point(col = "skyblue") +
geom_smooth(method = "lm")
Q: In the above plot, can you guess what the grey shaded area along the regression line represents? Answer:
It represents the confidence interval of the regression line; in this case that is 95%. Or, in other words, it means that we’re 95% confident that the true regression line lies somewhere in that gray zone. By default, the confidence graphic in geom_smooth() is always true and set to 0.95. These parameters are controlled by the se and level arguments, like so: geom_smooth(method = "lm", se = TRUE, level = 0.95).
One issue with how life expectancy is calculated is that it assumes the mortality for people born now, using mortality patterns of all ages based on how they are now. However, mortality ratios of e.g. people at 75 years of age now will probably not be same for people born now in 75 years. For example, there are reasons to believe that mortality ratios of older people are much higher as an effect of the smoking epidemics in the 1970’s and 1980’s.
In this study, we will try to control for the age distribution of LAs and their association to life expectancies. For simplicity, in this example we will just include the proportion of the population over 65 years of age for each LA.
Firstly, let’s get the available population estimates by age group. The excel files has three sheets, the first includes notes, the second estimates for males and the third estimates for females. As aforementioned, we will only work with the male life expectancy model from now on.
# Data
pop16_link <- "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fpopulationandmigration%2fpopulationprojections%2fdatasets%2flocalauthoritiesinenglandtable2%2f2016based/table2.xls"
download.file(pop16_link, "pop_16.xls", method = "curl")
pop16_m <- read_xls("pop_16.xls", sheet = 2, skip = 6, n_max = 7387, col_names = T)
# Preview
head(pop16_m)
## # A tibble: 6 x 29
## CODE AREA `AGE GROUP` `2016` `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024` `2025` `2026` `2027` `2028` `2029` `2030` `2031` `2032` `2033` `2034` `2035` `2036` `2037` `2038` `2039` `2040` `2041`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 E92000001 England 0-4 1758. 1734. 1723. 1719. 1714. 1708. 1710. 1707. 1704. 1700. 1697. 1693. 1689. 1684. 1679. 1674. 1670. 1668. 1668. 1670. 1675 1682. 1691. 1702. 1716. 1730.
## 2 E92000001 England 5-9 1756. 1791. 1805. 1813 1814. 1807. 1781. 1768. 1762. 1756. 1749. 1751. 1748. 1745. 1742. 1738. 1734. 1730 1725 1720. 1715. 1712. 1710. 1710. 1712. 1716.
## 3 E92000001 England 10-14 1572. 1622. 1676. 1718. 1760. 1794. 1827. 1840. 1847. 1846. 1839. 1813. 1800. 1794. 1788. 1781. 1782. 1780. 1777. 1774. 1770. 1766. 1762 1757. 1752. 1747.
## 4 E92000001 England 15-19 1632. 1600. 1582. 1578. 1588. 1621. 1668. 1720. 1761. 1802. 1835. 1868. 1881. 1888. 1888. 1880. 1854. 1842. 1836. 1830. 1822. 1824 1821. 1818. 1815 1812.
## 5 E92000001 England 20-24 1824. 1814. 1799. 1780. 1760. 1728. 1691. 1670. 1662. 1669. 1701. 1747. 1800. 1840. 1881. 1914 1947. 1960. 1967. 1967. 1960. 1934. 1921. 1915. 1910. 1902.
## 6 E92000001 England 25-29 1924. 1946. 1948. 1952. 1934. 1910. 1895. 1876. 1851. 1828. 1794. 1756. 1734. 1726. 1733. 1765. 1812. 1864. 1905. 1946. 1979 2013. 2025. 2032. 2032. 2025.
We will now have to construct a variable adding all age groups of 65 years and over.
# All age groups of 65 years and over
target_groups <- c("65-69", "70-74", "75-79", "80-84", "85-89", "90+")
# Use the %in% to subset those that match our target age groups
pop16_m_65 <- pop16_m[pop16_m$`AGE GROUP` %in% target_groups, ]
# Subset columns
pop16_m_65 <- pop16_m_65[, c("CODE", "AREA", "AGE GROUP", "2016")]
# Aggregate using FUN = "sum"
pop16_m_65 <- aggregate(`2016` ~ CODE + AREA, data = pop16_m_65, FUN = "sum")
We also need the total amount the obtain the ratio:
# subset the total only
pop16_m_total <- pop16_m[pop16_m$`AGE GROUP` == "All ages", ]
# Subset columns
pop16_m_total <- pop16_m_total[, c("CODE", "AREA", "AGE GROUP", "2016")]
# Rename and clean table
pop16_m_total$Pop_Male <- pop16_m_total$`2016`
pop16_m_total$`AGE GROUP` <- NULL
pop16_m_total$`2016` <- NULL
Merging the file will gives us two columns, the population over 65 years and the total population. Dividing the two columns will give us the proportion of people of 65 years of age and over to the total LA population, according to 2016 estimates.
# Join
pop16_m_65 <- merge(pop16_m_65, pop16_m_total, by = c("CODE", "AREA"), all.x = T)
# Calculate proportion
pop16_m_65$Pop_M65_ratio <- pop16_m_65$`2016` / pop16_m_65$Pop_Male
# Subset columns
pop16_m_65 <- pop16_m_65[, c("CODE", "Pop_M65_ratio")]
# Preview
head(pop16_m_65)
## CODE Pop_M65_ratio
## 1 E06000001 0.1721854
## 2 E06000002 0.1443001
## 3 E06000003 0.2066869
## 4 E06000004 0.1616580
## 5 E06000005 0.1833977
## 6 E06000006 0.1672026
# Check values
summary(pop16_m_65$Pop_M65_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05287 0.15461 0.17840 0.17978 0.20922 0.31928
The pop16_m_65 table contains the information we need to add to the LE table. we can again use the LA ONS code to join the two tables:
# Join to male LE table
le0_m <- merge(le0_m, pop16_m_65, by.x = "AreaCode", by.y = "CODE", all.x = T)
We now have a multiple linear regression model, using two explanatory variables, GDHI_2016_PP and Pop_M65_ratio. In multiple regression, it is important to look out for Multicollinearity. Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly correlated, which can be problematic, as the model assumes no association to work correctly. We can check the level of correlation here:
# Correlation
cor(le0_m$GDHI_2016_PP, le0_m$Pop_M65_ratio)
## [1] -0.1601987
Which shows that our two variables are not very highly linearly correlated. In this case the correlation value can be considered sufficient evidence.
We can now add the population over 65 ratio as a confounder to our model specification. In R, the way to do that is to add variable using the + character on the right side of the formula.
# New model
le_ext_model_m <- lm(LE_Birth ~ GDHI_2016_PP + Pop_M65_ratio, data = le0_m)
summary(le_ext_model_m)
##
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP + Pop_M65_ratio, data = le0_m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1206 -0.5313 0.0099 0.6532 2.5376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.7915 0.4048 174.87 <2e-16 ***
## GDHI_2016_PP 0.3223 0.0154 20.93 <2e-16 ***
## Pop_M65_ratio 14.4678 1.2089 11.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.984 on 315 degrees of freedom
## Multiple R-squared: 0.6201, Adjusted R-squared: 0.6177
## F-statistic: 257.1 on 2 and 315 DF, p-value: < 2.2e-16
The final model shows that when controlling for the proportion of population of 65 years age and over, the expected effect of gross disposable income is 0.32 years increase for an increase of £1,000. Both our explanatory variables are highly significant.
Note that when there is an increase in the percentage of population of 65 years age and over, our model expects also an increase in life expectancy. This which might not be easily explained. Perhaps that are other underlying reasons regarding specific latent characteristics of LAs that contribute to these relationships, which might also be related to the Pop_M65_ratio variable.
For this purpose, maybe a fixed-effects regression model, controlling for characteristics within each Local Authority, would be more appropriate. We will look at these types of models in later sessions.
– As we did above for Male LE, try to carry out the same analysis and construct a similar model using female life expectancies.
– What differences can you see?
Answer:
# Read data
pop16_f <- read_xls("pop_16.xls", sheet = 3, skip = 6, n_max = 7387, col_names = T)
# Preview
head(pop16_f)
## # A tibble: 6 x 29
## CODE AREA `AGE GROUP` `2016` `2017` `2018` `2019` `2020` `2021` `2022` `2023` `2024` `2025` `2026` `2027` `2028` `2029` `2030` `2031` `2032` `2033` `2034` `2035` `2036` `2037` `2038` `2039` `2040` `2041`
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 E92000001 England 0-4 1671. 1649. 1640. 1636. 1633. 1628. 1630. 1627. 1624. 1621. 1618. 1614. 1610. 1605. 1600 1596. 1592. 1590. 1590. 1592. 1596. 1603. 1612. 1622. 1635. 1649.
## 2 E92000001 England 5-9 1673. 1707. 1719. 1726. 1726. 1718. 1693. 1682. 1677. 1673. 1666. 1668. 1665. 1662. 1659. 1656. 1653. 1648. 1643. 1639. 1634. 1631. 1629. 1629. 1631. 1635.
## 3 E92000001 England 10-14 1498. 1544 1596. 1636. 1676. 1708. 1741. 1752. 1758. 1755. 1747. 1722. 1712. 1706. 1702 1696. 1697. 1695. 1692. 1689. 1686. 1682. 1678. 1673. 1668 1664.
## 4 E92000001 England 15-19 1548. 1516. 1499. 1496. 1507. 1538. 1582. 1632. 1671. 1710. 1741. 1774. 1785. 1791. 1789 1781 1756. 1745. 1740. 1736. 1730. 1731. 1728. 1726. 1722. 1719.
## 5 E92000001 England 20-24 1736. 1716. 1700. 1680. 1662. 1637 1602. 1581. 1574. 1583. 1612 1656. 1706 1745. 1784. 1815. 1848. 1859. 1865. 1863 1855. 1830. 1820. 1814. 1810. 1804.
## 6 E92000001 England 25-29 1887. 1895. 1882. 1873. 1851. 1821. 1797. 1775 1752. 1730 1703. 1666. 1645. 1638. 1647. 1676 1720. 1770. 1810. 1848. 1880. 1913. 1924. 1930. 1928 1920
# All age groups of 65 years and over
target_groups <- c("65-69", "70-74", "75-79", "80-84", "85-89", "90+")
# Use the %in% to subset those that match our target age groups
pop16_f_65 <- pop16_f[pop16_f$`AGE GROUP` %in% target_groups, ]
# Subset columns
pop16_f_65 <- pop16_f_65[, c("CODE", "AREA", "AGE GROUP", "2016")]
# Aggregate using FUN = "sum"
pop16_f_65 <- aggregate(`2016` ~ CODE + AREA, data = pop16_f_65, FUN = "sum")
# subset the total only
pop16_f_total <- pop16_f[pop16_f$`AGE GROUP` == "All ages", ]
# Subset columns
pop16_f_total <- pop16_f_total[, c("CODE", "AREA", "AGE GROUP", "2016")]
# Rename and clean table
pop16_f_total$Pop_Female <- pop16_f_total$`2016`
pop16_f_total$`AGE GROUP` <- NULL
pop16_f_total$`2016` <- NULL
# Join
pop16_f_65 <- merge(pop16_f_65, pop16_f_total, by = c("CODE", "AREA"), all.x = T)
# Calculate proportion
pop16_f_65$Pop_F65_ratio <- pop16_f_65$`2016` / pop16_f_65$Pop_Female
# Subset columns
pop16_f_65 <- pop16_f_65[, c("CODE", "Pop_F65_ratio")]
# Preview
head(pop16_f_65)
## CODE Pop_F65_ratio
## 1 E06000001 0.2016807
## 2 E06000002 0.1718310
## 3 E06000003 0.2309900
## 4 E06000004 0.1879397
## 5 E06000005 0.2110092
## 6 E06000006 0.1843318
# Check values
summary(pop16_f_65$Pop_F65_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0688 0.1814 0.2065 0.2080 0.2383 0.3408
# Remove outliers and join ratios to female LE table
le0_f <- le0_f[order(le0_f$GDHI_2016_PP, decreasing = T), ]
le0_f <- le0_f[-c(1:6), ]
le0_f <- merge(le0_f, pop16_f_65, by.x = "AreaCode", by.y = "CODE", all.x = T)
# Correlation
cor(le0_f$GDHI_2016_PP, le0_f$Pop_F65_ratio)
## [1] -0.1419296
# New model
le_ext_model_f <- lm(LE_Birth ~ GDHI_2016_PP + Pop_F65_ratio, data = le0_f)
summary(le_ext_model_f)
##
## Call:
## lm(formula = LE_Birth ~ GDHI_2016_PP + Pop_F65_ratio, data = le0_f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50036 -0.58993 0.02241 0.59656 2.28810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.73595 0.34866 217.22 <2e-16 ***
## GDHI_2016_PP 0.27737 0.01285 21.58 <2e-16 ***
## Pop_F65_ratio 10.07454 0.96924 10.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8235 on 315 degrees of freedom
## Multiple R-squared: 0.623, Adjusted R-squared: 0.6207
## F-statistic: 260.3 on 2 and 315 DF, p-value: < 2.2e-16