April 5, 2019

Isn't this a beautiful map of life expectancy across the world through the years?

What is the purpose of this presentation?

  • Serves purposes of instruction, and also serves as a personal record for myself for methods used.
  • I've been using R for a while now, but it was only recently in a GIS course that I realized that this lovely software package can be used to represent spatial data as well.
  • I will be walking through the methods outlined by Jakub Nowosab in his blog post.
  • I will also try to use these methods to create a map of life expectancy in the US through the years.

Let's get right to it

We are going to use R to download and visualize life expectancy data for all the countries in the world from 1963 to 2013.

Let us load the libraries we need for this project.

library(tidyverse)
library(sf)
library(tmap)
library(rnaturalearth)
library(wbstats)
library(devtools)
library(magick)
library(knitr)

(wbstats)

  • This is a package that enables you to interface directly with World Bank open data.
  • You do not need to sift through mounds of data.
  • The function of interest is wb()
  • Type in ?wb to read the documentation for this function.

Documentation

?wb
  • Under documentation, we see that that we can get started by just specifying an indicator (a measure of interest, like life expectancy), a country, and a year (startdate/enddate.)

To see the list of indicators, type this:

View(wbindicators(lang =c("en"))) 
  • Gives you a dataframe with indicatorID for indicators
  • Thousands of indicators are available, such as completeness of birth registration, mortality rates, etc.
  • Using the search function in the viewer, we see that the code for 'life expectancy at birth, total (years)' is 'SP.DYN.LE00.IN'

So our initial code to access the data is:

wb(indicator = "SP.DYN.LE00.IN")

But don't run it just yet! We need to provide more arguments for the function to work.

The next argument that the wb function needs is countries:

We want the data to be pulled from all countries. We can specify this using country = "countries_only". Our code now is:

wb(indicator = "SP.DYN.LE00.IN", country = "countries_only")

Now all we need is a startdate and an enddate:

  • We determined at the start that we wanted life expectancy data from 1963 to 2013.
  • Putting it all together, this is what the wb() call looks like.
wb(indicator = "SP.DYN.LE00.IN", country = "countries_only", 
   startdate = 1963, enddate = 2013)
  • However, this dataset is not easy to read.
  • It would be a lot easier to read if the life_expectancy at birth label was a column of its own.
  • Let us write a function to give us a neat dataframe.

I'm creating a function called wb_data_create2() to do this for me.

  • This function is a slight modification of Jakub Nowosab's wb_data_create() function.
  • I have changed it to avoid incorporating the map_dfr() function in my code.
wb_data_create2 = function(indicator, new_name, year1, year2, ...){
  df = wb(country = "countries_only", indicator = indicator,
          startdate = year1, enddate = year2, ...) %>%
    select(iso_a2 = iso2c, value, year = date) %>%
    mutate(indicator = new_name) %>%
    spread(indicator, value) %>%
    as_data_frame()
  # return the dataframe
  df
}

What is this function doing?

wb_data_create2 = function(indicator, new_name, year1, year2, ...){
  df = wb(country = "countries_only", indicator = indicator,
          startdate = year1, enddate = year2, ...) %>%
    select(iso_a2 = iso2c, value, year = date) %>%
    mutate(indicator = new_name) %>%
    spread(indicator, value) %>%
    as_data_frame()
  # return the dataframe
  df
}

It takes four inputs to obtain a dataframe:

  • indicator (The indicatorID)
  • new_name (What you want to rename the indicator ID to)
  • year1 (Year from which to begin aggregating data)
  • year2 (Year at which to stop collecting data)

What is this function doing?

  • Extracts three columns from the dataframe (iso2c, value and date)- These are the country name, life expectancy value, and year respectively.
  • Renames iso2c to iso_a2 (We need this column to be named this way for a join later).
  • adds a new column called new_name to replace the indicatorID and drops the life expectancy values under it.

How would you use it?

wb_data_create2("SP.DYN.LE00.IN", "life_expectancy", 1963, 2013)

Here's the first few rows of the output:

## # A tibble: 6 x 3
##   iso_a2 year  life_expectancy
##   <chr>  <chr>           <dbl>
## 1 AE     1963             55.4
## 2 AE     1964             56.4
## 3 AE     1965             57.4
## 4 AE     1966             58.3
## 5 AE     1967             59.2
## 6 AE     1968             60.1

This gives us a dataframe with life expectancy for every country for every year in the interval!

Let's make this dataset manageable

  • 5 year intervals could serve as useful checkpoints to assess if life expectancy is improving or not.
  • Now let's filter it to only include data for every 5 years ( 1963, 1968, 1973 and so on until 2013)
index <- seq(1963, 2013, by = 5)

df <- wb_data_create2("SP.DYN.LE00.IN", "life_expectancy", 1963, 2013)

data_life_exp_MINE <-  df %>% filter(year %in% index)

Here's the output

## # A tibble: 6 x 3
##   iso_a2 year  life_expectancy
##   <chr>  <chr>           <dbl>
## 1 AE     1963             55.4
## 2 AE     1968             60.1
## 3 AE     1973             63.8
## 4 AE     1978             66.7
## 5 AE     1983             69.0
## 6 AE     1988             70.9

Hooray! This yields a dataframe with 2143 rows and 3 columns, with life expectancies for each country given at 5-year intervals.

Let's use a dplyr pipe to group our life expectancies by year and get the average:

data_life_exp_avg = data_life_exp_MINE %>% 
  group_by(year) %>% 
  summarise(mean_life_exp = mean(life_expectancy, na.rm = TRUE)) 

Here's the output:

## # A tibble: 11 x 2
##    year  mean_life_exp
##    <chr>         <dbl>
##  1 1963           55.2
##  2 1968           57.3
##  3 1973           59.3
##  4 1978           61.1
##  5 1983           63.0
##  6 1988           64.5
##  7 1993           65.4
##  8 1998           66.5
##  9 2003           67.9
## 10 2008           69.8
## 11 2013           71.4

We can use ggplot() to visualize it:

ggplot() +
  geom_line(data = data_life_exp_MINE, 
            aes(year, life_expectancy, group = iso_a2), 
            color = "gray40") + 
  geom_line(data = data_life_exp_avg,
            aes(year, mean_life_exp, group = 1),
            color = "red", size = 2) +
  labs(x = "Year", y = "Life expectancy") + 
  theme_minimal()

Here's the output:

The new few slides break down the ggplot visualization.

Feel free to skip ahead to slide 45 if you're already familiar with the intricacies of ggplot().

Life expectancy through the years

The data_life_exp_MINE dataframe has a value of life expectancy for every country for every year. So we are going to try and use it to map out the changes in life expectancy over the years.

ggplot() + geom_line( data = data_life_exp_MINE, 
aes( x = year, y = life_expectancy))

Here's the output:

This doesn't yield a useful result. Why?

I like to think of the geom_line mapping as creating a series of points, joining them using lines and then erasing the points. Clearly something went wrong with the points!

So let us add the point mapping to try and find out what went wrong with geom_line.

ggplot() + geom_point( data = data_life_exp_MINE, 
aes( x = year, y = life_expectancy))

Here's the output:

It becomes clear now.

The map with the geom_line mapping is basically looking at the data_life_exp_MINE dataframe, representing the years on the x axis.

Take year 1963 on the x axis. For this year, there are 189 values of life expectancy, one for each country, all of which are represented as dots on the y axis for 1963.

Trying again

We need to group them by country. Let's try the exact same function again, but this time, with a group specified within the aes argument. We want a distinct line for each country, so let us specify group = iso_a2.

ggplot() + geom_line( data = data_life_exp_MINE, 
aes( x = year, y = life_expectancy, group = iso_a2))

Here's the output:

It works!

I'm going to change the color of all the lines to grey.

ggplot() + geom_line( data = data_life_exp_MINE, 
aes( x = year, y = life_expectancy, group = iso_a2), col = "gray40")

  • Now, I want to add a line that connects the mean life expectancy (for all countries) at every year.
  • Remember, this value is contained in the data_life_exp_avg dataframe, where we already neatly grouped and summarized the data using dplyr pipes.
ggplot() + geom_line( data = data_life_exp_avg, 
aes( x = year, y = mean_life_exp ))

Here's the output:

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

We get a similar error. The geom_line mapping doesn't understand that the data is already grouped at 5 - year intervals. To pacify it, let's specify a group within the aes argument.

ggplot() + geom_line( data = data_life_exp_avg,
aes( x = year, y = mean_life_exp, group = 1 ))

Here's the output:

It worked!

But remember, in our data_life_exp_avg dataframe, our life expectancies are already grouped by year. We just supply the group = 1 argument as a dummy variable for geom_line to use.

We could even say group = "anything" and it would still work. Try it for yourself!

ggplot() + geom_line( data = data_life_exp_avg, 
aes( x = year, y = mean_life_exp, group = "anything" ))

We get the exact same graph:

Combining the graphs

Let's add this line representing overall life expectancy by year to the graph that has life expectancy by year for each country. We just need to copy and paste the geom_line portion from this graph to the previous graph.

ggplot() + geom_line( data = data_life_exp_MINE, 
  aes( x = year, y = life_expectancy, group = iso_a2), col = "gray40") + 
  geom_line( data = data_life_exp_avg, 
             aes( x = year, y = mean_life_exp, group = 1 ))

Here's the output:

Can we improve this?

Our overall line (black) doesn't look too clear in a sea of grey country lines. Let's change the color to red for the overall line, and also make it thicker.

ggplot() + geom_line( data = data_life_exp_MINE,
    aes( x = year, y = life_expectancy, group = iso_a2), 
    col = "gray40") + 
  geom_line( data = data_life_exp_avg, 
  aes( x = year, y = mean_life_exp, group = "anything" ), 
  col = "red", size = 2) 

This is great. It tells us that mean life expectancy throughout the world has generally been increasing steadily.

Let us download spatial data of the world.

We are using the ne_countries function from the 'rnaturalearth' package.

world = ne_countries(returnclass = "sf")%>%
  select(iso_a2, name_long, continent) 

Here's the result:

## Simple feature collection with 6 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -73.41544 ymin: -55.25 xmax: 75.15803 ymax: 42.68825
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   iso_a2            name_long     continent                       geometry
## 0     AF          Afghanistan          Asia MULTIPOLYGON (((61.21082 35...
## 1     AO               Angola        Africa MULTIPOLYGON (((16.32653 -5...
## 2     AL              Albania        Europe MULTIPOLYGON (((20.59025 41...
## 3     AE United Arab Emirates          Asia MULTIPOLYGON (((51.57952 24...
## 4     AR            Argentina South America MULTIPOLYGON (((-65.5 -55.2...
## 5     AM              Armenia          Asia MULTIPOLYGON (((43.58275 41...

This returns a spatial polygons dataframe.

Columns we have selected are the 2 character country code, the full name of the country, and the continent in which it is located.

Now, we need to join this spatial data (world) to the non-spatial life expectancy data (data_life_exp_MINE)

  • The key column in the spatial dataset is called "iso_a2".
  • This holds two character country codes.
  • We need this to join the two datasets. This is why the function we wrote named the column in the non-spatial data as 'iso_a2'.

We can do this in 2 ways:

  • Join the spatial data to the non-spatial data. This is easy, but the disadvantage is that we will end up removing a lot of values.
  • Join the non-spatial data to the spatial data. This involves more steps, but we will retain rows with NAs without removal.

What are the steps?

  • Make the non-spatial data wide.
  • Join it to the spatial data.
  • Change the joined data set to long.

Here's how you do it:

data_life_exp_wide = data_life_exp_MINE %>% 
  spread(year, life_expectancy)

world_temporal <- world %>% 
  left_join(data_life_exp_wide, by = "iso_a2") %>% 
  gather(year, life_exp, `1963`:`2013`)

Here's what the prepared dataset looks like:

head(world_temporal)
## Simple feature collection with 6 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -73.41544 ymin: -55.25 xmax: 75.15803 ymax: 42.68825
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   iso_a2            name_long     continent year life_exp
## 1     AF          Afghanistan          Asia 1963   33.624
## 2     AO               Angola        Africa 1963   34.272
## 3     AL              Albania        Europe 1963   64.911
## 4     AE United Arab Emirates          Asia 1963   55.375
## 5     AR            Argentina South America 1963   65.311
## 6     AM              Armenia          Asia 1963   67.276
##                         geometry
## 1 MULTIPOLYGON (((61.21082 35...
## 2 MULTIPOLYGON (((16.32653 -5...
## 3 MULTIPOLYGON (((20.59025 41...
## 4 MULTIPOLYGON (((51.57952 24...
## 5 MULTIPOLYGON (((-65.5 -55.2...
## 6 MULTIPOLYGON (((43.58275 41...

We are now ready to create the map!

We will use the tmap package to fill the shapefiles by life expectancy:

Adding the along = "year" gives us a separate map for each year. The tmap_animation function uses ImageMagick to create a gif of the different maps.

 my_ani <- tm_shape(world_temporal, projection = "robin") +
  tm_fill("life_exp", title = "Life expectancy",
          palette = "inferno") +
  tm_facets( along = "year" ) +tm_layout(frame = FALSE)

tmap_animation(my_ani, filename = "Animation.gif",
               width = 1500, height = 600, delay = 60)

Here's the result.

It is beautiful to see how great strides have been made with life expectancies throughout the world in this format. This is superior to a facet view as even minute detail can be seen.

## Animation saved to D:\Lakshman\A_Cornell\Side_project\Life_expectancy\code_life_expectancy\AnimationUS.gif

Here is a map of life expectancy in the US.

I created this using a similar methodology, but decided to create a frame for every year.

To quantify these life expectancies a little further, I downloaded the NCHS life expectancy at birth dataset from the CDC:

  • You can find it here.

Let's rename the columns. Here's the code:

am_lifeex <- read.csv("cdc_data.csv")

colnames(am_lifeex)[4] <- "life_ex"
colnames(am_lifeex)[1] <- "Year"

Now let us filter and plot the data:

 p <- am_lifeex %>% 
   filter( !(is.na(life_ex)) & 
             Sex == "Both Sexes" & 
            Race %in% c("White", "Black")) %>% 
   ggplot()
 
  plot <- p +  
    geom_jitter(aes(x = Year, y = life_ex, 
                col = Race)) + 
ggtitle(" US Life Expectancy at Birth by Race for the past century") +
ylab("Life Expectancy in Years")

And here's the graph:

What does this graph tell us?

Two things are immediately apparent from this graph.

One, the African-American population has clearly had lower life expectancies throughout the past century.

Two, this has been changing in recent years, with the gap closing.

This concludes my walkthrough of visualizing open-access life expectancy data in R!