April 5, 2019
We are going to use R to download and visualize life expectancy data for all the countries in the world from 1963 to 2013.
library(tidyverse) library(sf) library(tmap) library(rnaturalearth) library(wbstats) library(devtools) library(magick) library(knitr)
?wb
View(wbindicators(lang =c("en")))
wb(indicator = "SP.DYN.LE00.IN")
But don't run it just yet! We need to provide more arguments for the function to work.
We want the data to be pulled from all countries. We can specify this using country = "countries_only". Our code now is:
wb(indicator = "SP.DYN.LE00.IN", country = "countries_only")
wb(indicator = "SP.DYN.LE00.IN", country = "countries_only", startdate = 1963, enddate = 2013)
wb_data_create2 = function(indicator, new_name, year1, year2, ...){
df = wb(country = "countries_only", indicator = indicator,
startdate = year1, enddate = year2, ...) %>%
select(iso_a2 = iso2c, value, year = date) %>%
mutate(indicator = new_name) %>%
spread(indicator, value) %>%
as_data_frame()
# return the dataframe
df
}
wb_data_create2 = function(indicator, new_name, year1, year2, ...){
df = wb(country = "countries_only", indicator = indicator,
startdate = year1, enddate = year2, ...) %>%
select(iso_a2 = iso2c, value, year = date) %>%
mutate(indicator = new_name) %>%
spread(indicator, value) %>%
as_data_frame()
# return the dataframe
df
}
wb_data_create2("SP.DYN.LE00.IN", "life_expectancy", 1963, 2013)
## # A tibble: 6 x 3 ## iso_a2 year life_expectancy ## <chr> <chr> <dbl> ## 1 AE 1963 55.4 ## 2 AE 1964 56.4 ## 3 AE 1965 57.4 ## 4 AE 1966 58.3 ## 5 AE 1967 59.2 ## 6 AE 1968 60.1
index <- seq(1963, 2013, by = 5)
df <- wb_data_create2("SP.DYN.LE00.IN", "life_expectancy", 1963, 2013)
data_life_exp_MINE <- df %>% filter(year %in% index)
## # A tibble: 6 x 3 ## iso_a2 year life_expectancy ## <chr> <chr> <dbl> ## 1 AE 1963 55.4 ## 2 AE 1968 60.1 ## 3 AE 1973 63.8 ## 4 AE 1978 66.7 ## 5 AE 1983 69.0 ## 6 AE 1988 70.9
Hooray! This yields a dataframe with 2143 rows and 3 columns, with life expectancies for each country given at 5-year intervals.
data_life_exp_avg = data_life_exp_MINE %>% group_by(year) %>% summarise(mean_life_exp = mean(life_expectancy, na.rm = TRUE))
## # A tibble: 11 x 2 ## year mean_life_exp ## <chr> <dbl> ## 1 1963 55.2 ## 2 1968 57.3 ## 3 1973 59.3 ## 4 1978 61.1 ## 5 1983 63.0 ## 6 1988 64.5 ## 7 1993 65.4 ## 8 1998 66.5 ## 9 2003 67.9 ## 10 2008 69.8 ## 11 2013 71.4
ggplot() +
geom_line(data = data_life_exp_MINE,
aes(year, life_expectancy, group = iso_a2),
color = "gray40") +
geom_line(data = data_life_exp_avg,
aes(year, mean_life_exp, group = 1),
color = "red", size = 2) +
labs(x = "Year", y = "Life expectancy") +
theme_minimal()
Feel free to skip ahead to slide 45 if you're already familiar with the intricacies of ggplot().
The data_life_exp_MINE dataframe has a value of life expectancy for every country for every year. So we are going to try and use it to map out the changes in life expectancy over the years.
ggplot() + geom_line( data = data_life_exp_MINE, aes( x = year, y = life_expectancy))
This doesn't yield a useful result. Why?
So let us add the point mapping to try and find out what went wrong with geom_line.
ggplot() + geom_point( data = data_life_exp_MINE, aes( x = year, y = life_expectancy))
The map with the geom_line mapping is basically looking at the data_life_exp_MINE dataframe, representing the years on the x axis.
Take year 1963 on the x axis. For this year, there are 189 values of life expectancy, one for each country, all of which are represented as dots on the y axis for 1963.
We need to group them by country. Let's try the exact same function again, but this time, with a group specified within the aes argument. We want a distinct line for each country, so let us specify group = iso_a2.
ggplot() + geom_line( data = data_life_exp_MINE, aes( x = year, y = life_expectancy, group = iso_a2))
It works!
ggplot() + geom_line( data = data_life_exp_MINE, aes( x = year, y = life_expectancy, group = iso_a2), col = "gray40")
ggplot() + geom_line( data = data_life_exp_avg, aes( x = year, y = mean_life_exp ))
## geom_path: Each group consists of only one observation. Do you need to ## adjust the group aesthetic?
ggplot() + geom_line( data = data_life_exp_avg, aes( x = year, y = mean_life_exp, group = 1 ))
But remember, in our data_life_exp_avg dataframe, our life expectancies are already grouped by year. We just supply the group = 1 argument as a dummy variable for geom_line to use.
We could even say group = "anything" and it would still work. Try it for yourself!
ggplot() + geom_line( data = data_life_exp_avg, aes( x = year, y = mean_life_exp, group = "anything" ))
Let's add this line representing overall life expectancy by year to the graph that has life expectancy by year for each country. We just need to copy and paste the geom_line portion from this graph to the previous graph.
ggplot() + geom_line( data = data_life_exp_MINE,
aes( x = year, y = life_expectancy, group = iso_a2), col = "gray40") +
geom_line( data = data_life_exp_avg,
aes( x = year, y = mean_life_exp, group = 1 ))
Our overall line (black) doesn't look too clear in a sea of grey country lines. Let's change the color to red for the overall line, and also make it thicker.
ggplot() + geom_line( data = data_life_exp_MINE,
aes( x = year, y = life_expectancy, group = iso_a2),
col = "gray40") +
geom_line( data = data_life_exp_avg,
aes( x = year, y = mean_life_exp, group = "anything" ),
col = "red", size = 2)
This is great. It tells us that mean life expectancy throughout the world has generally been increasing steadily.
We are using the ne_countries function from the 'rnaturalearth' package.
world = ne_countries(returnclass = "sf")%>% select(iso_a2, name_long, continent)
## Simple feature collection with 6 features and 3 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -73.41544 ymin: -55.25 xmax: 75.15803 ymax: 42.68825 ## epsg (SRID): 4326 ## proj4string: +proj=longlat +datum=WGS84 +no_defs ## iso_a2 name_long continent geometry ## 0 AF Afghanistan Asia MULTIPOLYGON (((61.21082 35... ## 1 AO Angola Africa MULTIPOLYGON (((16.32653 -5... ## 2 AL Albania Europe MULTIPOLYGON (((20.59025 41... ## 3 AE United Arab Emirates Asia MULTIPOLYGON (((51.57952 24... ## 4 AR Argentina South America MULTIPOLYGON (((-65.5 -55.2... ## 5 AM Armenia Asia MULTIPOLYGON (((43.58275 41...
This returns a spatial polygons dataframe.
Columns we have selected are the 2 character country code, the full name of the country, and the continent in which it is located.
data_life_exp_wide = data_life_exp_MINE %>% spread(year, life_expectancy) world_temporal <- world %>% left_join(data_life_exp_wide, by = "iso_a2") %>% gather(year, life_exp, `1963`:`2013`)
head(world_temporal)
## Simple feature collection with 6 features and 5 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -73.41544 ymin: -55.25 xmax: 75.15803 ymax: 42.68825 ## epsg (SRID): 4326 ## proj4string: +proj=longlat +datum=WGS84 +no_defs ## iso_a2 name_long continent year life_exp ## 1 AF Afghanistan Asia 1963 33.624 ## 2 AO Angola Africa 1963 34.272 ## 3 AL Albania Europe 1963 64.911 ## 4 AE United Arab Emirates Asia 1963 55.375 ## 5 AR Argentina South America 1963 65.311 ## 6 AM Armenia Asia 1963 67.276 ## geometry ## 1 MULTIPOLYGON (((61.21082 35... ## 2 MULTIPOLYGON (((16.32653 -5... ## 3 MULTIPOLYGON (((20.59025 41... ## 4 MULTIPOLYGON (((51.57952 24... ## 5 MULTIPOLYGON (((-65.5 -55.2... ## 6 MULTIPOLYGON (((43.58275 41...
Adding the along = "year" gives us a separate map for each year. The tmap_animation function uses ImageMagick to create a gif of the different maps.
my_ani <- tm_shape(world_temporal, projection = "robin") +
tm_fill("life_exp", title = "Life expectancy",
palette = "inferno") +
tm_facets( along = "year" ) +tm_layout(frame = FALSE)
tmap_animation(my_ani, filename = "Animation.gif",
width = 1500, height = 600, delay = 60)
It is beautiful to see how great strides have been made with life expectancies throughout the world in this format. This is superior to a facet view as even minute detail can be seen.
## Animation saved to D:\Lakshman\A_Cornell\Side_project\Life_expectancy\code_life_expectancy\AnimationUS.gif
I created this using a similar methodology, but decided to create a frame for every year.
am_lifeex <- read.csv("cdc_data.csv")
colnames(am_lifeex)[4] <- "life_ex"
colnames(am_lifeex)[1] <- "Year"
p <- am_lifeex %>%
filter( !(is.na(life_ex)) &
Sex == "Both Sexes" &
Race %in% c("White", "Black")) %>%
ggplot()
plot <- p +
geom_jitter(aes(x = Year, y = life_ex,
col = Race)) +
ggtitle(" US Life Expectancy at Birth by Race for the past century") +
ylab("Life Expectancy in Years")
Two things are immediately apparent from this graph.
One, the African-American population has clearly had lower life expectancies throughout the past century.
Two, this has been changing in recent years, with the gap closing.