This project is about using ggplot2 to create plots that help better understand trends in the world health and economics. It describes concepts such as faceting, time series plots, transformations and ridge plots.
Research questions 1) Is it a fair characterization of today’s world to say it is divided into western rich nations and the developing world in Africa, Asia, and Latin America?
To answer these questions, we will be using the gapminder dataset provided in “dslabs”.
## Loading the required packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(dslabs)
## Loading the gapminder dataset
data("gapminder")
##For each of the six pairs of countries below, which country do you think had the highest child mortality rates in 2015? Which pairs do you think are most similar?
gapminder |> filter(year == 2015 & country %in% c("Sri Lanka", "Turkey")) |>
select(country, infant_mortality)
## country infant_mortality
## 1 Sri Lanka 8.4
## 2 Turkey 11.6
##Turkey has a higher infant mortality rate compared to Sri Lanka.
gapminder |> filter(year == 2015 & country %in% c("Poland", "South Korea")) |>
select(country, infant_mortality)
## country infant_mortality
## 1 South Korea 2.9
## 2 Poland 4.5
gapminder |> filter(year == 2015 & country %in% c("Malaysia","Russia")) |>
select(country, infant_mortality, fertility)
## country infant_mortality fertility
## 1 Malaysia 6.0 1.94
## 2 Russia 8.2 1.61
gapminder |>
filter(year == 2015 & country %in% c("Thailand","South Africa")) |>
select(country, infant_mortality, fertility)
## country infant_mortality fertility
## 1 South Africa 33.6 2.34
## 2 Thailand 10.5 1.38
##We see that the European countries on this list have higher child mortality rates: Poland has a higher rate than South Korea, and Russia has a higher rate than Malaysia. We also see that Pakistan has a much higher rate than Vietnam, and South Africa has a much higher rate than Thailand.
##Scatterplots
##The reason for this stems from the preconceived notion that the world is divided into two groups: the western world (Western Europe and North America), characterized by long life spans and small families, versus the developing world (Africa, Asia, and Latin America) characterized by short life spans and large families.
##In order to analyze this world view, our first plot is a scatterplot of life expectancy versus fertility rates (average number of children per woman). We start by looking at data from about 50 years ago, when perhaps this view was first cemented in our minds.
gapminder |>
filter(year == 1962) |>
ggplot(aes(fertility, life_expectancy)) +
geom_point()
##Short lifespans characterized by large families and long lifespans characterized by small families.
##To confirm that indeed these countries are from the regions we expect, we can use color to represent continent.
gapminder |>
filter(year == 1962) |>
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point()
##In 1962, “the West versus developing world” view was grounded in some reality. Is this still the case 50 years later?
##Faceting
##We could easily plot the 2012 data in the same way we did for 1962. To make comparisons, however, side by side plots are preferable. In ggplot2, we can achieve this by faceting variables: we stratify the data by some variable and make the same plot for each strata.
##To achieve faceting, we add a layer with the function facet_grid, which automatically separates the plots. This function lets you facet by up to two variables using columns to represent one variable and rows to represent the other. The function expects the row and column variables to be separated by a ~.
gapminder |>
filter(year %in% c(1962, 2012)) |>
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_grid(year ~ continent)
gapminder |>
filter(year %in% c(1962, 2012)) |>
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_grid(. ~ year)
##This plot clearly shows that the majority of countries have moved from the developing world cluster to the western world one. In 2012, the western versus developing world view no longer makes sense. This is particularly clear when comparing Europe to Asia, the latter of which includes several countries that have made great improvements.
##Facet wrap
##To explore how this transformation happened through the years, we can make the plot for several years. For example, we can add 1970, 1980, 1990, and 2000. If we do this, we will not want all the plots on the same row, the default behavior of facet_grid, since they will become too thin to show the data. Instead, we will want to use multiple rows and columns. The function facet_wrap permits us to do this by automatically wrapping the series of plots so that each display has viewable dimensions.
years <-c(1962,1980,1990,2000,2012)
continents <-c("Europe", "Asia")
gapminder |>
filter(year %in% years & continent %in% continents)|>
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_wrap(.~year)
##This plot clearly shows how most Asian countries have improved at a much faster rate than European ones.
##Fixed scales for better comparisons
##The default choice of the range of the axes is important. When not using facet, this range is determined by the data shown in the plot. When using facet, this range is determined by the data shown in all plots and therefore kept fixed across plots. This makes comparisons across plots much easier. For example, in the above plot, we can see that life expectancy has increased and the fertility has decreased across most countries. We see this because the cloud of points moves. This is not the case if we adjust the scales.
gapminder |>
filter(year %in% c(1962,2012)) |>
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_wrap(.~year, scales = "free")
##In the plot above, we have to pay special attention to the range to notice that the plot on the right has a larger life expectancy.
##Time series plots
##The visualizations above effectively illustrate that data no longer supports the western versus developing world view. Once we see these plots, new questions emerge. For example, which countries are improving more and which ones less? Was the improvement constant during the last 50 years or was it more accelerated during certain periods? For a closer look that may help answer these questions, we introduce time series plots. For example, the below plot is a trend of the United States fertility rate.
gapminder|>
filter(country == "United States") |>
ggplot(aes(year, fertility)) +
geom_point()
## Warning: Removed 1 rows containing missing values (`geom_point()`).
##We see that the trend is not linear at all. Instead there is sharp drop during the 1960s and 1970s to below 2. Then the trend comes back to 2 and stabilizes during the 1990s.
##When the points are regularly and densely spaced, as they are here, we create curves by joining the points with lines, to convey that these data are from a single series, here a country. To do this, we use the geom_line function instead of geom_point.
gapminder |>
filter(country == "United States") |>
ggplot(aes(year, fertility)) +
geom_line()
## Warning: Removed 1 row containing missing values (`geom_line()`).
##This is particularly helpful when we look at two countries. If we subset the data to include two countries, one from Europe and one from Asia, then adapt the following code.
countries <-c("South Korea", "Germany")
gapminder |>
filter(country %in% countries & !is.na(fertility)) |> ggplot(aes(year, fertility, group = country)) +
geom_line()
##But which line goes with which country? We can assign colors to make this distinction. A useful side-effect of using the color argument to assign different colors to the different countries is that the data is automatically grouped.
gapminder |> filter(country %in% countries & !is.na(fertility)) |>
ggplot(aes(year, fertility, color = country)) +
geom_line()
##The plot clearly shows how South Korea’s fertility rate dropped drastically during the 1960s and 1970s, and by 1990 had a similar rate to that of Germany.
##Labels instead of legends
##For trend plots we recommend labeling the lines rather than using legends since the viewer can quickly see which line is which country. This suggestion actually applies to most plots: labeling is usually preferred over legends. We demonstrate how we can do this using the geomtextpath package. We define a data table with the label locations and then use a second mapping just for these labels.
##We now shift our attention to the second question related to the commonly held notion that wealth distribution across the world has become worse during the last decades. When general audiences are asked if poor countries have become poorer and rich countries become richer, the majority answers yes. By using stratification, histograms, smooth densities, and boxplots, we will be able to understand if this is in fact the case. First we learn how transformations can sometimes help provide more informative summaries and plots.
##The gapminder data table includes a column with the countries’ gross domestic product (GDP). GDP measures the market value of goods and services produced by a country in a year. The GDP per person is often used as a rough summary of a country’s wealth. Here we divide this quantity by 365 to obtain the more interpretable measure dollars per day. Using current US dollars as a unit, a person surviving on an income of less than $2 a day is defined to be living in absolute poverty. We add this variable to the data table.
gapminder_update <- gapminder |> mutate(
dollars_per_day = gdp/population/365
)
##The GDP values are adjusted for inflation and represent current US dollars, so these values are meant to be comparable across the years. Of course, these are country averages and within each country there is much variability. All the graphs and insights described below relate to country averages and not to individuals.
##Log transformations
past_year <- 1970
gapminder_update |>
filter(year == past_year & !is.na(gdp)) |>
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black")
##In this plot, we see that for the majority of countries, averages are below $10 a day. However, the majority of the x-axis is dedicated to the 35 countries with averages above $10. So the plot is not very informative about countries with values below $10 a day.
##It might be more informative to quickly be able to see how many countries have average daily incomes of about $1 (extremely poor), $2 (very poor), $4 (poor), $8 (middle), $16 (well off), $32 (rich), $64 (very rich) per day. These changes are multiplicative and log transformations convert multiplicative changes into additive ones: when using base 2, a doubling of a value turns into an increase by 1.
gapminder_update |> filter(year == past_year & !is.na(gdp)) |>
ggplot(aes(log2(dollars_per_day))) +
geom_histogram(binwidth = 1, color = "black")
gapminder_update |> filter(year == past_year &
!is.na(gdp)) |>
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2")
gapminder_update |>
filter(year==past_year & !is.na(gdp)) |>
ggplot(aes(log10(population))) +
geom_histogram(binwidth = 0.5, color = "black")
##comparing multiple distributions with boxplots and ridge plots
gapminder_update |>
filter(year == past_year & !is.na(gdp)) |>
mutate(region = reorder(region, dollars_per_day, FUN = median)) |>
ggplot(aes(dollars_per_day, region)) +
geom_point() +
scale_x_continuous(trans = "log2")
##We can already see that there is indeed a “west versus the rest” dichotomy: we see two clear groups, with the rich group composed of North America, Northern and Western Europe, New Zealand and Australia.
gapminder_update <- gapminder_update |>
mutate(group = case_when(
region %in% c("Western Europe", "Northern Europe","Southern Europe",
"Northern America",
"Australia and New Zealand") ~ "West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America",
"South America") ~ "Latin America",
continent == "Africa" &
region != "Northern Africa" ~ "Sub-Saharan",
TRUE ~ "Others"))
##We turn this group variable into a factor to control the order of the levels.
gapminder_update <- gapminder_update |>
mutate(group = factor(group, levels = c("Others", "Latin America",
"East Asia", "Sub-Saharan",
"West")))
##Boxplots
gapminder_update |> filter(year == past_year & !is.na(gdp)) |>
ggplot(aes(group, dollars_per_day)) +
geom_boxplot() +
scale_y_continuous(trans = "log2") +
xlab("") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
##Boxplots have the limitation that by summarizing the data into five numbers, we might miss important characteristics of the data. One way to avoid this is by showing the data.
gapminder_update |> filter(year == past_year & !is.na(gdp)) |>
ggplot(aes(group, dollars_per_day)) +
geom_boxplot() +
geom_point() +
scale_y_continuous(trans = "log2") +
xlab("") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
##1970 Versus 2010 Income Distributions
##Data exploration clearly shows that in 1970 there was a “west versus the rest” dichotomy. But does this dichotomy persist? Let’s use facet_grid see how the distributions have changed. To start, we will focus on two groups: the west and the rest. We make four histograms.
present_year <- 2010
past_present <-c(1970, 2010)
gapminder_update |>
filter(year %in% past_present & !is.na(gdp)) |>
mutate(west = ifelse(group == "West","West","Developing")) |>
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ west)
##Before we interpret the findings of this plot, we notice that there are more countries represented in the 2010 histograms than in 1970: the total counts are larger. One reason for this is that several countries were founded after 1970. For example, the Soviet Union divided into several countries during the 1990s. Another reason is that data was available for more countries in 2010. We remake the plots using only countries with data available for both years. In the data wrangling part of this book, we will learn tidyverse tools that permit us to write efficient code for this, but here we can use simple code using the intersect function.
country_list_1 <- gapminder_update |>
filter(year == past_year & !is.na(dollars_per_day)) |>
pull(country)
country_list_2 <- gapminder_update |>
filter(year == present_year & !is.na(dollars_per_day)) |>
pull(country)
country_list <- intersect(country_list_1, country_list_2)
gapminder_update |>
filter(year %in% past_present & !is.na(gdp) & country %in% country_list) |>
mutate(west = ifelse(group == "West","West","Developing")) |>
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year~ west)
##We now see that the rich countries have become a bit richer, but percentage-wise, the poor countries appear to have improved more. In particular, we see that the proportion of developing countries earning more than $16 a day increased substantially.
##To see which specific regions improved the most, we can remake the boxplots we made above, but now adding the year 2010 and then using facet to compare the two years.
gapminder_update |>
filter(year %in% past_present & country %in% country_list) |>
ggplot(aes(group, dollars_per_day)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(trans = "log2") +
xlab("") +
facet_grid(.~ year)
##Instead of faceting, we keep the data from each year together and ask to color (or fill) them depending on the year. Note that groups are automatically separated by year and each pair of boxplots drawn next to each other. Because year is a number, we turn it into a factor since ggplot2 automatically assigns a color to each category of a factor. Note that we have to convert the year columns from numeric to factor.
gapminder_update |>
filter(year%in% past_present & country %in% country_list) |>
mutate(year = factor(year)) |>
ggplot(aes(group, dollars_per_day, fill = year)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(trans = "log2") +
xlab("")
##The previous data exploration suggested that the income gap between rich and poor countries has narrowed considerably during the last 40 years. We used a series of histograms and boxplots to see this. We suggest a succinct way to convey this message with just one plot.
##Let’s start by noting that density plots for income distribution in 1970 and 2010 deliver the message that the gap is closing.
gapminder_update |>
filter(year %in% past_present & country %in% country_list) |>
ggplot(aes(dollars_per_day)) +
geom_density(fill = "grey") +
scale_x_continuous(trans = "log2") +
facet_grid(.~year)
##In the 1970 plot, we see two clear modes: poor and rich countries. In 2010, it appears that some of the poor countries have shifted towards the right, closing the gap.