0.1 Introduction
In February 2006, a Swedish physician and data advocate named Hans Rosling gave a famous TED talk titled “The best stats you’ve ever seen” where he presented global economic, health, and development data complied by the Gapminder Foundation.
The talk featured a famous bubble plot similar to this:
A bubble plot is a type of scatter plot where a third dimension is added: the value of an additional numeric variable is represented through the size of the points.
We will be using Gapminder data to create a bubble plot with
ggplot2.
1 Packages
To get started, load in the needed packages: {tidyverse}, {here}, and {gapminder}.
# Load packages
pacman:: p_load(here,tidyverse,gapminder)1.1 Gapminder data
The R package gapminder, which we just loaded, contains global economic, health, and development data complied by the Gapminder Foundation.
Run the following code to load gapminder data frame from
the gapminder package:
# Tell R to get the inbuilt dataframe from the package
data(gapminder, package="gapminder")
# Print dataframe
gapminder## # A tibble: 1,704 Ă— 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
Each row in this table corresponds to a country-year combination. For each row, we have 6 columns:
country: Country namecontinent: Geographic region of the worldyear: Calendar yearlifeExp: Average number of years a newborn child would live if current mortality patterns were to stay the samepop: Total populationgdpPercap: Gross domestic product per person (inflation-adjusted US dollars)
The glimpse() and summary() functions can
tell us more about these variables.
# Data structure
glimpse(gapminder)## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
This version of the gapminder dataset
contains information for 142 countries, divided in to
5 continents or world regions.
Data are recorded every 5 years from 1952 to 2007 (a total of 12 years).
# Data summary
summary(gapminder)## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
In this lesson, we will be using the gapminder dataset ,
but only the data from 2007. We can use {dplyr} functions to manipulate
the dataset and prepare it for plotting. Read through the commented code
below to understand how and why these manipulations are done:
# Create new data frame called gapminder07
gapminder07 <- gapminder %>%
# filter data frame to only include rows from 2007
filter(year == 2007) %>%
# remove the year column
select(-year) %>%
# rename columns to make them easier to understand
rename(life_expectancy = lifeExp,
population = pop,
gdp_per_capita = gdpPercap) %>%
# reorder dataset by population size (this will be useful later)
arrange(desc(population))
gapminder07## # A tibble: 142 Ă— 5
## country continent life_expectancy population gdp_per_capita
## <fct> <fct> <dbl> <int> <dbl>
## 1 China Asia 73.0 1318683096 4959.
## 2 India Asia 64.7 1110396331 2452.
## 3 United States Americas 78.2 301139947 42952.
## 4 Indonesia Asia 70.6 223547000 3541.
## 5 Brazil Americas 72.4 190010647 9066.
## 6 Pakistan Asia 65.5 169270617 2606.
## 7 Bangladesh Asia 64.1 150448339 1391.
## 8 Nigeria Africa 46.9 135031164 2014.
## 9 Japan Asia 82.6 127467972 31656.
## 10 Mexico Americas 76.2 108700891 11978.
## # ℹ 132 more rows
We will start with a regular scatter plot showing the relationship between two numerical variables, and then make it a bubble plot by adding a third dimension.
Let’s say we want to view the relationship between life expectancy and GPD per capita. Create a scatter plot, with GPD on the x axis and life expectancy on the y axis:
ggplot(data = gapminder07,
mapping = aes(x= gdp_per_capita,
y= life_expectancy))+
geom_point()Let’s view this plot through the grammar of graphics:
- The
geometric objects - visual marks that represent the data - are points. - The
datavariable gdp_per_capita gets mapped to thex-positionaesthetic of the points. - The
datavariable life_expectancy gets mapped to they-positionaesthetic of the points.
What we have created is a simple scatterplot by adding together the following components:
1.2 Quick detour: Plots as objects
A ggplot2 graph can be saved as a named R object (like a
data frame), manipulated further, and then printed or saved.
We use the assignment operator to save the plot as an object, just as we have done with data frames.
# create scatterplot and save it
gap_plot_01 <- ggplot(
data = gapminder07,
mapping = aes(
x = gdp_per_capita,
y = life_expectancy)
) +
geom_point() This will appear in your environment, but it will not be printed. To print the graph simply type and run the name of the object:
gap_plot_01You can add a line of best fit to your scatter plot and save it as a new plot, without having to write the old code again:
gap_plot_02 <- gap_plot_01 +
geom_smooth()
gap_plot_02## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
1.3 Bubble plots with
geom_point()
With {ggplot2}, bubble plots are built using the
geom_point() function. At least three arguments must be
provided to aes(): x, y and
size.
So let’s add an additional variable, population, and map
it to the size aesthetic.
gapminder_bubble <- ggplot(data = gapminder07,
mapping = aes(x= gdp_per_capita,
y= life_expectancy))+
geom_point(mapping = aes(size=population))
gapminder_bubbleHere, the population of each country is represented through point size. The legend will automatically be built by {ggplot2}, showing how point size scales with population size.
Many of the points are overlapping, so we can decrease the opacity of the points: Change the opacity of the points to 50%.
gapminder_plot_03 <- gapminder_bubble <- ggplot(data = gapminder07,
mapping = aes(x= gdp_per_capita,
y= life_expectancy))+
geom_point(mapping = aes(size=population,
alpha= 0.5 ))
gapminder_plot_031.4 Modifying scales
One of the optional grammar of graphics layers that we haven’t
learned about yet is scale_*() functions.
In this section, you can simply run the code we’ve already written for you. We will use two new scale functions.
1.4.1 Control point size
with scale_size()
The first thing we need to improve on the previous bubble plot is the
size range of the bubbles. scale_size() allows to set the
size of the smallest and the biggest point using the range
argument.
gapminder_plot_04 <- ggplot(
data = gapminder07,
mapping = aes(
x = gdp_per_capita,
y = life_expectancy,
size = population
)
) +
geom_point(alpha = 0.5,color="darkred") +
scale_size(range = c(1, 20))
gapminder_plot_041.5 Log transform scales
(Add that section from line graphs lesson and adapt it to this data.)
We can address this by log-transforming the x-axis using
scale_x_log10(), which log-scales the x-axis (as the name
suggests). We will add this function as a new layer after a
+ sign, as usual:
gapminder_plot_06 <- ggplot(
data = gapminder07,
mapping = aes(
x = gdp_per_capita,
y = life_expectancy,
size = population
)
) +
geom_point(alpha = 0.5) +
scale_size(range = c(1, 20)) +
scale_x_log10()
gapminder_plot_061.6 Adding a fourth dimension: color
Since we have one more variable in our dataset
(continent) , why not showing it using point color? Modify
the previous code to map the continent variable the
color mapping:
gapminder_plot_07 <- ggplot(
data = gapminder07,
mapping = aes(
x = gdp_per_capita,
y = life_expectancy,
size = population,
color = continent)) +
geom_point(alpha = 0.5
) +
scale_size(range = c(1, 20)) +
scale_x_log10()
gapminder_plot_07This produced a bubble plot displaying all the information from the four variables in our data frame.
Let’s again view this plot through the grammar of graphics. The first
three components are the same as the last plot, but now we have added
two additional aesthetic mappings.
- The
geometric objects - visual marks that represent the data - are points. - The
datavariable gdp_per_capita gets mapped to thex-positionaesthetic of the points. - The
datavariable life_expectancy gets mapped to they-positionaesthetic of the points. - The
datavariable population gets mapped to thesizeaesthetic of the points. - The
datavariable continent gets mapped to thecoloraesthetic of the points.
We built upon the simple scatterplot by adding variable colors and
sizes:
1.7 Bonus Challenge (optional)
Our current bubble plot doesn’t show us which country each bubble is
from, or what the exact population and GDP of the country is. One way to
communicate this information without crowding the graph is to make it
interactive. The ggplotly() function from the {plotly}
package can convert your plot to be interactive! Your challenge for this
section is to find out how to use this function to make your plot
interactive, so that you can hover over the points to see additional
information. Good luck!
pacman::p_load(plotly)ggplotly(gapminder_plot_07)