Workshop 9: Bubble plots, scales, and {plotly}

SHEMA HUGOR

2023-06-28

0.1 Introduction

In February 2006, a Swedish physician and data advocate named Hans Rosling gave a famous TED talk titled “The best stats you’ve ever seen” where he presented global economic, health, and development data complied by the Gapminder Foundation.

The talk featured a famous bubble plot similar to this:

A bubble plot is a type of scatter plot where a third dimension is added: the value of an additional numeric variable is represented through the size of the points.

We will be using Gapminder data to create a bubble plot with ggplot2.

1 Packages

To get started, load in the needed packages: {tidyverse}, {here}, and {gapminder}.

# Load packages
pacman:: p_load(here,tidyverse,gapminder)

1.1 Gapminder data

The R package gapminder, which we just loaded, contains global economic, health, and development data complied by the Gapminder Foundation.

Run the following code to load gapminder data frame from the gapminder package:

# Tell R to get the inbuilt dataframe from the package
data(gapminder, package="gapminder")

# Print dataframe
gapminder
## # A tibble: 1,704 Ă— 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

Each row in this table corresponds to a country-year combination. For each row, we have 6 columns:

  1. country: Country name

  2. continent: Geographic region of the world

  3. year: Calendar year

  4. lifeExp: Average number of years a newborn child would live if current mortality patterns were to stay the same

  5. pop: Total population

  6. gdpPercap: Gross domestic product per person (inflation-adjusted US dollars)

The glimpse() and summary() functions can tell us more about these variables.

# Data structure
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

This version of the gapminder dataset contains information for 142 countries, divided in to 5 continents or world regions.

Data are recorded every 5 years from 1952 to 2007 (a total of 12 years).

# Data summary
summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

In this lesson, we will be using the gapminder dataset , but only the data from 2007. We can use {dplyr} functions to manipulate the dataset and prepare it for plotting. Read through the commented code below to understand how and why these manipulations are done:

# Create new data frame called gapminder07
gapminder07 <- gapminder %>%
  # filter data frame to only include rows from 2007
  filter(year == 2007) %>%
  # remove the year column
  select(-year) %>%
  # rename columns to make them easier to understand
  rename(life_expectancy = lifeExp,
         population = pop,
         gdp_per_capita = gdpPercap) %>% 
  # reorder dataset by population size (this will be useful later)
  arrange(desc(population))

gapminder07
## # A tibble: 142 Ă— 5
##    country       continent life_expectancy population gdp_per_capita
##    <fct>         <fct>               <dbl>      <int>          <dbl>
##  1 China         Asia                 73.0 1318683096          4959.
##  2 India         Asia                 64.7 1110396331          2452.
##  3 United States Americas             78.2  301139947         42952.
##  4 Indonesia     Asia                 70.6  223547000          3541.
##  5 Brazil        Americas             72.4  190010647          9066.
##  6 Pakistan      Asia                 65.5  169270617          2606.
##  7 Bangladesh    Asia                 64.1  150448339          1391.
##  8 Nigeria       Africa               46.9  135031164          2014.
##  9 Japan         Asia                 82.6  127467972         31656.
## 10 Mexico        Americas             76.2  108700891         11978.
## # ℹ 132 more rows

We will start with a regular scatter plot showing the relationship between two numerical variables, and then make it a bubble plot by adding a third dimension.

Let’s say we want to view the relationship between life expectancy and GPD per capita. Create a scatter plot, with GPD on the x axis and life expectancy on the y axis:

ggplot(data = gapminder07,
       mapping = aes(x= gdp_per_capita,
                     y= life_expectancy))+
  geom_point()

Let’s view this plot through the grammar of graphics:

  1. The geometric objects - visual marks that represent the data - are points.
  2. The data variable gdp_per_capita gets mapped to the x-position aesthetic of the points.
  3. The data variable life_expectancy gets mapped to the y-position aesthetic of the points.

What we have created is a simple scatterplot by adding together the following components:

1.2 Quick detour: Plots as objects

A ggplot2 graph can be saved as a named R object (like a data frame), manipulated further, and then printed or saved.

We use the assignment operator to save the plot as an object, just as we have done with data frames.

# create scatterplot and save it
gap_plot_01 <- ggplot(
  data = gapminder07,
  mapping = aes(
    x = gdp_per_capita,
    y = life_expectancy)
) +
  geom_point() 

This will appear in your environment, but it will not be printed. To print the graph simply type and run the name of the object:

gap_plot_01

You can add a line of best fit to your scatter plot and save it as a new plot, without having to write the old code again:

gap_plot_02 <- gap_plot_01 + 
  geom_smooth()

gap_plot_02
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

1.3 Bubble plots with geom_point()

With {ggplot2}, bubble plots are built using the geom_point() function. At least three arguments must be provided to aes(): x, y and size.

So let’s add an additional variable, population, and map it to the size aesthetic.

gapminder_bubble <-  ggplot(data = gapminder07,
                            mapping = aes(x= gdp_per_capita,
                                          y= life_expectancy))+
  geom_point(mapping = aes(size=population))

gapminder_bubble

Here, the population of each country is represented through point size. The legend will automatically be built by {ggplot2}, showing how point size scales with population size.

Many of the points are overlapping, so we can decrease the opacity of the points: Change the opacity of the points to 50%.

gapminder_plot_03 <- gapminder_bubble <-  ggplot(data = gapminder07,
                            mapping = aes(x= gdp_per_capita,
                                          y= life_expectancy))+
  geom_point(mapping = aes(size=population,
                           alpha= 0.5 ))

gapminder_plot_03

1.4 Modifying scales

One of the optional grammar of graphics layers that we haven’t learned about yet is scale_*() functions.

In this section, you can simply run the code we’ve already written for you. We will use two new scale functions.

1.4.1 Control point size with scale_size()

The first thing we need to improve on the previous bubble plot is the size range of the bubbles. scale_size() allows to set the size of the smallest and the biggest point using the range argument.

gapminder_plot_04 <- ggplot(
  data = gapminder07,
  mapping = aes(
    x = gdp_per_capita,
    y = life_expectancy,
    size = population
   )
) +
  geom_point(alpha = 0.5,color="darkred") +
  scale_size(range = c(1, 20))

gapminder_plot_04

1.5 Log transform scales

(Add that section from line graphs lesson and adapt it to this data.)

We can address this by log-transforming the x-axis using scale_x_log10(), which log-scales the x-axis (as the name suggests). We will add this function as a new layer after a + sign, as usual:

gapminder_plot_06 <- ggplot(
  data = gapminder07,
  mapping = aes(
    x = gdp_per_capita,
    y = life_expectancy,
    size = population
  )
) +
  geom_point(alpha = 0.5) +
  scale_size(range = c(1, 20)) +
  scale_x_log10()

gapminder_plot_06

1.6 Adding a fourth dimension: color

Since we have one more variable in our dataset (continent) , why not showing it using point color? Modify the previous code to map the continent variable the color mapping:

gapminder_plot_07 <- ggplot(
  data = gapminder07,
  mapping = aes(
    x = gdp_per_capita,
    y = life_expectancy,
    size = population,
  color = continent)) +
  geom_point(alpha = 0.5
    ) +
  scale_size(range = c(1, 20)) +
  scale_x_log10()
  

gapminder_plot_07

This produced a bubble plot displaying all the information from the four variables in our data frame.

Let’s again view this plot through the grammar of graphics. The first three components are the same as the last plot, but now we have added two additional aesthetic mappings.

  1. The geometric objects - visual marks that represent the data - are points.
  2. The data variable gdp_per_capita gets mapped to the x-position aesthetic of the points.
  3. The data variable life_expectancy gets mapped to the y-position aesthetic of the points.
  4. The data variable population gets mapped to the size aesthetic of the points.
  5. The data variable continent gets mapped to the color aesthetic of the points.

We built upon the simple scatterplot by adding variable colors and sizes:

1.7 Bonus Challenge (optional)

Our current bubble plot doesn’t show us which country each bubble is from, or what the exact population and GDP of the country is. One way to communicate this information without crowding the graph is to make it interactive. The ggplotly() function from the {plotly} package can convert your plot to be interactive! Your challenge for this section is to find out how to use this function to make your plot interactive, so that you can hover over the points to see additional information. Good luck!

pacman::p_load(plotly)
ggplotly(gapminder_plot_07)

1.8 GG animate code