Agenda

1. Getting to know gapminder

For the first half of the lab, we’ll use the gapminder data set and ggplot2 package to learn the basics about data visualization. ggplot2 is a core package under tidyverse. The data frame gapminder will become available once you have loaded the package.

For the second half of the lab, we’ll use the Seattle Airbnb listings from Inside Airbnb.

Question 1.1. Install the gapminder package in the console. Load gapminder and tidyverse in the below code chunk.

options(repos = c(CRAN = "https://cloud.r-project.org"))

 install.packages("gapminder")
## 
## The downloaded binary packages are in
##  /var/folders/xs/l69xp4vd4l11sk3y3_3hzbf40000gn/T//RtmpSYFr7p/downloaded_packages
library(gapminder)
data(gapminder)

Before we start, let’s take a look at the gapminder data. Question 1.2. Use str() to take a look at data frame gapminder.

str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Question 1.3. From checking the structure of the data frame, how many different types of data do you find? Which are continuous, and which are categorical? Hint at the bottom * 2 types: continuous and categorical * 2 categorical: qualitative, can’t really be measured against each other * 4 continuous: can take any value in a range

Question 1.4. How many observations does the data frame contain?

You may also notice that gapminder has a nested/hierarchical structure: year in country in continent. These are panel data!

Question 1.5. Now, create a subset on country Algeria.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
algeria <- gapminder %>% filter(country=="Algeria")

Read about the origins of the gapminder data set by typing ?gapminder, and by looking at the source of the data: https://www.gapminder.org/data/.

2. Visualize covariations using ggplot2

A great way of plotting is to use the ggplot2 package. The core idea underlying this package is the layered grammar of graphics: we can break up elements of a plot into pieces and combine them.

ggplot2 graph objects consist of two primary components:

We’ll use ggplot2 to learn how to show the covariation between variables. Different from bar charts or histograms where only variation of a single variable is displayed, in this lab we focus on how different variables may vary together.

Scatterplots

Scatterplots are a good way to visualize the relationship between two variables, and to look for outliers.

theme_set(theme_minimal()) # setting minimal theme as defalt

p <- # use the assignment operator to save a plot!
  ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + # use the gapminder data, set x to the gdp and y to the life expectancy
  geom_point() # plot points (x,y) for each country in the data set

p # by calling the object 'p' we can display the graph

Question 2.1. Make a scatterplot comparing the population and the life expectancy, assign it to an object named ‘q’

q <- # use the assignment operator to save a plot!
  ggplot(gapminder, aes(x = pop, y = lifeExp)) + # use the gapminder data, set x to the pop and y to the life expectancy
  geom_point() # plot points (x,y) for each country in the data set

q # by calling the object 'p' we can display the graph

Scales

It appears that countries with higher per capita gdp may have higher life expectancy. Data points are however lumped together within the gdpPercap 0-30,000 range. To better see our data, we can transform the x-axis into a log scale.

p + # taking our old graph, the '+' lets us add to it
  scale_x_log10() # add our scale

Question 2.2. add a log scale to your graph q

q %>% + scale_x_log10()

Question 2.3. what do you think accounts for the distinctive shape of your scatterplot? Each line is a country, and naturally life expediency is going up over time. ## Colors

In addition to the x and y axes, another aesthetic available to use is color. This means we can look at how up to three variables change together!

For example, we can color data points by continent, a categorical variable:

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) + # same as p above but adding color
  geom_point() + # plot the points
  scale_x_log10() # add the same scale

If you map a continuous/numeric variable onto color, ggplot2 will pick a gradient scale:

ggplot(gapminder, aes(x = pop, y = lifeExp, color = gdpPercap)) + 
  geom_point() 

Question 2.4. Remake plot q by choosing a continuous variable to add colors

ggplot(gapminder, aes(x = pop, y = lifeExp, color = year)) + 
  geom_point() + 
  scale_x_log10()

Question 2.5. Remake plot q by choosing a categorical variable to add colors that explain the interesting shapes

q <- 
  ggplot(gapminder, aes(x = pop, y = lifeExp, color = continent)) +
geom_point()
  
q

Facets

Let’s take a different way of breaking down these data by continent. This time, we’ll facet the data into “small multiple” plots.

To do this, we add a new layer with facet_wrap. Note: the syntax is slightly different! You use a tilde (~) before the variable name.

(Why? Because you can facet by more than one variable. In R, this syntax is called a formula.)

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point() + 
  scale_x_log10() + 
  facet_wrap(~continent) 

Question 2.6: Add a continent facet wrap to q

q <- ggplot(gapminder, aes (x = year, y = lifeExp, color = continent, group = country)) +
  geom_line() +
  facet_wrap(~continent)
  
q

3 Exercise: Life expectancy over time

Instead of looking at the relationship between life expectancy and GDP, now we’ll look at changes in life expectancy over time. You can use the last code block for all the questions.

3.1: Line plot of life expectancy by year

Create a plot where x = year and y = lifeExp. This time, use a new geom: geom_line() instead of geom_point(). Initially, it won’t look quite right.

3.2: The group aesthetic

You need to tell the plot to group the lines by country. To do this, you’ll need a new aesthetic, group = country. Incorporate this into your plot. Does it look more reasonable now? yes ## 3.3: Facet and interpret Finally, facet by continent. Does life expectancy seem to have increased over time everywhere? Do you see any dips or decreases? It seems to have increased in every country for. the most part, there are some dips which may indicate some kind of war or famine during those years.

ggplot(gapminder, aes(x = year, y = lifeExp, group = country)) + 
  geom_line() + 
  scale_x_log10() + 
  facet_wrap(~continent) 

4. Maps

You can create maps in R using latitude and longitude data. This uses a new package called leaflet (https://rstudio.github.io/leaflet/), which you’ll need to install in the console.

The Inside Airbnb data has latitudes and longitudes for each listing, so we’ll use that.

Question 4.1: Install leaflet, load it, and then read the file ‘data/listings.csv’ and name it airbnb_data

install.packages("leaflet")
## 
## The downloaded binary packages are in
##  /var/folders/xs/l69xp4vd4l11sk3y3_3hzbf40000gn/T//RtmpSYFr7p/downloaded_packages
library(leaflet)
airbnb_data <- read.csv("listings.csv")

Points and popups

leaflet relies on the pipe (%>%) to add layers to maps.

leaflet(airbnb_data) %>% # begin by passing the data to leaflet
  addTiles() %>% # add the map files to the plot, 
                 # leaflet automatically uses the latitude and longitude data 
                 # to find the right map
  addCircles(popup = ~name) # add circles for each listing, 
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
                            # by adding the popup argument, we can click on a 
                            # cirle to show the name

Question 4.2: Make a leaflet plot that only includes only listings with a price over $200 and shows the price when the circle is clicked

leaflet(filter(airbnb_data, price > 200)) |> 
  addTiles() |>
  addCircles(popup = ~price)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively

Colors and legends: qualitative/categorical

Colors are a bit less automatic than with ggplot2. You need to create a palette using one of the provided functions, and then use that palette for your data and the legend.

# We're going to start by making a smaller data frame to use for our visualization
example_data <- 
  airbnb_data %>% # assign the small dataframe the name 'example_data'
  filter(neighbourhood == "University District") # only include listings from the UDistrict
 
# this palette is based on the type of room
# "Set1" is a qualitative palette name
room_type_pal <- colorFactor("Set1", example_data$room_type) 

leaflet(example_data) %>% # make a leaflet plot with our example data
  addTiles() %>% # add the map tiles
  addCircles(popup = ~name, # add circles which we can click to see the names
             color = ~room_type_pal(room_type)) %>% # color the circles with our palette
  addLegend(pal = room_type_pal, values = ~room_type) # add a legend so we know what's what
## Assuming "longitude" and "latitude" are longitude and latitude, respectively

Question 4.3: Use the same code from above, but map a neighborhood group of your choice

# We're going to start by making a smaller data frame to use for our visualization
example_data <- 
  airbnb_data %>% # assign the small dataframe the name 'example_data'
  filter(neighbourhood == "Roosevelt") # only include listings from the UDistrict
 
# this palette is based on the type of room
# "Set1" is a qualitative palette name
room_type_pal <- colorFactor("Set1", example_data$room_type) 

leaflet(example_data) %>% # make a leaflet plot with our example data
  addTiles() %>% # add the map tiles
  addCircles(popup = ~name, # add circles which we can click to see the names
             color = ~room_type_pal(room_type)) %>% # color the circles with our palette
  addLegend(pal = room_type_pal, values = ~room_type) # add a legend so we know what's what
## Assuming "longitude" and "latitude" are longitude and latitude, respectively

Colors and legends: numeric/continuous

You should use different color palettes for categorical vs numeric data. You’ve got a couple options for plotting numeric data:

  • colorNumeric: linear mapping of numbers onto a color gradient. For example, colorNumeric("RdPu", data$variable).

  • colorBin: bins by values, so each color spans the same numeric range. For example, colorBin("RdPu", data$variable, bins = 5).

  • colorQuantile: bins by quantiles, so each color has the same number of data points. For example, colorQuantile("RdPu", data$variable, n = 5).

The examples above use a red-purple gradient (“RdPu”) as the color palette. Type RColorBrewer::brewer.pal.info into your console for a full list of possible palettes.

example_data <- 
  airbnb_data %>% 
  filter(neighbourhood == "University District") 

price_pal <- colorNumeric("RdPu", example_data$price)

leaflet(example_data) %>%
  addTiles() %>%
  addCircles(popup = ~name, 
             color = ~price_pal(price)) %>%
  addLegend(pal = price_pal, values = ~price)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively

Question 4.4: Use the same code from above, but make a palette based on quantiles instead.

example_data <- 
  airbnb_data %>% 
  filter(neighbourhood == "University District") 

price_pal <- colorQuantile("RdPu", example_data$price)

leaflet(example_data) %>%
  addTiles() %>%
  addCircles(popup = ~name, 
             color = ~price_pal(price)) %>%
  addLegend(pal = price_pal, values = ~price)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively

Question 4.5: Compare the two maps of prices in the UDistrict. Which mapping of prices to colors—linear mapping by value or binning by quantile—do you think is more useful here, and why? There isn’t a right answer. I think quantile gives better insights to what you may be looking for like areas with higher volumes of a certain price and the distribution.

Hint 1.3 Variables country and continent are factor variables. Factor variables are categorical data with an underlying numerical representation.

References

Kieran Healy, Data Visualization: A Practical Introduction

Charles Lanfear, Introduction to R for Social Scientists