gapminderFor the first half of the lab, we’ll use the gapminder
data set and ggplot2 package to learn the basics about data
visualization. ggplot2 is a core package under
tidyverse. The data frame gapminder will
become available once you have loaded the package.
For the second half of the lab, we’ll use the Seattle Airbnb listings from Inside Airbnb.
Question 1.1. Install the gapminder package in
the console. Load gapminder and tidyverse in
the below code chunk.
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("gapminder")
##
## The downloaded binary packages are in
## /var/folders/xs/l69xp4vd4l11sk3y3_3hzbf40000gn/T//RtmpSYFr7p/downloaded_packages
library(gapminder)
data(gapminder)
Before we start, let’s take a look at the gapminder
data. Question 1.2. Use str() to take a look at data frame
gapminder.
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Question 1.3. From checking the structure of the data frame, how many different types of data do you find? Which are continuous, and which are categorical? Hint at the bottom * 2 types: continuous and categorical * 2 categorical: qualitative, can’t really be measured against each other * 4 continuous: can take any value in a range
Question 1.4. How many observations does the data frame contain?
You may also notice that gapminder has a
nested/hierarchical structure: year in country
in continent. These are panel data!
Question 1.5. Now, create a subset on country Algeria.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
algeria <- gapminder %>% filter(country=="Algeria")
Read about the origins of the gapminder data set by
typing ?gapminder, and by looking at the source of the
data: https://www.gapminder.org/data/.
A great way of plotting is to use the ggplot2 package.
The core idea underlying this package is the layered grammar of
graphics: we can break up elements of a plot into pieces and
combine them.
ggplot2 graph objects consist of two primary
components:
ggplot2 object using
+.We’ll use ggplot2 to learn how to show the
covariation between variables. Different from bar charts or
histograms where only variation of a single variable is
displayed, in this lab we focus on how different variables may vary
together.
Scatterplots are a good way to visualize the relationship between two variables, and to look for outliers.
theme_set(theme_minimal()) # setting minimal theme as defalt
p <- # use the assignment operator to save a plot!
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + # use the gapminder data, set x to the gdp and y to the life expectancy
geom_point() # plot points (x,y) for each country in the data set
p # by calling the object 'p' we can display the graph
Question 2.1. Make a scatterplot comparing the population and the life expectancy, assign it to an object named ‘q’
q <- # use the assignment operator to save a plot!
ggplot(gapminder, aes(x = pop, y = lifeExp)) + # use the gapminder data, set x to the pop and y to the life expectancy
geom_point() # plot points (x,y) for each country in the data set
q # by calling the object 'p' we can display the graph
It appears that countries with higher per capita gdp may have higher life expectancy. Data points are however lumped together within the gdpPercap 0-30,000 range. To better see our data, we can transform the x-axis into a log scale.
p + # taking our old graph, the '+' lets us add to it
scale_x_log10() # add our scale
Question 2.2. add a log scale to your graph q
q %>% + scale_x_log10()
Question 2.3. what do you think accounts for the distinctive shape of your scatterplot? Each line is a country, and naturally life expediency is going up over time. ## Colors
In addition to the x and y axes, another aesthetic available to use
is color. This means we can look at how up to three
variables change together!
For example, we can color data points by continent, a categorical variable:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) + # same as p above but adding color
geom_point() + # plot the points
scale_x_log10() # add the same scale
If you map a continuous/numeric variable onto color,
ggplot2 will pick a gradient scale:
ggplot(gapminder, aes(x = pop, y = lifeExp, color = gdpPercap)) +
geom_point()
Question 2.4. Remake plot q by choosing a continuous variable to add colors
ggplot(gapminder, aes(x = pop, y = lifeExp, color = year)) +
geom_point() +
scale_x_log10()
Question 2.5. Remake plot q by choosing a categorical variable to add colors that explain the interesting shapes
q <-
ggplot(gapminder, aes(x = pop, y = lifeExp, color = continent)) +
geom_point()
q
Let’s take a different way of breaking down these data by continent. This time, we’ll facet the data into “small multiple” plots.
To do this, we add a new layer with facet_wrap. Note:
the syntax is slightly different! You use a tilde (~)
before the variable name.
(Why? Because you can facet by more than one variable. In R, this syntax is called a formula.)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10() +
facet_wrap(~continent)
Question 2.6: Add a continent facet wrap to q
q <- ggplot(gapminder, aes (x = year, y = lifeExp, color = continent, group = country)) +
geom_line() +
facet_wrap(~continent)
q
Instead of looking at the relationship between life expectancy and GDP, now we’ll look at changes in life expectancy over time. You can use the last code block for all the questions.
Create a plot where x = year and
y = lifeExp. This time, use a new geom:
geom_line() instead of geom_point().
Initially, it won’t look quite right.
group aestheticYou need to tell the plot to group the lines by
country. To do this, you’ll need a new aesthetic,
group = country. Incorporate this into your plot. Does it
look more reasonable now? yes ## 3.3: Facet and interpret
Finally, facet by continent. Does life expectancy seem
to have increased over time everywhere? Do you see any dips or
decreases? It seems to have increased in every country for. the most
part, there are some dips which may indicate some kind of war or famine
during those years.
ggplot(gapminder, aes(x = year, y = lifeExp, group = country)) +
geom_line() +
scale_x_log10() +
facet_wrap(~continent)
You can create maps in R using latitude and longitude data. This uses
a new package called leaflet (https://rstudio.github.io/leaflet/), which you’ll need
to install in the console.
The Inside Airbnb data has latitudes and longitudes for each listing, so we’ll use that.
Question 4.1: Install leaflet, load it, and then read the file ‘data/listings.csv’ and name it airbnb_data
install.packages("leaflet")
##
## The downloaded binary packages are in
## /var/folders/xs/l69xp4vd4l11sk3y3_3hzbf40000gn/T//RtmpSYFr7p/downloaded_packages
library(leaflet)
airbnb_data <- read.csv("listings.csv")
leaflet relies on the pipe (%>%) to add
layers to maps.
leaflet(airbnb_data) %>% # begin by passing the data to leaflet
addTiles() %>% # add the map files to the plot,
# leaflet automatically uses the latitude and longitude data
# to find the right map
addCircles(popup = ~name) # add circles for each listing,
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
# by adding the popup argument, we can click on a
# cirle to show the name
Question 4.2: Make a leaflet plot that only includes only listings with a price over $200 and shows the price when the circle is clicked
leaflet(filter(airbnb_data, price > 200)) |>
addTiles() |>
addCircles(popup = ~price)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
Colors are a bit less automatic than with ggplot2. You
need to create a palette using one of the provided functions, and then
use that palette for your data and the legend.
# We're going to start by making a smaller data frame to use for our visualization
example_data <-
airbnb_data %>% # assign the small dataframe the name 'example_data'
filter(neighbourhood == "University District") # only include listings from the UDistrict
# this palette is based on the type of room
# "Set1" is a qualitative palette name
room_type_pal <- colorFactor("Set1", example_data$room_type)
leaflet(example_data) %>% # make a leaflet plot with our example data
addTiles() %>% # add the map tiles
addCircles(popup = ~name, # add circles which we can click to see the names
color = ~room_type_pal(room_type)) %>% # color the circles with our palette
addLegend(pal = room_type_pal, values = ~room_type) # add a legend so we know what's what
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
Question 4.3: Use the same code from above, but map a neighborhood group of your choice
# We're going to start by making a smaller data frame to use for our visualization
example_data <-
airbnb_data %>% # assign the small dataframe the name 'example_data'
filter(neighbourhood == "Roosevelt") # only include listings from the UDistrict
# this palette is based on the type of room
# "Set1" is a qualitative palette name
room_type_pal <- colorFactor("Set1", example_data$room_type)
leaflet(example_data) %>% # make a leaflet plot with our example data
addTiles() %>% # add the map tiles
addCircles(popup = ~name, # add circles which we can click to see the names
color = ~room_type_pal(room_type)) %>% # color the circles with our palette
addLegend(pal = room_type_pal, values = ~room_type) # add a legend so we know what's what
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
You should use different color palettes for categorical vs numeric data. You’ve got a couple options for plotting numeric data:
colorNumeric: linear mapping of numbers onto a color
gradient. For example,
colorNumeric("RdPu", data$variable).
colorBin: bins by values, so each color spans the
same numeric range. For example,
colorBin("RdPu", data$variable, bins = 5).
colorQuantile: bins by quantiles, so each color has
the same number of data points. For example,
colorQuantile("RdPu", data$variable, n = 5).
The examples above use a red-purple gradient (“RdPu”) as the color
palette. Type RColorBrewer::brewer.pal.info into your
console for a full list of possible palettes.
example_data <-
airbnb_data %>%
filter(neighbourhood == "University District")
price_pal <- colorNumeric("RdPu", example_data$price)
leaflet(example_data) %>%
addTiles() %>%
addCircles(popup = ~name,
color = ~price_pal(price)) %>%
addLegend(pal = price_pal, values = ~price)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
Question 4.4: Use the same code from above, but make a palette based on quantiles instead.
example_data <-
airbnb_data %>%
filter(neighbourhood == "University District")
price_pal <- colorQuantile("RdPu", example_data$price)
leaflet(example_data) %>%
addTiles() %>%
addCircles(popup = ~name,
color = ~price_pal(price)) %>%
addLegend(pal = price_pal, values = ~price)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
Question 4.5: Compare the two maps of prices in the UDistrict. Which mapping of prices to colors—linear mapping by value or binning by quantile—do you think is more useful here, and why? There isn’t a right answer. I think quantile gives better insights to what you may be looking for like areas with higher volumes of a certain price and the distribution.
Hint 1.3 Variables country and
continent are factor variables. Factor
variables are categorical data with an underlying numerical
representation.
Kieran Healy, Data Visualization: A Practical Introduction
Charles Lanfear, Introduction to R for Social Scientists