1. Maps

So far, we have worked with relatively simple data types - numbers (340) and characters ("Data Science"). However, R knows how to work with many different types of data.

Today, we’ll work with maps. Download the world.rds object at this link and place it in your data folder. This is an R-Dataset file (RDS) - basically, an object that I created for you in R and saved for you (and others!) to read back in.1

We will use the sf package to make maps. If you are on RStudio Cloud, it has already been installed for you. If you are on your own computer, run install.packages("sf"). This may take a minute - let me know if you have any issues.

Let’s read the file into an object named world - you’ll see why soon!

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
world <- readRDS("data/world.rds")

If we look at the object, we’ll see a lot of familiar data types:

world
## Simple feature collection with 246 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.57027
## CRS:           +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs
## First 10 features:
##     FIPS ISO2 ISO3 UN                NAME   AREA  POP2005 REGION SUBREGION
## ATG   AC   AG  ATG 28 Antigua and Barbuda     44    83039     19        29
## DZA   AG   DZ  DZA 12             Algeria 238174 32854159      2        15
## AZE   AJ   AZ  AZE 31          Azerbaijan   8260  8352021    142       145
## ALB   AL   AL  ALB  8             Albania   2740  3153731    150        39
## ARM   AM   AM  ARM 51             Armenia   2820  3017661    142       145
## AGO   AO   AO  AGO 24              Angola 124670 16095214      2        17
## ASM   AQ   AS  ASM 16      American Samoa     20    64051      9        61
## ARG   AR   AR  ARG 32           Argentina 273669 38747148     19         5
## AUS   AS   AU  AUS 36           Australia 768230 20310208      9        53
## BHR   BA   BH  BHR 48             Bahrain     71   724788    142       145
##          LON     LAT                       geometry
## ATG  -61.783  17.078 MULTIPOLYGON (((-61.68667 1...
## DZA    2.632  28.163 MULTIPOLYGON (((2.96361 36....
## AZE   47.395  40.430 MULTIPOLYGON (((46.57138 41...
## ALB   20.068  41.143 MULTIPOLYGON (((19.43621 41...
## ARM   44.563  40.534 MULTIPOLYGON (((45.15387 41...
## AGO   17.544 -12.296 MULTIPOLYGON (((13.9975 -5....
## ASM -170.730 -14.318 MULTIPOLYGON (((-169.4445 -...
## ARG  -65.167 -35.377 MULTIPOLYGON (((-65.74806 -...
## AUS  136.189 -24.973 MULTIPOLYGON (((142.5128 -1...
## BHR   50.562  26.019 MULTIPOLYGON (((50.53222 26...

However, look at the header (“Geometry type”, “Dimension”, “Bounding Box”, etc.) and the geometry column. This tells you that the file also contains a geometric object. There are other common sources for these geometric objects as well, like shapefiles (files with a .shp extension).

The MULTIPOLYGON part below tells you that the data contained is a geometric object. You can see some of the values (-61.68667, 1…..), which is what defines the shape on the map.

However, you don’t need to know how to read the underlying representation of the data. Thankfully, R knows how to read it for you! Amazingly, there are already built-in geoms to do the heavy lifting for you. This code should look very familiar:

world %>% 
  ggplot() + 
  geom_sf()

All of the functions you already know work in the way you would expect:

world %>% 
  filter(NAME == "Australia") %>%
  ggplot() + 
  geom_sf(color = "darkred", fill = "darkblue") +
  theme_bw()

world %>% 
  ggplot() + 
  geom_sf(aes(fill = factor(REGION))) +
  theme_bw()

You can even summarise! Instead of adding numbers, you can use functions to add, subtract, intersect, etc. the underlying map data. For example, st_union() is a function that overlaps several map objects:

# the %in% object will return TRUE if a value is in the vector
# FALSE otherwise
# equivalent to NAME == "United States" | NAME == "Mexico" ...
world %>%
  filter(NAME %in% c("United States", "Mexico", "Canada")) %>%
  summarise(geometry = st_union(geometry)) %>%
  ggplot() + 
    geom_sf()
## although coordinates are longitude/latitude, st_union assumes that they are planar

# this stretches really far
# coord_sf() will set "limit" for x-axis
world %>%
  filter(NAME %in% c("United States", "Mexico", "Canada")) %>%
  summarise(geometry = st_union(geometry)) %>%
  ggplot() + 
    geom_sf() +
    coord_sf(xlim = c(-180, -40))
## although coordinates are longitude/latitude, st_union assumes that they are planar

Exercises

  1. Using the world data, design a nice visualization that shows the population of a few adjacent countries of your choice.

2. Merging

So far, we have only ever worked with one dataset at a time. However, in many real-life situations (including on your final project), you will want to combine information from different datasets.

We will do this on the World Cup data, but let’s start with a simple example. For example, imagine a datasets of students and grades:

students <- tibble(name = c("Emma", "Josh", "Alice", "Olivia"),
       class = c(2022, 2024, 2023, 2022))
grades <- tibble(name = c("Josh", "Alice", "Tyler", "Emma"),
       grade = c(94, 99, 74, 95))

students
## # A tibble: 4 x 2
##   name   class
##   <chr>  <dbl>
## 1 Emma    2022
## 2 Josh    2024
## 3 Alice   2023
## 4 Olivia  2022
grades
## # A tibble: 4 x 2
##   name  grade
##   <chr> <dbl>
## 1 Josh     94
## 2 Alice    99
## 3 Tyler    74
## 4 Emma     95

Imagine you want the student, class, and grade all in the same dataset. To do that, we need to combine the students and grades datasets.

What do they have in common? Student names! In both datasets, the column is called name. There are several ways to join in R:

  • full_join(): keep all rows from both datasets, even if a row fails to match.
  • left_join(): keep all rows from left dataset, even if a row fails to match (right_join() also exists.)
  • inner_join(): keep only the rows that are found in both datasets.

The differences are easiest to see visually. We use the by argument to

inner_join(students, grades, by = "name")
## # A tibble: 3 x 3
##   name  class grade
##   <chr> <dbl> <dbl>
## 1 Emma   2022    95
## 2 Josh   2024    94
## 3 Alice  2023    99
full_join(students, grades, by = "name")
## # A tibble: 5 x 3
##   name   class grade
##   <chr>  <dbl> <dbl>
## 1 Emma    2022    95
## 2 Josh    2024    94
## 3 Alice   2023    99
## 4 Olivia  2022    NA
## 5 Tyler     NA    74
left_join(students, grades, by = "name")
## # A tibble: 4 x 3
##   name   class grade
##   <chr>  <dbl> <dbl>
## 1 Emma    2022    95
## 2 Josh    2024    94
## 3 Alice   2023    99
## 4 Olivia  2022    NA
right_join(students, grades, by = "name")
## # A tibble: 4 x 3
##   name  class grade
##   <chr> <dbl> <dbl>
## 1 Emma   2022    95
## 2 Josh   2024    94
## 3 Alice  2023    99
## 4 Tyler    NA    74

Imagine the student column name wasn’t the same in both datasets. Then, you would have to use a vector in the by argument:

grades <- grades %>%
  rename(student_name = name)
grades
## # A tibble: 4 x 2
##   student_name grade
##   <chr>        <dbl>
## 1 Josh            94
## 2 Alice           99
## 3 Tyler           74
## 4 Emma            95
inner_join(students, grades, by = c("name" = "student_name"))
## # A tibble: 3 x 3
##   name  class grade
##   <chr> <dbl> <dbl>
## 1 Emma   2022    95
## 2 Josh   2024    94
## 3 Alice  2023    99

Exercises

  1. Merges are tricky! Let’s practice here. Merge the countries and artists datasets together in two ways. First, keep only the artists who’s country capital is in countries (i.e. Canada and France should not be in your final answer):
artists <- tibble(name = c("Lorde", "Taylor Swift", 
                           "Drake", "BTS", "Harry Styles"),
       country = c("New Zealand", "USA", "Canada",
                   "South Korea", "England"))

countries <- tibble(country = c("New Zealand", "USA", 
                   "France", "South Korea", "England"),
       capital = c("Wellington", "Washington DC", "Paris",
                   "Seoul", "London"))

artists
## # A tibble: 5 x 2
##   name         country    
##   <chr>        <chr>      
## 1 Lorde        New Zealand
## 2 Taylor Swift USA        
## 3 Drake        Canada     
## 4 BTS          South Korea
## 5 Harry Styles England
countries
## # A tibble: 5 x 2
##   country     capital      
##   <chr>       <chr>        
## 1 New Zealand Wellington   
## 2 USA         Washington DC
## 3 France      Paris        
## 4 South Korea Seoul        
## 5 England     London
inner_join(artists, countries, by = "country")
## # A tibble: 4 x 3
##   name         country     capital      
##   <chr>        <chr>       <chr>        
## 1 Lorde        New Zealand Wellington   
## 2 Taylor Swift USA         Washington DC
## 3 BTS          South Korea Seoul        
## 4 Harry Styles England     London
  1. Now, merge states and artists so that all artists (but not countries) are kept in the resulting dataset even if they don’t match.
left_join(artists, countries, by = "country")
## # A tibble: 5 x 3
##   name         country     capital      
##   <chr>        <chr>       <chr>        
## 1 Lorde        New Zealand Wellington   
## 2 Taylor Swift USA         Washington DC
## 3 Drake        Canada      <NA>         
## 4 BTS          South Korea Seoul        
## 5 Harry Styles England     London

Conditional edits - ifelse() and case_when()

Here our goal is to combine data from the world object (which contains geometric shapes) with World Cup results from yesterday’s dataset.2 Let’s create a dataset of the number of World Cup wins per country.

cups <- read_csv("data/world_cups.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   year = col_double(),
##   team = col_character(),
##   scored = col_double(),
##   conceded = col_double(),
##   penalties = col_double(),
##   matches = col_double(),
##   shots_on_goal = col_double(),
##   shots_wide = col_double(),
##   free_kicks = col_double(),
##   offside = col_double(),
##   corners = col_double(),
##   won = col_double(),
##   drawn = col_double(),
##   lost = col_double(),
##   wc_winner = col_logical()
## )
# don't forget, this datset only goes up to 2006
wc_wins <- cups %>%
  group_by(team) %>%
  summarise(wins = sum(wc_winner))
## `summarise()` ungrouping output (override with `.groups` argument)

Sometimes you want to directly change values in your data. For example, this dataset only goes up to 2006. We can directly edit the values to update it to the present day. Since 2006, Spain (2010), Germany (2014), and France (2018) have won the World Cup. Let’s update those values directly using ifelse().

ifelse() will let you change values directly. For example, right now the value for France in our wc_wins data is 1. We know it should be 2, so we could use ifelse() to change the value only if the country is France, and leave it alone otherwise. Sinec we are editing the values of a column, we can use mutate():

# if the team name == France, make it 2
# if the team name does not equal France, leave it alone (= wins)
wc_wins %>%
  mutate(wins = ifelse(team == "France", 2, wins))
## # A tibble: 77 x 2
##    team       wins
##    <chr>     <dbl>
##  1 Algeria       0
##  2 Angola        0
##  3 Argentina     2
##  4 Australia     0
##  5 Austria       0
##  6 Belgium       0
##  7 Bolivia       0
##  8 Brazil        5
##  9 Bulgaria      0
## 10 Cameroon      0
## # … with 67 more rows

We can use multiple ifelse() statements if we want to change values for multiple countries:

wc_wins <- wc_wins %>%
  mutate(wins = ifelse(team == "France", 2, wins),
         wins = ifelse(team == "Germany", 4, wins),
         wins = ifelse(team == "Spain", 1, wins))

However, if you have many values like this it can be annoying to type. case_when() is a way of performing multiple if statements at once:

wc_wins %>%
  mutate(wins = case_when(team == "France" ~ 2, 
                          team == "Germany" ~ 4,
                          team == "Spain" ~ 1,
                          TRUE ~ as.numeric(wins)))
## # A tibble: 77 x 2
##    team       wins
##    <chr>     <dbl>
##  1 Algeria       0
##  2 Angola        0
##  3 Argentina     2
##  4 Australia     0
##  5 Austria       0
##  6 Belgium       0
##  7 Bolivia       0
##  8 Brazil        5
##  9 Bulgaria      0
## 10 Cameroon      0
## # … with 67 more rows

Exercises

  1. ifelse() is important, so let’s practice it. With the teachers dataset I’ve created below, use ifelse() to recode the school variable such that M becomes “Middle School” and H becomes “High School” (HINT: there are only two values). Don’t forget to save your object.
teachers <- tibble(names = c("Mr. Bourgeau", 
                             "Mrs. Nisraiyya", 
                             "Mr. Gundrum"),
                   school = c("H", "H", "M"),
                   subject = c("ENG", "SCI", "MAT"))
teachers
## # A tibble: 3 x 3
##   names          school subject
##   <chr>          <chr>  <chr>  
## 1 Mr. Bourgeau   H      ENG    
## 2 Mrs. Nisraiyya H      SCI    
## 3 Mr. Gundrum    M      MAT
teachers %>%
  mutate(school = ifelse(school == "M", "Middle", "High"))
## # A tibble: 3 x 3
##   names          school subject
##   <chr>          <chr>  <chr>  
## 1 Mr. Bourgeau   High   ENG    
## 2 Mrs. Nisraiyya High   SCI    
## 3 Mr. Gundrum    Middle MAT
  1. Next, use case_when() to recode the subject variable so the values “ENG”, “SCI”, and “MAT” are recoded to “English”, “Science”, and “Math.” Those are the only three values, so you can skip the TRUE ~ ... line from before.
teachers %>%
  mutate(subject = case_when(subject == "ENG" ~ "English",
                             subject == "SCI" ~ "Science",
                             subject == "MAT" ~ "Math"))
## # A tibble: 3 x 3
##   names          school subject
##   <chr>          <chr>  <chr>  
## 1 Mr. Bourgeau   H      English
## 2 Mrs. Nisraiyya H      Science
## 3 Mr. Gundrum    M      Math

3. Merging World Cup data

Our goal is to add the wc_winner column onto the world dataset. Here are the datasets we want to merge - what do they have in common?

world
## Simple feature collection with 246 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.57027
## CRS:           +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs
## First 10 features:
##     FIPS ISO2 ISO3 UN                NAME   AREA  POP2005 REGION SUBREGION
## ATG   AC   AG  ATG 28 Antigua and Barbuda     44    83039     19        29
## DZA   AG   DZ  DZA 12             Algeria 238174 32854159      2        15
## AZE   AJ   AZ  AZE 31          Azerbaijan   8260  8352021    142       145
## ALB   AL   AL  ALB  8             Albania   2740  3153731    150        39
## ARM   AM   AM  ARM 51             Armenia   2820  3017661    142       145
## AGO   AO   AO  AGO 24              Angola 124670 16095214      2        17
## ASM   AQ   AS  ASM 16      American Samoa     20    64051      9        61
## ARG   AR   AR  ARG 32           Argentina 273669 38747148     19         5
## AUS   AS   AU  AUS 36           Australia 768230 20310208      9        53
## BHR   BA   BH  BHR 48             Bahrain     71   724788    142       145
##          LON     LAT                       geometry
## ATG  -61.783  17.078 MULTIPOLYGON (((-61.68667 1...
## DZA    2.632  28.163 MULTIPOLYGON (((2.96361 36....
## AZE   47.395  40.430 MULTIPOLYGON (((46.57138 41...
## ALB   20.068  41.143 MULTIPOLYGON (((19.43621 41...
## ARM   44.563  40.534 MULTIPOLYGON (((45.15387 41...
## AGO   17.544 -12.296 MULTIPOLYGON (((13.9975 -5....
## ASM -170.730 -14.318 MULTIPOLYGON (((-169.4445 -...
## ARG  -65.167 -35.377 MULTIPOLYGON (((-65.74806 -...
## AUS  136.189 -24.973 MULTIPOLYGON (((142.5128 -1...
## BHR   50.562  26.019 MULTIPOLYGON (((50.53222 26...
wc_wins
## # A tibble: 77 x 2
##    team       wins
##    <chr>     <dbl>
##  1 Algeria       0
##  2 Angola        0
##  3 Argentina     2
##  4 Australia     0
##  5 Austria       0
##  6 Belgium       0
##  7 Bolivia       0
##  8 Brazil        5
##  9 Bulgaria      0
## 10 Cameroon      0
## # … with 67 more rows

Both have country names! In wc_wins, the column is called team while in world it is called NAME, but they contain the same type of information - a country name. So here, we will merge by country name. This way, R can “link” the world and wc_wins datasets.

world_join <- left_join(world, wc_wins, by = c("NAME" = "team"))

world_join %>%
  filter(REGION == 150 & NAME != "Russia") %>%
  ggplot() + geom_sf(aes(fill = factor(wins)))

Here, what is the difference between NA and 0? NA (meaning “no value” or “empty”) are countries in world that were not found in wc_wins. Why might this be? Well, these are countries that may have never been in the World Cup (Lithuania, Finland, etc.). However, we still have data on their geometries (even though the match failed) because we used left_join().

We can use ifelse() to fix them. Instead of == or != like normal, we will use is.na(), which is a function that returns TRUE is a value is NA and FALSEotherwise. Here, we want to turn all NAs into 0 and keep the values the same otherwise.

# example
values <- c(1, 2, 3, NA)
is.na(values)
## [1] FALSE FALSE FALSE  TRUE
world_join <- world_join %>%
  mutate(wins = ifelse(is.na(wins), 0, wins)) 

# filter down to Europe (150) for simplicity
world_join %>%
  filter(REGION == 150 & NAME != "Russia") %>%
  ggplot() + geom_sf(aes(fill = factor(wins)))

But wait! How about England? They won a World Cup in 1966. Why didn’t their merge work?

library(ggthemes)
europe <- world_join %>%
  filter(REGION == 150 & NAME != "Russia") %>%
  mutate(wins = ifelse(NAME == "United Kingdom", 1, wins)) %>%
  ggplot() + geom_sf(aes(fill = factor(wins))) + 
  theme_map()

europe

Exercises

  1. Using the code above and world_join, make a similar graph for North America (REGION == 19). To make the plot nicer, you may also want to remove Greenland. Try adding theme_map() from the ggthemes package (don’t forget to load it).
library(ggthemes)
world_join %>%
  filter(REGION == 19 & NAME != "Greenland") %>%
  ggplot() + geom_sf(aes(fill = factor(wins))) + 
  coord_sf(xlim = c(-180, -40)) + 
  theme_map()


  1. This object is from the maptools package: https://github.com/nasa/World-Wind-Java/tree/master/WorldWind/testData/shapefiles↩︎

  2. This dataset comes from user jokecamp on GitHub.↩︎