So far, we have worked with relatively simple data types - numbers (340) and characters ("Data Science"). However, R knows how to work with many different types of data.
Today, we’ll work with maps. Download the world.rds object at this link and place it in your data folder. This is an R-Dataset file (RDS) - basically, an object that I created for you in R and saved for you (and others!) to read back in.1
We will use the sf package to make maps. If you are on RStudio Cloud, it has already been installed for you. If you are on your own computer, run install.packages("sf"). This may take a minute - let me know if you have any issues.
Let’s read the file into an object named world - you’ll see why soon!
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
If we look at the object, we’ll see a lot of familiar data types:
## Simple feature collection with 246 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.57027
## CRS: +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs
## First 10 features:
## FIPS ISO2 ISO3 UN NAME AREA POP2005 REGION SUBREGION
## ATG AC AG ATG 28 Antigua and Barbuda 44 83039 19 29
## DZA AG DZ DZA 12 Algeria 238174 32854159 2 15
## AZE AJ AZ AZE 31 Azerbaijan 8260 8352021 142 145
## ALB AL AL ALB 8 Albania 2740 3153731 150 39
## ARM AM AM ARM 51 Armenia 2820 3017661 142 145
## AGO AO AO AGO 24 Angola 124670 16095214 2 17
## ASM AQ AS ASM 16 American Samoa 20 64051 9 61
## ARG AR AR ARG 32 Argentina 273669 38747148 19 5
## AUS AS AU AUS 36 Australia 768230 20310208 9 53
## BHR BA BH BHR 48 Bahrain 71 724788 142 145
## LON LAT geometry
## ATG -61.783 17.078 MULTIPOLYGON (((-61.68667 1...
## DZA 2.632 28.163 MULTIPOLYGON (((2.96361 36....
## AZE 47.395 40.430 MULTIPOLYGON (((46.57138 41...
## ALB 20.068 41.143 MULTIPOLYGON (((19.43621 41...
## ARM 44.563 40.534 MULTIPOLYGON (((45.15387 41...
## AGO 17.544 -12.296 MULTIPOLYGON (((13.9975 -5....
## ASM -170.730 -14.318 MULTIPOLYGON (((-169.4445 -...
## ARG -65.167 -35.377 MULTIPOLYGON (((-65.74806 -...
## AUS 136.189 -24.973 MULTIPOLYGON (((142.5128 -1...
## BHR 50.562 26.019 MULTIPOLYGON (((50.53222 26...
However, look at the header (“Geometry type”, “Dimension”, “Bounding Box”, etc.) and the geometry column. This tells you that the file also contains a geometric object. There are other common sources for these geometric objects as well, like shapefiles (files with a .shp extension).
The MULTIPOLYGON part below tells you that the data contained is a geometric object. You can see some of the values (-61.68667, 1…..), which is what defines the shape on the map.
However, you don’t need to know how to read the underlying representation of the data. Thankfully, R knows how to read it for you! Amazingly, there are already built-in geoms to do the heavy lifting for you. This code should look very familiar:
All of the functions you already know work in the way you would expect:
world %>%
filter(NAME == "Australia") %>%
ggplot() +
geom_sf(color = "darkred", fill = "darkblue") +
theme_bw()You can even summarise! Instead of adding numbers, you can use functions to add, subtract, intersect, etc. the underlying map data. For example, st_union() is a function that overlaps several map objects:
# the %in% object will return TRUE if a value is in the vector
# FALSE otherwise
# equivalent to NAME == "United States" | NAME == "Mexico" ...
world %>%
filter(NAME %in% c("United States", "Mexico", "Canada")) %>%
summarise(geometry = st_union(geometry)) %>%
ggplot() +
geom_sf()## although coordinates are longitude/latitude, st_union assumes that they are planar
# this stretches really far
# coord_sf() will set "limit" for x-axis
world %>%
filter(NAME %in% c("United States", "Mexico", "Canada")) %>%
summarise(geometry = st_union(geometry)) %>%
ggplot() +
geom_sf() +
coord_sf(xlim = c(-180, -40))## although coordinates are longitude/latitude, st_union assumes that they are planar
world data, design a nice visualization that shows the population of a few adjacent countries of your choice.So far, we have only ever worked with one dataset at a time. However, in many real-life situations (including on your final project), you will want to combine information from different datasets.
We will do this on the World Cup data, but let’s start with a simple example. For example, imagine a datasets of students and grades:
students <- tibble(name = c("Emma", "Josh", "Alice", "Olivia"),
class = c(2022, 2024, 2023, 2022))
grades <- tibble(name = c("Josh", "Alice", "Tyler", "Emma"),
grade = c(94, 99, 74, 95))
students## # A tibble: 4 x 2
## name class
## <chr> <dbl>
## 1 Emma 2022
## 2 Josh 2024
## 3 Alice 2023
## 4 Olivia 2022
## # A tibble: 4 x 2
## name grade
## <chr> <dbl>
## 1 Josh 94
## 2 Alice 99
## 3 Tyler 74
## 4 Emma 95
Imagine you want the student, class, and grade all in the same dataset. To do that, we need to combine the students and grades datasets.
What do they have in common? Student names! In both datasets, the column is called name. There are several ways to join in R:
full_join(): keep all rows from both datasets, even if a row fails to match.left_join(): keep all rows from left dataset, even if a row fails to match (right_join() also exists.)inner_join(): keep only the rows that are found in both datasets.The differences are easiest to see visually. We use the by argument to
## # A tibble: 3 x 3
## name class grade
## <chr> <dbl> <dbl>
## 1 Emma 2022 95
## 2 Josh 2024 94
## 3 Alice 2023 99
## # A tibble: 5 x 3
## name class grade
## <chr> <dbl> <dbl>
## 1 Emma 2022 95
## 2 Josh 2024 94
## 3 Alice 2023 99
## 4 Olivia 2022 NA
## 5 Tyler NA 74
## # A tibble: 4 x 3
## name class grade
## <chr> <dbl> <dbl>
## 1 Emma 2022 95
## 2 Josh 2024 94
## 3 Alice 2023 99
## 4 Olivia 2022 NA
## # A tibble: 4 x 3
## name class grade
## <chr> <dbl> <dbl>
## 1 Emma 2022 95
## 2 Josh 2024 94
## 3 Alice 2023 99
## 4 Tyler NA 74
Imagine the student column name wasn’t the same in both datasets. Then, you would have to use a vector in the by argument:
## # A tibble: 4 x 2
## student_name grade
## <chr> <dbl>
## 1 Josh 94
## 2 Alice 99
## 3 Tyler 74
## 4 Emma 95
## # A tibble: 3 x 3
## name class grade
## <chr> <dbl> <dbl>
## 1 Emma 2022 95
## 2 Josh 2024 94
## 3 Alice 2023 99
countries and artists datasets together in two ways. First, keep only the artists who’s country capital is in countries (i.e. Canada and France should not be in your final answer):artists <- tibble(name = c("Lorde", "Taylor Swift",
"Drake", "BTS", "Harry Styles"),
country = c("New Zealand", "USA", "Canada",
"South Korea", "England"))
countries <- tibble(country = c("New Zealand", "USA",
"France", "South Korea", "England"),
capital = c("Wellington", "Washington DC", "Paris",
"Seoul", "London"))
artists## # A tibble: 5 x 2
## name country
## <chr> <chr>
## 1 Lorde New Zealand
## 2 Taylor Swift USA
## 3 Drake Canada
## 4 BTS South Korea
## 5 Harry Styles England
## # A tibble: 5 x 2
## country capital
## <chr> <chr>
## 1 New Zealand Wellington
## 2 USA Washington DC
## 3 France Paris
## 4 South Korea Seoul
## 5 England London
## # A tibble: 4 x 3
## name country capital
## <chr> <chr> <chr>
## 1 Lorde New Zealand Wellington
## 2 Taylor Swift USA Washington DC
## 3 BTS South Korea Seoul
## 4 Harry Styles England London
states and artists so that all artists (but not countries) are kept in the resulting dataset even if they don’t match.## # A tibble: 5 x 3
## name country capital
## <chr> <chr> <chr>
## 1 Lorde New Zealand Wellington
## 2 Taylor Swift USA Washington DC
## 3 Drake Canada <NA>
## 4 BTS South Korea Seoul
## 5 Harry Styles England London
ifelse() and case_when()Here our goal is to combine data from the world object (which contains geometric shapes) with World Cup results from yesterday’s dataset.2 Let’s create a dataset of the number of World Cup wins per country.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## year = col_double(),
## team = col_character(),
## scored = col_double(),
## conceded = col_double(),
## penalties = col_double(),
## matches = col_double(),
## shots_on_goal = col_double(),
## shots_wide = col_double(),
## free_kicks = col_double(),
## offside = col_double(),
## corners = col_double(),
## won = col_double(),
## drawn = col_double(),
## lost = col_double(),
## wc_winner = col_logical()
## )
# don't forget, this datset only goes up to 2006
wc_wins <- cups %>%
group_by(team) %>%
summarise(wins = sum(wc_winner))## `summarise()` ungrouping output (override with `.groups` argument)
Sometimes you want to directly change values in your data. For example, this dataset only goes up to 2006. We can directly edit the values to update it to the present day. Since 2006, Spain (2010), Germany (2014), and France (2018) have won the World Cup. Let’s update those values directly using ifelse().
ifelse() will let you change values directly. For example, right now the value for France in our wc_wins data is 1. We know it should be 2, so we could use ifelse() to change the value only if the country is France, and leave it alone otherwise. Sinec we are editing the values of a column, we can use mutate():
# if the team name == France, make it 2
# if the team name does not equal France, leave it alone (= wins)
wc_wins %>%
mutate(wins = ifelse(team == "France", 2, wins))## # A tibble: 77 x 2
## team wins
## <chr> <dbl>
## 1 Algeria 0
## 2 Angola 0
## 3 Argentina 2
## 4 Australia 0
## 5 Austria 0
## 6 Belgium 0
## 7 Bolivia 0
## 8 Brazil 5
## 9 Bulgaria 0
## 10 Cameroon 0
## # … with 67 more rows
We can use multiple ifelse() statements if we want to change values for multiple countries:
wc_wins <- wc_wins %>%
mutate(wins = ifelse(team == "France", 2, wins),
wins = ifelse(team == "Germany", 4, wins),
wins = ifelse(team == "Spain", 1, wins))However, if you have many values like this it can be annoying to type. case_when() is a way of performing multiple if statements at once:
wc_wins %>%
mutate(wins = case_when(team == "France" ~ 2,
team == "Germany" ~ 4,
team == "Spain" ~ 1,
TRUE ~ as.numeric(wins)))## # A tibble: 77 x 2
## team wins
## <chr> <dbl>
## 1 Algeria 0
## 2 Angola 0
## 3 Argentina 2
## 4 Australia 0
## 5 Austria 0
## 6 Belgium 0
## 7 Bolivia 0
## 8 Brazil 5
## 9 Bulgaria 0
## 10 Cameroon 0
## # … with 67 more rows
ifelse() is important, so let’s practice it. With the teachers dataset I’ve created below, use ifelse() to recode the school variable such that M becomes “Middle School” and H becomes “High School” (HINT: there are only two values). Don’t forget to save your object.teachers <- tibble(names = c("Mr. Bourgeau",
"Mrs. Nisraiyya",
"Mr. Gundrum"),
school = c("H", "H", "M"),
subject = c("ENG", "SCI", "MAT"))
teachers## # A tibble: 3 x 3
## names school subject
## <chr> <chr> <chr>
## 1 Mr. Bourgeau H ENG
## 2 Mrs. Nisraiyya H SCI
## 3 Mr. Gundrum M MAT
## # A tibble: 3 x 3
## names school subject
## <chr> <chr> <chr>
## 1 Mr. Bourgeau High ENG
## 2 Mrs. Nisraiyya High SCI
## 3 Mr. Gundrum Middle MAT
case_when() to recode the subject variable so the values “ENG”, “SCI”, and “MAT” are recoded to “English”, “Science”, and “Math.” Those are the only three values, so you can skip the TRUE ~ ... line from before.teachers %>%
mutate(subject = case_when(subject == "ENG" ~ "English",
subject == "SCI" ~ "Science",
subject == "MAT" ~ "Math"))## # A tibble: 3 x 3
## names school subject
## <chr> <chr> <chr>
## 1 Mr. Bourgeau H English
## 2 Mrs. Nisraiyya H Science
## 3 Mr. Gundrum M Math
Our goal is to add the wc_winner column onto the world dataset. Here are the datasets we want to merge - what do they have in common?
## Simple feature collection with 246 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.57027
## CRS: +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs
## First 10 features:
## FIPS ISO2 ISO3 UN NAME AREA POP2005 REGION SUBREGION
## ATG AC AG ATG 28 Antigua and Barbuda 44 83039 19 29
## DZA AG DZ DZA 12 Algeria 238174 32854159 2 15
## AZE AJ AZ AZE 31 Azerbaijan 8260 8352021 142 145
## ALB AL AL ALB 8 Albania 2740 3153731 150 39
## ARM AM AM ARM 51 Armenia 2820 3017661 142 145
## AGO AO AO AGO 24 Angola 124670 16095214 2 17
## ASM AQ AS ASM 16 American Samoa 20 64051 9 61
## ARG AR AR ARG 32 Argentina 273669 38747148 19 5
## AUS AS AU AUS 36 Australia 768230 20310208 9 53
## BHR BA BH BHR 48 Bahrain 71 724788 142 145
## LON LAT geometry
## ATG -61.783 17.078 MULTIPOLYGON (((-61.68667 1...
## DZA 2.632 28.163 MULTIPOLYGON (((2.96361 36....
## AZE 47.395 40.430 MULTIPOLYGON (((46.57138 41...
## ALB 20.068 41.143 MULTIPOLYGON (((19.43621 41...
## ARM 44.563 40.534 MULTIPOLYGON (((45.15387 41...
## AGO 17.544 -12.296 MULTIPOLYGON (((13.9975 -5....
## ASM -170.730 -14.318 MULTIPOLYGON (((-169.4445 -...
## ARG -65.167 -35.377 MULTIPOLYGON (((-65.74806 -...
## AUS 136.189 -24.973 MULTIPOLYGON (((142.5128 -1...
## BHR 50.562 26.019 MULTIPOLYGON (((50.53222 26...
## # A tibble: 77 x 2
## team wins
## <chr> <dbl>
## 1 Algeria 0
## 2 Angola 0
## 3 Argentina 2
## 4 Australia 0
## 5 Austria 0
## 6 Belgium 0
## 7 Bolivia 0
## 8 Brazil 5
## 9 Bulgaria 0
## 10 Cameroon 0
## # … with 67 more rows
Both have country names! In wc_wins, the column is called team while in world it is called NAME, but they contain the same type of information - a country name. So here, we will merge by country name. This way, R can “link” the world and wc_wins datasets.
world_join <- left_join(world, wc_wins, by = c("NAME" = "team"))
world_join %>%
filter(REGION == 150 & NAME != "Russia") %>%
ggplot() + geom_sf(aes(fill = factor(wins)))
Here, what is the difference between NA and 0? NA (meaning “no value” or “empty”) are countries in
world that were not found in wc_wins. Why might this be? Well, these are countries that may have never been in the World Cup (Lithuania, Finland, etc.). However, we still have data on their geometries (even though the match failed) because we used left_join().
We can use ifelse() to fix them. Instead of == or != like normal, we will use is.na(), which is a function that returns TRUE is a value is NA and FALSEotherwise. Here, we want to turn all NAs into 0 and keep the values the same otherwise.
## [1] FALSE FALSE FALSE TRUE
world_join <- world_join %>%
mutate(wins = ifelse(is.na(wins), 0, wins))
# filter down to Europe (150) for simplicity
world_join %>%
filter(REGION == 150 & NAME != "Russia") %>%
ggplot() + geom_sf(aes(fill = factor(wins)))But wait! How about England? They won a World Cup in 1966. Why didn’t their merge work?
library(ggthemes)
europe <- world_join %>%
filter(REGION == 150 & NAME != "Russia") %>%
mutate(wins = ifelse(NAME == "United Kingdom", 1, wins)) %>%
ggplot() + geom_sf(aes(fill = factor(wins))) +
theme_map()
europeworld_join, make a similar graph for North America (REGION == 19). To make the plot nicer, you may also want to remove Greenland. Try adding theme_map() from the ggthemes package (don’t forget to load it).library(ggthemes)
world_join %>%
filter(REGION == 19 & NAME != "Greenland") %>%
ggplot() + geom_sf(aes(fill = factor(wins))) +
coord_sf(xlim = c(-180, -40)) +
theme_map()This object is from the maptools package: https://github.com/nasa/World-Wind-Java/tree/master/WorldWind/testData/shapefiles↩︎