\(\underline{\textbf{Chapter 2 Description}}\)
Learn two more mutating joins, the left and right join, which are mirror images of each other! You’ll learn use cases for each type of join as you explore parts and colors of LEGO themes. Then, you’ll explore how to join tables to themselves to understand the hierarchy of LEGO themes in the data.
library(dplyr)
library(tidyr) # Used to replace NA values
Chapter 2 Datasets
parts <- readRDS("Datasets/parts.rds")
part_categories <- readRDS("Datasets/part_categories.rds")
inventory_parts <- readRDS("Datasets/inventory_parts.rds")
sets <- readRDS("Datasets/sets.rds")
inventories <- readRDS("Datasets/inventories.rds")
colors <- readRDS("Datasets/colors.rds")
themes <- readRDS("Datasets/themes.rds")
inventory_parts_joined <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(sets, by = "set_num") %>%
select("set_num", "part_num", "color_id", "quantity")
millennium_falcon <- inventory_parts_joined %>%
filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
filter(set_num == "75190-1")
left_join VerbIn the video, you learned how to left join two LEGO sets. Now you’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets.
\((A)\) Left join
star_destroyerandmillennium_falcontables on thepart_numandcolor_id
\(\hphantom{(c~)}\) columns with the suffixes_falconand_star_destroyer.
millennium_falcon %>%
left_join(star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))
# A tibble: 263 × 6
set_num_falcon part_num color_id quantity_falcon set_num_star_destr…¹ quant…²
<chr> <chr> <dbl> <dbl> <chr> <dbl>
1 7965-1 12825 72 3 <NA> NA
2 7965-1 2412b 72 20 75190-1 11
3 7965-1 2412b 320 2 <NA> NA
4 7965-1 2419 71 1 <NA> NA
5 7965-1 2420 0 4 75190-1 1
6 7965-1 2420 71 1 <NA> NA
7 7965-1 2420 71 7 <NA> NA
8 7965-1 2431 72 2 <NA> NA
9 7965-1 2431 0 1 75190-1 3
10 7965-1 2431 19 2 <NA> NA
# … with 253 more rows, and abbreviated variable names ¹set_num_star_destroyer,
# ²quantity_star_destroyer
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
Using this table, and the
part_numorcolor_idcolumns, you can determine the frequency that a specific part or piece appears in the Millennium Falcon and Star Destroyer sets!
In the videos and the last exercise, you joined two sets based on their part and color. What if you joined the datasets by color alone?
\((A)\) Sum the
quantitycolumn groupings bycolor_idin the Millennium Falcon dataset.\((B)\) Now, sum the
quantitycolumn bycolor_idin the Star Destroyer dataset.\((C)\) Left join the two datasets,
millennium_falcon_colorsandstar_destroyer_colors,
\(\hphantom{(c~)}\) using thecolor_idcolumn and the_falconand_star_destroyersuffixes.
# (1) Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# millennium_falcon_colors
# (2) Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# (3) Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
left_join(star_destroyer_colors, by = "color_id", suffix = c("_falcon", "_star_destroyer"))
# A tibble: 21 × 3
color_id total_quantity_falcon total_quantity_star_destroyer
<dbl> <dbl> <dbl>
1 0 201 336
2 1 15 23
3 4 17 53
4 14 3 4
5 15 15 17
6 19 95 12
7 28 3 16
8 33 5 NA
9 36 1 14
10 41 6 15
# … with 11 more rows
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
That’s right! Sometimes, the level on which you’re comparing two sets- by piece or just by color- will require some thought and some pre-processing.
Left joins are really great for testing your assumptions about a data set and ensuring your data has integrity.
For example, the inventories table has a
version column, for when a LEGO kit gets some kind of
change or upgrade. It would be fair to assume that all sets
(which joins well with inventories) would have at least a
version 1. But let’s test this assumption out in the following
exercise.
\((A)\) Use a
left_jointo join togethersetsandinventory_version_1using their common column.\((B)\)
filterfor where theversioncolumn isNAusingis.na.
inventory_version_1 <- inventories %>%
filter(version == 1)
# (A) Join versions to sets
sets %>%
left_join(inventory_version_1, by = "set_num") %>%
# (B) Filter for where version is na
filter(is.na(version))
# A tibble: 1 × 6
set_num name year theme_id id version
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 40198-1 Ludo game 2018 598 NA NA
Concluding Remarks
It looks like there are cases where a set does not have an original version.
In Chapter 3, you’ll learn another way to find observations like this:anti_join.
right_join VerbNow that you’ve learned three different types of joins, you will be able to identify situations in which each join should be used.
\((A)\) Match each definition with the type of join that is described.
Concluding Remarks
You’ve distinguished between three different types of mutating joins: the inner, left, and right join.
In the next chapter, you’ll learn another type of mutating join: the full join!
Sometimes you’ll want to do some processing before you do a join, and prioritize keeping the second (right) table’s rows instead. In this case, a right join is for you.
In this exercise, we’ll count the part_cat_id from
parts, before using a right_join to join with
part_categories. The reason we do this is because we don’t
only want to know the count of part_cat_id in
parts, but we also want to know if there are any
part_cat_ids not present in parts.
\((A)\) Use the
countverb to count eachpart_cat_idin thepartstable.\((B)\) Use a
right_jointo joinpart_categories.
\(\qquad\hookrightarrow\) You’ll need to use thepart_cat_idfrom the count and theidcolumn frompart_categories.\((C)\)
filterfor where the columnnisNA.
parts %>%
# (A) Count the part_cat_id
count(part_cat_id) %>%
# (B) Right join part_categories
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# (C) Filter for NA
filter(is.na(n))
# A tibble: 1 × 3
part_cat_id n name
<dbl> <int> <chr>
1 66 NA Modulex
Concluding Remarks
From this filter, you found an instance where a part category is in one table but missing from the other table.
It’s important to understand which entries would be impacted byreplace_na(), so that we know which entries we would be omitting by using that function.
In both left and right joins, there is the opportunity for there to
be NA values in the resulting table. Fortunately, the
replace_na function can turn those NAs into meaningful
values.
In the last exercise, we saw that the n column had NAs
after the right_join. Let’s use the replace_na
column, which takes a list of column names and the values
with which NAs should be replaced, to clean up our table.
# ?replace_na
\((A)\) Use
replace_nato replace NAs in thencolumn with the value0.
# (1)
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# Use replace_na to replace missing values in the n column
replace_na(list(n=0))
# A tibble: 64 × 3
part_cat_id n name
<dbl> <int> <chr>
1 1 135 Baseplates
2 3 303 Bricks Sloped
3 4 1900 Duplo, Quatro and Primo
4 5 107 Bricks Special
5 6 128 Bricks Wedged
6 7 97 Containers
7 8 24 Technic Bricks
8 9 167 Plates Special
9 11 490 Bricks
10 12 85 Technic Connectors
# … with 54 more rows
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
Now you have replaced NAs with 0s for your analysis.
Tables can be joined to themselves!
In the themes table, which is available for you to
inspect in the console, you’ll notice there is both an id column and a
parent_id column. Keeping that in mind, you can join the
themes table to itself to determine the parent-child
relationships that exist for different themes.
In the videos, you saw themes joined to their own parents. In this
exercise, you’ll try a similar approach of joining themes to their own
children, which is similar but reversed. Let’s try this out to
discover what children the theme "Harry Potter" has.
\((A)\) Inner join
themesto their own children, resulting in suffixes"_parent"&"_child", respectively.\((B)\) Filter this table to find the children of the
"Harry Potter"theme.
themes %>%
# (A) Inner join the themes table
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
# (B) Filter for the "Harry Potter" parent name
filter(name_parent == "Harry Potter")
# A tibble: 6 × 5
id name_parent parent_id id_child name_child
<dbl> <chr> <dbl> <dbl> <chr>
1 246 Harry Potter NA 247 Chamber of Secrets
2 246 Harry Potter NA 248 Goblet of Fire
3 246 Harry Potter NA 249 Order of the Phoenix
4 246 Harry Potter NA 250 Prisoner of Azkaban
5 246 Harry Potter NA 251 Sorcerer's Stone
6 246 Harry Potter NA 667 Fantastic Beasts
Concluding Remarks
It looks like the Harry Potter parent theme has a few children, which are named after the books and movies from the series.
We can go a step further than looking at themes and their children. Some themes actually have grandchildren: their children’s children.
Here, we can inner join themes to a filtered version of
itself again to establish a connection between our last join’s
children and their children.
\((A)\) Use another inner join to combine
themesagain with itself.
- Be sure to use the suffixes
"_parent"&"_grandchild"so the columns in the resulting table are clear.
- Update the
byargument to specify the correct columns to join on.
- If you’re unsure of what columns to join on, it might help to look at the result of the first join.
# (1) Join themes to itself again to find the grandchild relationships
themes %>%
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))
# A tibble: 158 × 7
id_parent name_parent parent_id id_child name_child id_grandchild name
<dbl> <chr> <dbl> <dbl> <chr> <dbl> <chr>
1 1 Technic NA 5 Model 6 Airport
2 1 Technic NA 5 Model 7 Constructi…
3 1 Technic NA 5 Model 8 Farm
4 1 Technic NA 5 Model 9 Fire
5 1 Technic NA 5 Model 10 Harbor
6 1 Technic NA 5 Model 11 Off-Road
7 1 Technic NA 5 Model 12 Race
8 1 Technic NA 5 Model 13 Riding Cyc…
9 1 Technic NA 5 Model 14 Robot
10 1 Technic NA 5 Model 15 Traffic
# … with 148 more rows
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
It looks like there are 158 grandparent-grandchild relationships of LEGO themes.
Now you’re a whiz at using self joins to learn about hierarchical relationships.
So far, you’ve been inner joining a table to itself in order to find
the children of themes like "Harry Potter" or
"The Lord of the Rings".
But some themes might not have any children at all, which means they
won’t be included in the inner join. As you’ve learned in this chapter,
you can identify those with a left_join and a
filter().
\((A)\) Left join the themes table to its own children, with the suffixes
_parentand_childrespectively.\((B)\) Filter the result of the join to find themes that have no children.
themes %>%
# Left join the themes table to its own children
left_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
# Filter for themes that have no child themes
filter(is.na(id_child))
# A tibble: 586 × 5
id name_parent parent_id id_child name_child
<dbl> <chr> <dbl> <dbl> <chr>
1 2 Arctic Technic 1 NA <NA>
2 3 Competition 1 NA <NA>
3 4 Expert Builder 1 NA <NA>
4 6 Airport 5 NA <NA>
5 7 Construction 5 NA <NA>
6 8 Farm 5 NA <NA>
7 9 Fire 5 NA <NA>
8 10 Harbor 5 NA <NA>
9 11 Off-Road 5 NA <NA>
10 12 Race 5 NA <NA>
# … with 576 more rows
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
From this table, we learned that there are 586 themes in total that do not have any child themes.