\(\underline{\textbf{Chapter 2 Description}}\)

Learn two more mutating joins, the left and right join, which are mirror images of each other! You’ll learn use cases for each type of join as you explore parts and colors of LEGO themes. Then, you’ll explore how to join tables to themselves to understand the hierarchy of LEGO themes in the data.


library(dplyr)
library(tidyr)  # Used to replace NA values

Chapter 2 Datasets

parts <- readRDS("Datasets/parts.rds")
part_categories <- readRDS("Datasets/part_categories.rds")

inventory_parts <- readRDS("Datasets/inventory_parts.rds")

sets <- readRDS("Datasets/sets.rds")
inventories <- readRDS("Datasets/inventories.rds")

colors <- readRDS("Datasets/colors.rds")

themes <- readRDS("Datasets/themes.rds")
inventory_parts_joined <- inventories %>%
  inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
  inner_join(sets, by = "set_num") %>%
  select("set_num", "part_num", "color_id", "quantity")
  
millennium_falcon <- inventory_parts_joined %>%
  filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
  filter(set_num == "75190-1")



The left_join Verb


Left joining two sets by part and color

In the video, you learned how to left join two LEGO sets. Now you’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets.

Instructions

\((A)\) Left join star_destroyer and millennium_falcon tables on the part_num and color_id
\(\hphantom{(c~)}\) columns with the suffixes _falcon and _star_destroyer.

millennium_falcon %>%
  left_join(star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))
# A tibble: 263 × 6
   set_num_falcon part_num color_id quantity_falcon set_num_star_destr…¹ quant…²
   <chr>          <chr>       <dbl>           <dbl> <chr>                  <dbl>
 1 7965-1         12825          72               3 <NA>                      NA
 2 7965-1         2412b          72              20 75190-1                   11
 3 7965-1         2412b         320               2 <NA>                      NA
 4 7965-1         2419           71               1 <NA>                      NA
 5 7965-1         2420            0               4 75190-1                    1
 6 7965-1         2420           71               1 <NA>                      NA
 7 7965-1         2420           71               7 <NA>                      NA
 8 7965-1         2431           72               2 <NA>                      NA
 9 7965-1         2431            0               1 75190-1                    3
10 7965-1         2431           19               2 <NA>                      NA
# … with 253 more rows, and abbreviated variable names ¹​set_num_star_destroyer,
#   ²​quantity_star_destroyer
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

Using this table, and the part_num or color_id columns, you can determine the frequency that a specific part or piece appears in the Millennium Falcon and Star Destroyer sets!




Left joining two sets by color

In the videos and the last exercise, you joined two sets based on their part and color. What if you joined the datasets by color alone?

Instructions

\((A)\) Sum the quantity column groupings by color_id in the Millennium Falcon dataset.

\((B)\) Now, sum the quantity column by color_id in the Star Destroyer dataset.

\((C)\) Left join the two datasets, millennium_falcon_colors and star_destroyer_colors,
\(\hphantom{(c~)}\) using the color_id column and the _falcon and _star_destroyer suffixes.

# (1) Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
  group_by(color_id) %>%
  summarize(total_quantity = sum(quantity))
# millennium_falcon_colors
# (2) Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
  group_by(color_id) %>%
  summarize(total_quantity = sum(quantity))
# (3) Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
  left_join(star_destroyer_colors, by = "color_id", suffix = c("_falcon", "_star_destroyer"))
# A tibble: 21 × 3
   color_id total_quantity_falcon total_quantity_star_destroyer
      <dbl>                 <dbl>                         <dbl>
 1        0                   201                           336
 2        1                    15                            23
 3        4                    17                            53
 4       14                     3                             4
 5       15                    15                            17
 6       19                    95                            12
 7       28                     3                            16
 8       33                     5                            NA
 9       36                     1                            14
10       41                     6                            15
# … with 11 more rows
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

That’s right! Sometimes, the level on which you’re comparing two sets- by piece or just by color- will require some thought and some pre-processing.




Finding an observation that doesn’t have a match

Left joins are really great for testing your assumptions about a data set and ensuring your data has integrity.

For example, the inventories table has a version column, for when a LEGO kit gets some kind of change or upgrade. It would be fair to assume that all sets (which joins well with inventories) would have at least a version 1. But let’s test this assumption out in the following exercise.

Instructions

\((A)\) Use a left_join to join together sets and inventory_version_1 using their common column.

\((B)\) filter for where the version column is NA using is.na.

inventory_version_1 <- inventories %>%
  filter(version == 1)

# (A) Join versions to sets
sets %>%
  left_join(inventory_version_1, by = "set_num") %>%
  # (B) Filter for where version is na
  filter(is.na(version))
# A tibble: 1 × 6
  set_num name       year theme_id    id version
  <chr>   <chr>     <dbl>    <dbl> <dbl>   <dbl>
1 40198-1 Ludo game  2018      598    NA      NA


Concluding Remarks

It looks like there are cases where a set does not have an original version.
In Chapter 3, you’ll learn another way to find observations like this: anti_join.




The right_join Verb


Which join is best?

Now that you’ve learned three different types of joins, you will be able to identify situations in which each join should be used.

Instructions

\((A)\) Match each definition with the type of join that is described.


Concluding Remarks

You’ve distinguished between three different types of mutating joins: the inner, left, and right join.
In the next chapter, you’ll learn another type of mutating join: the full join!




Counting part colors

Sometimes you’ll want to do some processing before you do a join, and prioritize keeping the second (right) table’s rows instead. In this case, a right join is for you.

In this exercise, we’ll count the part_cat_id from parts, before using a right_join to join with part_categories. The reason we do this is because we don’t only want to know the count of part_cat_id in parts, but we also want to know if there are any part_cat_ids not present in parts.

Instructions

\((A)\) Use the count verb to count each part_cat_id in the parts table.

\((B)\) Use a right_join to join part_categories.
\(\qquad\hookrightarrow\) You’ll need to use the part_cat_id from the count and the id column from part_categories.

\((C)\) filter for where the column n is NA.

parts %>%
  # (A) Count the part_cat_id
  count(part_cat_id) %>%
  # (B) Right join part_categories
  right_join(part_categories, by = c("part_cat_id" = "id")) %>%
  # (C) Filter for NA
  filter(is.na(n))
# A tibble: 1 × 3
  part_cat_id     n name   
        <dbl> <int> <chr>  
1          66    NA Modulex


Concluding Remarks

From this filter, you found an instance where a part category is in one table but missing from the other table.
It’s important to understand which entries would be impacted by replace_na(), so that we know which entries we would be omitting by using that function.




Cleaning up your count

In both left and right joins, there is the opportunity for there to be NA values in the resulting table. Fortunately, the replace_na function can turn those NAs into meaningful values.

In the last exercise, we saw that the n column had NAs after the right_join. Let’s use the replace_na column, which takes a list of column names and the values with which NAs should be replaced, to clean up our table.

# ?replace_na

Instructions

\((A)\) Use replace_na to replace NAs in the n column with the value 0.

# (1)
parts %>%
  count(part_cat_id) %>%
  right_join(part_categories, by = c("part_cat_id" = "id")) %>%
  # Use replace_na to replace missing values in the n column
  replace_na(list(n=0))
# A tibble: 64 × 3
   part_cat_id     n name                   
         <dbl> <int> <chr>                  
 1           1   135 Baseplates             
 2           3   303 Bricks Sloped          
 3           4  1900 Duplo, Quatro and Primo
 4           5   107 Bricks Special         
 5           6   128 Bricks Wedged          
 6           7    97 Containers             
 7           8    24 Technic Bricks         
 8           9   167 Plates Special         
 9          11   490 Bricks                 
10          12    85 Technic Connectors     
# … with 54 more rows
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

Now you have replaced NAs with 0s for your analysis.




Joining tables to their children

Tables can be joined to themselves!

In the themes table, which is available for you to inspect in the console, you’ll notice there is both an id column and a parent_id column. Keeping that in mind, you can join the themes table to itself to determine the parent-child relationships that exist for different themes.

In the videos, you saw themes joined to their own parents. In this exercise, you’ll try a similar approach of joining themes to their own children, which is similar but reversed. Let’s try this out to discover what children the theme "Harry Potter" has.

Instructions

\((A)\) Inner join themes to their own children, resulting in suffixes "_parent" & "_child", respectively.

\((B)\) Filter this table to find the children of the "Harry Potter"theme.

themes %>% 
  # (A) Inner join the themes table
  inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
  # (B) Filter for the "Harry Potter" parent name 
  filter(name_parent == "Harry Potter")
# A tibble: 6 × 5
     id name_parent  parent_id id_child name_child          
  <dbl> <chr>            <dbl>    <dbl> <chr>               
1   246 Harry Potter        NA      247 Chamber of Secrets  
2   246 Harry Potter        NA      248 Goblet of Fire      
3   246 Harry Potter        NA      249 Order of the Phoenix
4   246 Harry Potter        NA      250 Prisoner of Azkaban 
5   246 Harry Potter        NA      251 Sorcerer's Stone    
6   246 Harry Potter        NA      667 Fantastic Beasts    


Concluding Remarks

It looks like the Harry Potter parent theme has a few children, which are named after the books and movies from the series.




Joining themes to their grandchildren

We can go a step further than looking at themes and their children. Some themes actually have grandchildren: their children’s children.

Here, we can inner join themes to a filtered version of itself again to establish a connection between our last join’s children and their children.

Instructions

\((A)\) Use another inner join to combine themes again with itself.

  • Be sure to use the suffixes "_parent" & "_grandchild" so the columns in the resulting table are clear.
  • Update the by argument to specify the correct columns to join on.
  • If you’re unsure of what columns to join on, it might help to look at the result of the first join.
# (1) Join themes to itself again to find the grandchild relationships
themes %>% 
  inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
  inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))
# A tibble: 158 × 7
   id_parent name_parent parent_id id_child name_child id_grandchild name       
       <dbl> <chr>           <dbl>    <dbl> <chr>              <dbl> <chr>      
 1         1 Technic            NA        5 Model                  6 Airport    
 2         1 Technic            NA        5 Model                  7 Constructi…
 3         1 Technic            NA        5 Model                  8 Farm       
 4         1 Technic            NA        5 Model                  9 Fire       
 5         1 Technic            NA        5 Model                 10 Harbor     
 6         1 Technic            NA        5 Model                 11 Off-Road   
 7         1 Technic            NA        5 Model                 12 Race       
 8         1 Technic            NA        5 Model                 13 Riding Cyc…
 9         1 Technic            NA        5 Model                 14 Robot      
10         1 Technic            NA        5 Model                 15 Traffic    
# … with 148 more rows
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

It looks like there are 158 grandparent-grandchild relationships of LEGO themes.
Now you’re a whiz at using self joins to learn about hierarchical relationships.




Left joining a table to itself

So far, you’ve been inner joining a table to itself in order to find the children of themes like "Harry Potter" or "The Lord of the Rings".

But some themes might not have any children at all, which means they won’t be included in the inner join. As you’ve learned in this chapter, you can identify those with a left_join and a filter().

Instructions

\((A)\) Left join the themes table to its own children, with the suffixes _parent and _child respectively.

\((B)\) Filter the result of the join to find themes that have no children.

themes %>% 
  # Left join the themes table to its own children
  left_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
  # Filter for themes that have no child themes
  filter(is.na(id_child))
# A tibble: 586 × 5
      id name_parent    parent_id id_child name_child
   <dbl> <chr>              <dbl>    <dbl> <chr>     
 1     2 Arctic Technic         1       NA <NA>      
 2     3 Competition            1       NA <NA>      
 3     4 Expert Builder         1       NA <NA>      
 4     6 Airport                5       NA <NA>      
 5     7 Construction           5       NA <NA>      
 6     8 Farm                   5       NA <NA>      
 7     9 Fire                   5       NA <NA>      
 8    10 Harbor                 5       NA <NA>      
 9    11 Off-Road               5       NA <NA>      
10    12 Race                   5       NA <NA>      
# … with 576 more rows
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

From this table, we learned that there are 586 themes in total that do not have any child themes.