\(\underline{\textbf{Course Description}}\)

Often in data science, you’ll encounter fascinating data that is spread across multiple tables. This course will teach you the skills you’ll need to join multiple tables together to analyze them in combination. You’ll practice your skills using a fun dataset about LEGOs from the Rebrickable website. The dataset contains information about the sets, parts, themes, and colors of LEGOs, but is spread across many tables.

\(\underline{\textbf{Chapter 1 Description}}\)

Get started with your first joining verb: inner-join! You’ll learn to join tables together to answer questions about the LEGO dataset, which contains information across many tables about the sets, parts, themes, and colors of LEGOs over time.


library(dplyr)
library(tidyr)

Loading Chapter 1 Datasets

# Exercises 1.1-1.2
parts <- readRDS("Datasets/parts.rds")
part_categories <- readRDS("Datasets/part_categories.rds")

# Exercise 1.3-1.4
inventory_parts <- readRDS("Datasets/inventory_parts.rds")

# Exercise 1.5
sets <- readRDS("Datasets/sets.rds")
inventories <- readRDS("Datasets/inventories.rds")

# Exercise 1.6
colors <- readRDS("Datasets/colors.rds")



The inner_join Verb


What columns would you join on?

You’ll be joining together the parts and part_categories tables.

To join these two tables together using the inner_join verb, what columns would you join from each table?

apply(parts, 2, n_distinct)
   part_num        name part_cat_id 
      17501       17418          63 
summary(parts)
   part_num             name            part_cat_id   
 Length:17501       Length:17501       Min.   : 1.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :37.00  
                                       Mean   :36.38  
                                       3rd Qu.:59.00  
                                       Max.   :65.00  
apply(part_categories, 2, n_distinct)
  id name 
  64   64 
summary(part_categories)
       id            name          
 Min.   : 1.00   Length:64         
 1st Qu.:18.75   Class :character  
 Median :34.50   Mode  :character  
 Mean   :34.36                     
 3rd Qu.:50.25                     
 Max.   :66.00                     
table(part_categories$id)

 1  3  4  5  6  7  8  9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
55 56 57 58 59 60 61 62 63 64 65 66 
 1  1  1  1  1  1  1  1  1  1  1  1 
table(parts$part_cat_id)

   1    3    4    5    6    7    8    9   11   12   13   14   15   16   17   18 
 135  303 1900  107  128   97   24  167  490   85  162   59  161  257  370  155 
  19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34 
 804  173  118   31  134   17   58  116  937  529  227  113   63  102   39   59 
  35   36   37   38   39   40   41   42   43   44   45   46   47   48   49   50 
 143  311  213  322   20   48  701  288   20   18  262   28  197  129   57   23 
  51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
  33   65   20    7   24   11  357  518 1565 2091  805   61  134  107  803 


My Answer

Based on my quick assessment of each data set, it makes sense to join the part_cat_id column from parts with the id column from part_categories.
All unique values for part_cat_id are also in id.




Joining parts and part categories

The inner_join is the key to bring tables together. To use it, you need to provide the two tables that must be joined and the columns on which they should be joined.

In this exercise, you’ll join a list of LEGO parts, available as parts, with these parts’ corresponding categories, available as part_categories. For example, the part Sticker Sheet 1 for Set 1650-1 is from the Stickers part category. You can join these tables to see all parts’ categories!

Instructions

\((A)\) Add the correct joining verb, the name of the second table, and the joining column for the second table.

\((B)\) Now, use the suffix argument to add "_part" and "_category" suffixes to replace
\(\hphantom{(c~)}\) the name.x and name.y fields.

# (A) Add the correct verb, table, and joining column
parts %>% 
  inner_join(part_categories, by = c("part_cat_id" = "id"))
# A tibble: 17,501 × 4
   part_num   name.x                                              part_…¹ name.y
   <chr>      <chr>                                                 <dbl> <chr> 
 1 0901       Baseplate 16 x 30 with Set 080 Yellow House Print         1 Basep…
 2 0902       Baseplate 16 x 24 with Set 080 Small White House P…       1 Basep…
 3 0903       Baseplate 16 x 24 with Set 080 Red House Print            1 Basep…
 4 0904       Baseplate 16 x 24 with Set 080 Large White House P…       1 Basep…
 5 1          Homemaker Bookcase 2 x 4 x 4                              7 Conta…
 6 10016414   Sticker Sheet #1 for 41055-1                             58 Stick…
 7 10026stk01 Sticker for Set 10026 - (44942/4184185)                  58 Stick…
 8 10039      Pullback Motor 8 x 4 x 2/3                               44 Mecha…
 9 10048      Minifig Hair Tousled                                     65 Minif…
10 10049      Minifig Shield Broad with Spiked Bottom and Cutout…      27 Minif…
# … with 17,491 more rows, and abbreviated variable name ¹​part_cat_id
# ℹ Use `print(n = ...)` to see more rows
# (B) Use the suffix argument to replace .x and .y suffixes
parts %>% 
  inner_join(part_categories, by = c("part_cat_id" = "id"), suffix = c("_part", "_category"))
# A tibble: 17,501 × 4
   part_num   name_part                                          part_…¹ name_…²
   <chr>      <chr>                                                <dbl> <chr>  
 1 0901       Baseplate 16 x 30 with Set 080 Yellow House Print        1 Basepl…
 2 0902       Baseplate 16 x 24 with Set 080 Small White House …       1 Basepl…
 3 0903       Baseplate 16 x 24 with Set 080 Red House Print           1 Basepl…
 4 0904       Baseplate 16 x 24 with Set 080 Large White House …       1 Basepl…
 5 1          Homemaker Bookcase 2 x 4 x 4                             7 Contai…
 6 10016414   Sticker Sheet #1 for 41055-1                            58 Sticke…
 7 10026stk01 Sticker for Set 10026 - (44942/4184185)                 58 Sticke…
 8 10039      Pullback Motor 8 x 4 x 2/3                              44 Mechan…
 9 10048      Minifig Hair Tousled                                    65 Minifi…
10 10049      Minifig Shield Broad with Spiked Bottom and Cutou…      27 Minifi…
# … with 17,491 more rows, and abbreviated variable names ¹​part_cat_id,
#   ²​name_category
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

You can now see the part category associated with each part, and you used the suffix argument to make sure your table’s fields are clear.




Joining with a One-To-Many Relationship


Joining parts and inventories

The LEGO data has many tables that can be joined together. Often times, some of the things you care about may be a few tables away (we’ll get to that later in the course). For now, we know that parts is a list of all LEGO parts, and a new table, inventory_parts, has some additional information about those parts, such as the color_id of each part you would find in a specific LEGO kit.

Let’s join these two tables together to observe how joining parts with inventory_parts increases the size of your table because of the one-to-many relationship that exists between these two tables.

Instructions

\((A)\) Connect the parts and inventory_parts tables by their part numbers using an inner join.

# (A) Combine the parts and inventory_parts tables
parts %>%
  inner_join(inventory_parts, by = c("part_num" = "part_num"))
# A tibble: 258,958 × 6
   part_num name                                 part_…¹ inven…² color…³ quant…⁴
   <chr>    <chr>                                  <dbl>   <dbl>   <dbl>   <dbl>
 1 0901     Baseplate 16 x 30 with Set 080 Yell…       1    1973       2       1
 2 0902     Baseplate 16 x 24 with Set 080 Smal…       1    1973       2       1
 3 0903     Baseplate 16 x 24 with Set 080 Red …       1    1973       2       1
 4 0904     Baseplate 16 x 24 with Set 080 Larg…       1    1973       2       1
 5 1        Homemaker Bookcase 2 x 4 x 4               7     508      15       1
 6 1        Homemaker Bookcase 2 x 4 x 4               7    1158      15       2
 7 1        Homemaker Bookcase 2 x 4 x 4               7    6590      15       2
 8 1        Homemaker Bookcase 2 x 4 x 4               7    9679      15       2
 9 1        Homemaker Bookcase 2 x 4 x 4               7   12256       1       2
10 1        Homemaker Bookcase 2 x 4 x 4               7   13356      15       1
# … with 258,948 more rows, and abbreviated variable names ¹​part_cat_id,
#   ²​inventory_id, ³​color_id, ⁴​quantity
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

Recall, this is an example of a one-to-many relationship.
Notice that the table increased in the number of rows after the join.




Joining in either direction

An inner_join works the same way with either table in either position. The table that is specified first is arbitrary, since you will end up with the same information in the resulting table either way.

Let’s prove this by joining the same two tables from the last exercise in the opposite order!

Instructions

\((A)\) Connect the inventory_parts and parts tables by their part numbers using an inner join.

# (A) Combine the parts and inventory_parts tables
inventory_parts %>%
  inner_join(parts, by = "part_num") # Only one join id needed if same name
# A tibble: 258,958 × 6
   inventory_id part_num             color_id quantity name              part_…¹
          <dbl> <chr>                   <dbl>    <dbl> <chr>               <dbl>
 1           21 3009                        7       50 Brick 1 x 6            11
 2           25 21019c00pat004pr1033       15        1 Legs and Hips wi…      61
 3           25 24629pr0002                78        1 Minifig Head Spe…      59
 4           25 24634pr0001                 5        1 Headwear Accesso…      27
 5           25 24782pr0001                 5        1 Minifig Hipwear …      27
 6           25 88646                       0        1 Tile Special 4 x…      15
 7           25 973pr3314c01                5        1 Torso with 1 Whi…      60
 8           26 14226c11                    0        3 String with End …      31
 9           26 2340px2                    15        1 Tail 4 x 1 x 3 w…      35
10           26 2340px3                    15        1 Tail 4 x 1 x 3 w…      35
# … with 258,948 more rows, and abbreviated variable name ¹​part_cat_id
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

This is the same join as the last exercise, but the order of the tables is reversed.
For an inner_join, either direction will yield a table that contains the same information!
Note that the columns will appear in a different order depending on which table comes first.




Joining Three or More Tables


Joining three tables

You can string together multiple joins with inner_join and the pipe (%>%), both with which you are already very familiar!

We’ll now connect sets, a table that tells us about each LEGO kit, with inventories, a table that tells us the specific version of a given set, and finally to inventory_parts, a table which tells us how many of each part is available in each LEGO kit.

So if you were building a Batman LEGO set, sets would tell you the name of the set, inventories would give you IDs for each of the versions of the set, and inventory_parts would tell you how many of each part would be in each version.

Instructions

\((A)\) Combine the inventories table with the sets table.

\((B)\) Next, join the inventory_parts table to the table you created in the previous join by the inventory IDs.

sets %>%
  # (A) Add inventories using an inner join 
  inner_join(inventories, by = "set_num") %>%
  # (B) Add inventory_parts using an inner join 
  inner_join(inventory_parts, by = c("id" = "inventory_id"))
# A tibble: 258,958 × 9
   set_num name               year theme…¹    id version part_…² color…³ quant…⁴
   <chr>   <chr>             <dbl>   <dbl> <dbl>   <dbl> <chr>     <dbl>   <dbl>
 1 700.3-1 Medium Gift Set …  1949     365 24197       1 bdoor01       2       2
 2 700.3-1 Medium Gift Set …  1949     365 24197       1 bdoor01      15       1
 3 700.3-1 Medium Gift Set …  1949     365 24197       1 bdoor01       4       1
 4 700.3-1 Medium Gift Set …  1949     365 24197       1 bslot02      15       6
 5 700.3-1 Medium Gift Set …  1949     365 24197       1 bslot02       2       6
 6 700.3-1 Medium Gift Set …  1949     365 24197       1 bslot02       4       6
 7 700.3-1 Medium Gift Set …  1949     365 24197       1 bslot02       1       6
 8 700.3-1 Medium Gift Set …  1949     365 24197       1 bslot02      14       6
 9 700.3-1 Medium Gift Set …  1949     365 24197       1 bslot0…      15       6
10 700.3-1 Medium Gift Set …  1949     365 24197       1 bslot0…       2       6
# … with 258,948 more rows, and abbreviated variable names ¹​theme_id,
#   ²​part_num, ³​color_id, ⁴​quantity
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

You joined together multiple tables using inner_join() and the pipe!




What’s the most common color?

Now let’s join an additional table, colors, which will tell us the color of each part in each set, so that we can answer the question, “what is the most common color of a LEGO piece?”

Instructions

\((A)\) Inner join the colors table using the color_id column from the previous join and
\(\hphantom{(c~)}\) the id column from colors; use the suffixes "_set" and "_color".

\((B)\) Count the name_color column and sort the results so the most prominent colors appear first.

# (A) Add an inner join for the colors table
sets %>%
  inner_join(inventories, by = "set_num") %>%
  inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
  inner_join(colors, by = c("color_id" = "id"), suffix = c("_set",  "_color"))
# A tibble: 258,958 × 11
   set_num name_set   year theme…¹    id version part_…² color…³ quant…⁴ name_…⁵
   <chr>   <chr>     <dbl>   <dbl> <dbl>   <dbl> <chr>     <dbl>   <dbl> <chr>  
 1 700.3-1 Medium G…  1949     365 24197       1 bdoor01       2       2 Green  
 2 700.3-1 Medium G…  1949     365 24197       1 bdoor01      15       1 White  
 3 700.3-1 Medium G…  1949     365 24197       1 bdoor01       4       1 Red    
 4 700.3-1 Medium G…  1949     365 24197       1 bslot02      15       6 White  
 5 700.3-1 Medium G…  1949     365 24197       1 bslot02       2       6 Green  
 6 700.3-1 Medium G…  1949     365 24197       1 bslot02       4       6 Red    
 7 700.3-1 Medium G…  1949     365 24197       1 bslot02       1       6 Blue   
 8 700.3-1 Medium G…  1949     365 24197       1 bslot02      14       6 Yellow 
 9 700.3-1 Medium G…  1949     365 24197       1 bslot0…      15       6 White  
10 700.3-1 Medium G…  1949     365 24197       1 bslot0…       2       6 Green  
# … with 258,948 more rows, 1 more variable: rgb <chr>, and abbreviated
#   variable names ¹​theme_id, ²​part_num, ³​color_id, ⁴​quantity, ⁵​name_color
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
# (B) Count the number of colors and sort
sets %>%
  inner_join(inventories, by = "set_num") %>%
  inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
  inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color")) %>%
  count(name_color, sort = TRUE)
# A tibble: 134 × 2
   name_color            n
   <chr>             <int>
 1 Black             48068
 2 White             30105
 3 Light Bluish Gray 26024
 4 Red               21602
 5 Dark Bluish Gray  19948
 6 Yellow            17088
 7 Blue              12980
 8 Light Gray         8632
 9 Reddish Brown      6960
10 Tan                6664
# … with 124 more rows
# ℹ Use `print(n = ...)` to see more rows


Concluding Remarks

Notice that Black and White are the two most prominent colors.