\(\underline{\textbf{Course Description}}\)
Often in data science, you’ll encounter fascinating data that is spread across multiple tables. This course will teach you the skills you’ll need to join multiple tables together to analyze them in combination. You’ll practice your skills using a fun dataset about LEGOs from the Rebrickable website. The dataset contains information about the sets, parts, themes, and colors of LEGOs, but is spread across many tables.
\(\underline{\textbf{Chapter 1 Description}}\)
Get started with your first joining verb: inner-join! You’ll learn to join tables together to answer questions about the LEGO dataset, which contains information across many tables about the sets, parts, themes, and colors of LEGOs over time.
library(dplyr)
library(tidyr)
Loading Chapter 1 Datasets
# Exercises 1.1-1.2
parts <- readRDS("Datasets/parts.rds")
part_categories <- readRDS("Datasets/part_categories.rds")
# Exercise 1.3-1.4
inventory_parts <- readRDS("Datasets/inventory_parts.rds")
# Exercise 1.5
sets <- readRDS("Datasets/sets.rds")
inventories <- readRDS("Datasets/inventories.rds")
# Exercise 1.6
colors <- readRDS("Datasets/colors.rds")
inner_join VerbYou’ll be joining together the parts and
part_categories tables.
To join these two tables together using the inner_join
verb, what columns would you join from each table?
apply(parts, 2, n_distinct)
part_num name part_cat_id
17501 17418 63
summary(parts)
part_num name part_cat_id
Length:17501 Length:17501 Min. : 1.00
Class :character Class :character 1st Qu.:17.00
Mode :character Mode :character Median :37.00
Mean :36.38
3rd Qu.:59.00
Max. :65.00
apply(part_categories, 2, n_distinct)
id name
64 64
summary(part_categories)
id name
Min. : 1.00 Length:64
1st Qu.:18.75 Class :character
Median :34.50 Mode :character
Mean :34.36
3rd Qu.:50.25
Max. :66.00
table(part_categories$id)
1 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
55 56 57 58 59 60 61 62 63 64 65 66
1 1 1 1 1 1 1 1 1 1 1 1
table(parts$part_cat_id)
1 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18
135 303 1900 107 128 97 24 167 490 85 162 59 161 257 370 155
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
804 173 118 31 134 17 58 116 937 529 227 113 63 102 39 59
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
143 311 213 322 20 48 701 288 20 18 262 28 197 129 57 23
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
33 65 20 7 24 11 357 518 1565 2091 805 61 134 107 803
Based on my quick assessment of each data set, it makes sense to join the
part_cat_idcolumn frompartswith theidcolumn frompart_categories.
All unique values forpart_cat_idare also inid.
The inner_join is the key to bring tables together. To
use it, you need to provide the two tables that must be joined and the
columns on which they should be joined.
In this exercise, you’ll join a list of LEGO parts, available as
parts, with these parts’ corresponding categories,
available as part_categories. For example, the part
Sticker Sheet 1 for Set 1650-1 is from the
Stickers part category. You can join these tables to see
all parts’ categories!
\((A)\) Add the correct joining verb, the name of the second table, and the joining column for the second table.
\((B)\) Now, use the
suffixargument to add"_part"and"_category"suffixes to replace
\(\hphantom{(c~)}\) thename.xandname.yfields.
# (A) Add the correct verb, table, and joining column
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"))
# A tibble: 17,501 × 4
part_num name.x part_…¹ name.y
<chr> <chr> <dbl> <chr>
1 0901 Baseplate 16 x 30 with Set 080 Yellow House Print 1 Basep…
2 0902 Baseplate 16 x 24 with Set 080 Small White House P… 1 Basep…
3 0903 Baseplate 16 x 24 with Set 080 Red House Print 1 Basep…
4 0904 Baseplate 16 x 24 with Set 080 Large White House P… 1 Basep…
5 1 Homemaker Bookcase 2 x 4 x 4 7 Conta…
6 10016414 Sticker Sheet #1 for 41055-1 58 Stick…
7 10026stk01 Sticker for Set 10026 - (44942/4184185) 58 Stick…
8 10039 Pullback Motor 8 x 4 x 2/3 44 Mecha…
9 10048 Minifig Hair Tousled 65 Minif…
10 10049 Minifig Shield Broad with Spiked Bottom and Cutout… 27 Minif…
# … with 17,491 more rows, and abbreviated variable name ¹part_cat_id
# ℹ Use `print(n = ...)` to see more rows
# (B) Use the suffix argument to replace .x and .y suffixes
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"), suffix = c("_part", "_category"))
# A tibble: 17,501 × 4
part_num name_part part_…¹ name_…²
<chr> <chr> <dbl> <chr>
1 0901 Baseplate 16 x 30 with Set 080 Yellow House Print 1 Basepl…
2 0902 Baseplate 16 x 24 with Set 080 Small White House … 1 Basepl…
3 0903 Baseplate 16 x 24 with Set 080 Red House Print 1 Basepl…
4 0904 Baseplate 16 x 24 with Set 080 Large White House … 1 Basepl…
5 1 Homemaker Bookcase 2 x 4 x 4 7 Contai…
6 10016414 Sticker Sheet #1 for 41055-1 58 Sticke…
7 10026stk01 Sticker for Set 10026 - (44942/4184185) 58 Sticke…
8 10039 Pullback Motor 8 x 4 x 2/3 44 Mechan…
9 10048 Minifig Hair Tousled 65 Minifi…
10 10049 Minifig Shield Broad with Spiked Bottom and Cutou… 27 Minifi…
# … with 17,491 more rows, and abbreviated variable names ¹part_cat_id,
# ²name_category
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
You can now see the part category associated with each part, and you used the
suffixargument to make sure your table’s fields are clear.
The LEGO data has many tables that can be joined together. Often
times, some of the things you care about may be a few tables away (we’ll
get to that later in the course). For now, we know that
parts is a list of all LEGO parts, and a new table,
inventory_parts, has some additional information about
those parts, such as the color_id of each part you would
find in a specific LEGO kit.
Let’s join these two tables together to observe how joining
parts with inventory_parts increases the size
of your table because of the one-to-many relationship that exists
between these two tables.
\((A)\) Connect the
partsandinventory_partstables by their part numbers using an inner join.
# (A) Combine the parts and inventory_parts tables
parts %>%
inner_join(inventory_parts, by = c("part_num" = "part_num"))
# A tibble: 258,958 × 6
part_num name part_…¹ inven…² color…³ quant…⁴
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 0901 Baseplate 16 x 30 with Set 080 Yell… 1 1973 2 1
2 0902 Baseplate 16 x 24 with Set 080 Smal… 1 1973 2 1
3 0903 Baseplate 16 x 24 with Set 080 Red … 1 1973 2 1
4 0904 Baseplate 16 x 24 with Set 080 Larg… 1 1973 2 1
5 1 Homemaker Bookcase 2 x 4 x 4 7 508 15 1
6 1 Homemaker Bookcase 2 x 4 x 4 7 1158 15 2
7 1 Homemaker Bookcase 2 x 4 x 4 7 6590 15 2
8 1 Homemaker Bookcase 2 x 4 x 4 7 9679 15 2
9 1 Homemaker Bookcase 2 x 4 x 4 7 12256 1 2
10 1 Homemaker Bookcase 2 x 4 x 4 7 13356 15 1
# … with 258,948 more rows, and abbreviated variable names ¹part_cat_id,
# ²inventory_id, ³color_id, ⁴quantity
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
Recall, this is an example of a one-to-many relationship.
Notice that the table increased in the number of rows after the join.
An inner_join works the same way with either table in
either position. The table that is specified first is arbitrary, since
you will end up with the same information in the resulting table either
way.
Let’s prove this by joining the same two tables from the last exercise in the opposite order!
\((A)\) Connect the
inventory_partsandpartstables by their part numbers using an inner join.
# (A) Combine the parts and inventory_parts tables
inventory_parts %>%
inner_join(parts, by = "part_num") # Only one join id needed if same name
# A tibble: 258,958 × 6
inventory_id part_num color_id quantity name part_…¹
<dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 21 3009 7 50 Brick 1 x 6 11
2 25 21019c00pat004pr1033 15 1 Legs and Hips wi… 61
3 25 24629pr0002 78 1 Minifig Head Spe… 59
4 25 24634pr0001 5 1 Headwear Accesso… 27
5 25 24782pr0001 5 1 Minifig Hipwear … 27
6 25 88646 0 1 Tile Special 4 x… 15
7 25 973pr3314c01 5 1 Torso with 1 Whi… 60
8 26 14226c11 0 3 String with End … 31
9 26 2340px2 15 1 Tail 4 x 1 x 3 w… 35
10 26 2340px3 15 1 Tail 4 x 1 x 3 w… 35
# … with 258,948 more rows, and abbreviated variable name ¹part_cat_id
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
This is the same join as the last exercise, but the order of the tables is reversed.
For aninner_join, either direction will yield a table that contains the same information!
Note that the columns will appear in a different order depending on which table comes first.
You can string together multiple joins with inner_join
and the pipe (%>%), both with which you are already very
familiar!
We’ll now connect sets, a table that tells us about each
LEGO kit, with inventories, a table that tells us the
specific version of a given set, and finally to
inventory_parts, a table which tells us how many of each
part is available in each LEGO kit.
So if you were building a Batman LEGO set, sets would
tell you the name of the set, inventories would give you
IDs for each of the versions of the set, and
inventory_parts would tell you how many of each part would
be in each version.
\((A)\) Combine the
inventoriestable with thesetstable.\((B)\) Next, join the
inventory_partstable to the table you created in the previous join by the inventory IDs.
sets %>%
# (A) Add inventories using an inner join
inner_join(inventories, by = "set_num") %>%
# (B) Add inventory_parts using an inner join
inner_join(inventory_parts, by = c("id" = "inventory_id"))
# A tibble: 258,958 × 9
set_num name year theme…¹ id version part_…² color…³ quant…⁴
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 700.3-1 Medium Gift Set … 1949 365 24197 1 bdoor01 2 2
2 700.3-1 Medium Gift Set … 1949 365 24197 1 bdoor01 15 1
3 700.3-1 Medium Gift Set … 1949 365 24197 1 bdoor01 4 1
4 700.3-1 Medium Gift Set … 1949 365 24197 1 bslot02 15 6
5 700.3-1 Medium Gift Set … 1949 365 24197 1 bslot02 2 6
6 700.3-1 Medium Gift Set … 1949 365 24197 1 bslot02 4 6
7 700.3-1 Medium Gift Set … 1949 365 24197 1 bslot02 1 6
8 700.3-1 Medium Gift Set … 1949 365 24197 1 bslot02 14 6
9 700.3-1 Medium Gift Set … 1949 365 24197 1 bslot0… 15 6
10 700.3-1 Medium Gift Set … 1949 365 24197 1 bslot0… 2 6
# … with 258,948 more rows, and abbreviated variable names ¹theme_id,
# ²part_num, ³color_id, ⁴quantity
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
You joined together multiple tables using
inner_join()and the pipe!
Now let’s join an additional table, colors, which will
tell us the color of each part in each set, so that we can answer the
question, “what is the most common color of a LEGO piece?”
\((A)\) Inner join the
colorstable using thecolor_idcolumn from the previous join and
\(\hphantom{(c~)}\) theidcolumn fromcolors; use the suffixes"_set"and"_color".\((B)\) Count the
name_colorcolumn and sort the results so the most prominent colors appear first.
# (A) Add an inner join for the colors table
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color"))
# A tibble: 258,958 × 11
set_num name_set year theme…¹ id version part_…² color…³ quant…⁴ name_…⁵
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 700.3-1 Medium G… 1949 365 24197 1 bdoor01 2 2 Green
2 700.3-1 Medium G… 1949 365 24197 1 bdoor01 15 1 White
3 700.3-1 Medium G… 1949 365 24197 1 bdoor01 4 1 Red
4 700.3-1 Medium G… 1949 365 24197 1 bslot02 15 6 White
5 700.3-1 Medium G… 1949 365 24197 1 bslot02 2 6 Green
6 700.3-1 Medium G… 1949 365 24197 1 bslot02 4 6 Red
7 700.3-1 Medium G… 1949 365 24197 1 bslot02 1 6 Blue
8 700.3-1 Medium G… 1949 365 24197 1 bslot02 14 6 Yellow
9 700.3-1 Medium G… 1949 365 24197 1 bslot0… 15 6 White
10 700.3-1 Medium G… 1949 365 24197 1 bslot0… 2 6 Green
# … with 258,948 more rows, 1 more variable: rgb <chr>, and abbreviated
# variable names ¹theme_id, ²part_num, ³color_id, ⁴quantity, ⁵name_color
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
# (B) Count the number of colors and sort
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color")) %>%
count(name_color, sort = TRUE)
# A tibble: 134 × 2
name_color n
<chr> <int>
1 Black 48068
2 White 30105
3 Light Bluish Gray 26024
4 Red 21602
5 Dark Bluish Gray 19948
6 Yellow 17088
7 Blue 12980
8 Light Gray 8632
9 Reddish Brown 6960
10 Tan 6664
# … with 124 more rows
# ℹ Use `print(n = ...)` to see more rows
Concluding Remarks
Notice that Black and White are the two most prominent colors.