Libraries
library(tidyverse)
library(forcats)
library(lubridate)
Data
parts <- readRDS("parts.rds")
answers <- readRDS("answers.rds")
colors <- readRDS("~/R Data/colors.rds")
inventories <- readRDS("~/R Data/inventories.rds")
inventory_parts <- readRDS("~/R Data/inventory_parts.rds")
parts <- readRDS("~/R Data/parts.rds")
part_categories <- readRDS("~/R Data/part_categories.rds")
question_tags <- readRDS("~/R Data/question_tags.rds")
questions <- readRDS("~/R Data/questions.rds")
sets <- readRDS("~/R Data/sets.rds")
tags <- readRDS("~/R Data/tags.rds")
themes <- readRDS("~/R Data/themes.rds")
You’ll work with the data throughout the course as you learn a total of six different joins! You’ll learn four mutating joins: inner join, left join, right join, and full join, and two filtering joins: semi join and anti join. In the final chapter, you’ll apply your new skills to Stack Overflow data, containing each of the almost 300,000 Stack Oveflow questions that are tagged with R, including information about their answers, the date they were asked, and their score. Get ready to take your dplyr skills to the next level!
Get started with your first joining verb: inner-join! You’ll learn to join tables together to answer questions about the LEGO dataset, which contains information across many tables about the sets, parts, themes, and colors of LEGOs over time.
The inner_join is the key to bring tables together. To use it, you need to provide the two tables that must be joined and the columns on which they should be joined.
## Add the correct verb, table, and joining column
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"))
## # A tibble: 17,501 × 4
## part_num name.x part_cat_id name.y
## <chr> <chr> <dbl> <chr>
## 1 0901 Baseplate 16 x 30 with Set 080 Yellow … 1 Baseplates
## 2 0902 Baseplate 16 x 24 with Set 080 Small W… 1 Baseplates
## 3 0903 Baseplate 16 x 24 with Set 080 Red Hou… 1 Baseplates
## 4 0904 Baseplate 16 x 24 with Set 080 Large W… 1 Baseplates
## 5 1 Homemaker Bookcase 2 x 4 x 4 7 Containers
## 6 10016414 Sticker Sheet #1 for 41055-1 58 Stickers
## 7 10026stk01 Sticker for Set 10026 - (44942/4184185) 58 Stickers
## 8 10039 Pullback Motor 8 x 4 x 2/3 44 Mechanical
## 9 10048 Minifig Hair Tousled 65 Minifig Headw…
## 10 10049 Minifig Shield Broad with Spiked Botto… 27 Minifig Acces…
## # … with 17,491 more rows
## Use the suffix argument to replace .x and .y suffixes
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"), suffix=c("_part", "_category"))
## # A tibble: 17,501 × 4
## part_num name_part part_cat_id name_category
## <chr> <chr> <dbl> <chr>
## 1 0901 Baseplate 16 x 30 with Set 080 Yello… 1 Baseplates
## 2 0902 Baseplate 16 x 24 with Set 080 Small… 1 Baseplates
## 3 0903 Baseplate 16 x 24 with Set 080 Red H… 1 Baseplates
## 4 0904 Baseplate 16 x 24 with Set 080 Large… 1 Baseplates
## 5 1 Homemaker Bookcase 2 x 4 x 4 7 Containers
## 6 10016414 Sticker Sheet #1 for 41055-1 58 Stickers
## 7 10026stk01 Sticker for Set 10026 - (44942/41841… 58 Stickers
## 8 10039 Pullback Motor 8 x 4 x 2/3 44 Mechanical
## 9 10048 Minifig Hair Tousled 65 Minifig Headwear
## 10 10049 Minifig Shield Broad with Spiked Bot… 27 Minifig Accesso…
## # … with 17,491 more rows
Let’s join these two tables together to observe how joining parts with inventory_parts increases the size of your table because of the one-to-many relationship that exists between these two tables.
## Combine the parts and inventory_parts tables
parts %>% inner_join(inventory_parts, by = "part_num")
## # A tibble: 258,958 × 6
## part_num name part_cat_id inventory_id color_id quantity
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0901 Baseplate 16 x 30 with S… 1 1973 2 1
## 2 0902 Baseplate 16 x 24 with S… 1 1973 2 1
## 3 0903 Baseplate 16 x 24 with S… 1 1973 2 1
## 4 0904 Baseplate 16 x 24 with S… 1 1973 2 1
## 5 1 Homemaker Bookcase 2 x 4… 7 508 15 1
## 6 1 Homemaker Bookcase 2 x 4… 7 1158 15 2
## 7 1 Homemaker Bookcase 2 x 4… 7 6590 15 2
## 8 1 Homemaker Bookcase 2 x 4… 7 9679 15 2
## 9 1 Homemaker Bookcase 2 x 4… 7 12256 1 2
## 10 1 Homemaker Bookcase 2 x 4… 7 13356 15 1
## # … with 258,948 more rows
# Combine the parts and inventory_parts tables
inventory_parts %>% inner_join(parts, by = "part_num")
## # A tibble: 258,958 × 6
## inventory_id part_num color_id quantity name part_cat_id
## <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 21 3009 7 50 Brick 1 x 6 11
## 2 25 21019c00pat004pr1033 15 1 Legs and Hip… 61
## 3 25 24629pr0002 78 1 Minifig Head… 59
## 4 25 24634pr0001 5 1 Headwear Acc… 27
## 5 25 24782pr0001 5 1 Minifig Hipw… 27
## 6 25 88646 0 1 Tile Special… 15
## 7 25 973pr3314c01 5 1 Torso with 1… 60
## 8 26 14226c11 0 3 String with … 31
## 9 26 2340px2 15 1 Tail 4 x 1 x… 35
## 10 26 2340px3 15 1 Tail 4 x 1 x… 35
## # … with 258,948 more rows
We’ll now connect sets, a table that tells us about each LEGO kit, with inventories, a table that tells us the specific version of a given set, and finally to inventory_parts, a table which tells us how many of each part is available in each LEGO kit.
## Add inventories using an inner join, and add inventory_parts using an inner join
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id"))
## # A tibble: 258,958 × 9
## set_num name year theme_id id version part_num color_id quantity
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 700.3-1 Medium Gift … 1949 365 24197 1 bdoor01 2 2
## 2 700.3-1 Medium Gift … 1949 365 24197 1 bdoor01 15 1
## 3 700.3-1 Medium Gift … 1949 365 24197 1 bdoor01 4 1
## 4 700.3-1 Medium Gift … 1949 365 24197 1 bslot02 15 6
## 5 700.3-1 Medium Gift … 1949 365 24197 1 bslot02 2 6
## 6 700.3-1 Medium Gift … 1949 365 24197 1 bslot02 4 6
## 7 700.3-1 Medium Gift … 1949 365 24197 1 bslot02 1 6
## 8 700.3-1 Medium Gift … 1949 365 24197 1 bslot02 14 6
## 9 700.3-1 Medium Gift … 1949 365 24197 1 bslot02a 15 6
## 10 700.3-1 Medium Gift … 1949 365 24197 1 bslot02a 2 6
## # … with 258,948 more rows
## Count the number of colors and sort
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color")) %>%
count(name_color, sort = TRUE)
## # A tibble: 134 × 2
## name_color n
## <chr> <int>
## 1 Black 48068
## 2 White 30105
## 3 Light Bluish Gray 26024
## 4 Red 21602
## 5 Dark Bluish Gray 19948
## 6 Yellow 17088
## 7 Blue 12980
## 8 Light Gray 8632
## 9 Reddish Brown 6960
## 10 Tan 6664
## # … with 124 more rows
Learn two more mutating joins, the left and right join, which are mirror images of each other! You’ll learn use cases for each type of join as you explore parts and colors of LEGO themes. Then, you’ll explore how to join tables to themselves to understand the hierarchy of LEGO themes in the data.
We’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets.
inventory_parts_joined <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
select(-id, -version) %>%
arrange(desc(quantity))
## Combine the star_destroyer and millennium_falcon tables
millennium_falcon <- inventory_parts_joined %>% filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>% filter(set_num == "75190-1")
millennium_falcon %>%
left_join(star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))
## # A tibble: 263 × 6
## set_num_falcon part_num color_id quantity_falcon set_num_star_destroyer
## <chr> <chr> <dbl> <dbl> <chr>
## 1 7965-1 63868 71 62 <NA>
## 2 7965-1 3023 0 60 <NA>
## 3 7965-1 3021 72 46 75190-1
## 4 7965-1 2780 0 37 75190-1
## 5 7965-1 60478 72 36 <NA>
## 6 7965-1 6636 71 34 75190-1
## 7 7965-1 3009 71 28 75190-1
## 8 7965-1 3665 71 22 <NA>
## 9 7965-1 2412b 72 20 75190-1
## 10 7965-1 3010 71 19 <NA>
## # … with 253 more rows, and 1 more variable: quantity_star_destroyer <dbl>
In the video, you learned how to left join two LEGO sets. Now you’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets.
# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
left_join(star_destroyer_colors, by =c("color_id"), suffix =c("_falcon", "_star_destroyer"))
## # A tibble: 21 × 3
## color_id total_quantity_falcon total_quantity_star_destroyer
## <dbl> <dbl> <dbl>
## 1 0 201 336
## 2 1 15 23
## 3 4 17 53
## 4 14 3 4
## 5 15 15 17
## 6 19 95 12
## 7 28 3 16
## 8 33 5 NA
## 9 36 1 14
## 10 41 6 15
## # … with 11 more rows
Left joins are really great for testing your assumptions about a data set and ensuring your data has integrity.
# Join versions to sets and filter for where version is na
inventory_version_1 <- inventories %>% filter(version == 1)
sets %>%
left_join(inventory_version_1, by = "set_num") %>% filter(is.na(version))
## # A tibble: 1 × 6
## set_num name year theme_id id version
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 40198-1 Ludo game 2018 598 NA NA
# Count the part_cat_id, Right join part_categories
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id"))
## # A tibble: 64 × 3
## part_cat_id n name
## <dbl> <int> <chr>
## 1 1 135 Baseplates
## 2 3 303 Bricks Sloped
## 3 4 1900 Duplo, Quatro and Primo
## 4 5 107 Bricks Special
## 5 6 128 Bricks Wedged
## 6 7 97 Containers
## 7 8 24 Technic Bricks
## 8 9 167 Plates Special
## 9 11 490 Bricks
## 10 12 85 Technic Connectors
## # … with 54 more rows
## Filter for NA
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
filter(is.na(n))
## # A tibble: 1 × 3
## part_cat_id n name
## <dbl> <int> <chr>
## 1 66 NA Modulex
In both left and right joins, there is the opportunity for there to be NA values in the resulting table. Fortunately, the replace_na function can turn those NAs into meaningful values.
## Use replace_na to replace missing values in the n column
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
replace_na(list(n=0))
## # A tibble: 64 × 3
## part_cat_id n name
## <dbl> <dbl> <chr>
## 1 1 135 Baseplates
## 2 3 303 Bricks Sloped
## 3 4 1900 Duplo, Quatro and Primo
## 4 5 107 Bricks Special
## 5 6 128 Bricks Wedged
## 6 7 97 Containers
## 7 8 24 Technic Bricks
## 8 9 167 Plates Special
## 9 11 490 Bricks
## 10 12 85 Technic Connectors
## # … with 54 more rows
## Inner join the themes table, Filter for the "Harry Potter" parent name
themes %>%
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>% filter(name_parent == "Harry Potter")
## # A tibble: 6 × 5
## id name_parent parent_id id_child name_child
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 246 Harry Potter NA 247 Chamber of Secrets
## 2 246 Harry Potter NA 248 Goblet of Fire
## 3 246 Harry Potter NA 249 Order of the Phoenix
## 4 246 Harry Potter NA 250 Prisoner of Azkaban
## 5 246 Harry Potter NA 251 Sorcerer's Stone
## 6 246 Harry Potter NA 667 Fantastic Beasts
## Join themes to itself again to find the grandchild relationships
themes %>%
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))
## # A tibble: 158 × 7
## id_parent name_parent parent_id id_child name_child id_grandchild name
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 Technic NA 5 Model 6 Airport
## 2 1 Technic NA 5 Model 7 Constructi…
## 3 1 Technic NA 5 Model 8 Farm
## 4 1 Technic NA 5 Model 9 Fire
## 5 1 Technic NA 5 Model 10 Harbor
## 6 1 Technic NA 5 Model 11 Off-Road
## 7 1 Technic NA 5 Model 12 Race
## 8 1 Technic NA 5 Model 13 Riding Cyc…
## 9 1 Technic NA 5 Model 14 Robot
## 10 1 Technic NA 5 Model 15 Traffic
## # … with 148 more rows
## Left join the themes table to its own children, Filter for themes that have no child themes
themes %>%
left_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
filter(is.na(name_child))
## # A tibble: 586 × 5
## id name_parent parent_id id_child name_child
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 2 Arctic Technic 1 NA <NA>
## 2 3 Competition 1 NA <NA>
## 3 4 Expert Builder 1 NA <NA>
## 4 6 Airport 5 NA <NA>
## 5 7 Construction 5 NA <NA>
## 6 8 Farm 5 NA <NA>
## 7 9 Fire 5 NA <NA>
## 8 10 Harbor 5 NA <NA>
## 9 11 Off-Road 5 NA <NA>
## 10 12 Race 5 NA <NA>
## # … with 576 more rows
You will cover three more joining verbs: full-join, semi-join, and anti-join. You’ll then use these verbs to answer questions about the similarities and differences between a variety of LEGO sets.
# Start with inventory_parts_joined table, Combine with the sets table, Combine with the themes table
inventory_parts_joined %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
## # A tibble: 258,958 × 9
## set_num part_num color_id quantity name_set year theme_id name_theme
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 40179-1 3024 72 900 Personalised M… 2016 277 Mosaic
## 2 40179-1 3024 15 900 Personalised M… 2016 277 Mosaic
## 3 40179-1 3024 0 900 Personalised M… 2016 277 Mosaic
## 4 40179-1 3024 71 900 Personalised M… 2016 277 Mosaic
## 5 40179-1 3024 14 900 Personalised M… 2016 277 Mosaic
## 6 k34434-1 3024 15 810 Lego Mosaic Ti… 2003 277 Mosaic
## 7 21010-1 3023 320 771 Robie House 2011 252 Architect…
## 8 k34431-1 3024 0 720 Lego Mosaic Cat 2003 277 Mosaic
## 9 42083-1 2780 0 684 Bugatti Chiron 2018 5 Model
## 10 k34434-1 3024 0 540 Lego Mosaic Ti… 2003 277 Mosaic
## # … with 258,948 more rows, and 1 more variable: parent_id <dbl>
Let’s focus on the batwing dataset, and use our skills to determine which parts are in both the batwing and batmobile sets, and which sets are in one, but not the other. While answering these questions, we’ll also be determining whether or not the parts we’re looking at in both sets also have the same color in common.
batmobile <- inventory_parts_joined %>%
filter(set_num == "7784-1") %>%
select(-set_num)
batwing <- inventory_parts_joined %>%
filter(set_num == "70916-1") %>%
select(-set_num)
# Filter the batwing set for parts that are also in the batmobile set
batwing %>% semi_join(batmobile, by = c("part_num"))
## # A tibble: 126 × 3
## part_num color_id quantity
## <chr> <dbl> <dbl>
## 1 3023 0 22
## 2 3024 0 22
## 3 3623 0 20
## 4 2780 0 17
## 5 3666 0 16
## 6 3710 0 14
## 7 6141 4 12
## 8 2412b 71 10
## 9 6141 72 10
## 10 6558 1 9
## # … with 116 more rows
# Filter the batwing set for parts that aren't in the batmobile set
batwing %>% anti_join(batmobile, by = c("part_num"))
## # A tibble: 183 × 3
## part_num color_id quantity
## <chr> <dbl> <dbl>
## 1 11477 0 18
## 2 99207 71 18
## 3 22385 0 14
## 4 99563 0 13
## 5 10247 72 12
## 6 2877 72 12
## 7 61409 72 12
## 8 11153 0 10
## 9 98138 46 10
## 10 2419 72 9
## # … with 173 more rows
Each set included in the LEGO data has an associated version number. We want to understand the version we are looking at to learn more about the parts that are included. Before doing that, we should confirm that there aren’t any sets that are missing a particular version.
# Use inventory_parts to find colors included in at least one set
colors %>% semi_join(inventory_parts, by = c("id" = "color_id"))
## # A tibble: 134 × 3
## id name rgb
## <dbl> <chr> <chr>
## 1 -1 [Unknown] #0033B2
## 2 0 Black #05131D
## 3 1 Blue #0055BF
## 4 2 Green #237841
## 5 3 Dark Turquoise #008F9B
## 6 4 Red #C91A09
## 7 5 Dark Pink #C870A0
## 8 6 Brown #583927
## 9 7 Light Gray #9BA19D
## 10 8 Dark Gray #6D6E5C
## # … with 124 more rows
## Use filter() to extract version 1
version_1_inventories <- inventories %>%
filter(version == 1)
## Use anti_join() to find which set is missing a version 1
sets %>%
anti_join(version_1_inventories, by = "set_num")
## # A tibble: 1 × 4
## set_num name year theme_id
## <chr> <chr> <dbl> <dbl>
## 1 40198-1 Ludo game 2018 598
We’ll want to add a column so that we can understand the fractions of specific pieces that are part of each set, rather than looking at the numbers of pieces alone.
inventory_parts_themes <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version) %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
# Filter the inventory_parts_themes table for the Batman theme, Add a fraction column of the total divided by the sum of the total
batman_colors <- inventory_parts_themes %>%
filter(name_theme == "Batman") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(fraction = total / sum(total))
head(batman_colors)
## # A tibble: 6 × 3
## color_id total fraction
## <dbl> <dbl> <dbl>
## 1 0 2807 0.296
## 2 1 243 0.0256
## 3 2 158 0.0167
## 4 4 529 0.0558
## 5 5 1 0.000105
## 6 10 13 0.00137
# Filter and aggregate the Star Wars set data; add a fraction column
star_wars_colors <- inventory_parts_themes %>%
filter(name_theme == "Star Wars") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(fraction = total / sum(total))
head(star_wars_colors)
## # A tibble: 6 × 3
## color_id total fraction
## <dbl> <dbl> <dbl>
## 1 0 3258 0.207
## 2 1 410 0.0261
## 3 2 36 0.00229
## 4 3 25 0.00159
## 5 4 434 0.0276
## 6 6 40 0.00254
The data you aggregated in the last exercise has been preloaded for you as batman_colors and star_wars_colors. Prior to visualizing the data, you’ll want to combine these tables to be able to directly compare the themes’ colors.
batman_colors %>%
# Join the Batman and Star Wars colors
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
# Replace NAs in the total_batman and total_star_wars columns
replace_na(list(total_batman=0, total_star_wars=0)) %>%
inner_join(colors, by = c("color_id" = "id"))
## # A tibble: 63 × 7
## color_id total_batman fraction_batman total_star_wars fraction_star_wa… name
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 2807 0.296 3258 0.207 Black
## 2 1 243 0.0256 410 0.0261 Blue
## 3 2 158 0.0167 36 0.00229 Green
## 4 4 529 0.0558 434 0.0276 Red
## 5 5 1 0.000105 0 NA Dark…
## 6 10 13 0.00137 6 0.000382 Brig…
## 7 14 426 0.0449 207 0.0132 Yell…
## 8 15 404 0.0426 1771 0.113 White
## 9 19 142 0.0150 1012 0.0644 Tan
## 10 25 36 0.00380 36 0.00229 Oran…
## # … with 53 more rows, and 1 more variable: rgb <chr>
batman_colors %>%
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
# Create the difference and total columns
mutate(difference = fraction_batman - fraction_star_wars,
total = total_batman+ total_star_wars) %>%
# Filter for totals greater than 200
filter(total >= 200)
## # A tibble: 16 × 9
## color_id total_batman fraction_batman total_star_wars fraction_star_wa… name
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 2807 0.296 3258 0.207 Black
## 2 1 243 0.0256 410 0.0261 Blue
## 3 4 529 0.0558 434 0.0276 Red
## 4 14 426 0.0449 207 0.0132 Yell…
## 5 15 404 0.0426 1771 0.113 White
## 6 19 142 0.0150 1012 0.0644 Tan
## 7 28 98 0.0103 183 0.0116 Dark…
## 8 36 86 0.00907 246 0.0156 Tran…
## 9 46 200 0.0211 39 0.00248 Tran…
## 10 70 297 0.0313 373 0.0237 Redd…
## 11 71 1148 0.121 3264 0.208 Ligh…
## 12 72 1453 0.153 2433 0.155 Dark…
## 13 84 278 0.0293 31 0.00197 Medi…
## 14 179 154 0.0162 232 0.0148 Flat…
## 15 378 22 0.00232 430 0.0273 Sand…
## 16 7 0 NA 209 0.0133 Ligh…
## # … with 3 more variables: rgb <chr>, difference <dbl>, total <dbl>
Put together all the types of join you learned in this course to analyze a new dataset: Stack Overflow questions, answers, and tags. This includes calculating and visualizing trends for some notable tags like dplyr and ggplot2. You’ll also master one more method for combining tables, the bind_rows verb, which stacks tables on top of each other.
Now we’ll join together questions with answers so we can measure the time between questions and answers.
questions %>%
# Inner join questions and answers with proper suffixes
inner_join(answers, by =c("id" = "question_id"), suffix=c("_question", "_answer")) %>%
# Subtract creation_date_question from creation_date_answer to create gap
mutate(gap = as.integer(creation_date_answer - creation_date_question))
## # A tibble: 380,643 × 7
## id creation_date_question score_question id_answer creation_date_answer
## <int> <date> <int> <int> <date>
## 1 22557677 2014-03-21 1 22560670 2014-03-21
## 2 22557707 2014-03-21 2 22558516 2014-03-21
## 3 22557707 2014-03-21 2 22558726 2014-03-21
## 4 22558084 2014-03-21 2 22558085 2014-03-21
## 5 22558084 2014-03-21 2 22606545 2014-03-24
## 6 22558084 2014-03-21 2 22610396 2014-03-24
## 7 22558084 2014-03-21 2 34374729 2015-12-19
## 8 22558395 2014-03-21 2 22559327 2014-03-21
## 9 22558395 2014-03-21 2 22560102 2014-03-21
## 10 22558395 2014-03-21 2 22560288 2014-03-21
## # … with 380,633 more rows, and 2 more variables: score_answer <int>, gap <int>
We can also determine how many questions actually yield answers. If we count the number of answers for each question, we can then join the answers counts with the questions table.
# Count and sort the question id column in the answers table
answer_counts <- answers %>%
count(question_id, sort = TRUE)
answer_counts
## # A tibble: 243,930 × 2
## question_id n
## <int> <int>
## 1 1295955 34
## 2 2547402 30
## 3 1358003 27
## 4 4090169 26
## 5 1535021 25
## 6 1189759 24
## 7 1815606 24
## 8 5963269 23
## 9 17200114 22
## 10 102056 21
## # … with 243,920 more rows
# Combine the answer_counts and questions tables
questions %>%
left_join(answer_counts, by = c("id" = "question_id")) %>%
# Replace the NAs in the n column
replace_na(list(n = 0))
## # A tibble: 294,735 × 4
## id creation_date score n
## <int> <date> <int> <dbl>
## 1 22557677 2014-03-21 1 1
## 2 22557707 2014-03-21 2 2
## 3 22558084 2014-03-21 2 4
## 4 22558395 2014-03-21 2 3
## 5 22558613 2014-03-21 0 1
## 6 22558677 2014-03-21 2 2
## 7 22558887 2014-03-21 8 1
## 8 22559180 2014-03-21 1 1
## 9 22559312 2014-03-21 0 1
## 10 22559322 2014-03-21 2 5
## # … with 294,725 more rows
To learn more about the questions and answers tables, you’ll want to use the question_tags table to understand the tags associated with each question that was asked, and each answer that was provided. You’ll be able to combine these tables using two inner joins on both the questions table and the answers table.
# Inner join the question_tags and tags tables with the questions table
questions %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
## # A tibble: 497,153 × 5
## id creation_date score tag_id tag_name
## <int> <date> <int> <dbl> <chr>
## 1 22557677 2014-03-21 1 18 regex
## 2 22557677 2014-03-21 1 139 string
## 3 22557677 2014-03-21 1 16088 time-complexity
## 4 22557677 2014-03-21 1 1672 backreference
## 5 22558084 2014-03-21 2 6419 time-series
## 6 22558084 2014-03-21 2 92764 panel-data
## 7 22558395 2014-03-21 2 5569 function
## 8 22558395 2014-03-21 2 134 sorting
## 9 22558395 2014-03-21 2 9412 vectorization
## 10 22558395 2014-03-21 2 18621 operator-precedence
## # … with 497,143 more rows
# Inner join the question_tags and tags tables with the answers table
answers %>%
inner_join(question_tags, by = "question_id") %>%
inner_join(tags, by = c("tag_id" = "id"))
## # A tibble: 625,845 × 6
## id creation_date question_id score tag_id tag_name
## <int> <date> <int> <int> <dbl> <chr>
## 1 39143935 2016-08-25 39142481 0 4240 average
## 2 39143935 2016-08-25 39142481 0 5571 summary
## 3 39144014 2016-08-25 39024390 0 85748 shiny
## 4 39144014 2016-08-25 39024390 0 83308 r-markdown
## 5 39144014 2016-08-25 39024390 0 116736 htmlwidgets
## 6 39144252 2016-08-25 39096741 6 67746 rstudio
## 7 39144375 2016-08-25 39143885 5 105113 data.table
## 8 39144430 2016-08-25 39144077 0 276 variables
## 9 39144625 2016-08-25 39142728 1 46457 dataframe
## 10 39144625 2016-08-25 39142728 1 9047 subset
## # … with 625,835 more rows