Course Description
Often in data science, you’ll encounter fascinating data that is spread across multiple tables. This course will teach you the skills you’ll need to join multiple tables together to analyze them in combination. You’ll practice your skills using a fun dataset about LEGOs from the Rebrickable website. The dataset contains information about the sets, parts, themes, and colors of LEGOs, but is spread across many tables. You’ll work with the data throughout the course as you learn a total of six different joins! You’ll learn four mutating joins: inner join, left join, right join, and full join, and two filtering joins: semi join and anti join. In the final chapter, you’ll apply your new skills to Stack Overflow data, containing each of the almost 300,000 Stack Oveflow questions that are tagged with R, including information about their answers, the date they were asked, and their score. Get ready to take your dplyr skills to the next level!
Get started with your first joining verb: inner-join! You’ll learn to join tables together to answer questions about the LEGO dataset, which contains information across many tables about the sets, parts, themes, and colors of LEGOs over time.
You’ll be joining together the parts
and
part_categories
tables. You can first inspect them in the
console. To join these two tables together using the
inner_join
verb, what columns would you join from each
table?
Since you’ll be working with dplyr throughout the course, the package will be preloaded for you in each exercise.
c("id" = "part_num")
c("part_cat_id" = "id")
c("part_num" = "part_cat_id")
c("part_material_id" = "part_categories")
The inner_join
is the key to bring tables together. To
use it, you need to provide the two tables that must be joined and the
columns on which they should be joined.
In this exercise, you’ll join a list of LEGO parts, available as
parts
, with these parts’ corresponding categories,
available as part_categories
. For example, the part
Sticker Sheet 1 for Set 1650-1
is from the
Stickers
part category. You can join these tables to see
all parts’ categories!
suffix
argument to add
"_part"
and "_category"
suffixes to replace
the name.x
and name.y
fields.# edited/added
library(tidyverse)
parts <- readRDS('parts.rds')
part_categories <- readRDS('part_categories.rds')
# Add the correct verb, table, and joining column
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"))
# Use the suffix argument to replace .x and .y suffixes
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"), suffix = c("_part", "_category"))
The LEGO data has many tables that can be joined together. Often
times, some of the things you care about may be a few tables away (we’ll
get to that later in the course). For now, we know that
parts
is a list of all LEGO parts, and a new table,
inventory_parts
, has some additional information about
those parts, such as the color_id
of each part you would
find in a specific LEGO kit.
Let’s join these two tables together to observe how joining
parts
with inventory_parts
increases the size
of your table because of the one-to-many relationship that exists
between these two tables.
parts
and inventory_parts
tables by their part numbers using an inner join.# edited/added
inventory_parts = readRDS('inventory_parts.rds')
# Combine the parts and inventory_parts tables
parts %>%
inner_join(inventory_parts, by = "part_num")
An inner_join
works the same way with either table in
either position. The table that is specified first is arbitrary, since
you will end up with the same information in the resulting table either
way.
Let’s prove this by joining the same two tables from the last exercise in the opposite order!
inventory_parts
and parts
tables by their part numbers using an inner join.# Combine the parts and inventory_parts tables
inventory_parts %>%
inner_join(parts, by = "part_num")
You can string together multiple joins with inner_join
and the pipe (%>%
), both with which you are already very
familiar!
We’ll now connect sets
, a table that tells us about each
LEGO kit, with inventories
, a table that tells us the
specific version of a given set, and finally to
inventory_parts
, a table which tells us how many of each
part is available in each LEGO kit.
So if you were building a Batman LEGO set, sets
would
tell you the name of the set, inventories
would give you
IDs for each of the versions of the set, and
inventory_parts
would tell you how many of each part would
be in each version.
inventories
table with the
sets
table.inventory_parts
table to the table you
created in the previous join by the inventory IDs.# edited/added
sets <- readRDS('sets.rds')
inventories <- readRDS('inventories.rds')
sets %>%
# Add inventories using an inner join
inner_join(inventories, by = "set_num") %>%
# Add inventory_parts using an inner join
inner_join(inventory_parts, by = c("id" = "inventory_id"))
Now let’s join an additional table, colors
, which will
tell us the color of each part in each set, so that we can answer the
question, “what is the most common color of a LEGO piece?”
colors
table using the
color_id
column from the previous join and the
id
column from colors
; use the suffixes
"_set"
and "_color"
.name_color
column and sort the results so the
most prominent colors appear first.# edited/added
colors = readRDS('colors.rds')
# Add an inner join for the colors table
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color"))
# Count the number of colors and sort
sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color")) %>%
count(name_color, sort = TRUE)
Learn two more mutating joins, the left and right join, which are mirror images of each other! You’ll learn use cases for each type of join as you explore parts and colors of LEGO themes. Then, you’ll explore how to join tables to themselves to understand the hierarchy of LEGO themes in the data.
In the video, you learned how to left join two LEGO sets. Now you’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets. We’ve created these for you and they have been preloaded for you:
millennium_falcon <- inventory_parts_joined %>%
filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
filter(set_num == "75190-1")
star_destroyer
and
millennium_falcon
tables on the part_num
and
color_id
columns with the suffixes _falcon
and
_star_destroyer
.# edited/added
inventory_parts_joined <- sets %>%
inner_join(inventories, by = "set_num") %>%
inner_join(inventory_parts, by = c("id" = "inventory_id"))
millennium_falcon <- inventory_parts_joined %>%
filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
filter(set_num == "75190-1")
# Combine the star_destroyer and millennium_falcon tables
millennium_falcon %>%
left_join(star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))
In the videos and the last exercise, you joined two sets based on their part and color. What if you joined the datasets by color alone? As with the last exercise, the Millennium Falcon and Star Destroyer sets have been created and preloaded for you:
millennium_falcon <- inventory_parts_joined %>%
filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
filter(set_num == "75190-1")
quantity
column by color_id
in the
Millennium Falcon dataset.quantity
column by color_id
in the Star Destroyer dataset.millennium_falcon_colors
and star_destroyer_colors
, using the color_id
column and the _falcon
and _star_destroyer
suffixes.# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
group_by(color_id) %>%
summarize(total_quantity = sum(quantity))
# Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
left_join(star_destroyer_colors, by = "color_id", suffix = c("_falcon", "_star_destroyer"))
Left joins are really great for testing your assumptions about a data set and ensuring your data has integrity.
For example, the inventories
table has a
version
column, for when a LEGO kit gets some kind of
change or upgrade. It would be fair to assume that all sets
(which joins well with inventories
) would have at least a
version 1. But let’s test this assumption out in the following
exercise.
left_join
to join together sets
and
inventory_version_1
using their common column.filter
for where the version
column is
NA
using is.na
.inventory_version_1 <- inventories %>%
filter(version == 1)
# Join versions to sets
sets %>%
left_join(inventory_version_1, by = "set_num") %>%
# Filter for where version is na
filter(is.na(version))
Now that you’ve learned three different types of joins, you will be able to identify situations in which each join should be used.
You want to keep only observations that match perfectly between tables.
You want to keep all observations in the first table, including matching observations in the second table.
You want to keep all observations in the second table, including matching observations in the first table.
Sometimes you’ll want to do some processing before you do a join, and prioritize keeping the second (right) table’s rows instead. In this case, a right join is for you.
In this exercise, we’ll count the part_cat_id
from
parts
, before using a right_join
to join with
part_categories
. The reason we do this is because we don’t
only want to know the count of part_cat_id
in
parts
, but we also want to know if there are any
part_cat_id
s not present in
parts
.
count
verb to count each
part_cat_id
in the parts
table.right_join
to join part_categories
.
You’ll need to use the part_cat_id
from the count and the
id
column from part_categories
.filter
for where the column n
is NA.parts %>%
# Count the part_cat_id
count(part_cat_id) %>%
# Right join part_categories
right_join(part_categories, by = c("part_cat_id" = "id"))
parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# Filter for NA
filter(is.na(n))
In both left and right joins, there is the opportunity for there to
be NA values in the resulting table. Fortunately, the
replace_na
function can turn those NAs into meaningful
values.
In the last exercise, we saw that the n
column had NAs
after the right_join
. Let’s use the replace_na
column, which takes a list
of column names and the values
with which NAs should be replaced, to clean up our table.
replace_na
to replace NAs in the n
column with the value 0
.parts %>%
count(part_cat_id) %>%
right_join(part_categories, by = c("part_cat_id" = "id")) %>%
# Use replace_na to replace missing values in the n column
replace_na(list(n = 0))
Tables can be joined to themselves!
In the themes
table, which is available for you to
inspect in the console, you’ll notice there is both an id
column and a parent_id
column. Keeping that in mind, you
can join the themes
table to itself to determine the
parent-child relationships that exist for different themes.
In the videos, you saw themes joined to their own parents. In this
exercise, you’ll try a similar approach of joining themes to their own
children, which is similar but reversed. Let’s try this out to
discover what children the theme "Harry Potter"
has.
themes
to their own children, resulting in
the suffixes "_parent"
and "_child"
,
respectively."Harry Potter"
theme.# edited/added
themes <- readRDS('themes.rds')
themes %>%
# Inner join the themes table
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
# Filter for the "Harry Potter" parent name
filter(name_parent == "Harry Potter")
We can go a step further than looking at themes and their children. Some themes actually have grandchildren: their children’s children.
Here, we can inner join themes
to a filtered version of
itself again to establish a connection between our last join’s
children and their children.
themes
again with
itself.
"_parent"
and
"_grandchild"
so the columns in the resulting table are
clear.by
argument to specify the correct columns
to join on. If you’re unsure of what columns to join on, it might help
to look at the result of the first join to get a feel for it.# Join themes to itself again to find the grandchild relationships
themes %>%
inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))
So far, you’ve been inner joining a table to itself in order to find
the children of themes like "Harry Potter"
or
"The Lord of the Rings"
.
But some themes might not have any children at all, which means they
won’t be included in the inner join. As you’ve learned in this chapter,
you can identify those with a left_join
and a
filter()
.
_parent
and _child
respectively.themes %>%
# Left join the themes table to its own children
left_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
# Filter for themes that have no child themes
filter(is.na(name_child))
In this chapter, you’ll cover three more joining verbs: full-join, semi-join, and anti-join. You’ll then use these verbs to answer questions about the similarities and differences between a variety of LEGO sets.
In the video, you compared two sets. Now, you’ll compare two themes, each of which is made up of many sets.
First, you’ll need to join in the themes
. Recall that
doing so requires going through the sets
first. You’ll use
the inventory_parts_joined
table from the video, which is
already available to you in the console.
inventory_parts_joined <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version)
inventory_parts_joined
and sets
tables.themes
table,
using the suffix argument to clarify which table each name
came from ("_set"
or "_theme"
).# edited/added
inventory_parts_joined <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version)
# Start with inventory_parts_joined table
inventory_parts_joined %>%
# Combine with the sets table
inner_join(sets, by = "set_num") %>%
# Combine with the themes table
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
Previously, you combined tables to compare themes. Before doing this comparison, you’ll want to aggregate the data to learn more about the pieces that are a part of each theme, as well as the colors of those pieces.
The table you created previously has been preloaded for you as
inventory_sets_themes
. It was filtered for each theme, and
the objects have been saved as batman
and
star_wars
.
inventory_sets_themes <- inventory_parts_joined %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
batman <- inventory_sets_themes %>%
filter(name_theme == "Batman")
star_wars <- inventory_sets_themes %>%
filter(name_theme == "Star Wars")
# edited/added
inventory_sets_themes <- inventory_parts_joined %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
batman <- inventory_sets_themes %>%
filter(name_theme == "Batman")
star_wars <- inventory_sets_themes %>%
filter(name_theme == "Star Wars")
# Count the part number and color id, weight by quantity
batman %>%
count(part_num, color_id, wt = quantity)
star_wars %>%
count(part_num, color_id, wt = quantity)
Now that you’ve got separate tables for the pieces in the
batman
and star_wars
themes, you’ll want to be
able to combine them to see any similarities or differences between the
two themes. The aggregating from the last exercise has been saved as
batman_parts
and star_wars_parts
, and is
preloaded for you.
batman_parts <- batman %>%
count(part_num, color_id, wt = quantity)
star_wars_parts <- star_wars %>%
count(part_num, color_id, wt = quantity)
star_wars_parts
table with the
batman_parts
table; use the suffix argument to include the
"_batman"
and "_star_wars"
suffixes.n_batman
and
n_star_wars
columns with 0s.# edited/added
batman_parts <- batman %>%
count(part_num, color_id, wt = quantity)
star_wars_parts <- star_wars %>%
count(part_num, color_id, wt = quantity)
batman_parts %>%
# Combine the star_wars_parts table
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
# Replace NAs with 0s in the n_batman and n_star_wars columns
replace_na(list(n_batman = 0, n_star_wars = 0))
The table you created in the last exercise includes the part number
of each piece, the color id, and the number of each piece in the Star
Wars and Batman themes. However, we have more information about each of
these parts that we can gain by combining this table with some of the
information we have in other tables. Before we compare the themes, let’s
ensure that we have enough information to make our findings more
interpretable. The table from the last exercise has been saved as
parts_joined
and is preloaded for you.
parts_joined <- batman_parts %>%
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
replace_na(list(n_batman = 0, n_star_wars = 0))
parts_joined
table in descending order.colors
table to the
parts_joined
table.parts
table to the previous join using an
inner join; add "_color"
and "_part"
suffixes
to specify whether or not the information came from the
colors
table or the parts
table.# edited/added
parts_joined <- batman_parts %>%
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
replace_na(list(n_batman = 0, n_star_wars = 0))
parts_joined %>%
# Sort the number of star wars pieces in descending order
arrange(desc(n_star_wars)) %>%
# Join the colors table to the parts_joined table
inner_join(colors, by = c("color_id" = "id")) %>%
# Join the parts table to the previous join
inner_join(parts, by = "part_num", suffix = c("_color", "_part"))
Earlier in the course, you distinguished between inner, left, and right joins. Since then, you’ve learned three new types of joins! Can you match these joins to the type of join described here?
Keep all observations from both tables.
Filter the first table for observations that match the second.
Filter the first table for observations that don’t match the second.
In the videos, you learned how to filter using the semi- and anti
join verbs to answer questions you have about your data. Let’s focus on
the batwing
dataset, and use our skills to determine which
parts are in both the batwing
and batmobile
sets, and which sets are in one, but not the other. While answering
these questions, we’ll also be determining whether or not the parts
we’re looking at in both sets also have the same color in common.
The batmobile
and batwing
datasets have
been preloaded for you.
batmobile <- inventory_parts_joined %>%
filter(set_num == "7784-1") %>%
select(-set_num)
batwing <- inventory_parts_joined %>%
filter(set_num == "70916-1") %>%
select(-set_num)
batwing
set for parts that are
also in the batmobile
, whether or not they
have the same color.batwing
set for parts that
aren’t also in the batmobile
, whether or
not they have the same color.# edited/added
batmobile <- inventory_parts_joined %>%
filter(set_num == "7784-1") %>%
select(-set_num)
batwing <- inventory_parts_joined %>%
filter(set_num == "70916-1") %>%
select(-set_num)
# Filter the batwing set for parts that are also in the batmobile set
batwing %>%
semi_join(batmobile, by = c("part_num"))
# Filter the batwing set for parts that aren't in the batmobile set
batwing %>%
anti_join(batmobile, by = c("part_num"))
Besides comparing two sets directly, you could also use a filtering
join like semi_join
to find out which colors ever appear in
any inventory part. Some of the colors could be optional, meaning they
aren’t included in any sets.
The inventory_parts
and colors
tables have
been preloaded for you.
inventory_parts
table to find the colors that
are included in at least one set.# Use inventory_parts to find colors included in at least one set
colors %>%
semi_join(inventory_parts, by = c("id" = "color_id"))
Each set included in the LEGO data has an associated version number. We want to understand the version we are looking at to learn more about the parts that are included. Before doing that, we should confirm that there aren’t any sets that are missing a particular version.
Let’s start by looking at the first version of each set to see if there are any sets that don’t include a first version.
filter()
to extract version
1
from the inventories
table; save the filter
to version_1_inventories
.anti_join
to combine
version_1_inventories
with sets
to determine
which set is missing a version 1.# Use filter() to extract version 1
version_1_inventories <- inventories %>%
filter(version == 1)
# Use anti_join() to find which set is missing a version 1
sets %>%
anti_join(version_1_inventories, by = "set_num")
To compare two individual sets, and the kinds of LEGO pieces that comprise them, we’ll need to aggregate the data into separate themes. Additionally, as we saw in the video, we’ll want to add a column so that we can understand the fractions of specific pieces that are part of each set, rather than looking at the numbers of pieces alone.
The inventory_parts_themes
table has been preloaded for
you.
inventory_parts_themes <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version) %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
"Batman"
theme to create the
batman_colors
object.fraction
column to batman_colors
that displays the total divided by the sum of the total."Star Wars"
set data to create the
star_wars_colors
object.fraction
column to star_wars_colors
to display the fraction of the total.# edited/added
inventory_parts_themes <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
arrange(desc(quantity)) %>%
select(-id, -version) %>%
inner_join(sets, by = "set_num") %>%
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
batman_colors <- inventory_parts_themes %>%
# Filter the inventory_parts_themes table for the Batman theme
filter(name_theme == "Batman") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
# Add a fraction column of the total divided by the sum of the total
mutate(fraction = total / sum(total))
# Filter and aggregate the Star Wars set data; add a fraction column
star_wars_colors <- inventory_parts_themes %>%
filter(name_theme == "Star Wars") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(fraction = total / sum(total))
The data you aggregated in the last exercise has been preloaded for
you as batman_colors
and star_wars_colors
.
Prior to visualizing the data, you’ll want to combine these tables to be
able to directly compare the themes’ colors.
batman_colors <- inventory_parts_themes %>%
filter(name_theme == "Batman") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(fraction = total / sum(total))
star_wars_colors <- inventory_parts_themes %>%
filter(name_theme == "Star Wars") %>%
group_by(color_id) %>%
summarize(total = sum(quantity)) %>%
mutate(fraction = total / sum(total))
batman_colors
and
star_wars_colors
tables; be sure to include all
observations from both tables.total_batman
and
total_star_wars
columns.difference
column which is the difference between
fraction_batman
and fraction_star_wars
, and a
total
column, which is the sum of total_batman
and total_star_wars.
filter
to select observations where
total
is at least 200.batman_colors %>%
# Join the Batman and Star Wars colors
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
# Replace NAs in the total_batman and total_star_wars columns
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id"))
batman_colors %>%
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
# Create the difference and total columns
mutate(difference = fraction_batman - fraction_star_wars,
total = total_batman + total_star_wars) %>%
# Filter for totals greater than 200
filter(total >= 200)
In the last exercise, you created colors_joined
. Now
you’ll create a bar plot with one bar for each color
(name
), showing the difference in fractions.
Because factors and visualization are beyond the scope of this
course, we’ve done some processing for you: here is the code that
created the colors_joined
table that will be used in the
video.
colors_joined <- batman_colors %>%
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
mutate(difference = fraction_batman - fraction_star_wars,
total = total_batman + total_star_wars) %>%
filter(total >= 200) %>%
mutate(name = fct_reorder(name, difference))
colors_joined
table to
display the most prominent colors in the Batman and Star Wars themes,
with the bars colored by their name
.# edited/added
colors_joined <- batman_colors %>%
full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
mutate(difference = fraction_batman - fraction_star_wars,
total = total_batman + total_star_wars) %>%
filter(total >= 200) %>%
drop_na() %>% # edited/added
mutate(name = fct_reorder(name, difference))
color_palette <- setNames(colors_joined$rgb, colors_joined$name)
# Create a bar plot using colors_joined and the name and difference columns
ggplot(colors_joined, aes(name, difference, fill = name)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = color_palette, guide = "none") +
labs(y = "Difference: Batman - Star Wars")
Put together all the types of join you learned in this course to analyze a new dataset: Stack Overflow questions, answers, and tags. This includes calculating and visualizing trends for some notable tags like dplyr and ggplot2. You’ll also master one more method for combining tables, the bind_rows verb, which stacks tables on top of each other.
Now we’ll join together questions
with
answers
so we can measure the time between questions and
answers.
Make sure to explore the tables and columns in the console before
starting the exercise. Can you tell how are questions identified in the
questions
table? How can you identify which answer
corresponds to which question using the answers
table?
questions
and
answers
tables using the suffixes "_question"
and "_answer"
, respectively.creation_date_question
from
creation_date_answer
within the as.integer()
function to create the gap
column.# edited/added
answers <- readRDS("answers.rds")
questions %>%
# Inner join questions and answers with proper suffixes
inner_join(answers, by = c("id" = "question_id"), suffix = c("_question", "_answer")) %>%
# Subtract creation_date_question from creation_date_answer to create gap
mutate(gap = as.integer(creation_date_answer - creation_date_question))
We can also determine how many questions actually yield answers. If
we count the number of answers for each question, we can then join the
answers counts with the questions
table.
question_id
column in the
answers
table to create the answer_counts
table.questions
table with the
answer_counts
table and include all observations from the
questions
table.n
column with 0s.# Count and sort the question id column in the answers table
answer_counts <- answers %>%
count(question_id, sort = TRUE)
# Combine the answer_counts and questions tables
questions %>%
left_join(answer_counts, by = c("id" = "question_id")) %>%
# Replace the NAs in the n column
replace_na(list(n = 0))
The table you created in the last exercise has been preloaded for you
as tagged_answers
. You can use this table to determine, on
average, how many answers each questions gets.
tagged_answers <- question_answer_counts %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
Some of the important variables from this table include:
n
, the number of answers for each question, and
tag_name
, the name of each tag associated with each
question.
Let’s use some of our favorite dplyr verbs to find out how many answers each question gets on average.
tagged_answers
table by
tag_name
.tagged_answers
to get the count of
questions
and the average_answers
.questions
column in descending
order.# edited/added
tagged_answers <- question_answer_counts %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
tagged_answers %>%
# Aggregate by tag_name
group_by(tag_name) %>%
# Summarize questions and average_answers
summarize(questions = n(),
average_answers = mean(n)) %>%
# Sort the questions in descending order
arrange(desc(questions))
Congratulations on making it through the course!
To review, you learned about 6 different joins and how to use classic dplyr verbs to create meaningful insights from your joined data.
You learned about the 4 mutating joins: inner join, which keeps only observations which match exactly between two tables, left joins, which keep all observations from the first table in your joins, right joins, which keep all observations from the second table in your joins, and full joins, which keep all observations from both tables.
You also learned the 2 filtering joins; semi joins, which filter the first table for observations which also exist in the second table, and anti joins, which filter the first table for observations that do not exist in the second table.
Pretty neat, huh? I hope you enjoyed the course, and feel confident that you can use these new tools in your future analyses. And can’t wait for you to JOIN us in future lessons!