Joining Data with dplyr

Course Description

Often in data science, you’ll encounter fascinating data that is spread across multiple tables. This course will teach you the skills you’ll need to join multiple tables together to analyze them in combination. You’ll practice your skills using a fun dataset about LEGOs from the Rebrickable website. The dataset contains information about the sets, parts, themes, and colors of LEGOs, but is spread across many tables. You’ll work with the data throughout the course as you learn a total of six different joins! You’ll learn four mutating joins: inner join, left join, right join, and full join, and two filtering joins: semi join and anti join. In the final chapter, you’ll apply your new skills to Stack Overflow data, containing each of the almost 300,000 Stack Oveflow questions that are tagged with R, including information about their answers, the date they were asked, and their score. Get ready to take your dplyr skills to the next level!

Joining Tables

Get started with your first joining verb: inner-join! You’ll learn to join tables together to answer questions about the LEGO dataset, which contains information across many tables about the sets, parts, themes, and colors of LEGOs over time.

Joining parts and part categories

The inner_join is the key to bring tables together. To use it, you need to provide the two tables that must be joined and the columns on which they should be joined.

In this exercise, you’ll join a list of LEGO parts, available as parts, with these parts’ corresponding categories, available as part_categories. For example, the part Sticker Sheet 1 for Set 1650-1 is from the Stickers part category. You can join these tables to see all parts’ categories!

library(dplyr)
library(tidyr)
library(ggplot2)

parts <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/parts.rds")

part_categories <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/part_categories.rds")

inventory_parts <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/inventory_parts.rds")

colors <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/colors.rds")

sets <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/sets.rds")

inventories <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/inventories.rds")

themes <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/themes.rds")

question_tags <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/question_tags.rds")

questions <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/questions.rds")

tags <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/tags.rds")

answers <- readRDS("G:/My Drive/Data Science/R/DataCamp/Data Analyst with R/05 Joining Data with dplyr/datasets/answers.rds")

# Add the correct verb, table, and joining column
# Use the suffix argument to replace .x and .y suffixes
parts %>% 
    inner_join(part_categories, by = c("part_cat_id" = "id"), suffix = c("_part", "_category"))

You can now see the part category associated with each part, and you used the suffix argument to make sure your table’s fields are clear.

Joining parts and inventories

The LEGO data has many tables that can be joined together. Often times, some of the things you care about may be a few tables away (we’ll get to that later in the course). For now, we know that parts is a list of all LEGO parts, and a new table, inventory_parts, has some additional information about those parts, such as the color_id of each part you would find in a specific LEGO kit.

Let’s join these two tables together to observe how joining parts with inventory_parts increases the size of your table because of the one-to-many relationship that exists between these two tables.

# Combine the parts and inventory_parts tables
parts %>%
inner_join(inventory_parts, "part_num")

Recall, this is an example of a one-to-many relationship. Notice that the table increased in the number of rows after the join.

Joining in either direction

An inner_join works the same way with either table in either position. The table that is specified first is arbitrary, since you will end up with the same information in the resulting table either way.

Let’s prove this by joining the same two tables from the last exercise in the opposite order!

# Combine the parts and inventory_parts tables
inventory_parts %>%
inner_join(parts, "part_num")

This is the same join as the last exercise, but the order of the tables is reversed. For an inner_join, either direction will yield a table that contains the same information! Note that the columns will appear in a different order depending on which table comes first.

Joining three tables

You can string together multiple joins with inner_join and the pipe (%>%), both with which you are already very familiar!

We’ll now connect sets, a table that tells us about each LEGO kit, with inventories, a table that tells us the specific version of a given set, and finally to inventory_parts, a table which tells us how many of each part is available in each LEGO kit.

So if you were building a Batman LEGO set, sets would tell you the name of the set, inventories would give you IDs for each of the versions of the set, and inventory_parts would tell you how many of each part would be in each version.

sets %>%
    # Add inventories using an inner join 
    inner_join(inventories, by = "set_num") %>%
    # Add inventory_parts using an inner join 
    inner_join(inventory_parts, by = c("id" = "inventory_id"))

What’s the most common color?

Now let’s join an additional table, colors, which will tell us the color of each part in each set, so that we can answer the question, “what is the most common color of a LEGO piece?”

# Count the number of colors and sort
sets %>%
    inner_join(inventories, by = "set_num") %>%
    inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
    inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color")) %>%
    count(name_color, sort = TRUE)

Notice that Black and White are the two most prominent colors.

2. Left and Right Joins

Learn two more mutating joins, the left and right join, which are mirror images of each other! You’ll learn use cases for each type of join as you explore parts and colors of LEGO themes. Then, you’ll explore how to join tables to themselves to understand the hierarchy of LEGO themes in the data.

Left joining two sets by part and color

You have learned how to left join two LEGO sets. Now you’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets.

inventory_parts_joined <- inventories %>%
inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
select(-id, -version) %>%
arrange(desc(quantity))

millennium_falcon <- inventory_parts_joined %>%
  filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
  filter(set_num == "75190-1")

# Combine the star_destroyer and millennium_falcon tables
millennium_falcon %>%
left_join(star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))

Using this table, and the part_num or color_id columns, you can determine the frequency that a specific part or piece appears in the Millennium Falcon and Star Destroyer sets!

Left joining two sets by color

In the last exercise, you joined two sets based on their part and color. What if you joined the datasets by color alone?

millennium_falcon <- inventory_parts_joined %>%
  filter(set_num == "7965-1")
star_destroyer <- inventory_parts_joined %>%
  filter(set_num == "75190-1")

# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
  group_by(color_id) %>%
  summarize(total_quantity = sum(quantity))

## `summarise()` ungrouping output (override with `.groups` argument)

millennium_falcon_colors

# Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
  group_by(color_id) %>%
  summarize(total_quantity = sum(quantity))

## `summarise()` ungrouping output (override with `.groups` argument)

# Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
  left_join(star_destroyer_colors, by = "color_id", suffix = c("_falcon", "_star_destroyer"))

Sometimes, the level on which you’re comparing two sets- by piece or just by color- will require some thought and some pre-processing.

Finding an observation that doesn’t have a match

Left joins are really great for testing your assumptions about a data set and ensuring your data has integrity.

For example, the inventories table has a version column, for when a LEGO kit gets some kind of change or upgrade. It would be fair to assume that all sets (which joins well with inventories) would have at least a version 1. But let’s test this assumption out in the following exercise.

inventory_version_1 <- inventories %>%
  filter(version == 1)

# Join versions to sets
sets %>%
  left_join(inventory_version_1, by = "set_num") %>%
  # Filter for where version is na
  filter(is.na(version))

It looks like there are cases where a set does not have an original version.

Counting part colors

Sometimes you’ll want to do some processing before you do a join, and prioritize keeping the second (right) table’s rows instead. In this case, a right join is for you.

In the example below, we’ll count the part_cat_id from parts, before using a right_join to join with part_categories. The reason we do this is because we don’t only want to know the count of part_cat_id in parts, but we also want to know if there are any part_cat_ids not present in parts.

parts %>%
    # Count the part_cat_id
    count(part_cat_id) %>%
    # Right join part_categories
    right_join(part_categories, by = c("part_cat_id" = "id")) %>%
  # Filter for NA
    filter(is.na(n))

From this filter, you found an instance where a part category is present in one table, but missing from the other table. It’s important to understand which entries would be impacted by replace_na(), so that we know which entries we would be omitting by using that function.

Cleaning up your count

In both left and right joins, there is the opportunity for there to be NA values in the resulting table. Fortunately, the replace_na function can turn those NAs into meaningful values.

In the last exercise, we saw that the n column had NAs after the right_join. Let’s use the replace_na column, which takes a list of column names and the values with which NAs should be replaced, to clean up our table.

parts %>%
    count(part_cat_id) %>%
    right_join(part_categories, by = c("part_cat_id" = "id")) %>%
    # Use replace_na to replace missing values in the n column
    replace_na(list(n = 0))

Now you have replaced NAs with 0s for your analysis.

Joining themes to their children

Tables can be joined to themselves!

In the themes table, which is available for you to inspect in the console, you’ll notice there is both an id column and a parent_id column. Keeping that in mind, you can join the themes table to itself to determine the parent-child relationships that exist for different themes. Let’s try this out to discover what children the theme “Harry Potter” has.

themes %>% 
    # Inner join the themes table
    inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
    # Filter for the "Harry Potter" parent name 
    filter(name_parent == "Harry Potter")

It looks like the Harry Potter parent theme has a few children, which are named after the books and movies from the series.

Joining themes to their grandchildren

We can go a step further than looking at themes and their children. Some themes actually have grandchildren: their children’s children.

Here, we can inner join themes to a filtered version of itself again to establish a connection between our last join’s children and their children.

# Join themes to itself again to find the grandchild relationships
themes %>% 
  inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
  inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))

It looks like there are 158 grandparent-grandchild relationships of LEGO themes. Now you’re a whiz at using self joins to learn about hierarchical relationships.

Left-joining a table to itself

So far, you’ve been inner joining a table to itself in order to find the children of themes like “Harry Potter” or “The Lord of the Rings”.

But some themes might not have any children at all, which means they won’t be included in the inner join. As you’ve learned in this chapter, you can identify those with a left_join and a filter().

themes %>% 
  # Left join the themes table to its own children
  left_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
  # Filter for themes that have no child themes
  filter(is.na(name_child))

From this table, we learned that there are 586 themes in total that do not have any child themes.

3. Full, Semi, and Anti Joins

In this chapter, you’ll cover three more joining verbs: full-join, semi-join, and anti-join. You’ll then use these verbs to answer questions about the similarities and differences between a variety of LEGO sets.

Differences between Batman and Star Wars

In this exercise you will compare two themes, each of which is made up of many sets.

First, you’ll need to join in the themes. Recall that doing so requires going through the sets first.

inventory_parts_joined <- inventories %>%
  inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
  arrange(desc(quantity)) %>%
  select(-id, -version)

# Start with inventory_parts_joined table
inventory_parts_joined %>%
# Combine with the sets table 
inner_join(sets, by = "set_num") %>%
# Combine with the themes table 
inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))

Aggregating each theme

Previously, you combined tables to compare themes. Before doing this comparison, you’ll want to aggregate the data to learn more about the pieces that are a part of each theme, as well as the colors of those pieces.

inventory_sets_themes <- inventory_parts_joined %>%
  inner_join(sets, by = "set_num") %>%
  inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))

batman <- inventory_sets_themes %>%
  filter(name_theme == "Batman")

star_wars <- inventory_sets_themes %>%
  filter(name_theme == "Star Wars")

# Count the part number and color id, weight by quantity
batman %>%
count(part_num, color_id, wt = quantity)

star_wars %>%
count(part_num, color_id, wt = quantity)

Now that your information about the pieces in each of these themes is more organized, you are closer to being able to compare them to learn more about the similarities and differences between these themes!

Full-joining Batman and Star Wars LEGO parts

Now that you’ve got separate tables for the pieces in the batman and star_wars themes, you’ll want to be able to combine them to see any similarities or differences between the two themes.

batman_parts <- batman %>%
  count(part_num, color_id, wt = quantity)

star_wars_parts <- star_wars %>%
  count(part_num, color_id, wt = quantity)

batman_parts %>%
  # Combine the star_wars_parts table 
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
  # Replace NAs with 0s in the n_batman and n_star_wars columns 
replace_na(list(n_batman = 0, n_star_wars = 0))

Now, you’ve created a comprehensive table that includes the part number, color id, and quantity of each piece in the Batman and Star Wars themes.

Comparing Batman and Star Wars LEGO parts

The table you created in the last exercise includes the part number of each piece, the color id, and the number of each piece in the Star Wars and Batman themes. However, we have more information about each of these parts that we can gain by combining this table with some of the information we have in other tables. Before we compare the themes, let’s ensure that we have enough information to make our findings more interpretable.

parts_joined <- batman_parts %>%
  full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
  replace_na(list(n_batman = 0, n_star_wars = 0))

parts_joined %>%
  # Sort the number of star wars pieces in descending order 
  arrange(desc(n_star_wars)) %>%
  # Join the colors table to the parts_joined table
  inner_join(colors, by = c("color_id" = "id")) %>%
  # Join the parts table to the previous join 
  inner_join(parts, by = "part_num", suffix = c("_color", "_part"))

You can now see the pieces side by side from each of these themes. Since the pieces are sorted by number of Star Wars pieces in descending order, you can see that the most common Star Wars piece is Black and has the part number 2780. While there are 392 pieces of this part in the Star Wars theme, you can also see from the table that there are 104 pieces of the same part in the Batman theme.

Something within one set but not another

You have learned how to filter using the semi- and anti-join verbs to answer questions you have about your data. Let’s focus on the batwing dataset, and use our skills to determine which parts are in both the batwing and batmobile sets, and which sets are in one, but not the other. While answering these questions, we’ll also be determining whether or not the parts we’re looking at in both sets also have the same color in common.

batmobile <- inventory_parts_joined %>%
  filter(set_num == "7784-1") %>%
  select(-set_num)

batwing <- inventory_parts_joined %>%
  filter(set_num == "70916-1") %>%
  select(-set_num)

# Filter the batwing set for parts that are also in the batmobile set
batwing %>%
semi_join(batmobile, by = "part_num")

# Filter the batwing set for parts that aren't in the batmobile set
batwing %>%
anti_join(batmobile, by = "part_num")

Based on these joins, we now know that there are 126 parts in the batwing set that are also in the batmobile set, and 183 parts that are in the batwing set that aren’t in the batmobile set.

What colors are included in at least one set?

Besides comparing two sets directly, you could also use a filtering join like semi_join to find out which colors ever appear in any inventory part. Some of the colors could be optional, meaning they aren’t included in any sets.

# Use inventory_parts to find colors included in at least one set
colors %>%
semi_join(inventory_parts, by = c("id" = "color_id"))

Out of the 179 colors in the colors table, there are 134 colors that are included in at least one set.

Which set is missing version 1?

Each set included in the LEGO data has an associated version number. We want to understand the version we are looking at to learn more about the parts that are included. Before doing that, we should confirm that there aren’t any sets that are missing a particular version.

Let’s start by looking at the first version of each set to see if there are any sets that don’t include a first version.

# Use filter() to extract version 1 
version_1_inventories <- inventories %>%
filter(version == 1)

# Use anti_join() to find which set is missing a version 1
sets %>%
anti_join(version_1_inventories, by = "set_num")

This is likely a data quality issue, and anti_join is a great tool for finding problems like that.

Aggregating sets to look at their differences

To compare two individual sets, and the kinds of LEGO pieces that comprise them, we’ll need to aggregate the data into separate themes. Additionally, we’ll want to add a column so that we can understand the percentages of specific pieces that are part of each set, rather than looking at the numbers of pieces alone.

inventory_parts_themes <- inventories %>%
  inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
  arrange(desc(quantity)) %>%
  select(-id, -version) %>%
  inner_join(sets, by = "set_num") %>%
  inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))

batman_colors <- inventory_parts_themes %>%
  # Filter the inventory_parts_themes table for the Batman theme
  filter(name_theme == "Batman") %>%
  group_by(color_id) %>%
  summarize(total = sum(quantity)) %>%
  # Add a percent column of the total divided by the sum of the total 
  mutate(percent = total / sum(total))

## `summarise()` ungrouping output (override with `.groups` argument)

batman_colors

# Filter and aggregate the Star Wars set data; add a percent column
star_wars_colors <- inventory_parts_themes %>%
  filter(name_theme == "Star Wars") %>%
  group_by(color_id) %>%
  summarize(total = sum(quantity)) %>%
  mutate(percent = total / sum(total))

## `summarise()` ungrouping output (override with `.groups` argument)

star_wars_colors

In addition to being able to view the sets for Batman and Star Wars separately, adding the colum also allowed us to be able to look at the percentage differences between the sets, rather than only being able to compare the numbers of pieces. This will become more useful as we more forward in our analysis.

Combining sets

Prior to visualizing the data, you’ll want to combine these tables to be able to directly compare the themes’ colors.

# Join the Batman and Star Wars colors
batman_colors <- batman_colors %>%
  full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars"))

# Check to see if any colors are missing in any column
# One method is to use apply function to check each column for NAs
apply(batman_colors, 2, function(x) any(is.na(x)))

##          color_id      total_batman    percent_batman   total_star_wars 
##             FALSE              TRUE              TRUE              TRUE 
## percent_star_wars 
##              TRUE

# Another method is to check for the number of NAs each column has
colSums(is.na(batman_colors))

##          color_id      total_batman    percent_batman   total_star_wars 
##                 0                 6                 6                11 
## percent_star_wars 
##                11

# Both methods tell us that we have missing data in our columns
# Replace NAs in the total_batman and total_star_wars columns
batman_colors %>%
replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
inner_join(colors, by = c("color_id" = "id")) %>%
# Create the difference and total columns
mutate(difference = percent_batman - percent_star_wars,
       total = total_batman + total_star_wars) %>%
# Filter for totals greater than 200
filter(total >= 200)

With this combined table that contains all the information from the batman_colors and star_wars_colors tables, you can now create an informative visualization to compare the colors of these sets.

Visualizing the difference: Batman and Star Wars

In the last exercise, you created colors_joined. Now you’ll create a bar plot with one bar for each color (name), showing the difference in percentages.

library(forcats)
colors_joined <- batman_colors %>%
  full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
  replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
  inner_join(colors, by = c("color_id" = "id")) %>%
  mutate(difference = percent_batman - percent_star_wars,
         total = total_batman + total_star_wars) %>%
  filter(total >= 200) %>%
  mutate(name = fct_reorder(name, difference)) 

# We need to create our palette by using the setNames function
(test <- setNames(c(1, 2), c("A", "B")))

## A B 
## 1 2

# We will apply the setNames function to the two color columns
color_palette <- setNames(colors_joined$rgb, colors_joined$name)
color_palette

##             Black              Blue               Red            Yellow 
##         "#05131D"         "#0055BF"         "#C91A09"         "#F2CD37" 
##             White               Tan          Dark Tan         Trans-Red 
##         "#FFFFFF"         "#E4CD9E"         "#958A73"         "#C91A09" 
##      Trans-Yellow     Reddish Brown Light Bluish Gray  Dark Bluish Gray 
##         "#F5CD2F"         "#582A12"         "#A0A5A9"         "#6C6E68" 
## Medium Dark Flesh       Flat Silver        Sand Green        Light Gray 
##         "#CC702A"         "#898788"         "#A0BCAC"         "#9BA19D"

# Create a bar plot using colors_joined and the name and difference columns
ggplot(colors_joined, aes(x = name, y = difference, fill = name)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = color_palette, guide = FALSE) +
  labs(y = "Difference: Batman - Star Wars")

## Warning: Removed 1 rows containing missing values (position_stack).

As you can see from the plot, the Batman set has more black, yellow, and red, while the Star Wars set has more light bluish gray, white, and tan.

Case Study: Joins on Stack Overflow Data

Put together all the types of join you learned in this course to analyze a new dataset: Stack Overflow questions, answers, and tags. This includes calculating and visualizing trends for some notable tags like dplyr and ggplot2. You’ll also master one more method for combining tables, the bind_rows verb, which stacks tables on top of each other.

Left-joining questions and tags

Three of the Stack Overflow survey datasets are questions, question_tags, and tags:

questions: an ID and the score, or how many times the question has been upvoted; the data only includes R-based questions

question_tags: a tag ID for each question and the question’s id

tags: a tag id and the tag’s name, which can be used to identify the subject of each question, such as ggplot2 or dplyr

In this exercise, we’ll be stitching together these datasets and replacing NAs in important fields.

Note that we’ll be using left_joins in this exercise to ensure we keep all questions, even those without a corresponding tag.

# Join the questions and question_tags tables
# Join in the tags table
# Replace the NAs in the tag_name column
questions_with_tags <- questions %>%
  left_join(question_tags, by = c("id" = "question_id")) %>%
  left_join(tags, by = c("tag_id" = "id")) %>%
  replace_na(list(tag_name = "only-r"))

Comparing scores across tags

The complete dataset you created in the last exercise is available to you as questions_with_tags. Let’s do a quick bit of analysis on it! You’ll use familiar dplyr verbs like group_by, summarize, arrange, and n to find out the average score of the most asked questions.

questions_with_tags %>%
    # Group by tag_name
    group_by(tag_name) %>%
    # Get mean score and num_questions
    summarize(score = mean(score),
              num_questions = n()) %>%
    # Sort num_questions in descending order
    arrange(desc(num_questions))

## `summarise()` ungrouping output (override with `.groups` argument)

It looks like questions with the R tag get a relatively low score, but questions with the loops tag are even lower.

What tags never appear on R questions?

The tags table includes all Stack Overflow tags, but some have nothing to do with R. How could you filter for just the tags that never appear on an R question?

# Using a join, filter for tags that are never on an R question
tags %>%
anti_join(question_tags, by = c("id" = "tag_id"))

It looks like there are more than 40,000 tags that have never appeared along R!

Finding gaps between questions and answers

Now we’ll join together questions with answers so we can measure the time between questions and answers.

Make sure to explore the tables and columns in the console before starting the exercise. Can you tell how are questions identified in the questions table? How can you identify which answer corresponds to which question using the answers table?

questions %>%
    # Inner join questions and answers with proper suffixes
    inner_join(answers, by = c("id" = "question_id"), suffix = c("_question", "_answer")) %>%
    # Subtract creation_date_question from creation_date_answer to create gap
    mutate(gap = as.integer(creation_date_answer - creation_date_question))

Now we could use this information to identify how long it takes different questions to get answers.

Joining question and answer counts

We can also determine how many questions actually yield answers. If we count the number of answers for each question, we can then join the answers counts with the questions table.

# Count and sort the question id column in the answers table
answer_counts <- answers %>%
count(question_id, sort = TRUE)

# Combine the answer_counts and questions tables
questions %>%
    left_join(answer_counts, by = c("id" = "question_id")) %>%
    # Replace the NAs in the n column
    replace_na(list(n = 0))

You can use this combined table to see which questions have the most answers, and which questions have no answers.

Joining questions, answers, and tags

Let’s build on the last exercise by adding the tags table to our previous joins. This will allow us to do a better job of identifying which R topics get the most traction on Stack Overflow.

answer_counts <- answers %>%
    count(question_id, sort = TRUE)

question_answer_counts <- questions %>%
    left_join(answer_counts, by = c("id" = "question_id")) %>%
    replace_na(list(n = 0))

question_answer_counts %>%
    # Join the question_tags tables
    inner_join(question_tags, by = c("id" = "question_id")) %>%
    # Join the tags table
    inner_join(tags, by = c("tag_id" = "id"))

head(question_answer_counts)

Average answers by question

The table you created in the last exercise is available to you as tagged_answers. You can use this table to determine, on average, how many answers each questions gets.

tagged_answers <- question_answer_counts %>%
    inner_join(question_tags, by = c("id" = "question_id")) %>%
    inner_join(tags, by = c("tag_id" = "id"))

tagged_answers %>%
    # Aggregate by tag_name
    group_by(tag_name) %>%
    # Summarize questions and average_answers
    summarize(questions = n(),
              average_answers = mean(n)) %>%
    # Sort the questions in descending order
    arrange(desc(questions))

## `summarise()` ungrouping output (override with `.groups` argument)

You can see if you post a question about ggplot2, on average you’ll get an answer.

Joining questions and answers with tags

To learn more about the questions and answers table, you’ll want to use the question_tags table to understand the tags associated with each question that was asked, and each answer that was provided. You’ll be able to combine these tables using two inner joins on both the questions table and the answers table.

# Inner join the question_tags and tags tables with the questions table
questions %>%
  inner_join(question_tags, by = c("id" = "question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))

# Inner join the question_tags and tags tables with the answers table
answers %>%
  inner_join(question_tags, by = c("question_id" = "question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))

Now we will be able to combine each of these individual tables into a single cohesive table to have a better understanding of the information we have about the questions, answers, and associated tags.

Binding and counting posts with tags

You have the two tables from the previous exercise as questions_with_tags and answers_with_tags. First, you’ll want to combine these tables into a single table called posts_with_tags. Once the information is consolidated into a single table, you can add more information by creating a date variable using the lubridate package.

library(lubridate)

questions_with_tags <- questions %>%
  inner_join(question_tags, by = c("id" = "question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))

answers_with_tags <- answers %>%
  inner_join(question_tags, by = "question_id") %>%
  inner_join(tags, by = c("tag_id" = "id"))

# Combine the two tables into posts_with_tags
posts_with_tags <- bind_rows(questions_with_tags %>% mutate(type = "question"), answers_with_tags %>% mutate(type = "answer"))


# Add a year column, then aggregate by type, year, and tag_name
posts_with_tags %>%
mutate(year = year(creation_date)) %>%
count(type, year, tag_name)

Now you have a single table that displays the type, year, tag_name, and count as n.

Visualizing questions and answers in tags

In the last exercise, you modified the posts_with_tags table to add a year column, and aggregated by type, year, and tag_name. The modified table is available as by_type_year_tag, and has one observation for each type (question/answer), year, and tag. Let’s create a plot to examine the information that the table contains about questions and answers for the dplyr and ggplot2 tags.

by_type_year_tag <- posts_with_tags %>%
  mutate(year = year(creation_date)) %>%
  count(type, year, tag_name)


# Filter for the dplyr and ggplot2 tag names 
by_type_year_tag_filtered <- by_type_year_tag %>%
  filter(tag_name %in% c("dplyr", "ggplot2"))

# Create a line plot faceted by the tag name 
ggplot(by_type_year_tag_filtered, aes(year, n, color = type)) +
  geom_line() +
  facet_wrap(~ tag_name)