Stack Overflow questions
We will analyze Stack Overflow questions, answers, and tags dataset.
This will include calculating and visualizing trends for some notable tags like dplyr and ggplot2.
library(readr)
library(tidyr)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
library(dplyr)
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
questions <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/89d5a716b4f41dbe4fcda1a7a1190f24f58f0e47/questions.rds"))
tags <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/207c31b235786e73496fd7e58e416779911a9d98/tags.rds"))
question_tags <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/966938d665c69bffd87393b345ea2837a94bab97/question_tags.rds"))
answers <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/6cb9c039aa8326d98de37afefa32e1c458764638/answers.rds"))
head(questions)
head(tags)
head(question_tags)
head(answers)
Joining questions and answers
Finding gaps between questions and answers
Now we’ll join together questions with answers so we can measure the time between questions and answers.
questions %>%
# Inner join questions and answers with proper suffixes
inner_join(answers, c("id" = "question_id"), suffix = c("_question", "_answer")) %>%
# Subtract creation_date_question from creation_date_answer to create gap
mutate(gap = as.integer(creation_date_answer - creation_date_question))
Now we could use this information to identify how long it takes different questions to get answers.
Joining question and answer counts
We can also determine how many questions actually yield answers. If we count the number of answers for each question, we can then join the answers counts with the questions table.
# Count and sort the question id column in the answers table
answer_counts <- answers %>%
count(question_id, sort = TRUE)
# Combine the answer_counts and questions tables
question_answer_counts <- questions %>%
left_join(answer_counts, by = c("id" = "question_id"))%>%
# Replace the NAs in the n column
replace_na(list(n = 0))
head(question_answer_counts)
We can use this combined table to see which questions have the most answers, and which questions have no answers.
Average answers by question
We can use tagged_answers table to determine, on average, how many answers each questions gets.
Some of the important variables from this table include: n, the number of answers for each question, and tag_name, the name of each tag associated with each question.
tagged_answers %>%
# Aggregate by tag_name
group_by(tag_name) %>%
# Summarize questions and average_answers
summarize(questions = n(),
average_answers = mean(n)) %>%
# Sort the questions in descending order
arrange(desc(questions))
We can see if you post a question about ggplot2, on average you’ll get an answer.
The bind rows verb
Binding and counting posts with tags
First, we’ll want to combine these tables into a single table called posts_with_tags. Once the information is consolidated into a single table, we can add more information by creating a date variable using the lubridate package.
library(lubridate)
Attaching package: 㤼㸱lubridate㤼㸲
The following object is masked from 㤼㸱package:base㤼㸲:
date
# Combine the two tables into posts_with_tags
posts_with_tags <- bind_rows(questions_with_tags %>% mutate(type = "question"),
answers_with_tags %>% mutate(type = "answer"))
# Add a year column, then aggregate by type, year, and tag_name
by_type_year_tag <- posts_with_tags %>%
mutate(year = year(creation_date)) %>%
group_by(type, year, tag_name) %>%
count()
by_type_year_tag
---
title: "Joins on Stack Overflow Data"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_collapsed: true
    
toc_depth: 3
---
# Stack Overflow questions

We will analyze Stack Overflow questions, answers, and tags dataset.

This will include calculating and visualizing trends for some notable tags like dplyr and ggplot2.
```{r}
library(readr)
library(tidyr)
library(dplyr)
```

```{r}
questions <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/89d5a716b4f41dbe4fcda1a7a1190f24f58f0e47/questions.rds"))
tags <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/207c31b235786e73496fd7e58e416779911a9d98/tags.rds"))
question_tags <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/966938d665c69bffd87393b345ea2837a94bab97/question_tags.rds"))
answers <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/6cb9c039aa8326d98de37afefa32e1c458764638/answers.rds"))
```





```{r}
head(questions)
head(tags)
head(question_tags)
head(answers)
```
## Left-joining questions and tags

Three of the Stack Overflow survey datasets are questions, question_tags, and tags:

questions: an ID and the score, or how many times the question has been upvoted; the data only includes R-based questions
question_tags: a tag ID for each question and the question's id
tags: a tag id and the tag's name, which can be used to identify the subject of each question, such as ggplot2 or dplyr
In this exercise, we'll be stitching together these datasets and replacing NAs in important fields.

Note that we'll be using left_joins in this exercise to ensure we keep all questions, even those without a corresponding tag. However, since we know the questions data is all R data, we'll want to manually tag these as R questions with replace_na.
```{r}
# Join the questions and question_tags tables
questions_with_tags <- questions %>%
  left_join(question_tags, by = c("id" = "question_id")) %>%
  # Join the tags as well
  left_join(tags, by = c("tag_id" = "id")) %>%
  # Replace the NAs in the tag_name column
  replace_na(list(tag_name = "only-r"))

head(questions_with_tags)
```
We now have a dataset that we can analyze after all that joining.

## Comparing scores across tags

Let's do a quick bit of analysis on it! We'll use familiar dplyr verbs like group_by, summarize, arrange, and n to find out the average score of the most asked questions.
```{r}
questions_with_tags %>%
	# Group by tag_name
  group_by(tag_name) %>%
	# Get mean score and num_questions
	summarize(score = mean(score),
          	  num_questions = n()) %>%
	# Sort num_questions in descending order
	arrange(desc(num_questions))
```
It looks like questions with the R tag get a relatively low score, but questions with the loops tag are even lower.

### What tags never appear on R questions?

The tags table includes all Stack Overflow tags, but some have nothing to do with R. How could you filter for just the tags that never appear on an R question?
```{r}
# Using a join, filter for tags that are never on an R question
tags %>%
  anti_join(question_tags, c("id" = "tag_id")) %>%
  filter(tag_name != "only-r")
```
It looks like there are more than 40,000 tags that have never appeared along R!

# Joining questions and answers

## Finding gaps between questions and answers

Now we'll join together questions with answers so we can measure the time between questions and answers.
```{r}
questions %>%
	# Inner join questions and answers with proper suffixes
	inner_join(answers, c("id" = "question_id"), suffix = c("_question", "_answer")) %>%
	# Subtract creation_date_question from creation_date_answer to create gap
	mutate(gap = as.integer(creation_date_answer - creation_date_question))
```
Now we could use this information to identify how long it takes different questions to get answers.

## Joining question and answer counts

We can also determine how many questions actually yield answers. If we count the number of answers for each question, we can then join the answers counts with the questions table.
```{r}
# Count and sort the question id column in the answers table
answer_counts <- answers %>%
  count(question_id, sort = TRUE)

# Combine the answer_counts and questions tables
question_answer_counts <- questions %>%
	left_join(answer_counts, by = c("id" = "question_id"))%>%
	# Replace the NAs in the n column
  replace_na(list(n = 0))

head(question_answer_counts)
```
We can use this combined table to see which questions have the most answers, and which questions have no answers.

## Joining questions, answers, and tags

Let's build on the last exercise by adding the tags table to our previous joins. This will allow us to do a better job of identifying which R topics get the most traction on Stack Overflow.
```{r}
tagged_answers <- question_answer_counts %>%
	# Join the question_tags tables
	inner_join(question_tags, by = c("id" = "question_id")) %>%
	# Join the tags table
  inner_join(tags, by = c("tag_id" = "id"))
tagged_answers
```
Now we have a more holistic view of how questions are answered by each tag.

## Average answers by question

We can use tagged_answers table to determine, on average, how many answers each questions gets.

Some of the important variables from this table include: n, the number of answers for each question, and tag_name, the name of each tag associated with each question.
```{r}
tagged_answers %>%
	# Aggregate by tag_name
  group_by(tag_name) %>% 
	# Summarize questions and average_answers
    summarize(questions = n(),
              average_answers = mean(n)) %>%
	# Sort the questions in descending order
	arrange(desc(questions))
```
We can see if you post a question about ggplot2, on average you'll get an answer.

# The bind rows verb

## Joining questions and answers with tags

To learn more about the questions and answers table, we'll want to use the question_tags table to understand the tags associated with each question that was asked, and each answer that was provided. We'll be able to combine these tables using two inner joins on both the questions table and the answers table.

```{r}
# Inner join the question_tags and tags tables with the questions table
questions_with_tags <- questions %>%
  inner_join(question_tags, by = c("id" = "question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))
questions_with_tags
# Inner join the question_tags and tags tables with the answers table
answers_with_tags <- answers %>%
  inner_join(question_tags, by = c("question_id" = "question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))
answers_with_tags
```
Now we will be able to combine each of these individual tables into a single cohesive table to have a better understanding of the information we have about the questions, answers, and associated tags.

## Binding and counting posts with tags

First, we'll want to combine these tables into a single table called posts_with_tags. Once the information is consolidated into a single table, we can add more information by creating a date variable using the lubridate package.

```{r}
library(lubridate)
# Combine the two tables into posts_with_tags
posts_with_tags <- bind_rows(questions_with_tags %>% mutate(type = "question"),
                              answers_with_tags %>% mutate(type = "answer"))


# Add a year column, then aggregate by type, year, and tag_name
by_type_year_tag <- posts_with_tags %>%
  mutate(year = year(creation_date)) %>%
  group_by(type, year, tag_name) %>%
  count()
by_type_year_tag
```
## Visualizing questions and answers in tags

Let's create a plot to examine the information that the table contains about questions and answers for the dplyr and ggplot2 tags. 
```{r}
library(ggplot2)
# Filter for the dplyr and ggplot2 tag names 
by_type_year_tag_filtered <- by_type_year_tag %>%
  filter(tag_name == "dplyr" | tag_name == "ggplot2")

# Create a line plot faceted by the tag name 
ggplot(by_type_year_tag_filtered, aes(x = year, y = n, color = type)) +
  geom_line() +
  facet_wrap(~ tag_name)

```
Notice answers on dplyr questions are growing faster than dplyr questions themselves; meaning the average dplyr question has more answers than the average ggplot2 question.


