Stack Overflow questions
We will analyze Stack Overflow questions, answers, and tags dataset.
This will include calculating and visualizing trends for some notable tags like dplyr and ggplot2.
library(readr)
library(tidyr)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
library(dplyr)
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
questions <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/89d5a716b4f41dbe4fcda1a7a1190f24f58f0e47/questions.rds"))
tags <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/207c31b235786e73496fd7e58e416779911a9d98/tags.rds"))
question_tags <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/966938d665c69bffd87393b345ea2837a94bab97/question_tags.rds"))
answers <- read_rds(url("https://assets.datacamp.com/production/repositories/5284/datasets/6cb9c039aa8326d98de37afefa32e1c458764638/answers.rds"))
head(questions)
head(tags)
head(question_tags)
head(answers)
Joining questions and answers
Finding gaps between questions and answers
Now we’ll join together questions with answers so we can measure the time between questions and answers.
questions %>%
# Inner join questions and answers with proper suffixes
inner_join(answers, c("id" = "question_id"), suffix = c("_question", "_answer")) %>%
# Subtract creation_date_question from creation_date_answer to create gap
mutate(gap = as.integer(creation_date_answer - creation_date_question))
Now we could use this information to identify how long it takes different questions to get answers.
Joining question and answer counts
We can also determine how many questions actually yield answers. If we count the number of answers for each question, we can then join the answers counts with the questions table.
# Count and sort the question id column in the answers table
answer_counts <- answers %>%
count(question_id, sort = TRUE)
# Combine the answer_counts and questions tables
question_answer_counts <- questions %>%
left_join(answer_counts, by = c("id" = "question_id"))%>%
# Replace the NAs in the n column
replace_na(list(n = 0))
head(question_answer_counts)
We can use this combined table to see which questions have the most answers, and which questions have no answers.
Average answers by question
We can use tagged_answers table to determine, on average, how many answers each questions gets.
Some of the important variables from this table include: n, the number of answers for each question, and tag_name, the name of each tag associated with each question.
tagged_answers %>%
# Aggregate by tag_name
group_by(tag_name) %>%
# Summarize questions and average_answers
summarize(questions = n(),
average_answers = mean(n)) %>%
# Sort the questions in descending order
arrange(desc(questions))
We can see if you post a question about ggplot2, on average you’ll get an answer.
The bind rows verb
Binding and counting posts with tags
First, we’ll want to combine these tables into a single table called posts_with_tags. Once the information is consolidated into a single table, we can add more information by creating a date variable using the lubridate package.
library(lubridate)
Attaching package: 㤼㸱lubridate㤼㸲
The following object is masked from 㤼㸱package:base㤼㸲:
date
# Combine the two tables into posts_with_tags
posts_with_tags <- bind_rows(questions_with_tags %>% mutate(type = "question"),
answers_with_tags %>% mutate(type = "answer"))
# Add a year column, then aggregate by type, year, and tag_name
by_type_year_tag <- posts_with_tags %>%
mutate(year = year(creation_date)) %>%
group_by(type, year, tag_name) %>%
count()
by_type_year_tag
