This data has one observation for each pair of a tag and a year, showing the number of questions asked in that tag in that year and the total number of questions asked in that year. For instance, there were 54 questions asked about the .htaccess tag in 2008, out of a total of 58390 questions in that year.
Rather than just the counts, we’re probably interested in a percentage: the fraction of questions that year that have that tag. So let’s add that to the table.
# Use mutate() to add a column called fraction to by_tag_year, representing number divided by year_total. Name the new table by_tag_year_fraction.
by_tag_year_fraction <- by_tag_year %>%
mutate(fraction = number/year_total)
# Print by_tag_year_fraction.
by_tag_year_fraction
So far we’ve been learning and using the R programming language. Wouldn’t we like to be sure it’s a good investment for the future? Has it been keeping pace with other languages, or have people been switching out of it?
Let’s look at whether the fraction of Stack Overflow questions that are about R has been increasing or decreasing over time.
# Use filter() to get only the observations from by_tag_year_fraction that represent R, saving them as r_over_time.
r_over_time <- by_tag_year_fraction %>% filter(tag == "r")
# Print r_over_time.
r_over_time
Rather than looking at the results in a table, we often want to create a visualization. Change over time is usually visualized with a line plot.
# Load the ggplot2 package.
library(ggplot2)
# Plot r_over_time with year on the x-axis and fraction on the y-axis. Add a geom_line() layer to the plot to create a line plot.
ggplot(data = r_over_time, aes(x = year, y = fraction)) +
geom_line(aes(group=1))
Based on that graph, it looks like R has been growing pretty fast in the last decade. Good thing we’re practicing it now!
Besides R, two other interesting tags are dplyr and ggplot2, which we’ve already used in this analysis. They both also have Stack Overflow tags!
Instead of just looking at R, let’s look at all three tags and their change over time. Are each of those tags increasing as a fraction of overall questions? Are any of them decreasing?
# Combine the tags "r", "dplyr" and "ggplot2" into a vector named selected_tags using c().
selected_tags <- c("r", "dplyr", "ggplot2")
# Use filter() on by_tag_year_fraction, along with the %in% operator, to get only the subset of tags in selected_tags. Name the new table selected_tags_over_time.
selected_tags_over_time <- by_tag_year_fraction %>%
filter (tag %in% selected_tags)
# Visualize the popularity of these three tags with a line plot in ggplot2 (with year on the x-axis and fraction on the y-axis) using color to represent tag.
ggplot(data = selected_tags_over_time, aes(x = year, y = fraction)) +
geom_line(aes(group = tag, color = tag))
We’ve looked at selected tags like R, ggplot2, and dplyr, and seen that they’re each growing. What tags might be shrinking? A good place to start is to plot the tags that we just saw that were the most-asked about of all time, including JavaScript, Java and C#.
# Use the filter() verb to filter by_tag_year_fraction only for the tags in highest_tags, which are the six largest tags.
highest_tags <- head(sorted_tags$tag)
by_tag_subset <- by_tag_year_fraction %>% filter(tag %in% highest_tags)
# Create a line plot of the fraction of questions each of these tags made up over time, using color to represent the tag.
ggplot(data = by_tag_subset, aes(x = year, y = fraction)) +
geom_line(aes(group = tag, color = tag))