This data has one observation for each pair of a tag and a year, showing
the number of questions asked in that tag in that year and the total
number of questions asked in that year. For instance, there were 54
questions asked about the .htaccess tag in 2008, out of a
total of 58390 questions in that year.
Rather than just the counts, we’re probably interested in a percentage: the fraction of questions that year that have that tag. So let’s add that to the table.
# Add fraction column
by_tag_year_fraction <- by_tag_year %>% mutate(fraction = number/year_total)
# Print the new table
by_tag_year_fraction
## # A tibble: 40,518 × 5
## year tag number year_total fraction
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2008 .htaccess 54 58390 0.000925
## 2 2008 .net 5910 58390 0.101
## 3 2008 .net-2.0 289 58390 0.00495
## 4 2008 .net-3.5 319 58390 0.00546
## 5 2008 .net-4.0 6 58390 0.000103
## 6 2008 .net-assembly 3 58390 0.0000514
## 7 2008 .net-core 1 58390 0.0000171
## 8 2008 2d 42 58390 0.000719
## 9 2008 32-bit 19 58390 0.000325
## 10 2008 32bit-64bit 4 58390 0.0000685
## # … with 40,508 more rows
So far we’ve been learning and using the R programming language. Wouldn’t we like to be sure it’s a good investment for the future? Has it been keeping pace with other languages, or have people been switching out of it?
Let’s look at whether the fraction of Stack Overflow questions that are about R has been increasing or decreasing over time.
# Filter for R tags
r_over_time <- by_tag_year_fraction %>% filter(tag == "r")
# Print the new table
r_over_time
## # A tibble: 11 × 5
## year tag number year_total fraction
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2008 r 8 58390 0.000137
## 2 2009 r 524 343868 0.00152
## 3 2010 r 2270 694391 0.00327
## 4 2011 r 5845 1200551 0.00487
## 5 2012 r 12221 1645404 0.00743
## 6 2013 r 22329 2060473 0.0108
## 7 2014 r 31011 2164701 0.0143
## 8 2015 r 40844 2219527 0.0184
## 9 2016 r 44611 2226072 0.0200
## 10 2017 r 54415 2305207 0.0236
## 11 2018 r 28938 1085170 0.0267
Rather than looking at the results in a table, we often want to create a visualization. Change over time is usually visualized with a line plot.
# Load ggplot2
library(ggplot2)
# Create a line plot of fraction over time
r_over_time %>% ggplot(aes(year,fraction)) + geom_line()
Based on that graph, it looks like R has been growing pretty fast in the last decade. Good thing we’re practicing it now!
Besides R, two other interesting tags are dplyr and ggplot2, which we’ve already used in this analysis. They both also have Stack Overflow tags!
Instead of just looking at R, let’s look at all three tags and their change over time. Are each of those tags increasing as a fraction of overall questions? Are any of them decreasing?
# A vector of selected tags
selected_tags <- c( "r", "dplyr","ggplot2")
# Filter for those tags
selected_tags_over_time <- by_tag_year_fraction %>%
filter(tag %in% selected_tags)
# Plot tags over time on a line plot using color to represent tag
selected_tags_over_time %>% ggplot(aes(year,fraction,color = tag)) + geom_line()
We’ve looked at selected tags like R, ggplot2, and dplyr, and seen that they’re each growing. What tags might be shrinking? A good place to start is to plot the tags that we just saw that were the most-asked about of all time, including JavaScript, Java and C#.
# Get the six largest tags
highest_tags <- head(sorted_tags$tag)
# Filter for the six largest tags
by_tag_subset <- by_tag_year_fraction %>%
filter(tag %in% highest_tags)
# Plot tags over time on a line plot using color to represent tag
by_tag_subset %>% ggplot(aes(year,fraction,color = tag))+ geom_line()