1. Data on tags over time

How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?

One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We’re going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.

Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there’s a tag for languages like R or Python, and for packages like ggplot2 or pandas.

We’ll be working with a dataset with one observation for each tag in each year. The dataset includes both the number of questions asked in that tag in that year, and the total number of questions asked in that year.

# Load the readr and dplyr packages.
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the dataset datasets/by_tag_year.csv into a variable named by_tag_year using the read_csv() function (not read.csv()).
by_tag_year <- read_csv("by_tag_year.csv")
## Rows: 40518 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): tag
## dbl (3): year, number, year_total
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print by_tag_year
by_tag_year

2. Now in fraction format

This data has one observation for each pair of a tag and a year, showing the number of questions asked in that tag in that year and the total number of questions asked in that year. For instance, there were 54 questions asked about the .htaccess tag in 2008, out of a total of 58390 questions in that year.

Rather than just the counts, we’re probably interested in a percentage: the fraction of questions that year that have that tag. So let’s add that to the table.

# Use mutate() to add a column called fraction to by_tag_year, representing number divided by year_total. Name the new table by_tag_year_fraction.
by_tag_year_fraction <- by_tag_year %>%
  mutate(fraction = number/year_total)

# Print by_tag_year_fraction.
by_tag_year_fraction

3. Has R been growing or shrinking?

So far we’ve been learning and using the R programming language. Wouldn’t we like to be sure it’s a good investment for the future? Has it been keeping pace with other languages, or have people been switching out of it?

Let’s look at whether the fraction of Stack Overflow questions that are about R has been increasing or decreasing over time.

# Use filter() to get only the observations from by_tag_year_fraction that represent R, saving them as r_over_time.
r_over_time <- by_tag_year_fraction %>% filter(tag == "r")

# Print r_over_time.
r_over_time

4. Visualizing change over time

Rather than looking at the results in a table, we often want to create a visualization. Change over time is usually visualized with a line plot.

# Load the ggplot2 package.
library(ggplot2)

# Plot r_over_time with year on the x-axis and fraction on the y-axis. Add a geom_line() layer to the plot to create a line plot.
ggplot(data = r_over_time, aes(x = year, y = fraction)) +
  geom_line(aes(group=1))

5. How about dplyr and ggplot2?

Based on that graph, it looks like R has been growing pretty fast in the last decade. Good thing we’re practicing it now!

Besides R, two other interesting tags are dplyr and ggplot2, which we’ve already used in this analysis. They both also have Stack Overflow tags!

Instead of just looking at R, let’s look at all three tags and their change over time. Are each of those tags increasing as a fraction of overall questions? Are any of them decreasing?

# Combine the tags "r", "dplyr" and "ggplot2" into a vector named selected_tags using c().
selected_tags <- c("r", "dplyr", "ggplot2")

# Use filter() on by_tag_year_fraction, along with the %in% operator, to get only the subset of tags in selected_tags. Name the new table selected_tags_over_time.
selected_tags_over_time <- by_tag_year_fraction %>% 
  filter (tag %in% selected_tags)

# Visualize the popularity of these three tags with a line plot in ggplot2 (with year on the x-axis and fraction on the y-axis) using color to represent tag.
ggplot(data = selected_tags_over_time, aes(x = year, y = fraction)) +
  geom_line(aes(group = tag, color = tag))

6. What are the most asked-about tags?

It’s sure been fun to visualize and compare tags over time. The dplyr and ggplot2 tags may not have as many questions as R, but we can tell they’re both growing quickly as well.

We might like to know which tags have the most questions overall, not just within a particular year. Right now, we have several rows for every tag, but we’ll be combining them into one. That means we want group_by() and summarize().

Let’s look at tags that have the most questions in history.

# Use the group_by() and summarize() verbs on by_tag_year to find the total number of questions for each tag, saving the column as tag_total. Then use the arrange() verb to sort the table in descending order of the tag_total column. Save the result to sorted_tags.
sorted_tags <- by_tag_year %>%
  group_by(tag) %>%
  summarise(tag_total = sum(number)) %>%
  arrange(desc(tag_total))

# Print sorted_tags.
sorted_tags

7. How have large programming languages changed over time?

We’ve looked at selected tags like R, ggplot2, and dplyr, and seen that they’re each growing. What tags might be shrinking? A good place to start is to plot the tags that we just saw that were the most-asked about of all time, including JavaScript, Java and C#.

# Use the filter() verb to filter by_tag_year_fraction only for the tags in highest_tags, which are the six largest tags.
highest_tags <- head(sorted_tags$tag)
by_tag_subset <- by_tag_year_fraction %>% filter(tag %in% highest_tags)

# Create a line plot of the fraction of questions each of these tags made up over time, using color to represent the tag.
ggplot(data = by_tag_subset, aes(x = year, y = fraction)) +
  geom_line(aes(group = tag, color = tag))

8. Some more tags!

Wow, based on that graph we’ve seen a lot of changes in what programming languages are most asked about. C# gets fewer questions than it used to, and Python has grown quite impressively.

This Stack Overflow data is incredibly versatile. We can analyze any programming language, web framework, or tool where we’d like to see their change over time. Combined with the reproducibility of R and its libraries, we have ourselves a powerful method of uncovering insights about technology.

To demonstrate its versatility, let’s check out how three big mobile operating systems (Android, iOS, and Windows Phone) have compared in popularity over time. But remember: this code can be modified simply by changing the tag names!

# Combine the tags "android", "ios" and "windows-phone" into a vector named my_tags using c().
my_tags <- c("android", "ios", "windows-phone")

# Use filter() on by_tag_year_fraction to get only the subset of tags in my_tags. Name the new table by_tag_subset.
by_tag_subset <- by_tag_year_fraction %>% filter(tag %in% my_tags)

# Visualize the popularity of these tags with a line plot in ggplot2 (with year on the x-axis and fraction on the y-axis) using color to represent tag.
ggplot(data = by_tag_subset, aes(x = year, y = fraction)) +
  geom_line(aes(group = tag, color = tag))