How can you determine which programming languages and technologies are most widely used? Which languages are gaining or losing popularity, helping you decide where to focus your efforts?

One excellent data source is Stack Overflow, a programming question-and-answer site with more than 16 million questions on programming topics. Each Stack Overflow question is tagged with a label identifying its topic or technology. By counting the number of questions related to each technology, you can estimate the popularity of different programming languages.

In this project, you will use data from the Stack Exchange Data Explorer to examine the relative popularity of R compared to other programming languages.

You’ll work with a dataset containing one observation per tag per year, including the number of questions for that tag and the total number of questions that year.

stack_overflow_data.csv

Project Instructions

Discover the trends in the popularity of programming languages by answering the following questions:

What was the percentage of R questions for 2020? Save the result in a data frame, r_2020, containing five columns: year, tag, num_questions, year_total, and percentage.

Identify the five programming language tags with the highest total number of questions asked between 2015 and 2020 (inclusive). Save the tag names as highest_tags. This variable can be a character vector, tibble, or data frame (if the latter, please use the column name tag).

Bonus: try visualizing the data along the way!

Load the necessary packages

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Load the dataset

data <- read_csv("stack_overflow_data.csv")
## Rows: 420066 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): tag
## dbl (3): year, num_questions, year_total
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View the dataset

glimpse(data)
## Rows: 420,066
## Columns: 4
## $ year          <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 20…
## $ tag           <chr> "treeview", "scheduled-tasks", "specifications", "render…
## $ num_questions <dbl> 69, 30, 21, 35, 6, 1, 159, 10, 4, 20, 11, 5, 19, 2, 19, …
## $ year_total    <dbl> 168541, 168541, 168541, 168541, 168541, 168541, 168541, …
summary(data)
##       year          tag            num_questions        year_total     
##  Min.   :2008   Length:420066      Min.   :     1.0   Min.   : 168541  
##  1st Qu.:2012   Class :character   1st Qu.:     2.0   1st Qu.:4787010  
##  Median :2015   Mode  :character   Median :     7.0   Median :5621997  
##  Mean   :2015                      Mean   :   142.6   Mean   :5222995  
##  3rd Qu.:2018                      3rd Qu.:    29.0   3rd Qu.:6431458  
##  Max.   :2020                      Max.   :264379.0   Max.   :6612772
head(data)
## # A tibble: 6 × 4
##    year tag             num_questions year_total
##   <dbl> <chr>                   <dbl>      <dbl>
## 1  2008 treeview                   69     168541
## 2  2008 scheduled-tasks            30     168541
## 3  2008 specifications             21     168541
## 4  2008 rendering                  35     168541
## 5  2008 http-post                   6     168541
## 6  2008 static-assert               1     168541

Identify the percentage of R questions: What was the percentage of R questions for 2020? Save the result in a data frame, r_2020, containing five columns: year, tag, num_questions, year_total, and percentage

r_2020 <- data %>% mutate(percentage = num_questions / year_total) %>% filter(year == 2020, tag == "r")
print(r_2020)
## # A tibble: 1 × 5
##    year tag   num_questions year_total percentage
##   <dbl> <chr>         <dbl>      <dbl>      <dbl>
## 1  2020 r             52662    5452545    0.00966

Calculate the five most asked-about tags between 2015-2020 You’ll want to filter and group the tags from 2015 to 2020 before calculating the totals and extracting the top five.

# step 1: filter the data for five years
highest_tags <- data %>% filter(year >= 2015, year <= 2020) %>%
# step 2: group by tag
group_by(tag) %>%
# step 3: calculate the total number of questions asked
  summarize(tag_total = sum(num_questions)) %>%
# step 4: arrange the tags in descending order of no. of questions asked
  arrange(desc(tag_total)) %>% 
# step 5: Create variable highest_tags and select top 5
  select(tag, tag_total) %>% head(n = 5)

Print highest_tags

highest_tags
## # A tibble: 5 × 2
##   tag        tag_total
##   <chr>          <dbl>
## 1 javascript   1373634
## 2 python       1187838
## 3 java          982747
## 4 android       737330
## 5 c#            730045

Bonus: Visualize your answer Filter for the five largest tags data_subset <- data_percentage %>% filter(tag %in% highest_tags, year >= 2015)

Plot tags over time on a line plot using color to represent tag ggplot(data_subset, aes(x = year, y = percentage, color = tag)) + geom_line()

# I'm going to create a percentage column to calculate the percentage of total questions asked for top 5 tags
data_percentage <- data %>% 
  mutate(percentage = num_questions / year_total)

# Now create the data subset variable
data_subset <- data_percentage %>% 
  filter(tag %in% highest_tags$tag, year >= 2015, year <= 2020)

Plot data_subset using a line plot using color to represent tag

ggplot(data_subset, aes(x = year, y = percentage, color = tag)) + geom_line()