How can you determine which programming languages and technologies are most widely used? Which languages are gaining or losing popularity, helping you decide where to focus your efforts?
One excellent data source is Stack Overflow, a programming question-and-answer site with more than 16 million questions on programming topics. Each Stack Overflow question is tagged with a label identifying its topic or technology. By counting the number of questions related to each technology, you can estimate the popularity of different programming languages.
In this project, you will use data from the Stack Exchange Data Explorer to examine the relative popularity of R compared to other programming languages.
You’ll work with a dataset containing one observation per tag per year, including the number of questions for that tag and the total number of questions that year.
stack_overflow_data.csv
Project Instructions
Discover the trends in the popularity of programming languages by answering the following questions:
What was the percentage of R questions for 2020? Save the result in a data frame, r_2020, containing five columns: year, tag, num_questions, year_total, and percentage.
Identify the five programming language tags with the highest total number of questions asked between 2015 and 2020 (inclusive). Save the tag names as highest_tags. This variable can be a character vector, tibble, or data frame (if the latter, please use the column name tag).
Bonus: try visualizing the data along the way!
Load the necessary packages
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Load the dataset
data <- read_csv("stack_overflow_data.csv")
## Rows: 420066 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): tag
## dbl (3): year, num_questions, year_total
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View the dataset
glimpse(data)
## Rows: 420,066
## Columns: 4
## $ year <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 20…
## $ tag <chr> "treeview", "scheduled-tasks", "specifications", "render…
## $ num_questions <dbl> 69, 30, 21, 35, 6, 1, 159, 10, 4, 20, 11, 5, 19, 2, 19, …
## $ year_total <dbl> 168541, 168541, 168541, 168541, 168541, 168541, 168541, …
summary(data)
## year tag num_questions year_total
## Min. :2008 Length:420066 Min. : 1.0 Min. : 168541
## 1st Qu.:2012 Class :character 1st Qu.: 2.0 1st Qu.:4787010
## Median :2015 Mode :character Median : 7.0 Median :5621997
## Mean :2015 Mean : 142.6 Mean :5222995
## 3rd Qu.:2018 3rd Qu.: 29.0 3rd Qu.:6431458
## Max. :2020 Max. :264379.0 Max. :6612772
head(data)
## # A tibble: 6 × 4
## year tag num_questions year_total
## <dbl> <chr> <dbl> <dbl>
## 1 2008 treeview 69 168541
## 2 2008 scheduled-tasks 30 168541
## 3 2008 specifications 21 168541
## 4 2008 rendering 35 168541
## 5 2008 http-post 6 168541
## 6 2008 static-assert 1 168541
Identify the percentage of R questions: What was the percentage of R questions for 2020? Save the result in a data frame, r_2020, containing five columns: year, tag, num_questions, year_total, and percentage
r_2020 <- data %>% mutate(percentage = num_questions / year_total) %>% filter(year == 2020, tag == "r")
print(r_2020)
## # A tibble: 1 × 5
## year tag num_questions year_total percentage
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2020 r 52662 5452545 0.00966
Calculate the five most asked-about tags between 2015-2020 You’ll want to filter and group the tags from 2015 to 2020 before calculating the totals and extracting the top five.
# step 1: filter the data for five years
highest_tags <- data %>% filter(year >= 2015, year <= 2020) %>%
# step 2: group by tag
group_by(tag) %>%
# step 3: calculate the total number of questions asked
summarize(tag_total = sum(num_questions)) %>%
# step 4: arrange the tags in descending order of no. of questions asked
arrange(desc(tag_total)) %>%
# step 5: Create variable highest_tags and select top 5
select(tag, tag_total) %>% head(n = 5)
Print highest_tags
highest_tags
## # A tibble: 5 × 2
## tag tag_total
## <chr> <dbl>
## 1 javascript 1373634
## 2 python 1187838
## 3 java 982747
## 4 android 737330
## 5 c# 730045
Bonus: Visualize your answer Filter for the five largest tags data_subset <- data_percentage %>% filter(tag %in% highest_tags, year >= 2015)
Plot tags over time on a line plot using color to represent tag ggplot(data_subset, aes(x = year, y = percentage, color = tag)) + geom_line()
# I'm going to create a percentage column to calculate the percentage of total questions asked for top 5 tags
data_percentage <- data %>%
mutate(percentage = num_questions / year_total)
# Now create the data subset variable
data_subset <- data_percentage %>%
filter(tag %in% highest_tags$tag, year >= 2015, year <= 2020)
Plot data_subset using a line plot using color to represent tag
ggplot(data_subset, aes(x = year, y = percentage, color = tag)) + geom_line()