Final Project

Author

Christopher Newman

This project is going to include datasets that show GitHub issues, pull requests, Repositories, and another dataset showing how popular each programming language is from 2004-2023 but we will only focus on the years 2011-2023. I found both of these datasets on Kaggle. I am going to explore what programming languages are used the most over all and we will be looking at each year. To do this we will merge datasets and add up values to find the all-time use of each programming language. I chose this topic because I have been coding ever since I got to college and I would like to see what languages are used the most and to also see all of the different languages that people use for different reasons.

Soruces: Github: https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data/data and https://www.kaggle.com/datasets/muhammadkhalid/most-popular-programming-languages-since-2004

https://blog.sagipl.com/top-programming-languages/

Loading packages

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(lubridate)
library(tidyr)
library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(highcharter)

Warning: package 'highcharter' was built under R version 4.3.3

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

setwd("C:/Users/Christopher Newman/Downloads/R coding")

Loading datasets

Most_pop_lang_Long <- read_csv("Popularity of Programming Languages from 2004 to 2023.csv")

Rows: 227 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Date
dbl (29): Abap, Ada, C/C++, C#, Cobol, Dart, Delphi/Pascal, Go, Groovy, Hask...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

repos <- read_csv("repos.csv")

Rows: 453 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): language
dbl (1): num_repos

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

issues <- read_csv("issues.csv")

Rows: 3375 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (3): year, quarter, count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

prs <- read_csv("prs.csv")

Rows: 3462 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (3): year, quarter, count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Meraging data sets

# Merge issues and prs datasets on name, year, and quarter
issues_prs <- inner_join(issues, prs, by = c("name", "year", "quarter"), suffix = c("_issues", "_prs"))

# Merge issues_prs with repos dataset using name and language
issues_prs_repos <- left_join(issues_prs, repos, by = c("name" = "language"))

# Prepare the pop_lang data by converting Date and extracting the Year
pop_lang <- Most_pop_lang_Long |>
  mutate(Date = parse_date_time(Date, orders = "my"), Year = year(Date)) |>
  select(Date, Year, JavaScript, Python, Java, `C/C++`, Ruby)

# Merge the issues_prs_repos with pop_lang using the Year
final_dataset <- left_join(issues_prs_repos, pop_lang, by = c("year" = "Year"))

Warning in left_join(issues_prs_repos, pop_lang, by = c(year = "Year")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 79 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

# Optional: Display the structure of the final dataset
str(final_dataset)

spc_tbl_ [34,908 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ name        : chr [1:34908] "Ruby" "Ruby" "Ruby" "Ruby" ...
 $ year        : num [1:34908] 2011 2011 2011 2011 2011 ...
 $ quarter     : num [1:34908] 3 3 3 3 3 3 3 3 3 3 ...
 $ count_issues: num [1:34908] 965 965 965 965 965 965 965 965 965 965 ...
 $ count_prs   : num [1:34908] 632 632 632 632 632 632 632 632 632 632 ...
 $ num_repos   : num [1:34908] 374802 374802 374802 374802 374802 ...
 $ Date        : POSIXct[1:34908], format: "2011-01-01" "2011-02-01" ...
 $ JavaScript  : num [1:34908] 6.81 6.93 6.94 7.07 7.13 7.17 7.05 7.03 7.04 7.08 ...
 $ Python      : num [1:34908] 6.65 6.87 6.91 6.84 6.91 6.89 6.9 6.89 6.98 7.2 ...
 $ Java        : num [1:34908] 28.4 28.2 28.1 28.2 28.2 ...
 $ C/C++       : num [1:34908] 12.2 12.5 12.5 12.5 12.5 ...
 $ Ruby        : num [1:34908] 2.62 2.55 2.55 2.49 2.47 2.57 2.58 2.66 2.57 2.57 ...
 - attr(*, "spec")=
  .. cols(
  ..   name = col_character(),
  ..   year = col_double(),
  ..   quarter = col_double(),
  ..   count = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

# Print the first few rows of the final dataset
print(head(final_dataset))

# A tibble: 6 × 12
  name   year quarter count_issues count_prs num_repos Date               
  <chr> <dbl>   <dbl>        <dbl>     <dbl>     <dbl> <dttm>             
1 Ruby   2011       3          965       632    374802 2011-01-01 00:00:00
2 Ruby   2011       3          965       632    374802 2011-02-01 00:00:00
3 Ruby   2011       3          965       632    374802 2011-03-01 00:00:00
4 Ruby   2011       3          965       632    374802 2011-04-01 00:00:00
5 Ruby   2011       3          965       632    374802 2011-05-01 00:00:00
6 Ruby   2011       3          965       632    374802 2011-06-01 00:00:00
# ℹ 5 more variables: JavaScript <dbl>, Python <dbl>, Java <dbl>,
#   `C/C++` <dbl>, Ruby <dbl>

# Transform the dataset to long format using 'gather'
lang_long <- Most_pop_lang_Long |>
  gather(Language, Usage, -Date)

print(lang_long)

# A tibble: 6,583 × 3
   Date           Language Usage
   <chr>          <chr>    <dbl>
 1 July 2004      Abap      0.34
 2 August 2004    Abap      0.35
 3 September 2004 Abap      0.41
 4 October 2004   Abap      0.4 
 5 November 2004  Abap      0.38
 6 December 2004  Abap      0.36
 7 January 2005   Abap      0.39
 8 February 2005  Abap      0.37
 9 March 2005     Abap      0.34
10 April 2005     Abap      0.34
# ℹ 6,573 more rows

lang_hist <- lang_long |>
  group_by(Language) |>
  summarize(Total_Usage = sum(Usage, na.rm = TRUE))

# Create a histogram using ggplot2
his <- ggplot(lang_hist, aes(x = reorder(Language, Total_Usage), y = Total_Usage)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(
    title = "Total Programming Language Popularity Distribution (2011-2023)",
    x = "Programming Language",
    y = "Total Popularity Index"
  ) +
  theme_minimal()
his

Comments: I first made this histogram to see a wide view of how popular the languages are and how many are in use from the years of 2011-2023.

# Identify the top 5 languages excluding R
top_5 <- lang_hist |>
  filter(Language != "R") |>
  arrange(desc(Total_Usage)) |>
  slice_head(n = 5) |>
  pull(Language)

# Include R explicitly in the list
selected_languages <- c(top_5, "R")

# Filter dataset to include only the selected languages
lang_filtered <- lang_hist |>
  filter(Language %in% selected_languages)

# Create a bar plot to compare R with the top 5 languages
ggplot(lang_filtered, aes(x = reorder(Language, Total_Usage), y = Total_Usage, fill = Language)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "R vs Top 5 Most Used Programming Languages",
    x = "Programming Language",
    y = "Total Popularity Index"
  ) +
  theme_minimal()

Comments: Based on the histogram I wanted to see what the top 5 most used languages were and I decided to add r in there since we are using it just to see it compared to the most used languages.

# Convert 'Date' to a proper date object
Most_pop_lang_Long$Date <- as.Date(Most_pop_lang_Long$Date, format = "%B %Y")

# Extract the 'Year' from 'Date' as a new column
Most_pop_lang_Long$Year <- format(Most_pop_lang_Long$Date, "%Y")

# Transform the dataset to long format using 'gather'
lang_long <- Most_pop_lang_Long |>
  gather(Language, Usage, -Date, -Year)

# Ensure the 'Year' column is available
print(head(lang_long))  # Verify that 'Year' column is present

# A tibble: 6 × 4
  Date   Year  Language Usage
  <date> <chr> <chr>    <dbl>
1 NA     <NA>  Abap      0.34
2 NA     <NA>  Abap      0.35
3 NA     <NA>  Abap      0.41
4 NA     <NA>  Abap      0.4 
5 NA     <NA>  Abap      0.38
6 NA     <NA>  Abap      0.36

# Summarize the data by language and year
lang_hist <- lang_long |>
  group_by(Language, Year) |>
  summarize(Total_Usage = sum(Usage, na.rm = TRUE))

`summarise()` has grouped output by 'Language'. You can override using the
`.groups` argument.

# Create a ggplot object with grouped bars by language and year
p <- ggplot(lang_hist, aes(x = Year, y = Total_Usage, fill = Language, text = paste("Language:", Language, "<br>Year:", Year, "<br>Usage:", Total_Usage))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Programming Language Popularity by Year",
    x = "Year",
    y = "Total Popularity Index",
    fill = "Programming Language"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Convert to a plotly object to add interactivity
ggplotly(p, tooltip = "text")

Comments: Now I wanted to see how popular each language was throughout the years but it did not go as planned. Whenever I load this section of code the years disappear and I could not figure out why so I decided to make it into a mouse over so you could see the numbers on each language.

# Summarize by year and language
lang_year_summary <- final_dataset |>
  group_by(name, year) |>
  summarize(Total_Issues = sum(count_issues, na.rm = TRUE),
            Total_PRS = sum(count_prs, na.rm = TRUE),
            Total_Repos = sum(num_repos, na.rm = TRUE)) |>
  ungroup()

`summarise()` has grouped output by 'name'. You can override using the
`.groups` argument.

# Create a ggplot object with grouped bars by programming language and year
p2 <- ggplot(lang_year_summary, aes(x = as.factor(year), y = Total_Repos, fill = name, 
                                   text = paste("Programming Language:", name,
                                                "<br>Year:", year,
                                                "<br>Total Repos:", Total_Repos))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Programming Language Popularity (Total Repos) by Year",
    x = "Year",
    y = "Total Repositories",
    fill = "Programming Language"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Use plotly to add interactivity
ggplotly(p2, tooltip = "text")

Comments: Now I made a histogram that shows the total repositories throughout the years. I made it so you can hover over them since you can’t really see each line

p3 <- ggplot(lang_year_summary, aes(x = as.factor(year), y = Total_Issues, fill = name, 
                                   text = paste("Programming Language:", name,
                                                "<br>Year:", year,
                                                "<br>Total Issues:", Total_Issues))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Programming Language Popularity (Total Issues) by Year",
    x = "Year",
    y = "Total Repositories",
    fill = "Programming Language"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Use plotly to add interactivity
ggplotly(p3, tooltip = "text")

Comments: I made the same thing above but for total issues for that year.

# Merge issues and prs on 'name', 'year', and 'quarter'
issues_prs <- inner_join(issues, prs, by = c("name", "year", "quarter"), suffix = c("_issues", "_prs"))

# Plot trends over time for issues and PRs
p3 <- ggplot(issues_prs, aes(x = year, y = count_issues, color = name, group = name)) +
  geom_line(size = 1.2) +
  labs(
    title = "Trends in Issues and Pull Requests by Programming Language",
    x = "Year",
    y = "Number of Issues",
    color = "Programming Language"
  ) +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

p3_interactive <- ggplotly(p3)
p3_interactive

Comments: I made a line graph using issues and PRs which correlate with each other, that shows it throughout the year.

# Aggregate the total repositories by programming language
repos_summary <- final_dataset |>
  group_by(name) |>
  summarize(Total_Repos = sum(num_repos, na.rm = TRUE))

# Identify the top 10 programming languages by total repositories (excluding R initially)
top_10 <- repos_summary |>
  filter(name != "R") |>
  arrange(desc(Total_Repos)) |>
  slice_head(n = 9) |>
  pull(name)

# Include R explicitly in the selection
selected_languages <- unique(c(top_10, "R"))

# Filter the dataset to include only the top 10 plus R
repos_filtered <- repos_summary |>
  filter(name %in% selected_languages)

# Create a bar plot to compare R against other languages
ggplot(repos_filtered, aes(x = reorder(name, Total_Repos), y = Total_Repos, fill = name)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "R vs Top 10 Programming Languages by Total Repositories",
    x = "Programming Language",
    y = "Total Repositories"
  ) +
  theme_minimal()

Now I made a graph showing the top 10 most used languages using the total repositories and I decided to add R in there to see how it compares to the others.

# Aggregate the total repositories by programming language
repos_summary <- final_dataset |>
  group_by(name) |>
  summarize(Total_Repos = sum(num_repos, na.rm = TRUE))

# Identify the top 10 programming languages by total repositories (excluding R initially)
top_10 <- repos_summary |>
  filter(name != "R") |>
  arrange(desc(Total_Repos)) |>
  slice_head(n = 9) |>
  pull(name)

# Include R explicitly in the selection
selected_languages <- unique(c(top_10, "R"))

# Filter the dataset to include only the top 10 plus R
repos_filtered <- repos_summary |>
  filter(name %in% selected_languages)

# Create a highcharter bar plot to compare R against other languages
highchart() |>
  hc_chart(type = "bar") |>
  hc_title(text = "R vs Top 10 Programming Languages by Total Repositories") |>
  hc_xAxis(categories = repos_filtered$name) |>
  hc_yAxis(title = list(text = "Total Repositories")) |>
  hc_add_series(name = "Total Repositories", data = repos_filtered$Total_Repos, colorByPoint = TRUE) %>%
  hc_plotOptions(series = list(dataLabels = list(enabled = TRUE))) |>
  hc_tooltip(pointFormat = "Total Repositories: {point.y}")

Comments: Now I changed it to high charter to make it more interactive

# Aggregate the total issues by programming language
issues_summary <- final_dataset |>
  group_by(name) |>
  summarize(Total_Issues = sum(count_issues, na.rm = TRUE))

# Identify the top 10 programming languages by total issues (excluding R initially)
top_10_issues <- issues_summary |>
  filter(name != "R") |>
  arrange(desc(Total_Issues)) |>
  slice_head(n = 9) |>
  pull(name)

# Include R explicitly in the selection
selected_languages_issues <- unique(c(top_10_issues, "R"))

# Filter the dataset to include only the top 10 plus R
issues_filtered <- issues_summary |>
  filter(name %in% selected_languages_issues)

# Create a highcharter bar plot to compare R against other languages by issues
highchart() |>
  hc_chart(type = "bar") |>
  hc_title(text = "R vs Top 10 Programming Languages by Total Issues") |>
  hc_xAxis(categories = issues_filtered$name) |>
  hc_yAxis(title = list(text = "Total Issues")) |>
  hc_add_series(name = "Total Issues", data = issues_filtered$Total_Issues, colorByPoint = TRUE) |>
  hc_plotOptions(series = list(dataLabels = list(enabled = TRUE))) |>
  hc_tooltip(pointFormat = "Total Issues: {point.y}")

Comments: Now I did the same but with total issues instead

#Paragraph:

B. Source:https://blog.sagipl.com/top-programming-languages, this is where I found the image but I found it interesting that R was in the top 10 to learn in 2024. This website also gives reasoning on why each language of used and it gives what each one is good for. So if you are interested in web/app development you should learn Java since it is most used for that.

C. This visualization represents the many different languages there are and how popular each one is. The only pattern I saw was that Java, javascript, and Python were always in the top 5 for any of the graphs which means a lot of companies use these languages and it also shows that a lot of people just like coding in these languages which is why I chose the GitHub dataset since people can code in whatever they want. I could not get the years to work for most of the visualizations so if I could do one thing, that would be to figure out why the years would delete after a certain point in my coding. I tried everything when it came to that so I decided to continue without using the years.