This project is going to include datasets that show GitHub issues, pull requests, Repositories, and another dataset showing how popular each programming language is from 2004-2023 but we will only focus on the years 2011-2023. I found both of these datasets on Kaggle. I am going to explore what programming languages are used the most over all and we will be looking at each year. To do this we will merge datasets and add up values to find the all-time use of each programming language. I chose this topic because I have been coding ever since I got to college and I would like to see what languages are used the most and to also see all of the different languages that people use for different reasons.
Soruces: Github: https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data/data and https://www.kaggle.com/datasets/muhammadkhalid/most-popular-programming-languages-since-2004
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(highcharter)
Warning: package 'highcharter' was built under R version 4.3.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Most_pop_lang_Long <-read_csv("Popularity of Programming Languages from 2004 to 2023.csv")
Rows: 227 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Date
dbl (29): Abap, Ada, C/C++, C#, Cobol, Dart, Delphi/Pascal, Go, Groovy, Hask...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
repos <-read_csv("repos.csv")
Rows: 453 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): language
dbl (1): num_repos
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
issues <-read_csv("issues.csv")
Rows: 3375 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (3): year, quarter, count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
prs <-read_csv("prs.csv")
Rows: 3462 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (3): year, quarter, count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Meraging data sets
# Merge issues and prs datasets on name, year, and quarterissues_prs <-inner_join(issues, prs, by =c("name", "year", "quarter"), suffix =c("_issues", "_prs"))# Merge issues_prs with repos dataset using name and languageissues_prs_repos <-left_join(issues_prs, repos, by =c("name"="language"))# Prepare the pop_lang data by converting Date and extracting the Yearpop_lang <- Most_pop_lang_Long |>mutate(Date =parse_date_time(Date, orders ="my"), Year =year(Date)) |>select(Date, Year, JavaScript, Python, Java, `C/C++`, Ruby)# Merge the issues_prs_repos with pop_lang using the Yearfinal_dataset <-left_join(issues_prs_repos, pop_lang, by =c("year"="Year"))
Warning in left_join(issues_prs_repos, pop_lang, by = c(year = "Year")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 79 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# Optional: Display the structure of the final datasetstr(final_dataset)
# Transform the dataset to long format using 'gather'lang_long <- Most_pop_lang_Long |>gather(Language, Usage, -Date)print(lang_long)
# A tibble: 6,583 × 3
Date Language Usage
<chr> <chr> <dbl>
1 July 2004 Abap 0.34
2 August 2004 Abap 0.35
3 September 2004 Abap 0.41
4 October 2004 Abap 0.4
5 November 2004 Abap 0.38
6 December 2004 Abap 0.36
7 January 2005 Abap 0.39
8 February 2005 Abap 0.37
9 March 2005 Abap 0.34
10 April 2005 Abap 0.34
# ℹ 6,573 more rows
lang_hist <- lang_long |>group_by(Language) |>summarize(Total_Usage =sum(Usage, na.rm =TRUE))# Create a histogram using ggplot2his <-ggplot(lang_hist, aes(x =reorder(Language, Total_Usage), y = Total_Usage)) +geom_bar(stat ="identity", fill ="skyblue") +coord_flip() +labs(title ="Total Programming Language Popularity Distribution (2011-2023)",x ="Programming Language",y ="Total Popularity Index" ) +theme_minimal()his
Comments: I first made this histogram to see a wide view of how popular the languages are and how many are in use from the years of 2011-2023.
# Identify the top 5 languages excluding Rtop_5 <- lang_hist |>filter(Language !="R") |>arrange(desc(Total_Usage)) |>slice_head(n =5) |>pull(Language)# Include R explicitly in the listselected_languages <-c(top_5, "R")# Filter dataset to include only the selected languageslang_filtered <- lang_hist |>filter(Language %in% selected_languages)# Create a bar plot to compare R with the top 5 languagesggplot(lang_filtered, aes(x =reorder(Language, Total_Usage), y = Total_Usage, fill = Language)) +geom_bar(stat ="identity") +coord_flip() +labs(title ="R vs Top 5 Most Used Programming Languages",x ="Programming Language",y ="Total Popularity Index" ) +theme_minimal()
Comments: Based on the histogram I wanted to see what the top 5 most used languages were and I decided to add r in there since we are using it just to see it compared to the most used languages.
# Convert 'Date' to a proper date objectMost_pop_lang_Long$Date <-as.Date(Most_pop_lang_Long$Date, format ="%B %Y")# Extract the 'Year' from 'Date' as a new columnMost_pop_lang_Long$Year <-format(Most_pop_lang_Long$Date, "%Y")# Transform the dataset to long format using 'gather'lang_long <- Most_pop_lang_Long |>gather(Language, Usage, -Date, -Year)# Ensure the 'Year' column is availableprint(head(lang_long)) # Verify that 'Year' column is present
# A tibble: 6 × 4
Date Year Language Usage
<date> <chr> <chr> <dbl>
1 NA <NA> Abap 0.34
2 NA <NA> Abap 0.35
3 NA <NA> Abap 0.41
4 NA <NA> Abap 0.4
5 NA <NA> Abap 0.38
6 NA <NA> Abap 0.36
# Summarize the data by language and yearlang_hist <- lang_long |>group_by(Language, Year) |>summarize(Total_Usage =sum(Usage, na.rm =TRUE))
`summarise()` has grouped output by 'Language'. You can override using the
`.groups` argument.
# Create a ggplot object with grouped bars by language and yearp <-ggplot(lang_hist, aes(x = Year, y = Total_Usage, fill = Language, text =paste("Language:", Language, "<br>Year:", Year, "<br>Usage:", Total_Usage))) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Programming Language Popularity by Year",x ="Year",y ="Total Popularity Index",fill ="Programming Language" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))# Convert to a plotly object to add interactivityggplotly(p, tooltip ="text")
Comments: Now I wanted to see how popular each language was throughout the years but it did not go as planned. Whenever I load this section of code the years disappear and I could not figure out why so I decided to make it into a mouse over so you could see the numbers on each language.
# Summarize by year and languagelang_year_summary <- final_dataset |>group_by(name, year) |>summarize(Total_Issues =sum(count_issues, na.rm =TRUE),Total_PRS =sum(count_prs, na.rm =TRUE),Total_Repos =sum(num_repos, na.rm =TRUE)) |>ungroup()
`summarise()` has grouped output by 'name'. You can override using the
`.groups` argument.
# Create a ggplot object with grouped bars by programming language and yearp2 <-ggplot(lang_year_summary, aes(x =as.factor(year), y = Total_Repos, fill = name, text =paste("Programming Language:", name,"<br>Year:", year,"<br>Total Repos:", Total_Repos))) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Programming Language Popularity (Total Repos) by Year",x ="Year",y ="Total Repositories",fill ="Programming Language" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))# Use plotly to add interactivityggplotly(p2, tooltip ="text")
Comments: Now I made a histogram that shows the total repositories throughout the years. I made it so you can hover over them since you can’t really see each line
p3 <-ggplot(lang_year_summary, aes(x =as.factor(year), y = Total_Issues, fill = name, text =paste("Programming Language:", name,"<br>Year:", year,"<br>Total Issues:", Total_Issues))) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Programming Language Popularity (Total Issues) by Year",x ="Year",y ="Total Repositories",fill ="Programming Language" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))# Use plotly to add interactivityggplotly(p3, tooltip ="text")
Comments: I made the same thing above but for total issues for that year.
# Merge issues and prs on 'name', 'year', and 'quarter'issues_prs <-inner_join(issues, prs, by =c("name", "year", "quarter"), suffix =c("_issues", "_prs"))# Plot trends over time for issues and PRsp3 <-ggplot(issues_prs, aes(x = year, y = count_issues, color = name, group = name)) +geom_line(size =1.2) +labs(title ="Trends in Issues and Pull Requests by Programming Language",x ="Year",y ="Number of Issues",color ="Programming Language" ) +theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
p3_interactive <-ggplotly(p3)p3_interactive
Comments: I made a line graph using issues and PRs which correlate with each other, that shows it throughout the year.
# Aggregate the total repositories by programming languagerepos_summary <- final_dataset |>group_by(name) |>summarize(Total_Repos =sum(num_repos, na.rm =TRUE))# Identify the top 10 programming languages by total repositories (excluding R initially)top_10 <- repos_summary |>filter(name !="R") |>arrange(desc(Total_Repos)) |>slice_head(n =9) |>pull(name)# Include R explicitly in the selectionselected_languages <-unique(c(top_10, "R"))# Filter the dataset to include only the top 10 plus Rrepos_filtered <- repos_summary |>filter(name %in% selected_languages)# Create a bar plot to compare R against other languagesggplot(repos_filtered, aes(x =reorder(name, Total_Repos), y = Total_Repos, fill = name)) +geom_bar(stat ="identity") +coord_flip() +labs(title ="R vs Top 10 Programming Languages by Total Repositories",x ="Programming Language",y ="Total Repositories" ) +theme_minimal()
Now I made a graph showing the top 10 most used languages using the total repositories and I decided to add R in there to see how it compares to the others.
# Aggregate the total repositories by programming languagerepos_summary <- final_dataset |>group_by(name) |>summarize(Total_Repos =sum(num_repos, na.rm =TRUE))# Identify the top 10 programming languages by total repositories (excluding R initially)top_10 <- repos_summary |>filter(name !="R") |>arrange(desc(Total_Repos)) |>slice_head(n =9) |>pull(name)# Include R explicitly in the selectionselected_languages <-unique(c(top_10, "R"))# Filter the dataset to include only the top 10 plus Rrepos_filtered <- repos_summary |>filter(name %in% selected_languages)# Create a highcharter bar plot to compare R against other languageshighchart() |>hc_chart(type ="bar") |>hc_title(text ="R vs Top 10 Programming Languages by Total Repositories") |>hc_xAxis(categories = repos_filtered$name) |>hc_yAxis(title =list(text ="Total Repositories")) |>hc_add_series(name ="Total Repositories", data = repos_filtered$Total_Repos, colorByPoint =TRUE) %>%hc_plotOptions(series =list(dataLabels =list(enabled =TRUE))) |>hc_tooltip(pointFormat ="Total Repositories: {point.y}")
Comments: Now I changed it to high charter to make it more interactive
# Aggregate the total issues by programming languageissues_summary <- final_dataset |>group_by(name) |>summarize(Total_Issues =sum(count_issues, na.rm =TRUE))# Identify the top 10 programming languages by total issues (excluding R initially)top_10_issues <- issues_summary |>filter(name !="R") |>arrange(desc(Total_Issues)) |>slice_head(n =9) |>pull(name)# Include R explicitly in the selectionselected_languages_issues <-unique(c(top_10_issues, "R"))# Filter the dataset to include only the top 10 plus Rissues_filtered <- issues_summary |>filter(name %in% selected_languages_issues)# Create a highcharter bar plot to compare R against other languages by issueshighchart() |>hc_chart(type ="bar") |>hc_title(text ="R vs Top 10 Programming Languages by Total Issues") |>hc_xAxis(categories = issues_filtered$name) |>hc_yAxis(title =list(text ="Total Issues")) |>hc_add_series(name ="Total Issues", data = issues_filtered$Total_Issues, colorByPoint =TRUE) |>hc_plotOptions(series =list(dataLabels =list(enabled =TRUE))) |>hc_tooltip(pointFormat ="Total Issues: {point.y}")
Comments: Now I did the same but with total issues instead
#Paragraph:
B. Source:https://blog.sagipl.com/top-programming-languages, this is where I found the image but I found it interesting that R was in the top 10 to learn in 2024. This website also gives reasoning on why each language of used and it gives what each one is good for. So if you are interested in web/app development you should learn Java since it is most used for that.
C. This visualization represents the many different languages there are and how popular each one is. The only pattern I saw was that Java, javascript, and Python were always in the top 5 for any of the graphs which means a lot of companies use these languages and it also shows that a lot of people just like coding in these languages which is why I chose the GitHub dataset since people can code in whatever they want. I could not get the years to work for most of the visualizations so if I could do one thing, that would be to figure out why the years would delete after a certain point in my coding. I tried everything when it came to that so I decided to continue without using the years.
Comments: I first made this histogram to see a wide view of how popular the languages are and how many are in use from the years of 2011-2023.