Github user data analysis: micheleriva

The aim of this project is to analyze github user data of a particular user on GitHub. For this purpose, I make use of the github API for accessing data and subsequently wrangle and analyze the data in R

micheleriva is the Github user handle of Michele Riva. The user profile can be found on https://github.com/micheleriva

1. Description: Senior Architect @nearform · @google GDE · @microsoft MVP · Book Author · International Speaker. Company : @nearform Location: Italy email: webiste: https://www.micheleriva.it Twitter handle:@MicheleRivaCode

2. Followers: 469, Following: 442

3. Number of repositories: 49

Project coding

Loading the libraries required in the project

#install.packages("repurrrsive")
#install.packages("tidyverse")
#install.packages("httr")
#install.packages("gh")
#install.packages("ggplot2")
#install.packages("kableExtra")
#install.packages("ggthemes")
#install.packages("treemapify")
#install.packages("ggplotify")

library(repurrrsive)
## Warning: package 'repurrrsive' was built under R version 4.1.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(httr)
## Warning: package 'httr' was built under R version 4.1.3
library(gh)
## Warning: package 'gh' was built under R version 4.1.3
library(ggplot2)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.1.3
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.1.3
library(treemapify)
## Warning: package 'treemapify' was built under R version 4.1.3
library(ggplotify)
## Warning: package 'ggplotify' was built under R version 4.1.3

micheleriva Profile details

Getting profile details by accessing github through personal token (mapped to Shrawani Misra) and setting env variable. Limit=Inf does not set a limit to the data fetched.

Showing micheleriva user’s login, name, repositories and followers in a kable table.

token <- "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" 
Sys.setenv(GITHUB_TOKEN = token)
profile_michele <- gh("GET /users/micheleriva", 
                     username = "micheleriva",.limit=Inf)

profile_michele <- tibble(
  login = profile_michele$login,
  name = profile_michele$name,
  public_repos = profile_michele$public_repos,
  followers = profile_michele$followers
)
head(profile_michele, n = 50) %>% 
  kable() %>% 
  kable_paper(bootstrap_options = "striped", full_width=F)
login name public_repos followers
micheleriva Michele Riva 49 469

Data Collection

Fetching follower data of micheleriva

Fetching follwers of micheleriva using gh package. The same is then input to a dataframe. Later, we make sure there are no null values by making a null list and comparing it with our data using is not null

#Get followers info and populate in dataframe
followers_michele <- gh("/users/micheleriva/followers", .limit = Inf)
## i Running gh query
## i Running gh query, got 100 records of about 500
## i Running gh query, got 200 records of about 500
## i Running gh query, got 300 records of about 500
## i Running gh query, got 400 records of about 500
df_followers_michele <- data.frame(User=character(),
                           login=character(), 
                           public_repos=integer(), 
                           followers=integer()) 

null_list <- function(x){map_chr(x, ~{ifelse(is.null(.x), NA, .x)})}
is.not.null <- function(x) !is.null(x)
n <- length(followers_michele)

Next, to effectively iterate through every follower, we run a for loop for the entire 469 observations (that is the total number of followers of micheleriva as per his github profile). Getting the login details of a follower, their name, repositories,followers using gh package. Again, we check for null values and make a dataframe

for (i in 1:n)
{
  # Login of i follower to fetch further data 
  login = followers_michele[[i]]$login
  
  # fetch that follower's profile
  f_profile <- gh("GET /users/:login", login = login, .limit = Inf)
  name = f_profile$name
  public_repos =f_profile$public_repos
  followers = f_profile$followers
  
  # Making sure there are no null values. Finally, populate in dataframe
  if (is.not.null(name) & is.not.null(login) & is.not.null(public_repos) 
      & is.not.null(followers))
  {
    df_followers_michele <- rbind(df_followers_michele, data.frame(User=null_list(name),
                                                   login = (login),
                                                   public_repos=null_list(public_repos),
                                                   followers = null_list(followers)))
  }
}


head(df_followers_michele, n = 10) %>% 
  kable() %>% 
  kable_paper(bootstrap_options = "striped", full_width=F)
User login public_repos followers
Jonas Galvez galvez 62 427
=Bill.Barnhill BillBarnhill 143 45
Jiten (Jits) Bhagat jits 18 127
Giorgio Pomettini Pomettini 51 75
Andrey Esin esin 33 6209
Antonio Caputo tonycaputome 27 40
Daniel Vinciguerra dvinciguerra 97 133
Fabrizio Guglielmino guglielmino 48 30
X Code singerdmx 15 401
Tristan de Cacqueray TristanCacqueray 267 83

Fetching repositories data of micheleriva

Similarly, we do the same for repositories of micheleriva. Fetch the repositories data using gh package, checking for null values and populating data frame. Here, we iteratively run a loop for 49 observations(the total number of repositories of micheleriva)

# Getting repositories info and poopulate it in dataframe
repos_michele <- gh("GET /users/micheleriva/repos", username = "micheleriva",.limit = Inf)
length(repos_michele)
## [1] 49
df_repoinfo <- data.frame(Repo_Name=character(),
                               size=integer(), 
                               forks=integer(), 
                               open_issues_count=integer(),
                               closed_issues_count=integer()) 

for (i in 1:length(repos_michele))
{ 
  #Details of i repo to fetch further data 
  name = repos_michele[[i]]$name
  size = repos_michele[[i]]$size
  created_year = as.integer(substring(repos_michele[[i]]$created_at,1,4))
  forks = repos_michele[[i]]$forks_count
  open_issues_count = repos_michele[[i]]$open_issues_count
  
  closed_issues_url <-
    paste0(repos_michele[[i]]$url,"/issues?state=closed")
  
  closed_issues = gh(closed_issues_url,username = "micheleriva",.limit = Inf)
  closed_issues_count = length(closed_issues)
  
  # Populate data to data frame
  if (is.not.null(name) & is.not.null(size) & is.not.null(forks)
      & is.not.null(created_year)
      & is.not.null(open_issues_count) & is.not.null(closed_issues_count))
  {
    df_repoinfo<-rbind(df_repoinfo, data.frame(Repo_Name = null_list(name),
                                                         size = null_list(size),
                                                         forks = null_list(forks),
                                                         created_year = null_list(created_year),
                                                         open_issues_count = null_list(open_issues_count),
                                                         closed_issues_count = null_list(closed_issues_count)))
  }
  
}
head(df_repoinfo, n = 15) %>% kable() %>% 
  kable_paper(bootstrap_options = "striped", full_width=F)
Repo_Name size forks created_year open_issues_count closed_issues_count
aika 51 0 2019 0 0
Aquarium 5238 1 2019 6 13
c-vs-ts-wasm 41 0 2018 0 1
CadregaLisp 3041 0 2018 0 1
caesar-cipher 240 0 2019 0 0
coronablocker 527 2 2020 1 0
cvTracker 36 0 2018 0 0
dat-theme 2537 0 2018 0 1
DNN-Pose-Estimator 4587 1 2018 0 1
editorjs-go 243 2 2020 1 1
ElixirIdenticon 717 0 2018 0 0
Emotion-Detection-ML-Example 6 10 2018 2 1
face-detection 1769 3 2018 1 0
fastify-vite 3267 0 2022 0 0
gauguin 1256 8 2020 3 3

Thus our data collection has been done.

##Data Visualization

As we can see, the user micheleriva has been active for many years, so we see which years have seen the most amount of contirbutions from him. To visualize this, we use treemap and see that 2018 is when he has made the most amount of contributions.

df_repoinfo$size <- as.numeric(as.character(df_repoinfo$size))
df_repoinfo$forks <- as.numeric(as.character(df_repoinfo$forks))
df_repoinfo$created_year <- as.numeric(as.character(df_repoinfo$created_year))
df_repoinfo$open_issues_count <- as.numeric(as.character(df_repoinfo$open_issues_count))
df_repoinfo$closed_issues_count <- as.numeric(as.character(df_repoinfo$closed_issues_count))

df_repo_summary <- df_repoinfo %>% group_by(created_year) %>% 
  summarise(Repo_Count = n()) 

ggplot(df_repo_summary, aes(area = Repo_Count, fill = created_year, label=created_year)) +
  geom_treemap() +
  geom_treemap_text( colour = "grey", place = "centre",grow = TRUE)+
  scale_fill_viridis_c()+
  ggtitle("michelerivas: Amount of Repository contributions distributed over years")

As a developer, it is very easy to have errors and issues even when checking code to production. The count of open issues will give an idea of the repositories that might have issues like these.

Thus, plotting the names of those repositories that have more than 0 open issue, we can see that a few of them stand out, like “krabs”, “vue-product-spinner”, “Aquarium” that have significant amount of open issues. This visualization also tells us that the user micheleriva ran into significant issues for “krabs” and “std” even in 2021 despite having experience, denoting that every repository and their issue may be unique and prior expertise may not always be a solution.

ggplot(data = df_repoinfo %>% filter(open_issues_count!=0), mapping = aes(
  x=Repo_Name, y=open_issues_count)) +
  geom_point(size=4, colour='mediumseagreen')+
  facet_grid(~created_year)+
  theme_solarized()+
  ggtitle("michelerivas: Repositories with open issue counts")+
  xlab("Repository Name") + ylab("Number of Open issues")+
    coord_flip()