Assignment2-Misra-micheleriva

Github user data analysis: micheleriva

The aim of this project is to analyze github user data of a particular user on GitHub. For this purpose, I make use of the github API for accessing data and subsequently wrangle and analyze the data in R

micheleriva is the Github user handle of Michele Riva. The user profile can be found on https://github.com/micheleriva

1. Description: Senior Architect @nearform · @google GDE · @microsoft MVP · Book Author · International Speaker. Company : @nearform Location: Italy email:ciao@micheleriva.it webiste: https://www.micheleriva.it Twitter handle:@MicheleRivaCode

2. Followers: 469, Following: 442

3. Number of repositories: 49

Project coding

Loading the libraries required in the project

#install.packages("repurrrsive")
#install.packages("tidyverse")
#install.packages("httr")
#install.packages("gh")
#install.packages("ggplot2")
#install.packages("kableExtra")
#install.packages("ggthemes")
#install.packages("treemapify")
#install.packages("ggplotify")

library(repurrrsive)

## Warning: package 'repurrrsive' was built under R version 4.1.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(httr)

## Warning: package 'httr' was built under R version 4.1.3

library(gh)

## Warning: package 'gh' was built under R version 4.1.3

library(ggplot2)
library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.1.3

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.1.3

library(treemapify)

## Warning: package 'treemapify' was built under R version 4.1.3

library(ggplotify)

## Warning: package 'ggplotify' was built under R version 4.1.3

micheleriva Profile details

Getting profile details by accessing github through personal token (mapped to Shrawani Misra) and setting env variable. Limit=Inf does not set a limit to the data fetched.

Showing micheleriva user’s login, name, repositories and followers in a kable table.

token <- "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" 
Sys.setenv(GITHUB_TOKEN = token)
profile_michele <- gh("GET /users/micheleriva", 
                     username = "micheleriva",.limit=Inf)

profile_michele <- tibble(
  login = profile_michele$login,
  name = profile_michele$name,
  public_repos = profile_michele$public_repos,
  followers = profile_michele$followers
)
head(profile_michele, n = 50) %>% 
  kable() %>% 
  kable_paper(bootstrap_options = "striped", full_width=F)

login	name	public_repos	followers
micheleriva	Michele Riva	49	469

Data Collection

Fetching follower data of micheleriva

Fetching follwers of micheleriva using gh package. The same is then input to a dataframe. Later, we make sure there are no null values by making a null list and comparing it with our data using is not null

#Get followers info and populate in dataframe
followers_michele <- gh("/users/micheleriva/followers", .limit = Inf)

## i Running gh query

## i Running gh query, got 100 records of about 500

## i Running gh query, got 200 records of about 500

## i Running gh query, got 300 records of about 500

## i Running gh query, got 400 records of about 500

df_followers_michele <- data.frame(User=character(),
                           login=character(), 
                           public_repos=integer(), 
                           followers=integer()) 

null_list <- function(x){map_chr(x, ~{ifelse(is.null(.x), NA, .x)})}
is.not.null <- function(x) !is.null(x)
n <- length(followers_michele)

Next, to effectively iterate through every follower, we run a for loop for the entire 469 observations (that is the total number of followers of micheleriva as per his github profile). Getting the login details of a follower, their name, repositories,followers using gh package. Again, we check for null values and make a dataframe

for (i in 1:n)
{
  # Login of i follower to fetch further data 
  login = followers_michele[[i]]$login
  
  # fetch that follower's profile
  f_profile <- gh("GET /users/:login", login = login, .limit = Inf)
  name = f_profile$name
  public_repos =f_profile$public_repos
  followers = f_profile$followers
  
  # Making sure there are no null values. Finally, populate in dataframe
  if (is.not.null(name) & is.not.null(login) & is.not.null(public_repos) 
      & is.not.null(followers))
  {
    df_followers_michele <- rbind(df_followers_michele, data.frame(User=null_list(name),
                                                   login = (login),
                                                   public_repos=null_list(public_repos),
                                                   followers = null_list(followers)))
  }
}


head(df_followers_michele, n = 10) %>% 
  kable() %>% 
  kable_paper(bootstrap_options = "striped", full_width=F)

User	login	public_repos	followers
Jonas Galvez	galvez	62	427
=Bill.Barnhill	BillBarnhill	143	45
Jiten (Jits) Bhagat	jits	18	127
Giorgio Pomettini	Pomettini	51	75
Andrey Esin	esin	33	6209
Antonio Caputo	tonycaputome	27	40
Daniel Vinciguerra	dvinciguerra	97	133
Fabrizio Guglielmino	guglielmino	48	30
X Code	singerdmx	15	401
Tristan de Cacqueray	TristanCacqueray	267	83

Fetching repositories data of micheleriva

Similarly, we do the same for repositories of micheleriva. Fetch the repositories data using gh package, checking for null values and populating data frame. Here, we iteratively run a loop for 49 observations(the total number of repositories of micheleriva)

# Getting repositories info and poopulate it in dataframe
repos_michele <- gh("GET /users/micheleriva/repos", username = "micheleriva",.limit = Inf)
length(repos_michele)

## [1] 49

df_repoinfo <- data.frame(Repo_Name=character(),
                               size=integer(), 
                               forks=integer(), 
                               open_issues_count=integer(),
                               closed_issues_count=integer()) 

for (i in 1:length(repos_michele))
{ 
  #Details of i repo to fetch further data 
  name = repos_michele[[i]]$name
  size = repos_michele[[i]]$size
  created_year = as.integer(substring(repos_michele[[i]]$created_at,1,4))
  forks = repos_michele[[i]]$forks_count
  open_issues_count = repos_michele[[i]]$open_issues_count
  
  closed_issues_url <-
    paste0(repos_michele[[i]]$url,"/issues?state=closed")
  
  closed_issues = gh(closed_issues_url,username = "micheleriva",.limit = Inf)
  closed_issues_count = length(closed_issues)
  
  # Populate data to data frame
  if (is.not.null(name) & is.not.null(size) & is.not.null(forks)
      & is.not.null(created_year)
      & is.not.null(open_issues_count) & is.not.null(closed_issues_count))
  {
    df_repoinfo<-rbind(df_repoinfo, data.frame(Repo_Name = null_list(name),
                                                         size = null_list(size),
                                                         forks = null_list(forks),
                                                         created_year = null_list(created_year),
                                                         open_issues_count = null_list(open_issues_count),
                                                         closed_issues_count = null_list(closed_issues_count)))
  }
  
}
head(df_repoinfo, n = 15) %>% kable() %>% 
  kable_paper(bootstrap_options = "striped", full_width=F)

Repo_Name	size	forks	created_year	open_issues_count	closed_issues_count
aika	51	0	2019	0	0
Aquarium	5238	1	2019	6	13
c-vs-ts-wasm	41	0	2018	0	1
CadregaLisp	3041	0	2018	0	1
caesar-cipher	240	0	2019	0	0
coronablocker	527	2	2020	1	0
cvTracker	36	0	2018	0	0
dat-theme	2537	0	2018	0	1
DNN-Pose-Estimator	4587	1	2018	0	1
editorjs-go	243	2	2020	1	1
ElixirIdenticon	717	0	2018	0	0
Emotion-Detection-ML-Example	6	10	2018	2	1
face-detection	1769	3	2018	1	0
fastify-vite	3267	0	2022	0	0
gauguin	1256	8	2020	3	3

Thus our data collection has been done.

##Data Visualization

As we can see, the user micheleriva has been active for many years, so we see which years have seen the most amount of contirbutions from him. To visualize this, we use treemap and see that 2018 is when he has made the most amount of contributions.

df_repoinfo$size <- as.numeric(as.character(df_repoinfo$size))
df_repoinfo$forks <- as.numeric(as.character(df_repoinfo$forks))
df_repoinfo$created_year <- as.numeric(as.character(df_repoinfo$created_year))
df_repoinfo$open_issues_count <- as.numeric(as.character(df_repoinfo$open_issues_count))
df_repoinfo$closed_issues_count <- as.numeric(as.character(df_repoinfo$closed_issues_count))

df_repo_summary <- df_repoinfo %>% group_by(created_year) %>% 
  summarise(Repo_Count = n()) 

ggplot(df_repo_summary, aes(area = Repo_Count, fill = created_year, label=created_year)) +
  geom_treemap() +
  geom_treemap_text( colour = "grey", place = "centre",grow = TRUE)+
  scale_fill_viridis_c()+
  ggtitle("michelerivas: Amount of Repository contributions distributed over years")

As a developer, it is very easy to have errors and issues even when checking code to production. The count of open issues will give an idea of the repositories that might have issues like these.

Thus, plotting the names of those repositories that have more than 0 open issue, we can see that a few of them stand out, like “krabs”, “vue-product-spinner”, “Aquarium” that have significant amount of open issues. This visualization also tells us that the user micheleriva ran into significant issues for “krabs” and “std” even in 2021 despite having experience, denoting that every repository and their issue may be unique and prior expertise may not always be a solution.

ggplot(data = df_repoinfo %>% filter(open_issues_count!=0), mapping = aes(
  x=Repo_Name, y=open_issues_count)) +
  geom_point(size=4, colour='mediumseagreen')+
  facet_grid(~created_year)+
  theme_solarized()+
  ggtitle("michelerivas: Repositories with open issue counts")+
  xlab("Repository Name") + ylab("Number of Open issues")+
    coord_flip()

Assignment2-Misra-micheleriva

Shrawani Misra

30/03/2022

Github user data analysis: micheleriva

micheleriva Profile details

Data Collection