Overview

Public commit activity can provide insight into open source contribution of activity among top technology companies with a Github presence.

Methodology

Given a list of companies of interest, identify author activity for recent committers on repositories in the official company Github organization. This excludes repositories and organizations owned by the author but does not exclude repositories belonging to organizations of which the author is a member. This only samples from committers from the past month. Repositories reported in the final results must show significant activity, either through number and consistency of commits by an author in the recent past or by the number of authors having commits on the project.

Assumptions

  • The majority of committers on official company repositories were employees of the company at the time the commit was made.
  • Significant activity from company-identified committers on other repositories is company-sanctioned.

Companies of Interest

The following companies were identified as being of interest based on an internal report. I was not able to determine why these companies were chosen.

  • Amazon
  • Cisco
  • Facebook
  • Fujitsu
  • Google
  • Huawei
  • IBM
  • Microsoft
  • Oracle
  • Pivotal
  • Redhat

I’ve added the following companies to the report based on their activity in AI Open Source projects I analyze on a monthly basis.

  • Apple
  • Baidu
  • Intel
  • NVIDIA
  • Samsung
# TODO: This description needs to go into overview

# Organizations were found using the following search terms: <company name> github organization
# Organizations were manually verified
# Any listed as a "foundation" or lacking a company logo or company website url were excluded.

# TODO: cite these as sources/footnotes

# https://github.com/collections/open-source-organizations

# Intel - https://software.intel.com/en-us/code-samples/github
# Microsoft - https://opensource.microsoft.com/
# Amazon - https://amzn.github.io/

companies <- read_csv('
Apple,apple
Amazon.com,amzn
Amazon.com,alexa
Amazon.com,aws
Amazon.com,awslabs
Baidu,baidu
Baidu,ecomfe
Baidu,baidu-research
Baidu,baidu-aip
Baidu,fex-team
Cisco,cisco
Cisco,ciscodevnet
Facebook,facebook
Fujitsu,fujitsu
Google,google
Google,googlesamples
Google,googlecloudplatform
Huawei,huawei
Huawei,liteos
Huawei,huaweibigdata
Huawei,huaweicloud
Huawei,huawei-clouds
IBM,ibm
IBM,ibm-cloud
IBM,ibmresearch
IBM,ibmdatascience
IBM,ibm-watson-iot
IBM,watson-explorer
IBM,watson-developer-cloud
Intel,intel
Intel,01org
Intel,intellabs
Intel,intel-bigdata
Intel,intel-cloud
Microsoft,microsoft
Microsoft,azure
Microsoft,aspnet
Microsoft,powershell
NVIDIA,nvidia
NVIDIA,nvidiagameworks
NVIDIA,nvlabs
Oracle,oracle
Pivotal,pivotal-cf
Redhat,rht-labs
Redhat,redhat-openstack
Redhat,redhat-developer
Samsung,samsung
', col_names=c("name","github_org"))

Data Collection

Github Commits

The following functions use the Github API to identify contributors for companies of interest. Official company repositories were looked up manually and provided in the companies data structure defined above.

Github API Base Functions

query_params <- list(
  client_id=params$gh_id, 
  client_secret=params$gh_secret)

get_gh_resp <- function (url, query) {
  req <- GET(url, query=query)
  json <- content(req, as = "text")
  resp <- fromJSON(json, flatten = TRUE)
  #resp_df <-  resp %>% unlist() %>% as.data.frame.list()
  return(resp)
}

Latest Active Repositories per Organization

The following functions query the Github API to return a list of the most recently updated repositories for each Github organization.

# get repos updated in the last month for an organization

get_gh_org_repos <- function (org, url, query) {
  org_url <- str_replace(url, ":org", org)
  sha_resp <- get_gh_resp(org_url, query)
}

orgs_url <- "https://api.github.com/orgs/:org/repos"

org_repos = data_frame()
for (n in 1:nrow(companies)) {
  org <- companies$github_org[n]
  org_resp <- get_gh_org_repos(org, orgs_url, query_params)
  
  if (!is.data.frame(org_resp)) {
    print(paste(org, org_resp$message))
    next()
  }
  
  # add the github org and company name for reference
  org_resp <- org_resp %>% 
    mutate(company=companies$name[n], github_org=org)
  
  org_repos <- bind_rows(org_resp, org_repos)
}

write_rds(org_repos, "data/org_repos.Rds")

Latest Commits on Organization Repositories

The following functions query the Github API to return the latest commits on the most recently updated repositories for each company.

get_gh_commits <- function (url, query) {
  req <- GET(url, query=query, accept("application/vnd.github.cloak-preview"))
  json <- content(req, as = "text")
  resp <- fromJSON(json, flatten = TRUE)
  #resp_df <-  resp %>% unlist() %>% as.data.frame.list()
  return(resp)
}

org_repos <- read_rds("data/org_repos.Rds")
org_commits <- data_frame()

for (n in 1:nrow(org_repos)) {
  commits_url <- str_replace(org_repos$commits_url[n], "\\{/sha\\}", "")
  org_commits_resp <- get_gh_commits(commits_url, query_params)
  
  if (!is.data.frame(org_commits_resp)) {
    print(paste(org_repos$full_name[n], org_commits_resp$message))
    next()
  }
  
  # add the github org and company name for reference
  org_commits_resp <- org_commits_resp %>% 
    mutate(company=org_repos$company[n], github_org=org_repos$github_org[n], repo=org_repos$full_name[n])
  
  org_commits <- bind_rows(org_commits, org_commits_resp)
}

write_rds(org_commits, "data/org_commits.Rds")

# get commits from the past 4 weeks
commits_since <- today() - weeks(4)

org_commits_latest <- org_commits %>%
  filter(commit.committer.date > commits_since)

write_rds(org_commits_latest, "data/org_commits_latest.Rds")

Affliated Authors

The following functions create a list of authors based on the committer information in commits pulled from the organization repositories with github user information pulled from the Github API.

org_commits_latest <- read_rds("data/org_commits_latest.Rds")

committers <- org_commits_latest %>%
  group_by(company) %>%
  mutate(num_orgs = n_distinct(github_org),
         num_repos = n_distinct(repo)) %>%
  group_by(company, committer.login) %>%
  summarise(num_org_commits=n(), num_orgs=first(num_orgs), num_repos=first(num_repos))

write_rds(committers, "data/committers.Rds")
write_csv(committers, "data/committers.csv")

The following plot shows the distribution of the number of authors

committers <- read_rds("data/committers.Rds")

# Plot distribution of authors per company
ggplot(committers, aes(x=company)) +
  geom_bar(aes(fill=factor(num_orgs))) +
  coord_flip() +
  labs(title="Distribution of Authors",
       x="Company", 
       y="Count of Committers active in the past Month on Company-owned Repos", 
       fill="# Github Orgs")

Affiliated Authors Sample

Cluster committers by company and take a random sample of 10 committers identified per company. If a company has less than 10, use all of the ones we identified.

committers <- read_rds("data/committers.Rds")

# add org info
org_commits_latest <- read_rds("data/org_commits_latest.Rds")
committers <- committers %>% 
  inner_join(org_commits_latest %>% select(committer.login, github_org)) %>%
  unique()

# num committers
committers <- committers %>%
  group_by(company) %>%
  mutate(num_committers = n_distinct(committer.login))

write_rds(committers, "data/committers.Rds")

# build sample base

committers_sample_base <- committers %>%
  filter(! is.na(committer.login) & ! committer.login %in% c("web-flow"))

committers_less_than_sample_size <- committers_sample_base %>%
  filter(num_committers < 10)

committers_sample <- committers_sample_base %>%
  filter(num_committers >= 10) %>%
  sample_n(10)

committers_sample <- bind_rows(committers_sample, committers_less_than_sample_size)

write_rds(committers_sample, "data/committers_sample.Rds")
write_csv(committers_sample, "data/committers_sample.csv")
committers_sample <- read_rds("data/committers_sample.Rds")
# Plot distribution of committers per company
ggplot(committers_sample, aes(x=company)) +
  geom_bar(aes(fill=num_committers < 10)) +
  coord_flip() +
  labs(title="Author Sample Size Distribution",
       x="Company", 
       y="Author Sample Size per Company")

Affiliated Authors Commits

The following function uses the Github API to look up commits belonging to the authors identified from the organizations’ commits. Repositories belonging to the author are excluded.

Use a sample due to Githup Search API rate limiting:

get_gh_commits <- function (url, query) {
  req <- GET(url, query=query, accept("application/vnd.github.cloak-preview"))
  json <- content(req, as = "text")
  resp <- fromJSON(json, flatten = TRUE)
  #resp_df <-  resp %>% unlist() %>% as.data.frame.list()
  return(resp)
}

search_gh_commits_by_login <- function(login, orgname, query_params) {
  commits_search_url <- "https://api.github.com/search/commits?q="
  
  commits_search_author <- paste("author", login, sep=":")
  commits_search_date <- paste("author-date", ">2018-04-01", sep=":")
  
  commits_search_query <- paste(commits_search_author, sep="+")
  commits_search_url <- paste(commits_search_url, commits_search_query, sep="")
  print(commits_search_url)
  
  commits_resp <- get_gh_commits(commits_search_url, append(query_params, c(per_page=100)))
  return(commits_resp)
}

committers_sample <- read_rds("data/committers_sample.Rds")

affiliated_commits <- data_frame()
for (n in 1:nrow(committers_sample)) {
  print(paste(committers_sample$company[n], committers_sample$committer.login[n]))
  
  # Github API rate limits Search requests to 30 per minute
  if(n %% 30 == 0) {
    print(paste(n, "Rate Limit Nap"))
    Sys.sleep(60)
  }

  commits_resp <- search_gh_commits_by_login(
    committers_sample$committer.login[n], 
    committers_sample$github_org[n], 
    query_params)
  
  commits <- commits_resp$items
  
  commits <- commits %>% 
    mutate(login=committers_sample$committer.login[n], 
           company=committers_sample$company[n],
           github_org=committers_sample$github_org[n])

  affiliated_commits <- bind_rows(affiliated_commits, commits)
}

write_rds(affiliated_commits, "data/affiliated_commits_sample.Rds")
write_csv(affiliated_commits %>% select(-parents), "data/affiliated_commits_sample.csv")
affiliated_commits <- read_rds("data/affiliated_commits_sample.Rds")

affiliated_commits_summary <- affiliated_commits %>%
  filter(login == author.login) %>%
  group_by(company, repository.full_name) %>%
  summarise(num_authors=n_distinct(login), num_commits=n())

write_rds(affiliated_commits_summary, "data/affiliated_commits_sample_summary.Rds")
write_csv(affiliated_commits_summary, "data/affiliated_commits_sample_summary.csv")

Pull commit activity for the full list (takes awhile due to Github Search API rate limiting):

get_gh_commits <- function (url, query) {
  req <- GET(url, query=query, accept("application/vnd.github.cloak-preview"))
  json <- content(req, as = "text")
  resp <- fromJSON(json, flatten = TRUE)
  #resp_df <-  resp %>% unlist() %>% as.data.frame.list()
  return(resp)
}

search_gh_commits_by_login <- function(login, orgname, query_params) {
  commits_search_url <- "https://api.github.com/search/commits?q="
  
  commits_search_author <- paste("author", login, sep=":")
  commits_search_date <- paste("author-date", ">2018-04-01", sep=":")
  
  commits_search_query <- paste(commits_search_author, sep="+")
  commits_search_url <- paste(commits_search_url, commits_search_query, sep="")
  print(commits_search_url)
  
  commits_resp <- get_gh_commits(commits_search_url, append(query_params, c(per_page=100)))
  return(commits_resp)
}

committers <- read_rds("data/committers.Rds")

committers <- committers %>%
  filter(!is.na(committer.login) & 
           !committer.login %in% (
             "web-flow"
             ))

#TODO uncomment affiliated_commits <- data_frame()
  for (n in 1:nrow(committers)) {
    committer <- as_data_frame(committers[n,])
    print(paste(committer$company, committer$committer.login))
    
    # check if already in commits
    if(committer$committer.login %in% affiliated_commits$login) {
      print(paste(n, ":", "already pulled commits for", committer$committer.login))
      next()
    }
    
    # Github API rate limits Search requests to 30 per minute
    if(n %% 30 == 0) {
      print(paste(n, "Rate Limit Nap"))
      Sys.sleep(60)
    }
  
    commits_resp <- search_gh_commits_by_login(
      committer$committer.login, 
    committer$github_org, 
    query_params)
  
  commits <- commits_resp$items

  if(length(commits) == 0) {
    print(paste(n, ":", "no commits for", committer$committer.login))
    commits <- data_frame()
  }

  # when using committer df, gets error "Evaluation error: $ operator is invalid for atomic vectors."
  commits <- commits %>% 
    mutate(login=committers$committer.login[n], 
           company=committers$company[n],
           github_org=committers$github_org[n])

  affiliated_commits <- bind_rows(affiliated_commits, commits)
}

write_rds(affiliated_commits, "data/affiliated_commits.Rds")
write_csv(affiliated_commits %>% select(-parents), "data/affiliated_commits.csv")
affiliated_commits <- read_rds("data/affiliated_commits.Rds")

affiliated_commits_summary <- affiliated_commits %>%
  filter(login == author.login) %>%
  separate(repository.full_name, c("org", "repo"), sep="/", remove=FALSE) %>%
  filter(str_to_lower(login) != str_to_lower(org)) %>% # remove repos the actor owns
  group_by(company, repository.full_name) %>%
  summarise(num_authors=n_distinct(login), 
            num_commits=n(),
            org=first(org),
            repo=first(repo))

write_rds(affiliated_commits_summary, "data/affiliated_commits_summary.Rds")
write_csv(affiliated_commits_summary, "data/affiliated_commits_summary.csv")

Affiliated Authors Events

The following function uses the Github Archive to look up events belonging to the authors identified from the organizations’ commits.

Build the SQL query for Google BigQuery using our contributor list.

For the actual query run for this dataset, see - (https://bigquery.cloud.google.com/savedquery/306220071795:2bc874b1376e4eb58f5bf5cd20cf979c)

# extract github logins
logins <- read_rds("data/committers.Rds")
logins <- logins %>%
  filter(!is.na(committer.login),
         !committer.login %in% c("web-flow")) %>%
  select(company, committer.login)

query <- paste0("SELECT * FROM [githubarchive:month.201805], [githubarchive:month.201804] WHERE actor.login IN('",
                paste(logins$committer.login, collapse="','"),
                "')")

print(query)
## [1] "SELECT * FROM [githubarchive:month.201805], [githubarchive:month.201804] WHERE actor.login IN('ajot','akiyano','almann','andrewlewis','ask-sdk','awood45','awstools','baldwinmatt','breedloj','btmash','cbommas','chrisradek','cjyclaire','ekandrotA9','etwillbefine','fenxiong','franklin-lobb','frozenberg','fukajun','haikuoliu','jahkeup','JakeMKelly','jasdel','joguSD','kstich','kvasukib','lukeseawalker','mattsb42-aws','MicheleBeargroup','mikemaas-amazon','minbi','mingma7','normj','petderek','pfifer','phillipberndt','richardpen','robm26','rohandubal','rohkat-aws','sahilpalvia','SalusaSecondus','sarah-johnson','sayalee','schroffd','sean-smith','sharanyad','spati2','sstevenkang','stealthycoin','steveataws','sungolivia','tgregg','tianrenz','wfus','aciidb0mb3r','adrian-prantl','alblue','AnthonyLatsis','aschwaighofer','bcardosolopes','BenchR267','benrimmington','chandlerc','cheshire','davezarzycki','dcci','dmbryson','DougGregor','gottesmm','hansmi','hyp','jakepetroules','JDevlieghere','jfbastien','jimingham','kubamracek','lattner','michaelrsweet','milseman','nkcsgexi','parkera','repzret','rintaro','rudkx','shajrawi','speachy','spevans','swift-ci','tkremenek','TNorthover','topperc','tstellar','vedantk','vsapsai','xedin','zayass','ZolotukhinM','Dafrok','dagamayank','dingelish','dmudiger','errorrik','flyhighzy','iMuduo','jhunters','LeuisKen','loatheb','luyuan','lvgreenTian','meteorasd555','Ovilia','sharannarang','zhangtao07','gilesheron','linuxwolf','mikewiebe','mstorsjo','pabuhler','sijchen','strfry','acdlite','facebook-github-bot','gaearon','gfosco','hhvm-bot','jingping2015','nlutsenko','PaulTaykalo','artygus','bshaffer','fhinkel','googleapis-publisher','houglum','mfschwartz','nlativy','ronshapiro','ryanmats','blueliuyun','cjj0144','edisonxiang','freesky-edward','iDiy','LiteIOT','modular-magician','niuzhenguo','rambleraptor','Savasw','solin319','Supowang08','SuYai','tenji','tmx1991','twowinter','youxiaobo','zengchen1024','abdonrd','anweshan','dpopp07','emmajdaws','germanattanasio','jerry1100','jesusprubio','komedani','lpatino10','srl295','stevemart','vpham16','adrian-wang','ahkok','bbian','carsonwang','ColinIanKing','ehsantn','gczsjdy','linhong-intel','littlezhou','qiyuangong','spandruvada','uartie','xhaihao','xuguangxin','ajaybhargavb','ajcvickers','amarzavery','AndriySvyryd','anpete','aspnetci','bergmeister','bgelens','boumenot','bricelam','bulentelmaci','daschult','dougbu','erezvani1529','finiteattractor','frankycrm14','HaoK','hglkrijger','hyonholee','JamesWTruher','jaredmoo','jianghaolu','jkotalik','johlju','JunTaoLuo','katmsft','kichalla','kwirkykat','lmazuel','maumar','milismsft','natemcmaster','NTaylorMullen','pakrym','PlagueHO','pmiddleton','pranavkm','rajshah11','rickle-msft','RikkiGibson','ryanbrandenburg','rynowak','sanbhatt','sarangan12','shahabhijeet','smitpatel','Tratcher','vinjiang','weltling','aaronp24','chrschorn','kbrenneman','kyaoNV','larsbishop','mingyuliutw','nbenty','pixar-oss','tmbdev','xunhuang1995','cjbj','Djelibeybi','doxiao','gvenzl','idodeclare','malibuworkcrew','mjwallin1','Orviss','rosemarymarano','vladak','albertoleal','animatedmax','apps-manager','bentarnoff','blgm','cf-london','cf-meganmoore','cshollingsworth','dlresende','dyozie','edwardecook','flangewad','georgeharley','ghanna2017','jacknewberry','jberney','jncd','kdolor-pivotal','kirederik','mingxiao','mkowalski','mlimonczenko','mrosecrance','ojhughes','parth-pandit','Pivotal-Christopher-Wong','pspinrad','reid47','seviet','snneji','spring-buildmaster','StevenLocke','terminatingcode','vikafed','fbricon','fche','gorkem','ibuziuk','Katka92','l0rd','LightGuard','oybed','pmkovar','rhopp','ScrewTSW','sgayou','sherl0cks','springdo','tdbeattie','Tompage1994','xsuchy','gombos','liaxim','NolaDonato','yichoi','zherczeg')"

Resulting query was run in GBQ (faster to do it through their Web UI) and data were saved to the following table:

An archive of this table is available here: (https://storage.googleapis.com/open_source_community_metrics_exports/cbc_org-repo-network_org-by-repo-summary/login-events_20180527.csv)

login_events <- read_csv("https://storage.googleapis.com/open_source_community_metrics_exports/cbc_org-repo-network_org-by-repo-summary/login-events_20180527.csv")

# add company info
logins <- read_rds("data/committers.Rds")
logins <- logins %>%
  filter(!is.na(committer.login),
         !committer.login %in% c("web-flow")) %>%
  select(company, committer.login) %>%
  rename(actor_login=committer.login)

login_events <- login_events %>%
  inner_join(logins)

write_rds(login_events, "data/login_events.Rds")
login_events <- read_rds("data/login_events.Rds")

login_events_summary <- login_events %>%
  separate(repo_name, c("org", "repo"), sep="/", remove=FALSE) %>%
  filter(str_to_lower(actor_login) != str_to_lower(org)) %>% # remove repos the actor owns
  group_by(company, repo_name, type) %>%
  summarise(num_authors=n_distinct(actor_login), 
            num_events=n(),
            org=first(org),
            repo=first(repo))

write_rds(login_events_summary, "data/login_events_summary.Rds")
write_csv(login_events_summary, "data/login_events_summary.csv")

Analysis

Project Demographics

Language

Committer Diversity

Maturity

Type

Significant Commit Activity

Based on the commit log, what significant activity do we see from the committers?

TODO: Don’t define significance as number of commits, define as consistency over a broader time in the past

TODO: Maybe don’t separate events and commits?

affiliated_commits_summary <- read_rds("data/affiliated_commits_summary.Rds")

affiliated_commits_summary <- affiliated_commits_summary %>%
  mutate(org=str_to_lower(org)) %>%
  left_join(companies %>% rename(org_company=name, org=github_org))
## Joining, by = "org"
affiliated_commits_summary <- affiliated_commits_summary %>%
  mutate(owned_repo = company == org_company,
         owned_repo = ifelse(is.na(owned_repo), FALSE, owned_repo))

owned_org_summary <- affiliated_commits_summary %>%
  group_by(company, org) %>%
  summarise(
    owned_repo=first(owned_repo),
    total_commits=sum(num_commits),
    total_authors=sum(num_authors),
    num_repos=n_distinct(repository.full_name)
  )

owned_repo_summary <- affiliated_commits_summary %>%
  group_by(company, org, repo) %>%
  summarise(
    owned_repo=first(owned_repo),
    total_commits=sum(num_commits),
    total_authors=sum(num_authors)
  )

write_csv(owned_repo_summary, "data/commits_repo_summary.csv")
# proportion of org-owned repository activity vs non-org owned
ggplot(owned_org_summary, aes(x=company, y=num_repos, fill=owned_repo)) +
  geom_bar(stat="identity") +
  coord_flip()

# top significant repos per org if a huge list
ggplot(owned_org_summary %>% 
         filter(total_commits > 1, total_authors > 1) %>% 
         group_by(company,org) %>% 
         top_n(5, total_commits), 
       aes(x=reorder(org, total_commits), y=total_commits, fill=owned_repo)) +
  geom_bar(stat="identity", position="stack") +
  coord_flip() +
  facet_wrap(~ company, scales="free", ncol=3) +
  labs(title="Github Projects per Company with Most Commits", x="Github Project", y="Total Commits", fill="Own Repo?")

# top significant repos per org if a huge list
ggplot(owned_org_summary %>% 
         filter(total_commits > 1, total_authors > 1) %>% 
         group_by(company,org) %>% 
         top_n(2, total_authors), 
       aes(x=reorder(org, total_authors), y=total_authors, fill=owned_repo)) +
  geom_bar(stat="identity", position="stack") +
  coord_flip() +
  facet_wrap(~ company, scales="free", ncol=3) +
  labs(title="Github Projects per Company with Most Authors", x="Github Project", y="Total Authors", fill="Own Repo?")

Huawei

I was specifically asked about Huawei, so here’s a breakdown.

# top significant repos per org if a huge list
ggplot(owned_org_summary %>% 
         filter(company == "Huawei"), 
       aes(x=reorder(org, total_authors), y=total_authors, fill=owned_repo)) +
  geom_bar(stat="identity", position="stack") +
  coord_flip() +
  labs(title="Github Projects with Huawei Authors", x="Github Project", y="Total Authors", fill="Own Repo?")

# top significant repos per org if a huge list
ggplot(owned_repo_summary %>% 
         filter(company == "Huawei"), 
       aes(x=reorder(repo, total_authors), y=total_authors, fill=owned_repo)) +
  geom_bar(stat="identity", position="stack", show.legend = FALSE) +
  coord_flip() +
  labs(title="Github Projects with Huawei Authors", x="Github Project", y="Total Authors", fill="Own Repo?") +
  facet_wrap(~ org, scales="free", ncol=5)

# top significant repos per org if a huge list
ggplot(owned_org_summary %>% 
         filter(company == "Huawei", total_commits > 1, total_authors > 1),
       aes(x=reorder(org, total_commits), y=total_commits, fill=owned_repo)) +
  geom_bar(stat="identity", position="stack") +
  coord_flip() +
  labs(title="Github Projects with Huawei Authors", x="Github Project", y="Total Commits", fill="Own Repo?")

# top significant repos per org if a huge list
ggplot(owned_repo_summary %>% 
         filter(company == "Huawei"),
       aes(x=reorder(repo, total_commits), y=total_commits, fill=owned_repo)) +
  geom_bar(stat="identity", position="stack", show.legend = FALSE) +
  coord_flip() +
  labs(title="Github Projects with Huawei Authors", x="Github Project", y="Total Commits", fill="Own Repo?") +
  facet_wrap(~ org, scales="free", ncol=5)

Individual Authors

Significance is defined by a large number of commits to a particular project.

Multiple Authors (Same Company)

Significance is defined as the presence of multiple authors from the same company.

Multiple Companies

Significance is defined as the presence of commits from multiple companies.

Significant Event Activity

# proportion of org-owned repository activity vs non-org owned

# top significant repos per org if a huge list
login_events_summary <- read_rds("data/login_events_summary.Rds")

login_events_summary <- login_events_summary %>%
  mutate(org=str_to_lower(org)) %>%
  left_join(companies %>% rename(org_company=name, org=github_org))
## Joining, by = "org"
login_events_summary <- login_events_summary %>%
  mutate(owned_repo = company == org_company,
         owned_repo = ifelse(is.na(owned_repo), FALSE, owned_repo))

events_org_summary <- login_events_summary %>%
  group_by(company, org) %>%
  summarise(
    owned_repo=first(owned_repo),
    total_events=sum(num_events),
    total_authors=sum(num_authors),
    num_repos=n_distinct(repo_name)
  )

events_org_type_summary <- login_events_summary %>%
  group_by(company, org, type) %>%
  summarise(
    owned_repo=first(owned_repo),
    total_events=sum(num_events),
    total_authors=sum(num_authors),
    num_repos=n_distinct(repo_name)
  )

events_repo_summary <- login_events_summary %>%
  group_by(company, org, repo) %>%
  summarise(
    owned_repo=first(owned_repo),
    total_events=sum(num_events),
    total_authors=sum(num_authors)
  )

events_repo_type_summary <- login_events_summary %>%
  group_by(company, org, repo, type) %>%
  summarise(
    owned_repo=first(owned_repo),
    total_events=sum(num_events),
    total_authors=sum(num_authors)
  )

write_csv(events_repo_summary, "data/events_repo_summary.csv")
write_csv(events_repo_type_summary, "data/events_repo_type_summary.csv")
# proportion of org-owned repository activity vs non-org owned
ggplot(events_org_summary, aes(x=company, y=num_repos, fill=owned_repo)) +
  geom_bar(stat="identity") +
  coord_flip()

# proportion of org-owned repository activity vs non-org owned
ggplot(events_org_type_summary, aes(x=company, y=num_repos, fill=type)) +
  geom_bar(stat="identity", position="dodge") +
  coord_flip() +
  facet_wrap(~ owned_repo)

# top significant repos per org if a huge list
ggplot(events_org_summary %>% 
         filter(total_events > 1, total_events > 1) %>% 
         group_by(company,org) %>% 
         top_n(5, total_events), 
       aes(x=reorder(org, total_events), y=total_events, fill=owned_repo)) +
  geom_bar(stat="identity", position="stack") +
  coord_flip() +
  facet_wrap(~ company, scales="free", ncol=3) +
  labs(title="Github Projects per Company with Most Commits", x="Github Project", y="Total Commits", fill="Own Repo?")

ggplot(events_org_type_summary %>% 
         filter(total_events > 1, total_events > 1, owned_repo == FALSE) %>% 
         group_by(company,org) %>% 
         top_n(5, total_events), 
       aes(x=reorder(org, total_events), y=total_events, fill=type)) +
  geom_bar(stat="identity", position="dodge") +
  coord_flip() +
  facet_wrap(~ company, scales="free", ncol=3) +
  labs(title="Github Projects per Company with Most Commits", x="Github Project", y="Total Commits", fill="Type")

# top significant repos per org if a huge list
ggplot(events_org_summary %>% 
         group_by(company,org) %>% 
         top_n(1, total_authors), 
       aes(x=reorder(org, total_authors), y=total_authors, fill=owned_repo)) +
  geom_bar(stat="identity", position="stack") +
  coord_flip() +
  facet_wrap(~ company, scales="free", ncol=3) +
  labs(title="Github Projects per Company with Most Authors", x="Github Project", y="Total Authors", fill="Own Repo?")

# top significant repos per org if a huge list
ggplot(events_repo_summary %>% 
         filter(company == "Huawei"), 
       aes(x=reorder(repo, total_authors), y=total_authors, fill=owned_repo)) +
  geom_bar(stat="identity", position="stack", show.legend=FALSE) +
  coord_flip() +
  labs(title="Github Projects with Huawei Authors", x="Github Project", y="Total Authors", fill="Own Repo?") +
  facet_wrap(~ org, scales="free", ncol=4)

# top significant repos per org if a huge list
ggplot(events_repo_type_summary %>% 
         filter(company == "Huawei", owned_repo == FALSE),
       aes(x=reorder(repo, total_events), y=total_events, fill=type)) +
  geom_bar(stat="identity", position="dodge") +
  coord_flip() +
  labs(title="Github Projects with Huawei Authors", x="Github Project", y="Total Events", fill="") +
  facet_wrap(~ org, scales="free", ncol=3)

Multiple Companies

Significance is defined as the presence of events from multiple companies.

Summary