UMCES coauthors

Introduction

This analysis was done just out of curiosity and is not intended to be taken seriously.

This document shows analysis of publicly available data on coauthor relationships of faculty members at the University of Maryland Center for Environmental Science (UMCES). UMCES departments are: Appalachian Laboratory (AL), Chesapeake Biological Laboratory (CBL), Horn Point Laboratory (HPL), and Institute of Marine and Environmental Technology (IMET). Full code for these results is available on GitHub, but data folders with the faculty lists are kept offline.

Below is an interactive graph of UMCES faculty coauthorship, with details further below.

Data

The UMCES faculty list is taken as of 2020-02-26. If you are one of the faculty and wish your name to be removed from analysis just let me know.

The coauthorship network is based on data extracted from Google Scholar on 2020-10-12. Nodes of the network are UMCES faculty members; an edge between faculty members A and B exists if A is listed (at least once) among authors of publications in B’s profile or vice versa.

Why Google Scholar (advantages):

Google Scholar is probably the most popular platform tracking academic work. From 66 people in the faculty list, I found Google Scholar profiles for 52 people.
Google Scholar is efficient in adding new publications to the profiles automatically, so the data are likely most updated compared to other platforms.
Google Scholar is more inclusive in terms of indexing different publication sources, compared with overly selective Web of Science and such.
Google Scholar is easier to crawl than other websites.

Possible problems (disadvantages):

Google Scholar profiles were not available for some of the faculty. From their coauthorship links, only those with Google Scholar users were recorded (for example, if faculty members A, B, and C are coauthors but only C has a Google Scholar profile, then the identified links will be A-C and B-C, without A-B).
The accuracy and completeness of Google Scholar is not perfect; account owners put different amount of effort to maintain their profiles.
There could be computer errors when accessing the web data, extracting and matching names, including the cases of common or short family names.
For publications with many coauthors, only first few (up to 5-7) authors are tracked. Attempts to scrape full metadata for each publication have been blocked by the server.

Summaries

Node degree

The plots below show “within-UMCES-collaborativeness” by counting how many coauthors from UMCES each faculty member has (that is, node degree in the coauthorship network). Red points represent faculty members without Google Scholar account (see data description above).

Each faculty member had a different chance to establish collaborations within UMCES. For example, junior faculty members are likely to have fewer collaborations, and the next plot shows it.

Betweenness centrality

Node degree is one of many measures of node centrality (sort of importance in a network context). Another common measure is betweenness centrality based on the the number of shortest paths in a network that pass through the specific node (in other words, how often the specific node appears in an arbitrage position).

And repeat with grouping by faculty rank.

Number of publications

Below is number of publications retrieved from Google Scholar for each faculty member.

Affiliation and collaboration

Here investigate whether network clusters match the formal affiliation of faculty to different departments. The network clusters represent communities (color rectangles on the clustering dendrogram below) densely connected by the coauthorship links.

From several readily available algorithms, fast greedy algorithm was used, which identified 9 communities.

Knowing the actual affiliations of the faculty, the clustering can be checked using several evaluation criteria, one of which is purity (Section 16 in Manning, Raghavan, and Schutze 2008):

\[Purity(\Omega,C) = \frac{1}{N}\sum_{k}\max_{j}|\omega_k\cap c_j|,\]

where \(\Omega=\{\omega_1,\ldots,\omega_K \}\) is the set of identified clusters and \(C=\{c_1,\ldots,c_J\}\) is the set of classes. That is, within each class \(j=1,\ldots,J\) find the size of the most populous cluster from the \(K-j\) unassigned clusters. Then, sum together the \(\min(K,J)\) sizes found and divide by the sample size \(N\).

When classes represent the laboratory affiliation (that was not used in clustering) and clusters are the communities obtained by tracking coauthorship links, \(Purity =\) 0.68.

Below is a matrix showing percentage distribution of within-UMCES collaborations and answers the question: considering collaborators from UMCES, what is the proportion of collaborators from the home lab and other labs?

Percentage distribution of within-UMCES collaborators
	AL	CBL	HPL	IMET	Total
AL	72.3	2.3	25.4	0.0	100
CBL	0.5	81.2	14.4	3.9	100
HPL	7.0	19.9	68.8	4.2	100
IMET	0.0	14.2	11.0	74.8	100

Example inference from the table above: from all UMCES collaborators of AL authors, 72.3% are from AL, 2.3% are from CBL, 25.4% are from HPL, and 0% are from IMET.

Number of UMCES collaborators per 100 papers from a specific lab (rows)
	AL	CBL	HPL	IMET
AL	13.3	0.4	4.7	0.0
CBL	0.2	35.5	6.3	1.7
HPL	2.6	7.4	25.5	1.6
IMET	0.0	3.3	2.6	17.6

Example inference from the table above: for 100 publications from AL, there are on average 13.3 collaborators from AL, 0.4 collaborators from CBL, 4.7 collaborators from HPL, and 0 collaborators from IMET.

Please, remember about the data limitations, such as using only about 6 first authors in multi-author publications and absence of Google Scholar accounts for some faculty members.

Conclusion

It’s been fun.

Updates (since 2020-07-16)

Updated publication data by redownloading it from Google
Added a table with percentage distribution of within-UMCES collaborators

Updates (since 2020-05-16)

Changed family name matching to be not case-sensitive because some names are spelled on Google Scholar in all caps
Changed name matching from family name to family name + initial matching for higher accuracy
Fixed a typo in one of the family names
Added Google Scholar data for one more person
Added number of publications plot and clustering

Next steps

Retrieve full lists of authors per paper (not truncated)
Count number of joint publications to get a weighted network
Update text matching
Text analytics on paper titles, journal titles
Update Google IDs if someone opened an account
Count average number of authors per paper
Add other information from Google Scholar such as citations and h-index
Add textbook references

References

Manning, C. D., P. Raghavan, and H. Schutze. 2008. Introduction to Information Retrieval. New York: Cambridge University Press.