L08- Characterizing the Whole Network

Data Description
Calculations and Visualizations
Discussion

This week we are exploring properties of the entire network. I am continuing the analysis with Wikipedia page-link frequency (citations ) compared to the collaboration distance and erdos number for a small sample of authors. A few more authors have been added to the list, but this is still a very small-world network. Also, it is probably skewed because not all the authors had entries in the database I used to collect collaboration distance/erdos number. I am still looking for an api or rest service to retrieve this data, so if you know of any R, python lib or rest service, let me know.

Data Description

A sample of ~~eleven~~ sixteen Network Scientists were chosen at random from a list in Wikipedia¹. An independent sample was collected for each author using the name as the seed value to a maximum depth of three for each iteration. The total number of links is the citation frequency of the author.
Erdos number ² ³ ⁴.
The resulting dataset contains an originating node (the author’s name), the Erdos number (associated with the originating node) and the total citations (page-links) for the author.
Collaboration distance.⁵.

Calculations and Visualizations

Two types of reciprocity are calculated. First the correlation:

## [1] 0.1121911

The correlation measure may be interpreted as the net tendency for edges of similar relative value (with respect to the mean edge value) to occur within the same dyads. If all all edge values are identical then the correlation reciprocity should be 1 by definition. The reciprocity correlation is very low, which I suspect, could be due to the wide range of the edge values:

Maximum edge value

## [1] 480

Minimum edge value

## [1] 0

Testing the dyadic reciprocity:

 # Reciprocity 
grecip(citation_erdos_mtrx, citation_coll_mtrx, measure ="dyadic")

##       Mut 
## 0.7416667

The dyadic reciprocity of the graph is the proportion of dyads which are symmetric. Since this is the basis of the graph, it should be at least 50%. The number is much higher, which is most likely due to the small pool that the samples are chosen from within a closely related set of data.

Transitivity

## [1] 0.8395311

Centrality

Key-player has been calculated using centrality measures, this was done using both eigenvalue and degree, both resulted in the same key player who is surprisingly is Olaf Sporns. Visually node #71 looks much more important than the others. Olaf had a lower citation count and collaborated less than Stephen P. Borgatti or Albert-László Barabási in this sample. There is some other, as yet to be discovered, factors at play.

# Eigenvalue centrality
sna::evcent(citation_erdos_mtrx, citation_coll_mtrx, auth_cit_erdos$erdos, g=1)
indx <-  which( sna::evcent(citation_erdos_mtrx, citation_coll_mtrx, auth_cit_erdos$erdos)==max(sna::evcent(citation_erdos_mtrx, citation_coll_mtrx, auth_cit_erdos$erdos)))

auth_cit_erdos$author[[indx]]

auth_cit_erdos$author[[indx]]

## [1] "Olaf_Sporns"

# Degree centrality

deg_indx <-  which(sna::degree(citation_erdos_mtrx, citation_coll_mtrx, auth_cit_erdos$erdos) == max(sna::degree(citation_erdos_mtrx, citation_coll_mtrx, auth_cit_erdos$erdos)))

auth_cit_erdos$author[[deg_indx]]

auth_cit_erdos$author[[deg_indx]]

## [1] "Olaf_Sporns"

# auth_cit_erdos
# actor_collab[]
 g<- igraph::graph.data.frame(d =c(auth_cit_erdos, cit_collab, cit_erdos), directed=FALSE, vertices = NULL)
plot(g)

Removing Olaf:

For this chose reciprocity for an invariant. This seems to be highly dependent upon Olaf. However, since this is not a highly disconnected graph removing a key player doesn’t have a great impact.

# Reciprocity 
grecip(citation_erdos_mtrx, citation_coll_mtrx, measure ="correlation")

## [1] 0.1121911

# Reciprocity 
grecip(citation_erdos_mtrx, citation_coll_mtrx, measure ="dyadic")

##       Mut 
## 0.7416667

g<- igraph::graph.data.frame(d =c(new_auth_cit_erdos, cit_collab[c(2:13, 15,16), c(2:13, 15,16)], cit_erdos[c(2:13, 15,16),c(2:13, 15,16) ]), directed=FALSE, vertices = NULL)
plot(g)

ANOVA, in lieu of t-test

TOOLS>STATISTICS>ANOVA
--------------------------------------------------------------------------------

Dependent variable:                     "C:\Users\dev1\MyData\data\Lab08\collab.##h" Col 1
Independent variable:                   "C:\Users\dev1\MyData\data\Lab08\erd.##h" Col 1
# of permutations:                      5000
Random seed:                            17568


        ANALYSIS OF VARIANCE

         Source             DF            SSQ    F-Statistic   Significance
 ============== ============== ============== ============== ==============
      Treatment              4         195.00         0.5216         0.7229
          Error             11        1028.00
          Total             15        1223.00

R-Square/Eta-Square: 0.159


----------------------------------------
Running time:  00:00:01
Output generated:  21 Oct 15 22:58:43
UCINET 6.587 Copyright (c) 1992-2015 Analytic Technologies

Discussion

Reciprocity doesn’t appear to be a strong indicator for this analysis because almost all actors (author) are in dyadic relationships in a small world network. Additionally it’s mainly composed of authors who are, for the most part, working in a related field. As such, this is a co-citation matrix with a proportionality constant based on the collaboration number and the erdos number. Therefore, a high reciprocity should be expected for the dyadic similarity. The low result for the correlational reciprocity is most likely attributable the the wide range of edge values printed in the max in min above.

There are a couple of triads, but the transitivity is very sensitive to NA values. If these are altered the transitivity becomes valid, otherwise it’s NA. A larger sample size would probably exhibit more resiliency.

When trying the t-test, I kept getting an error and suggested using the ANOVA instead. I’ve included it here to fulfill that requirement. Unsurprisingly, it indicates that the results aren’t by chance.

Most of the measurement, although interesting are not very pertinent to the citation/ collaboration network analysis with the exception of key-player. That was an interesting result, since I wouldn’t expect the outcome based on a visual inspection, though it may be due to the unlabeled nodes. But I suspect that it has more to do with the small-network.

Removing a key player in a highly disconnected graph doesn’t have too much of an impact, as it would in a more highly connected graph it.

https://en.wikipedia.org/wiki/List_of_network_scientists ↩
An Erdős number describes a person’s degree of separation from Erdős himself, based on their collaboration with him.↩
Erdős alone was assigned the Erdős number of 0 (for being himself), while his immediate collaborators could claim an Erdős number of 1, their collaborators have Erdős number at most 2, and so on.↩
Retrieved from http://www.ams.org/mathscinet/collaborationDistance.html ↩
Retrieved from http://www.ams.org/mathscinet/collaborationDistance.html ↩

L08- Characterizing the Whole Network

Lorren Kay

October 21, 2015

Data Description

Calculations and Visualizations

Discussion