\(\\\) \(\\\)

This project is part of my PhD dissertation project. The network has already been assembled.

Let’s first load the required packages and the network file: \(\\\)

Loading libraries….

rr library(igraph)

\(\\\)

Descriptive Network Analysis

rr # tbnet <- read_graph(‘~/Documents/sna/graphs_old/CAnet.graphml’, format = ‘graphml’) tbnet <- read_graph(‘./graphs/TBnet.graphml’, format = ‘graphml’)

\(\\\)

Get a summary of the network:

rr summary(tbnet)

IGRAPH UN-- 173 1937 -- 
+ attr: affil (v/c), numPub (v/n), place (v/c), country (v/c), name (v/c), timesCited (v/n), id (v/c), key (e/n),
| subject (e/c), year (e/n), wosid (e/c), journal (e/c), title (e/c), timesCited (e/n), doi (e/c)

Interpretation: Our co-authorship network is an undirected multigraph (parallel edges) with 173 authors and 1937 scientific collaborations. Each node (author) in the network has 7 attributes: name, country, place, affil, numPub, timesCited and id. Each edge has 8 attributes: key, subject, abstract, year, wosid (Web of science Identification number), journal, title and doi.

Computing Node Centrality Measures: We compute centrality measures such as degree (number of ties to a given author), betweenness (number of shortest paths between alters that go through a particular author), closeness (number of steps required for a particular author to access every other author in the network) and eigenvectors (degree to which an author is connected to other well connected authors in the network), brokerage (degree to which an actor occupies a brokerage position across all pairs of alters)

Degree centrality Let’s first compute the degree and strength of the nodes in the network:

rr d.tbnet <- degree(tbnet) s.tbnet <- graph.strength(simplify(tbnet)) # for weighted graph

Now, let’s plot a histogram of the degree and strength distributions:

Interpretation: While there is a substantial number of nodes of quite low degree, there are also a non-trivial number of nodes with higher order of degree magnitudes. Given the nature of this distribution, a log-log scale is more effective at summarizing the degree information.

From here, the processing requires our network to be a simple graph. We would therefore change our multigraph to a graph object.

Visualization

Now, let’s now visualize our co-authorship graph network tbnet2 using the Kamada and Kawai layout:

rr # l.tbnet <- layout.kamada.kawai(tbnet) # l.tbnet2 <- layout.kamada.kawai(tbnet2) l.tbnet2 <- layout.kamada.kawai(tbnet2)

rr # plot(tbnet, layout=l.tbnet, vertex.label=NA, vertex.size=2) # plot(tbnet2, layout=l.tbnet2, vertex.label=NA, vertex.size=2, vertex.color=‘red’) plot(tbnet2, layout=l.tbnet2, vertex.label=NA, vertex.size=2, vertex.color=‘blue’) # We use l.tbnet2 as layout

rr d.tbnet <- degree(tbnet2) # Degree of simple graph tbnet2 V(tbnet2)\(degree <- d.tbnet dd.tbnet <- degree.distribution(tbnet2) # Compute degree distribution d <- 1:max(d.tbnet)-1 ind <- (dd.tbnet != 0) a.nn.deg.tbnet <- graph.knn(tbnet2, V(tbnet2))\)knn mean(d.tbnet)

[1] 17.36416

Now, ploting degree distribution of the simplified graph.

Interpretation: From the plot on the right, the degrees of the node no clear distribution. Let’s investigate the manner in which nodes of different degrees are linked with each other in the coauthorship network. For this purpose, we bring in the notion of the average degree of the neighbors of a given node. We then plot the average neighbor degree against node degree.

Interpretation: The plot above suggests that while there is a tendency of nodes of higher degrees to link with similar nodes, nodes of lower degree tend to link with nodes of both lower and higher degrees. In other words, while prominent authors with important collaborations tend to collaborate with similar authors, young or less prolific authors tend to collaborate with both prolific and authors with very few collaborations.

Let’s compute the other 3 node centrality measures:

rr A <- get.adjacency(tbnet2, sparse = FALSE) g <- network::as.network.matrix(A)

Closeness centrality: captures the notion that a node is central if it close to many other nodes Considering a network \(G=(V,E)\) where \(V\) is the set of nodes and \(E\), the set of edges, the closeness centrality \(c_{Cl}(v)\) of a node \(v\) is defined as: \[c_{Cl}(v)=\frac{1}{\sum_{u\in V}dist(v,u)}\] \(dist(v,u)\) is defined as the geodesic distance between the nodes \(u,v \in V\).

rr cl.tbnet <- closeness(tbnet2) V(tbnet2)$closeness <- cl.tbnet summary(cl.tbnet)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000368 0.0003116 0.0003175 0.0002907 0.0003211 0.0003280 

Betweeness centrality: summarizes the extent to which a node is located between other pairs of nodes. It relates to the to the perspective that importance relates to where a node is located with respect to the paths in the network graph. According to Freeman, it is defined as: \[c_{B}(v)=\frac{\sigma (s,t|v)}{\sum_{s \neq t \neq v \in V}\sigma (s,t)}\] where \(\sigma(s,t|v)\) is the total number of shortest paths between \(s\) and \(t\) that pass through \(v\), and \(\sigma (s,t)\) is the total number of shortest paths between \(s\) and \(t\) regardless of whether or not they pass through \(v\).

rr bw.tbnet <- betweenness(tbnet2) V(tbnet2)$betweenness <- bw.tbnet summary(bw.tbnet)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0000    0.8075   12.4900   99.0600   30.5600 3077.0000 

Eigenvector centrality: seeks to capture the idea that the more central the neighbors of a node are, the more central that node itsel is. According to Bonacich and Katz [Page 48, book Kolaczyk and Csardi, 2nd edition, 2009], the Eigenvector centrality measure is defined as: \[c_{E_i}(v)=\alpha \sum_{\{u,v\}\in E}c_{E_i}(u)\]

Where the vector \(\mathbf{c}_{E_i}=(c_{E_i}(1),\dots ,c_{E_i}(N_v))^T\) is the solution to the eigenvalue problem \(\mathbf{Ac}_{E_i}=\alpha^{-1}\mathbf{c}_{E_i}\), where \(\mathbf{A}\) is the adjacency matrix for the network \(G\). According to Bonacich, an optimal choice of \(\alpha^{-1}\) is the largest eigenvalue of \(\mathbf{A}\).

rr ev.tbnet <- evcent(tbnet2)\(vector V(tbnet2)\)eigenv <- ev.tbnet summary(ev.tbnet)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.04589 0.08708 0.13830 0.15230 1.00000 

Visualization highliting Hubs and Authorities

rr V(tbnet2)\(hubScore <- hub.score(tbnet2)\)vector V(tbnet2)\(authScore <- authority.score(tbnet2)\)vector summary(tbnet2)

IGRAPH UNW- 173 1502 -- 
+ attr: affil (v/c), numPub (v/n), place (v/c), country (v/c), name (v/c), timesCited (v/n), id (v/c), degree (v/n),
| closeness (v/n), betweenness (v/n), eigenv (v/n), hubScore (v/n), authScore (v/n), key (e/x), subject (e/x), year
| (e/x), wosid (e/x), journal (e/x), title (e/x), timesCited (e/n), doi (e/x), weight (e/n)

Note that the Hub scores and Authority scores are exactly equal to the eigenvectors of the nodes in the network. An exploration of the auth_data frame below helps the reader notice the remark.

rr auth_data <- data.frame(id=as.numeric(V(tbnet2)\(id),name=V(tbnet2)\)name,numPub=as.numeric(V(tbnet2)\(numPub),timesCited=as.numeric(V(tbnet2)\)timesCited),degree=V(tbnet2)\(degree,closeness=V(tbnet2)\)closeness,betweenness=V(tbnet2)\(betweenness,eigenV=V(tbnet2)\)eigenv,hubScore=V(tbnet2)\(hubScore,authScore=V(tbnet2)\)authScore) auth_data\(affiliation<-V(tbnet2)\)place auth_data\(city<-V(tbnet2)\)affil auth_data\(country<-V(tbnet2)\)country auth_data\(city[which(auth_data\)city==‘LONDON WC1E 7HT’)]<-‘LONDON’ auth_data\(city[which(auth_data\)city==‘LONDON WC1’)]<-‘LONDON’ auth_data\(city[which(auth_data\)city==‘LONDON NW3 2QG’)]<-‘LONDON’ auth_data\(country[which(auth_data\)city==‘COTONOU’ & auth_data\(country=='FRANCE')]<-'BENIN' auth_data\)country[which(V(tbnet2)$country==‘MA USA’)]<-‘USA’ auth_data\(city[which(auth_data\)city==‘BOBO DIOULASSO 01’)]<-‘BOBO DIOULASSO’ # Obtaining coordinates # library(ggmap) # cord<-unlist(geocode(paste(auth_data\(city[1],auth_data\)country[1],sep=‘,’))) # longlat<-data.frame(id=auth_data\(id, name=auth_data\)name, long=coord[1:1792],lat=coord[1793:3584]) # write.csv(longlat,file = ‘~/Documents/sna/graphs/auth_coordinates.csv’) # Getting node information # longlat<-read.csv(‘./graphs/auth_coordinates.csv’,header = T) # longlat<-subset(longlat, select = -c(X)) # auth_data\(long<-longlat\)long # auth_data\(lat<-longlat\)lat # Getting edge list and related information # eList<-get.edgelist(tbnet2,names = T) # colnames(eList)<-c(‘source’,‘target’) # eList<-as.data.frame(eList) # new1<-eList # new2<-eList # new1[]<-auth_data\(long[match(unlist(eList),auth_data\)name)] # new2[]<-auth_data\(lat[match(unlist(eList),auth_data\)name)] # edges<-data.frame(id=1:length(E(tbnet2)),source=eList\(source,target=eList\)target,weight=E(tbnet2)\(weight,timesCited=E(tbnet2)\)timesCited,long_source=new1\(source,lat_source=new2\)source,long_target=new1\(target,lat_target=new2\)target)

Characterizing Edges: Edge betweenness centrality extends from the notion of vertex centrality by assigning to each edge a value reflecting the number of shortest paths traversing that edge. We compute edge betweenness to assess which co-authorship collaborations are important for the flow of information. We then present the 10 most important collaborations in our tuberculosis co-authorship network.

rr eb <- edge.betweenness(tbnet2) E(simplify(tbnet))[order(eb, decreasing=T)[1:10]]

+ 10/1502 edges (vertex names):
 [1] ODOUN MATHIEU  --GNINAFON MARTIN   FAIHUN FRANK   --DE JONG BOUKE C   ODOUN MATHIEU  --TREBUCQ ARNAUD   
 [4] ZELLWEGER J P  --GNINAFON MARTIN   TREBUCQ ARNAUD --ADJONOU CHRISTINE ODOUN MATHIEU  --WACHINOU PRUDENCE
 [7] AFFOLABI DISSOU--BAHSOW OUMOU      AFFOLABI DISSOU--TOUNDOH N         AFFOLABI DISSOU--BEKOU W          
[10] AFFOLABI DISSOU--MAKPENON A       

Characterizing Network cohesion:

In this section, we are going to assess the extent to which subsets of authors are cohesive with the respect to their relation in the co-authorship network. Specifically, we aim at determining if collaborators (co-authors) of a given author tend to collaborate as well. What subset of collaborating authors tend to be more productive in our network? While there are many techniques to determine network cohesion, we choose to investigate local triads and global giant components, cliques detection as well as clustering or communities detection in our tuberculosis co-authorship network.

Cliques: According to Kolaczyk and Csardi (2009), cliques are defined as complete subgraphs such that all nodes within the subset are connected by edges. We compute the number of cliques in our tuberculosis co-authorship network, then compute the number and size of the maximal cliques.

rr clique.number(tbnet2)

[1] 28

Our tuberculosis co-authorship network contains 28 cliques.

rr table(sapply(maximal.cliques(tbnet2), length))


 3  4  5  6  7  8  9 11 12 13 15 16 19 21 28 
 1  5  7  5  5 10  4  4  1  1  1  2  1  1  1 

Density and related notions of relative frequency: Defined as the frequency of realized edges relative to potential edges, the density of a subgraph \(H\) in \(G\) provides a measure of how close \(H\) is to be a clique in \(G\). Density values varie between 0 and 1: \[den(H)=\frac{|E_H|}{|V_H|(V_H-1)/2}\] Here we compute the general density of our tuberculosis co-authorship network.

rr graph.density(tbnet2)

[1] 0.1009544

We assess the relative frequency of \(G\) by computing its transitivity defined as: \[cl_T = \frac{3\tau_\Delta (G)}{\tau_3 (G)}\] where \(\tau_\Delta (G)\) is the number of triangles in \(G\), and \(\tau_3 (G)\) is the number of connected triples (sometimes referred to as 2-star). This measure is also referred to as the fraction of transitive triples. It represents a measure of global clustering of \(G\) summarizing the relative frequency with which connected triples close to form triangles.

rr transitivity(tbnet2)

[1] 0.6305286

Another analogue of this measure is the local transitivity defined as: \[cl(v)=\tau_\Delta (v)/\tau_3 (v)\] where \(\tau_\Delta (v)\) denotes the number of triangles in \(G\) into which \(v \in V\) falls and \(\tau_3 (v)\) is the number of connected triples in \(G\) for which the two edges are both incident to \(v\). Here we compute the local transitivity for all the nodes in our tuberculosis co-authorship network.

rr tr<-transitivity(tbnet2,‘local’,vids = 1:length(V(tbnet2))) V(tbnet2)\(transitivity<-tr auth_data\)transitivity<-tr

Connectivity, Cuts, and Flows: In this section, we measure how close our tuberculosis co-authorship is close to separate into distincts subgraphs. We are also interested in assessing how well information flows in the network. We first start with the concept of connectedness. Since our network is an undirected graph, we do not consider the idea of weak and strong connectivity. A graph \(G\) is said to be connected if every node in \(G\) is reachable from every other node.

rr is.connected(tbnet2)

[1] FALSE

From the output above, we clearly conclude that our co-authorship network is not connected.

Often time, one of the connected components can dominate the others, hence the idea of giant component. Let’s then census our co-authorship:

rr comps<-decompose.graph(tbnet2) table(sapply(comps,vcount))


 16 157 
  1   1 

From the output of the census of all connected components of the network above, we can see that there are 2 main components containing respectively 16 and 157 nodes. There is a giant component containing \(157/173\approx 90.8\%\) of all the vertices in the network.

We further devote closer attention to this giant component.

rr tbnet2.gc <- decompose.graph(tbnet2)[[1]] summary(tbnet2.gc)

IGRAPH UNW- 157 1382 -- 
+ attr: affil (v/c), numPub (v/n), place (v/c), country (v/c), name (v/c), timesCited (v/n), id (v/c), degree (v/n),
| closeness (v/n), betweenness (v/n), eigenv (v/n), hubScore (v/n), authScore (v/n), transitivity (v/n), key (e/x),
| subject (e/x), year (e/x), wosid (e/x), journal (e/x), title (e/x), timesCited (e/n), doi (e/x), weight (e/n)

Let’s plot this giant component:

rr plot(tbnet2.gc, layout=layout.kamada.kawai(tbnet2.gc), vertex.label=NA, vertex.size=2, edge.width=0.08, vertex.color=‘lightblue’)

One important characteristic observed in giant component is the so-called small-world property which refers to the situation wherein the shortest-path distance between pairs of nodes is generally small and the clustering is relatively high. For our tuberculosis co-authorship network, let’s compute the average path length

rr average.path.length(tbnet2.gc)

[1] 2.126164

and the longest of paths

rr diameter(tbnet2.gc)

[1] 5

Let’s assess the transitivity of our giant component:

rr transitivity(tbnet2.gc)

[1] 0.6140667

Interpretation: We can see that the average path length of the giant component of our co-authorship network is small and the longest of paths is not much bigger. Hence our giant component has all the characteristics of a small-world. In addition, the clustering in this network is high indicating that 61% of the connected triples are close to form triangles.

We investigate the concepts of vertex and edge cuts derived from the concept of vertex(edge) connectivity. The vertex (edge) connectivity of a graph \(G\) is the largest integer such that \(G\) is k-vertex- (edge-) connected.

rr vertex.connectivity(tbnet2.gc)

[1] 1

rr edge.connectivity(tbnet2.gc)

[1] 2

In the case of the giant component of our co-authorship network, the vertex connectivity is equal to 1 while the edge connectivity is equal to 2, thus requires the removal of only a single well-chosen node (author) or 2 collaboration ties in order to break this subgraph into additional components.

A set of nodes (edges) that disconnects the graph is called a vertex cut (edge cut). A single node (author) that disconnects of such vertices is called a cut vertex and can provide a sense of where a network is vulnerable. Let’s identify such weak points in our co-authorship network:

rr tbnet2.cut.vertices <- articulation.points(tbnet2.gc) tbnet2.cut.vertices

+ 1/157 vertex, named:
[1] WACHINOU PRUDENCE

The above listed author constitutes the only weak articulation point of our co-authorship network but also the most important nodes of our network.

rr length(tbnet2.cut.vertices)

[1] 1

In our tuberculosis co-authorship network, less than 1% of the nodes are cut vertices meaning that the vulnerability of the network is dependant on a very small set of authors in the co-authorship network.

Graph Partitioning: Regularly framed as community detection problem, graph partitioning is an unsupervized method used in the analysis of network data to find subsets of nodes that demonstrate a ‘cohesiveness’ with respect to thei underlying relational patterns. Cohesive subsets of nodes generally are well connected among themselves and are well separated from the other nodes in the graph. Here, we perform two well established methods of graph partitioning: Hierarchical clustering and Spectral clustering.

Hierarchical Clustering: Hierarchical clustering methods are of two kinds: - agglomerative: “based on the successive coarsening of partitions through the process of merging”, it uses modularity as metrics. - divisive: “based on the successive refinement of partitions through the process of splitting”

Here, we apply the agglomerative method on our tuberculosis co-authorship network:

rr com.tbnet2 <- fastgreedy.community(tbnet2) V(tbnet2)\(community <- com.tbnet2\)membership length(com.tbnet2)

[1] 6

The agglomerative hierarchical clustering identifies 6 communities in our co-authorship network.

rr sizes(com.tbnet2)

Community sizes
 1  2  3  4  5  6 
58 25 16 31 14 29 

The largest community contains 58 authors. Medium size communities contain between 14 and 29 authors.

Let’s now visualize the communities:

rr plot(com.tbnet2, tbnet2,vertex.label=’’, layout=l.tbnet2, mark.groups = NULL, # vertex.size = 3, # edge.color = memb.tbnet2, edge.width = 0.08, vertex.size=1+4 * sqrt(bw.tbnet/1000) )

rr gc.com <- fastgreedy.community(tbnet2.gc) length(gc.com)

[1] 9

There are 9 communities in the giant component of our network.

rr sizes(gc.com)

Community sizes
 1  2  3  4  5  6  7  8  9 
49 25 12 12  5  4 14  5 31 

The 9 communities contain between 5 and 49 authors.

Let’s now visualize the 9 communities in the giant component:

rr l.gc<-layout.kamada.kawai(tbnet2.gc) bw.gc<-betweenness(tbnet2.gc) V(tbnet2.gc)$community <- membership(gc.com)

rr plot(gc.com, tbnet2.gc,vertex.label=’’, # layout=l.tbnet2, layout = l.gc, mark.groups = NULL, # vertex.size = 3, # edge.color = memb.tbnet2, edge.width = 0.06, vertex.size=1+4 * sqrt(bw.gc/1000) )

Let’s plot the 9 communities separately:

rr par(cex.main=3) par(mfrow=c(5,2)) # Plot Original giant component graph plot(gc.com, tbnet2.gc,vertex.label=‘’, main = ’Main component’, # layout=l.tbnet2, layout = l.gc, mark.groups = NULL, # vertex.size = 3, # edge.color = memb.tbnet2, edge.width = 0.05, vertex.size=1+4 * sqrt(bw.gc/1000) ) for(i in 1:9){ g <- which(V(tbnet2.gc)\(community==i) G.group <- subgraph(tbnet2.gc, g) plot(G.group,vertex.label='', main = paste('Community/Partition ',i), # layout=l.tbnet2, layout = l.gc[g,], mark.groups = NULL, vertex.color = gc.com\)membership[g], # vertex.size = 3, # edge.color = memb.tbnet2, edge.width = 0.06, vertex.size=1+4 * sqrt(bw.gc/1000) ) }

At structural_properties.c:1945 :igraph_subgraph is deprecated from igraph 0.6, use igraph_induced_subgraph instead

We finally save all our generated R objects for later use.

rr save(tbnet, file = ‘./Rdata/TBnet.rda’) save(tbnet2, file = ‘./Rdata/TBnet2.rda’) save(auth_data, file = ‘./Rdata/TBauth_data.rda’) save(edges, file = ‘./Rdata/TBedges.rda’) save(tbnet2.gc, file = ‘./Rdata/TBnet2.gc.rda’)

rr # source(‘plotly_map.R’)

\(\\\)

\(\\\) \(\\\)

NEXT TUTORIAL: Mathematical Modeling for Network Graphs

