Using a dataset of authors who have collaborated on a scientific paper, we analyze the network of co-authorships. From this network we calculate centrality measures to identity the key actors in the network.
df <- read_csv("~/Library/CloudStorage/Box-Box/Taurean/Year_1/SSNA/authorship_edgelist.csv")
## New names:
## * `` -> ...1
## Rows: 1283 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): V1, V2
## dbl (1): ...1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2 <- as.matrix(df)
df2 <- df2[,-1]
head(df2)
## V1 V2
## [1,] "langston, wj" "bebianno, mj"
## [2,] "bebianno, mj" "langston, wj"
## [3,] "bebianno, mj" "langston, wj"
## [4,] "bebianno, mj" "langston, wj"
## [5,] "nott, ja" "bebianno, mj"
## [6,] "bebianno, mj" "nott, ja"
Note for some weird reason I cannot get labels to show up on my plot
G<-graph_from_edgelist (df2, directed = FALSE )
G <- simplify(G)
V(G)$label <- V(G)$name
print(G)
## IGRAPH 2382d0e UN-- 173 734 --
## + attr: name (v/c), label (v/c)
## + edges from 2382d0e (vertex names):
## [1] langston, wj--bebianno, mj langston, wj--nott, ja
## [3] langston, wj--serafim, ma langston, wj--coelho, mr
## [5] langston, wj--company, rm langston, wj--mingjiang, z
## [7] langston, wj--simkiss, k langston, wj--ryan, kp
## [9] bebianno, mj--nott, ja bebianno, mj--serafim, map
## [11] bebianno, mj--mudge, sm bebianno, mj--machado, lm
## [13] bebianno, mj--caetano, m bebianno, mj--falcao, m
## [15] bebianno, mj--gibbs, pe bebianno, mj--east, ja
## + ... omitted several edges
plot(G, vertex.size = 5, vertex_label = V(G)$label, edge.arrow.size =.25)
We calculate two centrality measures, betweenness and eigenvector centrality. Then we graph the regression line of the equation predicting eigenvector centrality based on betweenness. Based on our dataset we see that actor 2, Bebbianno, Mj is the most central actor of the network in terms of eigenvector centrality and betweenness. However, he still falls below the predicted value of eigenvector centrality based on his betweenness. This may be due to the fact that eigenvector centrality highest value is at 1. An interesting thing is that this data have several actors with low to zero betweenness centrality.
#Data frame creation
cent<-data.frame(name = V(G)$name, deg = degree(G), bet=betweenness(G, normalized= TRUE),eig=evcent(G)$vector) # evcent returns lots of data associated with the EC, but we only need the leading eigenvector
rownames(cent)<-rownames(df2) #Ids in this case
residuals<-lm(eig~bet,data=cent)$residuals # We just save the residuals from OLS which will be used in the next steps
cent<-transform(cent,res=residuals)
library(ggplot2) #Powerful but a little bit complicated platform
# Spelling out, cent is the dataset and we specifically are asking for two variables eig and bet, in addition we need the row names to be considered in the plot. Additionally I am asking the nodes or names to be weighed by the size of the residuals and colored by that as well.
# We use the residuals to color and shape the points of our plot, making it easier to spot outliers.
#x11()
#pdf("key actor.pdf", 15, 15)
p<-ggplot(cent,aes(x=bet,y=eig, label=rownames(cent),colour=res, size=(1/abs(res))))+xlab("Betweenness Centrality")+ylab("Eigenvector Centrality")
# p is the graph all the information necessary to, yet it requires some extra commands called adding layers
# geom_point() will add bubbles
p+geom_point()+labs(title="Key Actor Analysis for WTCC")
# geom_text() adds names instead
p+geom_text()+labs(title="Key Actor Analysis for WTCC")
# We, of course cn add both, just adjusting them so that they do not overlap
p + geom_point() + geom_text(hjust=1.5, vjust=1)+labs(title="Key Actor Analysis for WTCC")
#We can add the best line possible based on the data, to do this we need to compute the linear regression one more time and save the $\alpha$ and $\beta$ coefficients
coeffs<-as.data.frame(coef(lm(eig~bet,data=cent)))
#Then we just simply add the new variables to our old graph
p + geom_point() + geom_text(hjust=1.5, vjust=1)+labs(title="Key Actor Analysis for Authorship") + geom_abline(intercept = coeffs[1,], slope = coeffs[2,],colour = "red", size = 2,alpha=.25)
#Finally, we can get rid of the legend since the regression line is doing a good job guiding us to detect outliers
p + geom_point() + geom_text(hjust=1.5, vjust=1)+labs(title="Key Actor Analysis for Authorship") + geom_abline(intercept = coeffs[1,], slope = coeffs[2,],colour = "red", size = 2,alpha=.25) + theme(legend.position = "none")
dev.off()
## null device
## 1
topeig <- cent[order(-cent$eig),]
head(topeig)
## name deg bet eig res
## 2 bebianno, mj 172 0.863571072 1.0000000 -0.0430273
## 29 company, r 61 0.037441634 0.6280942 0.4715420
## 21 serafim, a 57 0.030615136 0.6104133 0.4611862
## 45 lopes, b 32 0.005575955 0.4445187 0.3221598
## 22 cravo, a 33 0.007618206 0.4429558 0.3184054
## 65 gomes, t 22 0.002940976 0.3138213 0.1942898
topeig <- cent[order(-cent$bet),]
head(topeig)
## name deg bet eig res
## 2 bebianno, mj 172 0.863571072 1.0000000 -0.0430273
## 29 company, r 61 0.037441634 0.6280942 0.4715420
## 21 serafim, a 57 0.030615136 0.6104133 0.4611862
## 22 cravo, a 33 0.007618206 0.4429558 0.3184054
## 45 lopes, b 32 0.005575955 0.4445187 0.3221598
## 65 gomes, t 22 0.002940976 0.3138213 0.1942898
#####################################################################
###################Figure 2##############################
library(igraph)
l <- layout_with_kk(G)
#V(G)$name<-rownames(df2)
V(G)$size<-abs((cent$bet)/max(cent$bet))*25 #The divisor is the highest betweenness
nodes<-V(G)$name # Setting a variable to manipulte names, nodes contains the IDs of the participants
#x<-quantile(cent$eig, .975)
#nodes[which(abs(cent$eig)<=(x))]<-NA # this gives the top 10%
V(G)$color <- rgb(0, 139, 0, max=255, 255/2)
V(G)$frame.color <- NA
E(G)$color <- rgb(161, 161, 161, max=255, 255/2)
#How many nodes would have names?
table(is.na(nodes))
##
## FALSE
## 173
plot(G,layout=l,vertex.label=nodes, vertex.label.dist=0.25,vertex.label.cex=0.25, vertex.label.color=rgb(0, 104, 139, max=255, 255/3),edge.width=1)
#pdf("actor_plot no isolates.pdf", 25, 25)
plot(G,layout=l,vertex.label=nodes, vertex.label.dist=0.0,vertex.label.cex=0.25, vertex.label.color=rgb(0, 104, 139, max=255, 255/3),edge.width=1)
title(main="Kek Actor Analysis LATTC", sub="Key actors weigthed by eigenvector and betweenness centrality", col.main="black", col.sub="black", cex.sub=1.2,cex.main=2,font.sub=2)
dev.off()
## null device
## 1
Top 5% includes 9 names, which means an eigenvector centrality above .27
#####################################################################
###################Figure 2##############################
library(igraph)
l <- layout_with_kk(G)
#V(G)$name<-rownames(df2)
V(G)$size<-abs((cent$bet)/max(cent$bet))*25 #The divisor is the highest betweenness
nodes<-V(G)$name # Setting a variable to manipulte names, nodes contains the IDs of the participants
x<-quantile(cent$eig, .95)
x
## 95%
## 0.270474
nodes[which(abs(cent$eig)<=(x))]<-NA # this gives the top 10%
V(G)$color <- rgb(0, 139, 0, max=255, 255/2)
V(G)$frame.color <- NA
E(G)$color <- rgb(161, 161, 161, max=255, 255/2)
#How many nodes would have names?
table(is.na(nodes))
##
## FALSE TRUE
## 9 164
plot(G,layout=l,vertex.label=nodes, vertex.label.dist=0.25,vertex.label.cex=0.25, vertex.label.color=rgb(0, 104, 139, max=255, 255/3),edge.width=1)
#pdf("actor_plot no isolates.pdf", 25, 25)
plot(G,layout=l,vertex.label=nodes, vertex.label.dist=0.0,vertex.label.cex=0.25, vertex.label.color=rgb(0, 104, 139, max=255, 255/3),edge.width=1)
title(main="Kek Actor Analysis Authorship", sub="Key actors weigthed by eigenvector and betweenness centrality", col.main="black", col.sub="black", cex.sub=1.2,cex.main=2,font.sub=2)
dev.off()
## null device
## 1