This is a report for the first assignment of the AMSI Summer school course Complex Networks
There are 32 files each is a vole networks. The networks are represented by adjacent matrixs. In this assignment, I need to:
1. calculate the average path length for all networks: using breadth first seaerch algorithm to transfer adjacent matrix to path matrix.
2. calculate the two clustering coefficients.
3. analyse how the average path length is affected by the size of the network and discuss if the vole networks fit the definition of small-world.
In order to calculate average path length, I need to implement breadth first seaerch algorithm to transfer adjacent matrix to path matrix.
# transfer the adjacent matrix to path matrix
## search function for one coloumn
s <- function(x){
b = which(x == 1) # return index of those there is an edge
b
}
## search function for target coloumns
myf01 <- function(x){
x <- as.matrix(x)
mylist <- apply(x, 2, s) # return index of edges for each target node. list
b <- as.vector(unlist(mylist)) # transfer list to a single vector
index <- duplicated(b) # find duplicate element
b <- b[!index] # delete duplicate element
b
}
## transfer function, using function s and myf01
breadthfirst1 <- function(B){
B <- as.matrix(B)
n <- ncol(B) # noumber of nodes
P <- B
P[,] = 0 # initiate the path matrix, a matrix with n*n size and everywhere is 0
visited <- NULL # initiate visited vector
for (r in 1:n){ # choose a root in every iteration
A <- B
verte <- myf01(A[,r]) # search nodes who are linked to the root
P[verte, r] <- 1 # for rth coloumn, set those nodes to be 1 in Path matrix
A[c(r, verte), ] <- 0 # set the visited nodes' rows to be 0 in the Adjacent matrix
visited <- c(r, verte) # add to visited
p <- 2 # path length
while (length(verte) > 0){
vvv <- myf01(A[, verte])
P[vvv, r] <- p
A[vvv, ] <- 0
visited <- c(visited, vvv)
verte <- vvv
p <- p + 1
}
}
P
}
Because there exist disconnected graphs, I need to think about how to calculate the avearage path length.
In a network, the average path length is the average shortest path between two nodes. If two nodes are disconnected, meaning there is no path between them, then the path length between them is in infinite. As a consequence, if a network contains disconnected components (collections of nodes that have no paths between them), then the mean path length also diverges to infinity. One way to avoid this problem is to calculate average path length only from nodes in the largest connected component.(Nykamp DQ, “Mean path length definition.”)
Thus here in the code I first delete isolate nodes. Then I identify disconnected parts and calculate the average path length for the largest connected component.
# deal with the disconnected part
# delete isolate nodes
deleteisolate <- function(A){
a <- apply(A, 2, function(x) sum(x) == 0)
A[!a, !a]
}
# identify disconnected part and get pathmatrix for each part
disjointpathmatrix <- function(A){
p <- A
q <- A
diag(p) <- 1
if (sum(p == 0) != 0){
a <- apply(p,2, function(x) which(x == 0))
index <- as.integer(row.names(p))
groups <- list()
pathmatrixs <- list()
i <- 1
while (length(a) != 0){
group <- lapply(a, function(x) identical(a[[1]], x))
groups[i] <- list(NULL)
groups[[i]] <- index[unlist(group)]
pathmatrixs[i] <- list(NULL)
pathmatrixs[[i]] <- q[unlist(group), unlist(group)]
q <- q[!unlist(group), !unlist(group)]
index <- index[!unlist(group)]
a <- a[!unlist(group)]
i <- i+1
}
pathmatrixs
}
else A
}
# calculate average path length for each part
averagepathlength <- function(A){
if (is.matrix(A)){
n <- ncol(A)
sum(A)/(n*(n-1))
}
else {
size <- unlist(lapply(A, ncol))
a <- which(size == max(size))
lapply(A, function(x) sum(x)/(ncol(x)*(ncol(x)-1)))[a]
}
}
By implementing the two formular in the lecture notes I wrote the following function to calculate two clastering coefficients.
# Clustering coefficients: first method
clustercoefficient1 <- function(a){
A <- as.matrix(a)
n <- ncol(A)
N <- 0
N3 <- 0
for (i in 1:n){
for (j in 1:n){
for (k in 1:n){
if (k > j & j> i) {
N <- N + A[i,j]*A[i,k]*A[j,k]
N3 <- N3 + A[i,j]*A[i,k] + A[j,i]*A[j,k] + A[k,i]*A[k,j]
}
}
}
}
3*N/N3
}
# Clustering coefficients: second method
clustercoefficient2 <- function(a){
A <- as.matrix(a)
n <- ncol(A)
N <- 0
N3 <- 0
Ci <- c()
for (i in 1:n){
for (j in 1:n){
for (k in 1:n){
if (k > j) {
N <- N + A[i,j]*A[i,k]*A[j,k]
N3 <- N3 + A[i,j]*A[i,k]
}
}
}
if(N3 != 0){
Ci <- c(Ci, N/N3)
}
else Ci <- c(Ci, 0)
N <- 0
N3 <- 0
}
sum(Ci)/n
}
# get the size, mean degree, average path length(select the max one) and two coefficient for each table
results <- function(a){
a <- as.matrix(a)
mytable <- deleteisolate(a) # delete isolate node first
pathmatrix <- breadthfirst1(mytable) # then make the pathmatrix
pathmatrixs <- disjointpathmatrix(pathmatrix)
averagepl <- max(unlist(averagepathlength(pathmatrixs)))
coeff1 <- clustercoefficient1(a)
coeff2 <- clustercoefficient2(a)
n <- ncol(a) # size of the network
lamada <- sum(apply(a, 2, sum))/n # mean degree
c(n, lamada, averagepl, coeff1, coeff2)
}
After creating all the functions, I read the 32 networks into R and using these functions to calculate the results.
# read all the tables in R and do the analysis
filename <- c()
for (i in 1:32){
filename[i] <- paste("gvole",i,".txt", sep = "")
}
analysis <- c()
options(digits=2)
setwd("~/Desktop/2016AMSI/complex networks/vole_adjacency_matrices")
for (i in 1:32){
a <- read.table(filename[i])
result <- results(a)
analysis <- cbind(analysis, result)
}
myresult <- as.data.frame(t(analysis))
row.names(myresult) <- 1:32
colnames(myresult) <- c("size", "average degree", "average path length", "cluster coefficient 1", "cluster coefficient 2")
myresult$bigo <- log(myresult$`size`)/log(myresult$`average degree`)
The following table is the results from previous calculation and ordered by the network size. And I draw three figures to analysis if vole networks fit the definition of small-world.
myresultordered <- myresult[order(myresult$`size`),]
myresultordered
## size average degree average path length cluster coefficient 1
## 21 17 0.3529412 1.666667 0.0000000
## 22 19 1.6842105 2.400000 0.4615385
## 20 21 0.8571429 1.300000 0.8000000
## 17 28 2.6428571 1.755556 0.5825243
## 1 29 2.2758621 3.409357 0.5526316
## 19 37 1.6756757 2.928105 0.3913043
## 5 43 5.4883721 2.848283 0.4722639
## 16 49 4.9795918 3.784864 0.5581395
## 27 49 2.8979592 3.752228 0.3954545
## 2 50 4.0800000 3.019231 0.4639831
## 18 64 3.7500000 4.000726 0.5060241
## 4 65 4.0000000 5.726062 0.4921466
## 32 69 3.4202899 5.590430 0.4230769
## 28 77 8.0000000 2.871497 0.5154420
## 26 81 3.2345679 3.816667 0.5315534
## 3 84 6.2857143 2.957242 0.4746094
## 6 87 6.9425287 2.918487 0.4570082
## 23 96 5.7291667 3.358098 0.4673807
## 29 98 5.3877551 3.442105 0.4428822
## 31 98 4.3877551 4.905285 0.5270270
## 25 107 5.4579439 4.270707 0.5188776
## 8 124 3.8870968 4.773226 0.4510135
## 10 125 4.0160000 5.862643 0.5436242
## 30 135 6.9037037 3.653841 0.4878301
## 9 141 4.9787234 5.060367 0.5236052
## 15 141 4.1560284 8.488258 0.5593220
## 24 141 6.9929078 3.369513 0.4448952
## 7 149 7.0469799 3.818441 0.5344353
## 11 164 8.0121951 3.676865 0.4974801
## 13 219 5.7534247 5.518799 0.5680603
## 14 232 5.2758621 4.295957 0.5795738
## 12 242 8.9504132 3.839340 0.5081858
## cluster coefficient 2 bigo
## 21 0.0000000 -2.720441
## 22 0.2368421 5.648295
## 20 0.1666667 -19.750302
## 17 0.5892857 3.428686
## 1 0.3471264 4.094679
## 19 0.2424710 6.994968
## 5 0.5901859 2.209051
## 16 0.5924684 2.424285
## 27 0.3512148 3.657703
## 2 0.4800577 2.782186
## 18 0.4359003 3.146484
## 4 0.5411622 3.011184
## 32 0.5610766 3.443132
## 28 0.6465211 2.088929
## 26 0.5502254 3.743476
## 3 0.6253054 2.410306
## 6 0.6173597 2.304787
## 23 0.6362381 2.614818
## 29 0.6491109 2.722457
## 31 0.6008437 3.100428
## 25 0.6117751 2.753465
## 8 0.5866843 3.550427
## 10 0.6181177 3.472891
## 30 0.6425180 2.538886
## 9 0.6206059 3.083006
## 15 0.6113849 3.473887
## 24 0.6612692 2.544485
## 7 0.6652077 2.562710
## 11 0.6723626 2.450722
## 13 0.7861623 3.079830
## 14 0.8099106 3.274968
## 12 0.7420513 2.504421
library(ggplot2)
qplot(myresultordered$`bigo`[-c(1,3)], myresultordered$`average path length`[-c(1,3)]) + ylab("average path length") + xlab("log(n)/log(λ)") + ggtitle("plot 1")
qplot(1:32, myresultordered$`cluster coefficient 1`) + geom_abline(intercept = 0.5, slope = 0) + ylab("cluster coefficient 1") + xlab("networks") + ggtitle("plot 2")
qplot(1:32, myresultordered$`cluster coefficient 2`) + geom_abline(intercept = 0.5, slope = 0) + ylab("cluster coefficient 2") + xlab("networks") + ggtitle("plot 3")
According to the definition of small-world, A small-world graph, G, is a graph with relatively small average path length and a relatively high cluster coefficient., we check if
1. average path length grow with network size as O(log(n)/log(λ)) where λ is the mean degree of the network.
2. cluster coefficient greater than 0.5
From plot1, we can see that average path length is growing as O(log(n)/log(λ)) getting bigger.(I delete two negtive number in O(log(n)/log(λ) since they don’t make sense and make the figure unclear).
Thus I think the vole network meet the requirement 1.
From plot2 and plot3, we can see that not all the cluster coefficient are larger than 0.5.
For plot2, half of the cluster coefficient are larger than 0.5 and these value tend to be bigger as size growing.
For plot3, all the large size networks’ cluster coefficient are larger than 0.5.
Thus I think the vole network meet the requirement2.
In total, we can say that vole network is a small-world network.