library('igraph')
library('ggplot2')
GraphML is XML-base file format for graphs. It has quite clear and flexible format. Along with XML it represents information in hierarchical way:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph id="G" edgedefault="undirected">
<node id="n0"/>
<node id="n1"/>
<node id="n2"/>
<node id="n3"/>
<node id="n4"/>
<node id="n5"/>
<node id="n6"/>
<node id="n7"/>
<node id="n8"/>
<node id="n9"/>
<node id="n10"/>
<edge source="n0" target="n2"/>
<edge source="n1" target="n2"/>
<edge source="n2" target="n3"/>
<edge source="n3" target="n5"/>
<edge source="n3" target="n4"/>
<edge source="n4" target="n6"/>
<edge source="n6" target="n5"/>
<edge source="n5" target="n7"/>
<edge source="n6" target="n8"/>
<edge source="n8" target="n7"/>
<edge source="n8" target="n9"/>
<edge source="n8" target="n10"/>
</graph>
</graphml>
Another flexible format and a bit more readible format is GML (Graph Modelling Language).
graph
[
directed 1
node
[
id A
label "Node A"
]
node
[
id B
label "Node B"
]
node
[
id C
label "Node C"
]
edge
[
source B
target A
label "Edge B to A"
]
edge
[
source C
target A
label "Edge C to A"
]
]
Pajek format emerges from a large network analysis tool. It is not that readible, however it is often used to store big netoworks
*Vertices 82670
1 "entity"
2 "thing"
3 "anything"
4 "something"
5 "nothing"
6 "whole"
*arcs
1 2 5161
1 9 5615
2 9 7894
2 3 812
2 8 123
3 8 15
3 4 456
4 5 456
4 7 4
5 7 4568
5 6 456
6 4 4849
In the first case this is simply a pair of connected node that may be supported by a weight.
a;b
c;d
In the second case this is a sequence of nodes – each line contains neigbours of a node that is stays on the first place
1 3 4 5
2 3 8
3 9
A very rarely used format as it stores minimum amount of information about network and uses lots of space..
;A;B;C;D;E
A;0;1;0;1;0
B;1;0;0;0;0
C;0;0;1;0;0
D;0;1;0;1;0
E;0;0;0;0;0
Basically a single command is sufficient to read network data to R:
g <- read.graph(file = '...',format = '...', )
Only in case of adjacency matrix you have to do a bit more..
dat <- read.csv(file = '...', sep = ' ')
m <- as.matrix(dat)
g <- graph.adjacency(m, mode='...', weighted = ...)
Open extract files from supplimentary achive. Try to load and plot some of the networks there.
Now, we are goint to generate power-law distribution syntetically.
During this part we will determine the meaning of α and try to estimate it.
The functions below are for generating random variables distributed according to power-law and creating histogramm based on generated data.
generate_dist = function(xmin, xmax, alpha, size) {
r <- runif(size)
return((xmin^(-alpha+1)+r*(xmax^(-alpha+1)-xmin^(-alpha+1)))^(1/(-alpha+1)))
}
lorenz = function(h){
prob <- rev(h$counts/sum(h$counts))
cumProb <- cumsum(prob)
wage <- cumsum(prob*rev(h$mids))/sum(prob*rev(h$mids))
return(data.frame(wage=wage, prob=cumProb))
}
That is a very legitimate question. In fact α shows how fairly degree of the distribution is spread in the networks. To show this, lets generate distributions with various parametes.
xmin <- 1
xmax <- 100
alpha <- 3.1
n <- 1000
x <- generate_dist(xmin, xmax, alpha, n)
h <- hist(x, breaks = length(unique(x)))
l <- lorenz(h)
qplot(data=l, x = wage, y = prob, geom = 'line') +
xlab('Proportion of top degrees') +
ylab('Proportion of nodes')
As you may see, the bigger α is the more fairly degree is distrubuted among the vertices.
We are using build-in igraph function to estimate α of destribution
g = read.graph('networks/netscience.gml', format='gml')
d = degree(g)
h = hist(d, breaks = 1000)
fit = power.law.fit(d, 1, implementation = "plfit")
alpha = fit$alpha
print(alpha)
## [1] 2
The final part of the task is to calculate parameters and goodness-of-fit for several networks: - netscience.gml - Cit-HepPh.txt
Hints:
g = read.graph ...
degree(g)
power.law.fit
fit$alpha
fit$KS.p
Path is a finite or infinite sequence of edges which connect a sequence of vertices
g <- graph.formula(1-2-3-4-5-6, 1-3-7, 1-8-4)
plot(g)
shortest.paths(g)
## 1 2 3 4 5 6 7 8
## 1 0 1 1 2 3 4 2 1
## 2 1 0 1 2 3 4 2 2
## 3 1 1 0 1 2 3 1 2
## 4 2 2 1 0 1 2 2 1
## 5 3 3 2 1 0 1 3 2
## 6 4 4 3 2 1 0 4 3
## 7 2 2 1 2 3 4 0 3
## 8 1 2 2 1 2 3 3 0
The shortest path from one node to others
get.shortest.paths(g,1)$vpath
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 1 3
##
## [[4]]
## [1] 1 3 4
##
## [[5]]
## [1] 1 3 4 5
##
## [[6]]
## [1] 1 3 4 5 6
##
## [[7]]
## [1] 1 3 7
##
## [[8]]
## [1] 1 8
get.shortest.paths(g,1,4)$vpath
## [[1]]
## [1] 1 3 4
get.all.shortest.paths(g,1,4)$res
## [[1]]
## [1] 1 8 4
##
## [[2]]
## [1] 1 3 4
diameter(g) # the length of the "longest shortest path"
## [1] 4
get.diameter(g)
## [1] 1 3 4 5 6
E(g)$color <- "blue"
E(g, path=get.diameter(g))$color <- "red"
plot(g)
radius(g) # is the length of the shortest paths among longest ones
## [1] 2
The shortest path between two certain nodes
get.shortest.paths(g,1,4)$vpath # only one arbitary path
## [[1]]
## [1] 1 3 4
get.all.shortest.paths(g,1,4)$res # all paths
## [[1]]
## [1] 1 8 4
##
## [[2]]
## [1] 1 3 4
diameter(g) # the length of the "longest shortest path"
## [1] 4
The longest path among shortest paths
diameter(g) # the length of the "longest shortest path"
## [1] 4
get.diameter(g)
## [1] 1 3 4 5 6
E(g)$color <- "blue"
E(g, path=get.diameter(g))$color <- "red"
plot(g)
radius(g) # is the length of the shortest paths among longest ones
## [1] 2
Recall from the previous seminar
vcount(g) # number of nodes
## [1] 8
ecount(g) # number of edges
## [1] 9
E(g) # list of edges
## Edge sequence:
##
## [1] 2 -- 1
## [2] 3 -- 1
## [3] 8 -- 1
## [4] 3 -- 2
## [5] 4 -- 3
## [6] 7 -- 3
## [7] 5 -- 4
## [8] 8 -- 4
## [9] 6 -- 5
V(g) # list of nodes
## Vertex sequence:
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
Geting a histogram, by calculating the shortest path length between each pair of vertices
g<-graph.formula(1-2-3-4-5-6, 1-3-7, 1-8-4, 3-9-10-12, 11-12-13-3)
tab <- as.table(path.length.hist(g)$res)
names(tab) <- 1:length(tab)
barplot(tab)
The density of a graph is the ratio of the number of edges and the number of possible edges.
graph.density(g)
## [1] 0.1923077
The data was collected from the members of a university karate club by Wayne Zachary in 1977. Each node represents a member of the club, and each edge represents a tie between two members of the club. Zachary studied conflict and fission in this community network, as the karate club was split into two separate clubs. The network is very small: it has 34 vertices and 78 undirected edges.
g <- graph.famous("Zachary")
plot(g)
graph.density(g)
## [1] 0.1390374
diameter(g) # the length of the "longest shortest path"
## [1] 5
E(g)$color <- "blue"
E(g, path=get.diameter(g))$color <- "red"
plot(g,layout = layout.fruchterman.reingold)
tab <- as.table(path.length.hist(g)$res)
names(tab) <- 1:length(tab)
barplot(tab,legend = c(path.length.hist(g)$res),main = "The shortest path lengths distribution", col = c("lightblue", "mistyrose", "lightcyan","lavender", "lightgreen"),args.legend = list(title = "The number of paths"),xlab = "Path length", ylab = "The number of paths" )
See more information on http://snap.stanford.edu/data/cit-HepPh.html
g <- read.graph("./networks/Cit-HepPh.txt", format="edgelist")
vcount(g)
## [1] 9912554
ecount(g)
## [1] 421578
tab <- as.table(path.length.hist(g)$res)
max(tab)
## [1] 42044044
min(tab)
## [1] 9
names(tab) <- 1:length(tab)
barplot(tab,main = "The shortest path lengths distribution",xlab = "Path length", ylab = "The number of paths", cex.names = 0.5)
g <- graph.ring(8)
plot(g)
shortest.paths(g)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 1 2 3 4 3 2 1
## [2,] 1 0 1 2 3 4 3 2
## [3,] 2 1 0 1 2 3 4 3
## [4,] 3 2 1 0 1 2 3 4
## [5,] 4 3 2 1 0 1 2 3
## [6,] 3 4 3 2 1 0 1 2
## [7,] 2 3 4 3 2 1 0 1
## [8,] 1 2 3 4 3 2 1 0
get.diameter(g)
## [1] 1 2 3 4 5
diameter(g)
## [1] 4
radius(g)
## [1] 4
g <- graph.formula(1-2-3-4-5-6,7-8-9-3-10, 4-11)
E(g)$color <- "blue"
E(g, path=get.diameter(g))$color <- "red"
plot(g)
shortest.paths(g)
## 1 2 3 4 5 6 7 8 9 10 11
## 1 0 1 2 3 4 5 5 4 3 3 4
## 2 1 0 1 2 3 4 4 3 2 2 3
## 3 2 1 0 1 2 3 3 2 1 1 2
## 4 3 2 1 0 1 2 4 3 2 2 1
## 5 4 3 2 1 0 1 5 4 3 3 2
## 6 5 4 3 2 1 0 6 5 4 4 3
## 7 5 4 3 4 5 6 0 1 2 4 5
## 8 4 3 2 3 4 5 1 0 1 3 4
## 9 3 2 1 2 3 4 2 1 0 2 3
## 10 3 2 1 2 3 4 4 3 2 0 3
## 11 4 3 2 1 2 3 5 4 3 3 0
diameter(g)
## [1] 6
get.diameter(g)
## [1] 6 5 4 3 9 8 7
radius(g)
## [1] 3
set.seed(1)
g <- graph.formula(1-2-3-4-5-1)
E(g)$weight <- sample(seq_len(ecount(g)+5))
## Warning in eattrs[[name]][index] <- value: число единиц для замены не
## является произведением длины замены
E(g)$color <- "blue"
E(g, path=get.diameter(g))$color <- "red"
plot(g,edge.label = E(g)$weight)
shortest.paths(g)
## 1 2 3 4 5
## 1 0 3 8 6 4
## 2 3 0 5 9 7
## 3 8 5 0 7 9
## 4 6 9 7 0 2
## 5 4 7 9 2 0
diameter(g)
## [1] 9
diameter(g, weights=NA)
## [1] 2
E(g)$color <- "blue"
E(g, path=get.diameter(g, weights=NA))$color <- "red"
plot(g, edge.label = E(g)$weight)
shortest.paths(g, weights=NA)
## 1 2 3 4 5
## 1 0 1 2 2 1
## 2 1 0 1 2 2
## 3 2 1 0 1 2
## 4 2 2 1 0 1
## 5 1 2 2 1 0
A connected component is a maximal connected subgraph of G. Each vertex belongs to exactly one connected component, as does each edge. To check, whether the graph is connected use command
is.connected(g)
A directed graph is called weakly connected if replacing all of its directed edges with undirected edges produces a connected (undirected) graph. It is connected if it contains a directed path from u to v or a directed path from v to u for every pair of vertices u, v. It is strongly connected if it contains a directed path from u to v and a directed path from v to u for every pair of vertices u, v.
clusters(graph, mode=c("weak", "strong")) # select the required mode
no.clusters(graph, mode=c("weak", "strong"))
g<-read.graph(file='./networks/tutorial2_1.txt', format="edgelist")
plot(g)
clusters(g, mode=c("weak")) # select the required mode
## $membership
## [1] 1 2 3 1 1 1 1 4 4 4 4 2 1 4 4 4 4 4 4 2 4 4 4 4 4 2 2 2 2 1 1 1 1 1 2
## [36] 1 2 1 2 2 2 2 1 1 1 1
##
## $csize
## [1] 17 13 1 15
##
## $no
## [1] 4
clusters(g, mode=c("strong")) # select the required mode
## $membership
## [1] 35 25 24 19 20 21 22 13 14 17 4 2 36 6 5 7 8 11 12 26 9 10 15
## [24] 16 18 27 28 1 3 23 40 37 38 39 29 41 30 42 32 31 33 34 44 45 43 46
##
## $csize
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1
##
## $no
## [1] 46
lst <-cluster.distribution(g, cumulative = FALSE)
distr<-table(lst,seq(1,length(lst)))[2,]
barplot(distr,main = "The maximal connected component sizes distribution", ylab = "The number of components",xlab = "Size")
g<-read.graph(file='./networks/tutorial2_2.txt', format="edgelist")
plot(g)
clusters(g, mode=c("weak")) # select the required mode
## $membership
## [1] 1 1 1 1 1 1 1 1 1 1 1
##
## $csize
## [1] 11
##
## $no
## [1] 1
clusters(g, mode=c("strong")) # select the required mode
## $membership
## [1] 1 1 1 1 2 7 8 3 6 5 4
##
## $csize
## [1] 4 1 1 1 1 1 1 1
##
## $no
## [1] 8
lst <-cluster.distribution(g, cumulative = FALSE)
distr<-table(lst,seq(1,length(lst)))[2,]
barplot(distr,main = "The maximal connected component sizes distribution", ylab = "The number of components",xlab = "Size")
g<-read.graph(file='./networks/tutorial2_3.txt', format="edgelist")
plot(g)
clusters(g, mode=c("weak")) # select the required mode
## $membership
## [1] 1 1 1 1 1 1 1 2 3 3
##
## $csize
## [1] 7 1 2
##
## $no
## [1] 3
clusters(g, mode=c("strong")) # select the required mode
## $membership
## [1] 3 3 3 3 4 4 4 2 1 1
##
## $csize
## [1] 2 1 4 3
##
## $no
## [1] 4
lst <-cluster.distribution(g, cumulative = FALSE)
distr<-table(lst,seq(1,length(lst)))[2,]
barplot(distr,main = "The maximal connected component sizes distribution", ylab = "The number of components",xlab = "Size")
2pen triplet
plot(graph.formula(A-B-C))
Closed triplet
plot(graph.formula(A-B-C-A))
The transitivity the number of closed tripletsthe total number of triples centered on each of the vertices
vi - the i-th vertex of a graph ki - vertex degree (the number of edges incident to the vertex) eij - the edge between i-th and j-th verticies
The local clustering coefficient for node vi Cv=the number of closed triplets of the nodethe number of triples centered around the node
The global clustering coefficient of the graph is average over all local clustering coefficients Cv.
g <- graph.formula(A-B-C-A,C-D)
plot(g)
transitivity(g, type = "local")
## [1] 1.0000000 1.0000000 0.3333333 NaN
transitivity(g, type = "global") # transitivity(g)
## [1] 0.6
transitivity(g, vids="C", type = "local")
## [1] 0.3333333
transitivity(g, vids="A", type = "local")
## [1] 1
transitivity(g, vids="B", type = "local")
## [1] 1
g <- graph.formula(A - D, B - D, C - D, A - B - C)
plot(g)
transitivity(g, type = "local") # local clustering coefficient
## [1] 1.0000000 0.6666667 0.6666667 1.0000000
transitivity(g, vids="A", type = "local")
## [1] 1
transitivity(g, type = "global") # global clustering coefficient
## [1] 0.75
transitivity(g, type="localaverage") # average the local clustering coefficient
## [1] 0.8333333
transitivity(g, vids="C", type = "local")
## [1] 1
diameter(g)
## [1] 2
gw <- graph.formula(A-B-C-A : D-E, E-Y-A)
plot(gw)
transitivity(gw, vids="A", type = "local")
## [1] 0.3333333
transitivity(gw, vids="B", type = "local")
## [1] 1
transitivity(gw, vids="C", type = "local")
## [1] 0.3333333
transitivity(gw, vids="D", type = "local")
## [1] 0
transitivity(gw, type="localaverage")
## [1] 0.5
transitivity(gw, type = "global")
## [1] 0.4