On this first exercise, we will learn the basic functions of the Igraph package, a great tool to use Social Network Analysis metrics in our projects. We will use in this first exercise a dataset, “mathoverflow-ints.txt”, containing interactions between MathOverflow users with the next information:

You have to complete the code chunks in this document but also analyze the results, extract insights and describe them in an understandable way.

Please, write all your conclusions in blue

Both exercises may be done individually or in pairs; do not forget to write the name of the members of the team at the beginning of this document and also your team number in the name of this file (exercise1_PXX.Rmd)

Loading data

In this section, the goal is loading the dataset given for this exercise, building the graph object and analyzing basics metrics.

The summary we get after creating the graph means the following: This is q iGrqph object with randomly assigned name. Is is directed, as opposed to undirected (D), Named (N) meaning the name vertex attribute is set, and the next two spaces contain dashes (–) because the graph is not weighted, and not bipartite. There are 24818 nodes in the network and 506550 edges. On the next line, all of the attributes are listed separated by commas. In our case we have: -name (a vertex attribute, type is character) -timestamp (an edge attribute, numeric type) -type of interaction, answer to question etc (edge attribute, character type)

IGRAPH 2d1d067 DN– 24818 506550 – + attr: name (v/c), timestamp (e/n), type (e/c)

The graph has been created correctly: - Each observation in the .txt file defines an edge of the network (origin, destination, timestamp, type). Therefore the number of edges in the graph must be equal to the number of observations in the .txt file. it is the case (e = 506,550). - Similarly, the number of unique IDs in the first two columns is equal to the number of unique vertices/nodes (n = 24,818).

# CHUNK 1

# Reading the DF and renaming columns to more logical names
df<-fread("/Users/marina/Documents/IE/Semester 3/Social Networks Analysis /exercise1/mathoverflow-ints.txt")
names(df) <- c("to", "from", "timestamp", "type")

# Building the graph
g<-graph.data.frame(df, directed = TRUE)
g
## IGRAPH d6355e4 DN-- 24818 506550 -- 
## + attr: name (v/c), timestamp (e/n), type (e/c)
## + edges from d6355e4 (vertex names):
##  [1] 1   ->4  3   ->4  1   ->2  25  ->1  14  ->16 1   ->16 22  ->16
##  [8] 1   ->2  3   ->16 2   ->1  27  ->1  13  ->16 1   ->28 21  ->1 
## [15] 1   ->7  32  ->1  7   ->27 32  ->1  32  ->27 1450->22 37  ->1 
## [22] 37  ->1  1   ->2  37  ->1  32  ->1  19  ->16 40  ->2  42  ->2 
## [29] 2   ->32 2   ->32 45  ->3  45  ->32 40  ->7  45  ->44 40  ->42
## [36] 50  ->27 50  ->16 1   ->44 54  ->44 45  ->44 3   ->56 19  ->44
## [43] 60  ->32 42  ->40 42  ->1  65  ->2  65  ->4  65  ->3  40  ->2 
## [50] 1   ->44 66  ->65 3   ->66 37  ->65 45  ->1  65  ->2  66  ->1 
## + ... omitted several edges
summary(g)
## IGRAPH d6355e4 DN-- 24818 506550 -- 
## + attr: name (v/c), timestamp (e/n), type (e/c)
# Verifying that graph is correct
# Checking that the number of edges is the same as observations
length(df$to)
## [1] 506550
ecount(g) # both 506550
## [1] 506550
# Same number of vertices/nodes
v3 <- c(df$to,df$from)
length(unique(v3))
## [1] 24818
vcount(g) # both 24818
## [1] 24818
# Simplifying the graph and adding weights
E(g)$weight <- 1
gsim <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE,
                 edge.attr.comb=c(weight="sum", "ignore"))

is_simple(gsim)
## [1] TRUE
is_weighted(gsim)
## [1] TRUE
summary(gsim) # now we see that it is weighted
## IGRAPH cbb644f DNW- 24818 227991 -- 
## + attr: name (v/c), weight (e/n)

Analyzing the graph

In this section, we will use Igraph functions to analyze some basic features:

The graph is not fully connected. There are 104 components. There are 4 different sizes of components: 1, 2, 3, 24668. The distribution shows you the probability of being a part of a component of a given size, 59 have 1 node (p=0.57), 41 have 2 nodes (p=0.40), 3 have 3 nodes (p=0.02), 1 has 24668 nodes (p=0.01).

The p-value .01 of the Kolmogorov-Smirnov test indicates that the weights do not fit a Power-Law distribution, as we can reject the test at 95% confidence.

# CHUNK 2

is_connected(gsim) # false
## [1] FALSE
count_components(gsim) #104
## [1] 104
ccs<-clusters(gsim) # to compute the number connected components 
ccs$no 
## [1] 104
ccs$csize
##   [1] 24668     3     1     1     2     2     1     1     1     2     2
##  [12]     1     1     2     1     2     2     2     2     2     1     2
##  [23]     2     1     2     2     3     2     2     1     1     2     2
##  [34]     3     2     1     2     2     2     1     1     2     2     1
##  [45]     1     2     2     2     1     2     1     1     2     2     2
##  [56]     2     2     1     1     1     2     1     1     2     1     2
##  [67]     1     1     1     1     1     1     1     1     1     1     1
##  [78]     1     2     2     1     1     1     1     1     1     1     1
##  [89]     1     1     1     1     2     1     1     2     1     1     1
## [100]     1     2     1     1     1
# Plot of distribution by size of CC
x <-order(table(ccs$membership),decreasing = TRUE)
y <- table(ccs$membership)
plot(x,y) 

# Distribution of number of nodes in CC
hist(ccs$csize, breaks = 24668,  freq = FALSE)

# Distribution of weight
plot(density(E(gsim)$weight)) 

fit_power_law(E(gsim)$weight, force.continuous = FALSE)
## $continuous
## [1] FALSE
## 
## $alpha
## [1] 3.107498
## 
## $xmin
## [1] 5
## 
## $logLik
## [1] -23978.95
## 
## $KS.stat
## [1] 0.01481846
## 
## $KS.p
## [1] 0.01863459

Node degree

A typical analysis when dealing with social networks is studying the degree distribution of the nodes. - Visualize the statisical distribution of the node degree. - Again, is it a power-law distribution? Explain why this is the case.

The distribution of the degrees shows us that in this network the probability of having a high degree is relatively high, as what we are looking at is how many connections each node has. By looking at the logs of the degrees we can see that the body of the the data follows a power-law distribution with the downwards sloping section, but that there is a lot of noise at the end. When testing for power-law, we reject the null hypothesis that the data follows the PL distribution with a p-value of .03, though the body of the data would probably fit.

# CHUNK 3
head(degree(gsim, mode = "total"))
##   1   3  25  14  22   2 
## 642 478 195   1 469 194
#Save degrees 
D<-degree(gsim, mode = "total")
# Plot the density
plot(density(D), log = 'xy', xlab = 'Degree', ylab = 'Density', main = 'Degree Density')
## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 x value <= 0 omitted
## from logarithmic plot
## Warning in xy.coords(x, y, xlabel, ylabel, log): 66 y values <= 0 omitted
## from logarithmic plot

#Fit Power-Law
fit_power_law(degree(gsim), force.continuous = FALSE )
## $continuous
## [1] FALSE
## 
## $alpha
## [1] 1.817757
## 
## $xmin
## [1] 4
## 
## $logLik
## [1] -45841.43
## 
## $KS.stat
## [1] 0.01275773
## 
## $KS.p
## [1] 0.03492005

Building connected graphs

In general, in interaction networks, many connected components turn out after creating the graph. However, most of the nodes typically belong to the same giant component. In this section, the goal is: - Build a subgraph with all the nodes belonging to the largest connected component. - Validate the graph has been correctly created. The number of nodes of the new graph is the same as the size of the largest connected component of the simplified graph. Is_connected also shows that the graph is fully connected.

By looking at the average degree of the graphs, we can see that the simplified graph had a slightly smaller average degree than this new fully connected graph. This is because we have removed all of the nodes belonging to the 103 components of sizes 1, 2 or 3 nodes that have a very low number of edges and bring down that average.

# CHUNK 4

table(ccs$membership)
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 24668     3     1     1     2     2     1     1     1     2     2     1 
##    13    14    15    16    17    18    19    20    21    22    23    24 
##     1     2     1     2     2     2     2     2     1     2     2     1 
##    25    26    27    28    29    30    31    32    33    34    35    36 
##     2     2     3     2     2     1     1     2     2     3     2     1 
##    37    38    39    40    41    42    43    44    45    46    47    48 
##     2     2     2     1     1     2     2     1     1     2     2     2 
##    49    50    51    52    53    54    55    56    57    58    59    60 
##     1     2     1     1     2     2     2     2     2     1     1     1 
##    61    62    63    64    65    66    67    68    69    70    71    72 
##     2     1     1     2     1     2     1     1     1     1     1     1 
##    73    74    75    76    77    78    79    80    81    82    83    84 
##     1     1     1     1     1     1     2     2     1     1     1     1 
##    85    86    87    88    89    90    91    92    93    94    95    96 
##     1     1     1     1     1     1     1     1     2     1     1     2 
##    97    98    99   100   101   102   103   104 
##     1     1     1     1     2     1     1     1
# keep the largest CC in a graph
g2<-induced.subgraph(gsim, names(ccs$membership[ccs$membership == 1])) 

# Same size as the larges CC
ccs$csize
##   [1] 24668     3     1     1     2     2     1     1     1     2     2
##  [12]     1     1     2     1     2     2     2     2     2     1     2
##  [23]     2     1     2     2     3     2     2     1     1     2     2
##  [34]     3     2     1     2     2     2     1     1     2     2     1
##  [45]     1     2     2     2     1     2     1     1     2     2     2
##  [56]     2     2     1     1     1     2     1     1     2     1     2
##  [67]     1     1     1     1     1     1     1     1     1     1     1
##  [78]     1     2     2     1     1     1     1     1     1     1     1
##  [89]     1     1     1     1     2     1     1     2     1     1     1
## [100]     1     2     1     1     1
vcount(g2)
## [1] 24668
is_connected(g2)
## [1] TRUE
# Average node degree in all graphs
mean(degree(gsim))
## [1] 18.37304
mean(degree(g2))
## [1] 18.48006

Visualizing shortest paths

On this previous subgraph, you have to compute:

To this end, use the layout.fruchterman.reingold function to place the nodes in the visualization.

# CHUNK 5

# The shortest path from highest degree node
max(degree(g2)) # 2758 is degree
## [1] 2758
which.max(degree(g2)) #290 is the ID
## 290 
##  57
head(sort(degree(g2), decreasing = T))
##   290 11142   763   297  6094  1409 
##  2758  2755  2058  2011  1963  1897
# Saving the shortest paths 
gsp<- get.shortest.paths(g2, which.max(degree(g2)))
## Warning in get.shortest.paths(g2, which.max(degree(g2))): At
## structural_properties.c:4597 :Couldn't reach some vertices
#saving the whole list of vertices
gsp1<-gsp$vpath

which.max(sapply(gsp1, length)) # gives the index: 14734
## [1] 14734
head(sort(sapply(gsp1, length), decreasing = T)) # proves that there is only 1 longest one
## [1] 7 6 6 6 6 6
# Plot of the Path
splist<-gsp1[[14734]]

plot(induced.subgraph(g2, unlist(splist)))

# Plot of the full graph visualizing the shortest path

V(g2)$size = 2
V(g2)[splist]$size = 15
V(g2)[splist]$color = "blue"

layout = layout.fruchterman.reingold(g2)
layout[splist,][,1] = layout[splist,][,1] +10

plot(g2, layout=layout,
     vertex.label.cesx=3, edge.arrow.size=.10,vertex.label= NA)