On this first exercise, we will learn the basic functions of the Igraph package, a great tool to use Social Network Analysis metrics in our projects. We will use in this first exercise a dataset, “mathoverflow-ints.txt”, containing interactions between MathOverflow users with the next information:

Answers to questions
Comments to questions
Comments to answers

You have to complete the code chunks in this document but also analyze the results, extract insights and describe them in an understandable way.

Please, write all your conclusions in blue

Both exercises may be done individually or in pairs; do not forget to write the name of the members of the team at the beginning of this document and also your team number in the name of this file (exercise1_PXX.Rmd)

Loading data

In this section, the goal is loading the dataset given for this exercise, building the graph object and analyzing basics metrics.

Build the graph object with all the interactions. Include the timestamp and the type of the edge as edge attributes.
Describe the values provided by summary function on the graph object.

The summary we get after creating the graph means the following: This is q iGrqph object with randomly assigned name. Is is directed, as opposed to undirected (D), Named (N) meaning the name vertex attribute is set, and the next two spaces contain dashes (–) because the graph is not weighted, and not bipartite. There are 24818 nodes in the network and 506550 edges. On the next line, all of the attributes are listed separated by commas. In our case we have: -name (a vertex attribute, type is character) -timestamp (an edge attribute, numeric type) -type of interaction, answer to question etc (edge attribute, character type)

IGRAPH 2d1d067 DN– 24818 506550 – + attr: name (v/c), timestamp (e/n), type (e/c)

Validate the graph has been created correctly comparing with the information after reading the file.

The graph has been created correctly: - Each observation in the .txt file defines an edge of the network (origin, destination, timestamp, type). Therefore the number of edges in the graph must be equal to the number of observations in the .txt file. it is the case (e = 506,550). - Similarly, the number of unique IDs in the first two columns is equal to the number of unique vertices/nodes (n = 24,818).

Simplify the graph in order to remove loops and multiple edges. After simplifying the graph, a weight attribute should have been created, counting the number of interactions between two users.

# CHUNK 1

# Reading the DF and renaming columns to more logical names
df<-fread("/Users/marina/Documents/IE/Semester 3/Social Networks Analysis /exercise1/mathoverflow-ints.txt")
names(df) <- c("to", "from", "timestamp", "type")

# Building the graph
g<-graph.data.frame(df, directed = TRUE)
g

## IGRAPH d6355e4 DN-- 24818 506550 -- 
## + attr: name (v/c), timestamp (e/n), type (e/c)
## + edges from d6355e4 (vertex names):
##  [1] 1   ->4  3   ->4  1   ->2  25  ->1  14  ->16 1   ->16 22  ->16
##  [8] 1   ->2  3   ->16 2   ->1  27  ->1  13  ->16 1   ->28 21  ->1 
## [15] 1   ->7  32  ->1  7   ->27 32  ->1  32  ->27 1450->22 37  ->1 
## [22] 37  ->1  1   ->2  37  ->1  32  ->1  19  ->16 40  ->2  42  ->2 
## [29] 2   ->32 2   ->32 45  ->3  45  ->32 40  ->7  45  ->44 40  ->42
## [36] 50  ->27 50  ->16 1   ->44 54  ->44 45  ->44 3   ->56 19  ->44
## [43] 60  ->32 42  ->40 42  ->1  65  ->2  65  ->4  65  ->3  40  ->2 
## [50] 1   ->44 66  ->65 3   ->66 37  ->65 45  ->1  65  ->2  66  ->1 
## + ... omitted several edges

summary(g)

## IGRAPH d6355e4 DN-- 24818 506550 -- 
## + attr: name (v/c), timestamp (e/n), type (e/c)

# Verifying that graph is correct
# Checking that the number of edges is the same as observations
length(df$to)

## [1] 506550

ecount(g) # both 506550

## [1] 506550

# Same number of vertices/nodes
v3 <- c(df$to,df$from)
length(unique(v3))

## [1] 24818

vcount(g) # both 24818

## [1] 24818

# Simplifying the graph and adding weights
E(g)$weight <- 1
gsim <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE,
                 edge.attr.comb=c(weight="sum", "ignore"))

is_simple(gsim)

## [1] TRUE

is_weighted(gsim)

## [1] TRUE

summary(gsim) # now we see that it is weighted

## IGRAPH cbb644f DNW- 24818 227991 -- 
## + attr: name (v/c), weight (e/n)

Analyzing the graph

In this section, we will use Igraph functions to analyze some basic features:

Is the graph connected? If it is not, compute the connected components and analyze the statistical distribution of the size of the resulting components.

The graph is not fully connected. There are 104 components. There are 4 different sizes of components: 1, 2, 3, 24668. The distribution shows you the probability of being a part of a component of a given size, 59 have 1 node (p=0.57), 41 have 2 nodes (p=0.40), 3 have 3 nodes (p=0.02), 1 has 24668 nodes (p=0.01).

Analyze the statistical distribution of the “weight” attribute created in the previous section and fit it to a power law distribution. Does it fit well enough?

The p-value .01 of the Kolmogorov-Smirnov test indicates that the weights do not fit a Power-Law distribution, as we can reject the test at 95% confidence.

# CHUNK 2

is_connected(gsim) # false

## [1] FALSE

count_components(gsim) #104

## [1] 104

ccs<-clusters(gsim) # to compute the number connected components 
ccs$no

## [1] 104

ccs$csize

##   [1] 24668     3     1     1     2     2     1     1     1     2     2
##  [12]     1     1     2     1     2     2     2     2     2     1     2
##  [23]     2     1     2     2     3     2     2     1     1     2     2
##  [34]     3     2     1     2     2     2     1     1     2     2     1
##  [45]     1     2     2     2     1     2     1     1     2     2     2
##  [56]     2     2     1     1     1     2     1     1     2     1     2
##  [67]     1     1     1     1     1     1     1     1     1     1     1
##  [78]     1     2     2     1     1     1     1     1     1     1     1
##  [89]     1     1     1     1     2     1     1     2     1     1     1
## [100]     1     2     1     1     1

# Plot of distribution by size of CC
x <-order(table(ccs$membership),decreasing = TRUE)
y <- table(ccs$membership)
plot(x,y)

# Distribution of number of nodes in CC
hist(ccs$csize, breaks = 24668,  freq = FALSE)

# Distribution of weight
plot(density(E(gsim)$weight))

fit_power_law(E(gsim)$weight, force.continuous = FALSE)

## $continuous
## [1] FALSE
## 
## $alpha
## [1] 3.107498
## 
## $xmin
## [1] 5
## 
## $logLik
## [1] -23978.95
## 
## $KS.stat
## [1] 0.01481846
## 
## $KS.p
## [1] 0.01863459

Node degree

A typical analysis when dealing with social networks is studying the degree distribution of the nodes. - Visualize the statisical distribution of the node degree. - Again, is it a power-law distribution? Explain why this is the case.

The distribution of the degrees shows us that in this network the probability of having a high degree is relatively high, as what we are looking at is how many connections each node has. By looking at the logs of the degrees we can see that the body of the the data follows a power-law distribution with the downwards sloping section, but that there is a lot of noise at the end. When testing for power-law, we reject the null hypothesis that the data follows the PL distribution with a p-value of .03, though the body of the data would probably fit.

# CHUNK 3
head(degree(gsim, mode = "total"))

##   1   3  25  14  22   2 
## 642 478 195   1 469 194

#Save degrees 
D<-degree(gsim, mode = "total")
# Plot the density
plot(density(D), log = 'xy', xlab = 'Degree', ylab = 'Density', main = 'Degree Density')

## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 x value <= 0 omitted
## from logarithmic plot

## Warning in xy.coords(x, y, xlabel, ylabel, log): 66 y values <= 0 omitted
## from logarithmic plot

#Fit Power-Law
fit_power_law(degree(gsim), force.continuous = FALSE )

## $continuous
## [1] FALSE
## 
## $alpha
## [1] 1.817757
## 
## $xmin
## [1] 4
## 
## $logLik
## [1] -45841.43
## 
## $KS.stat
## [1] 0.01275773
## 
## $KS.p
## [1] 0.03492005

Building connected graphs

In general, in interaction networks, many connected components turn out after creating the graph. However, most of the nodes typically belong to the same giant component. In this section, the goal is: - Build a subgraph with all the nodes belonging to the largest connected component. - Validate the graph has been correctly created. The number of nodes of the new graph is the same as the size of the largest connected component of the simplified graph. Is_connected also shows that the graph is fully connected.

Analyze how node degree has changed from the original graph to this new subgraph. Find out which nodes have now a smaller number of connections and explain why.

By looking at the average degree of the graphs, we can see that the simplified graph had a slightly smaller average degree than this new fully connected graph. This is because we have removed all of the nodes belonging to the 103 components of sizes 1, 2 or 3 nodes that have a very low number of edges and bring down that average.

# CHUNK 4

table(ccs$membership)

## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 24668     3     1     1     2     2     1     1     1     2     2     1 
##    13    14    15    16    17    18    19    20    21    22    23    24 
##     1     2     1     2     2     2     2     2     1     2     2     1 
##    25    26    27    28    29    30    31    32    33    34    35    36 
##     2     2     3     2     2     1     1     2     2     3     2     1 
##    37    38    39    40    41    42    43    44    45    46    47    48 
##     2     2     2     1     1     2     2     1     1     2     2     2 
##    49    50    51    52    53    54    55    56    57    58    59    60 
##     1     2     1     1     2     2     2     2     2     1     1     1 
##    61    62    63    64    65    66    67    68    69    70    71    72 
##     2     1     1     2     1     2     1     1     1     1     1     1 
##    73    74    75    76    77    78    79    80    81    82    83    84 
##     1     1     1     1     1     1     2     2     1     1     1     1 
##    85    86    87    88    89    90    91    92    93    94    95    96 
##     1     1     1     1     1     1     1     1     2     1     1     2 
##    97    98    99   100   101   102   103   104 
##     1     1     1     1     2     1     1     1

# keep the largest CC in a graph
g2<-induced.subgraph(gsim, names(ccs$membership[ccs$membership == 1])) 

# Same size as the larges CC
ccs$csize

##   [1] 24668     3     1     1     2     2     1     1     1     2     2
##  [12]     1     1     2     1     2     2     2     2     2     1     2
##  [23]     2     1     2     2     3     2     2     1     1     2     2
##  [34]     3     2     1     2     2     2     1     1     2     2     1
##  [45]     1     2     2     2     1     2     1     1     2     2     2
##  [56]     2     2     1     1     1     2     1     1     2     1     2
##  [67]     1     1     1     1     1     1     1     1     1     1     1
##  [78]     1     2     2     1     1     1     1     1     1     1     1
##  [89]     1     1     1     1     2     1     1     2     1     1     1
## [100]     1     2     1     1     1

vcount(g2)

## [1] 24668

is_connected(g2)

## [1] TRUE

# Average node degree in all graphs
mean(degree(gsim))

## [1] 18.37304

mean(degree(g2))

## [1] 18.48006

Visualizing shortest paths

On this previous subgraph, you have to compute:

The shortest path from the node with the highest degree to the rest of the graph.
From all these paths, select one path with the maximum length.
Visualize only this path using the plot function. visualize the node with max degree and all paths with higest path
Visualize this path within the giant component subgraph. save plot of full graph as png png() plot(g) and show png file dev.off()

To this end, use the layout.fruchterman.reingold function to place the nodes in the visualization.

# CHUNK 5

# The shortest path from highest degree node
max(degree(g2)) # 2758 is degree

## [1] 2758

which.max(degree(g2)) #290 is the ID

## 290 
##  57

head(sort(degree(g2), decreasing = T))

##   290 11142   763   297  6094  1409 
##  2758  2755  2058  2011  1963  1897

# Saving the shortest paths 
gsp<- get.shortest.paths(g2, which.max(degree(g2)))

## Warning in get.shortest.paths(g2, which.max(degree(g2))): At
## structural_properties.c:4597 :Couldn't reach some vertices

#saving the whole list of vertices
gsp1<-gsp$vpath

which.max(sapply(gsp1, length)) # gives the index: 14734

## [1] 14734

head(sort(sapply(gsp1, length), decreasing = T)) # proves that there is only 1 longest one

## [1] 7 6 6 6 6 6

# Plot of the Path
splist<-gsp1[[14734]]

plot(induced.subgraph(g2, unlist(splist)))

# Plot of the full graph visualizing the shortest path

V(g2)$size = 2
V(g2)[splist]$size = 15
V(g2)[splist]$color = "blue"

layout = layout.fruchterman.reingold(g2)
layout[splist,][,1] = layout[splist,][,1] +10

plot(g2, layout=layout,
     vertex.label.cesx=3, edge.arrow.size=.10,vertex.label= NA)

Social Network Analysis - Exercise 1

Pair15 - Marina Begović - Théo Tortorici

Loading data

Analyzing the graph

Node degree

Building connected graphs

Visualizing shortest paths