Data Source:
J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in
Large Networks: # Natural Cluster Sizes and the Absence of Large
Well-Defined Clusters. # Internet Mathematics 6(1) 29–123, 2009. #Data
Source: https://snap.stanford.edu/data/roadNet-CA.html #A road
network of California. Intersections and endpoints #are represented by
nodes and the roads connecting these intersections or #road endpoints
are represented by undirected edges.
- Learning Sources: Jaromir Kovarik website: https://sites.google.com/site/webpagesjaromir/home?authuser=0
# Scott, J. (2017).
- Social Network Analysis (4th ed.). Sage Publications.
Import Data
Set working directory.
Libraries.
Read my Data.
Remove the leading and trailing spaces in the column.
Split the names column into two columns.
Audit myData.
## from to
## 1 0 1
## 2 0 2
## 3 0 469
## 4 1 0
## 5 1 6
## 6 1 385
## [1] "from" "to"
## [1] 5533214 2
## 'data.frame': 5533214 obs. of 2 variables:
## $ from: chr "0" "0" "0" "1" ...
## $ to : chr "1" "2" "469" "0" ...
Convert all character columns to integer columns.
All the elements in ‘myData$to’ are also present in
‘myData$from’, and vice versa.
Convert the data frame to a graph object.
The data is comprised of 5533214 edges and 1965206
nodes/vertices.
## IGRAPH 4669222 UN-- 1965206 5533214 --
## + attr: name (v/c)
## + edges from 4669222 (vertex names):
## [1] 0 --1 0 --2 0 --469 0 --1 1 --6 1 --385
## [7] 0 --2 2 --3 0 --469 469--380 469--37415 1 --6
## [13] 6 --5 1 --385 385--384 385--386 2 --3 3 --4
## [19] 3 --419 3 --422 3 --4 4 --5 4 --98 4 --420
## [25] 3 --419 419--420 419--35698 3 --422 422--183 422--423
## [31] 422--37415 4 --5 6 --5 5 --98 4 --98 5 --98
## [37] 98 --470 98 --35729 4 --420 419--420 420--35709 7 --8
## [43] 7 --9 7 --79 7 --8 8 --33 7 --9 9 --10
## + ... omitted several edges
## [1] "igraph"
View all vertex attributes in a list.
## + 5/1965206 vertices, named, from 4669222:
## name
## 1 0
## 2 1
## 3 2
## 4 469
## 5 6
Audit the network
Verify again whether the network is undirected.
The network is undirected, as described in the data source.
## [1] FALSE
Review vertices and edges.
The network is comprised of 1965206 vertices and 5533214 edges.
Symmetry.
The number of symmetric edges is half of the number of symmetric
pairs (being equal to the number of total edges in the network of
roads). This means that every edge in the network has a corresponding
symmetric edge, indicating that there are no one-way streets or
unidirectional connections in the network.
Having a symmetric road network can have practical implications for
transportation planning and routing, since it allows for more flexible
and efficient travel between nodes. For example, if there is a traffic
jam on one side of an edge, drivers can still use the other side of the
edge to reach their destination.
Symmetric networks can be more resilient to disruptions, as there
are always multiple ways to reach a given node in the network.
Reciprocity.
Since the network is undirected, bidirectional edges are defined as
edges that have a reciprocal edge in the opposite direction. The result
is 1, which means that there is a perfect level of bidirectionality in
the network.
## [1] 1
Assortativity.
The assortativity coefficient for road network is 0.1260417. This
suggests that nodes with similar degree tend to be connected to each
other more often than expected by chance.
## [1] 0.1260417
## [1] 0.1260417
Density of the undirected network.
The density is approximately 5.68 x 10^-7. This indicates that the
network is very sparse, meaning that only a very small fraction of all
possible edges actually exist in the network.
## [1] 5533214
## [1] 1965206
## Warning in vcount(g) * vcount(g): NAs produced by integer overflow
## [1] NA
## [1] 2.865441e-06
Plot the data – plot(g).
The network is too large. Plotting the data plot(g) results in a
large igraph of 334.4Mb.
Analyze the entire network
Calculate the number of nodes in the largest weakly connected
component (WCC) of the network.
A weakly connected component is a group of nodes where there is a
path between any two nodes in the group, but the edges may be directed
or undirected. Since the network is undirected, I used mode”weak” to
cluster the network.
## [1] 1957027
Calculate the number of edges in the largest weakly connected
component (WCC) of the network.
A weakly connected component is a group of nodes where there is a
path between any two nodes in the group, but the edges may be directed
or undirected. Since the edges are undirected, I used mode”weak” to
cluster the network.
## [1] 2760388
Count the Nodes in largest SCC.
This represents the number of nodes in the largest strongly
connected component (SCC) of the network. A strongly connected component
is a group of nodes where there is a directed path between any two nodes
in the group.
## [1] 1957027
Calculate the number of edges in largest SCC.
This represents the number of edges in the largest strongly
connected component of the network.
## [1] 2760388
Average Clustering Coefficient.
This is a measure of how tightly clustered the network is. It is
calculated as the average of the local clustering coefficients of all
nodes in the graph. The local clustering coefficient of a node is the
proportion of pairs of the node’s neighbors that are also neighbors of
each other.
## [1] 0.05542
Calculate the global clustering coefficient.
Count the number of triangles in the graph.
There are 120676 triangles in the entire network.
## [1] 120676
Compute the fraction of closed triangles in the graph.
The fraction of closed triangles in a graph is defined as the ratio
between the number of closed triangles (triangles where all three
vertices are connected to each other) and the total number of connected
triples of vertices in the graph.
## [1] 0.02142
Compute the ratio of the size of the largest weakly connected
component to the total number of vertices in the network.
There are 2368 clusters in the entire network. The largest weakly
connected component includes approximately 99.6% of the vertices,
meaning that it includes almost all of the vertices in the network. This
indicates that the network is highly cohesive, with only a small number
of isolated vertices or small components.
Calculate the degree of connection between roads.
Overall, on average, each road is connected to about 5-6 other roads
in the network, both as an incoming and outgoing connection.
## [1] 5.63118
## [1] 5.63118
## [1] 5.63118
Eigenvector Centrality.
Eigenvector centrality measures the influence of a node in a
network, taking into account the centrality of its neighbors. The scalar
value of the eigenvector centrality for the entire network is 9.276724.
This value represents the overall influence of the network, taking into
account the centrality of all nodes and their connections.
## Length Class Mode
## vector 1965206 -none- numeric
## value 1 -none- numeric
## options 20 -none- list
## [1] 9.276724
Test for Randomness … to determine if the observed patterns of
connections in the network are likely to have occurred by chance or if
they are meaningful and indicative of underlying structures or
processes.
## Warning in ks.test.default(igraph::degree(randomgraph), igraph::degree(g)):
## p-value will be approximate in the presence of ties
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: igraph::degree(randomgraph) and igraph::degree(g)
## D = 0.23933, p-value < 2.2e-16
## alternative hypothesis: two-sided
Plot the road network based on the degree of distribution.
Plot the road network using the density() function based on the
degree of distribution.
The degree of distribution of the road network is heavily skewed
compared to the random generated network. Another approach for analyzing
large networks is to use sampling methods, such as random sampling. This
can be more efficient than analyzing the entire network, especially if
the degree distribution is highly skewed.

Analyze the network using random sampling
For the purpose of this analysis, I am reducing the data set to a
sample of 300 rows.
## from to
## 2017743 695394 695395
## 3334670 1187240 1175408
## 4328362 1529088 1529090
## 3269750 1154424 1154482
## 2782437 977390 977391
## 3715420 1312379 1312352
## [1] 300 2
## 'data.frame': 300 obs. of 2 variables:
## $ from: int 695394 1187240 1529088 1154424 977390 1312379 1659386 139799 207174 1928616 ...
## $ to : int 695395 1175408 1529090 1154482 977391 1312352 1659387 139795 207172 1928609 ...
## [1] "from" "to"
The roads data frame has only two attributes.
I added the heterogeneity, as the third attribute. Heterogeneity can
be used as a measure of how interconnected different parts of a graph
are, and can be used in various network analysis applications.
The “heterogeneity” attribute, calculated based on the Jaccard
index, represents a measure of the similarity between the sets of
neighbors of two vertices in a graph. Specifically, it measures the
ratio of the number of common neighbors between two vertices to the
number of distinct neighbors they have.
A higher heterogeneity score (closer to 1) indicates that the two
vertices have a more diverse set of neighbors, while a lower score
(closer to 0) indicates that they share more common neighbors.
Randomly sample roads data frame based on the distribution of
heterogeneity, to ensure that data analyzed is representative, while the
size is being reduced to 150 variables.
Group heterogeneity into 3 bins
## [1] 60 4
Plot the graph with the node coordinates generated by
layout_with_fr, and the communities detected by the
cluster_edge_betweenness algorithm.
There are 60 communities in the network. This can help to identify
densely connected regions of the road network and potentially reveal
patterns in how different roads are used or connected.
## [1] 60

Create the network dataframe:
The links object is the data frame that specifies the edges of the
graph, with each row representing an edge and the columns representing
the source and target nodes of the edge.
## [1] TRUE
Plot the network using the fruchterman.reingold
layout.
Plot the network based on the the heterogeneity groups using
Fruchterman-Reingold layout algorithm.
## IGRAPH d15fd8c UN-- 120 60 --
## + attr: name (v/c), label (v/n), heterogeneity (e/n),
## | heterogeneity_group (e/x)
## + edges from d15fd8c (vertex names):
## [1] 1065844--1065730 1334290--1334403 1541164--1541166 1102756--1102236
## [5] 1777294--1775472 277910 --277916 1392208--1392211 140581 --140450
## [9] 1358615--1358591 493052 --493047 881504 --881536 321521 --321514
## [13] 1249164--1249165 436028 --436027 884118 --883872 833343 --833367
## [17] 1212260--1212257 842772 --842771 1892692--1892693 401902 --401903
## [21] 1744387--1754130 1873266--1873213 303117 --303198 115011 --115010
## [25] 1917361--1917360 171417 --171416 1821541--1821540 1041976--1041975
## + ... omitted several edges

Plot the network using the Louvain algorithm.
Analyze the sampled_data / the network
Network “dimensions”.
The network has 120 vertices and 60 edges.
Density of the network.
The density of the network is 0.008403361, which means that the
network is sparse,
indicating that only a very small fraction of all possible edges
actually exist in the network.
## [1] 0.008403361
Symmetry.
The number of symmetric edges is half of the number of symmetric
pairs, (being equal to the number of total edges in the network of
roads).
This means that every edge in the network has a corresponding
symmetric edge, indicating that there are no one-way streets or
unidirectional connections in the network.
Reciprocity of the network.
Since the network is undirected, bidirectional edges are defined as
edges that have a reciprocal edge in the opposite direction. The result
is 1, which means that there is a perfect level of bidirectionality in
the network.
## [1] 1
Calculate the degree of each node in the sampled network.
Every road has 1 connection in this network. Each road is connected
to 1 other road in the network, both as an incoming and outgoing
connection.
The largest weakly connected component includes only 1.67% of all of
the vertices in the network. This indicates that the sampled network is
poorly connected and cohesive, with many isolated vertices or small
components.
The eigenvector centrality for the entire network is 1. This value
represents the overall influence of the network, taking into account the
centrality of all nodes and their connections.
Test for randomness
to determine if the observed patterns of connections in the sampled
network are likely to have occurred by chance, or if they are meaningful
and indicative of underlying structures or processes.
## Warning in ks.test.default(igraph::degree(randomgraph),
## igraph::degree(network)): p-value will be approximate in the presence of ties
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: igraph::degree(randomgraph) and igraph::degree(network)
## D = 0.375, p-value = 9.382e-08
## alternative hypothesis: two-sided
Plot the road network based on the degree of distribution.
Plot the road network using the density() function based on the
degree of distribution.
Similarly to the road network, the degree of distribution of the
sampled road network is heavily skewed compared to the random generated
network. It makes sense since the sampled road network is a random
sample of the (large) road network (the source).
