Data Source:

J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large Networks: # Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. # Internet Mathematics 6(1) 29–123, 2009. #Data Source: https://snap.stanford.edu/data/roadNet-CA.html #A road network of California. Intersections and endpoints #are represented by nodes and the roads connecting these intersections or #road endpoints are represented by undirected edges.

  1. Learning Sources: Jaromir Kovarik website: https://sites.google.com/site/webpagesjaromir/home?authuser=0
  2. Scott, J. (2017). Social Network Analysis (4th ed.). Sage Publications.

Import Data

Set working directory.

Libraries.

Read my Data.

Remove the leading and trailing spaces in the column.

Split the names column into two columns.

Audit myData.

##   from  to
## 1    0   1
## 2    0   2
## 3    0 469
## 4    1   0
## 5    1   6
## 6    1 385
## [1] "from" "to"
## [1] 5533214       2
## 'data.frame':    5533214 obs. of  2 variables:
##  $ from: chr  "0" "0" "0" "1" ...
##  $ to  : chr  "1" "2" "469" "0" ...

Convert all character columns to integer columns.

All the elements in ‘myData$to’ are also present in ‘myData$from’, and vice versa.

Convert the data frame to a graph object.

The data is comprised of 5533214 edges and 1965206 nodes/vertices.

## IGRAPH a861cc7 UN-- 1965206 5533214 -- 
## + attr: name (v/c)
## + edges from a861cc7 (vertex names):
##  [1] 0  --1     0  --2     0  --469   0  --1     1  --6     1  --385  
##  [7] 0  --2     2  --3     0  --469   469--380   469--37415 1  --6    
## [13] 6  --5     1  --385   385--384   385--386   2  --3     3  --4    
## [19] 3  --419   3  --422   3  --4     4  --5     4  --98    4  --420  
## [25] 3  --419   419--420   419--35698 3  --422   422--183   422--423  
## [31] 422--37415 4  --5     6  --5     5  --98    4  --98    5  --98   
## [37] 98 --470   98 --35729 4  --420   419--420   420--35709 7  --8    
## [43] 7  --9     7  --79    7  --8     8  --33    7  --9     9  --10   
## + ... omitted several edges
## [1] "igraph"

View all vertex attributes in a list.

## + 5/1965206 vertices, named, from a861cc7:
##   name
## 1    0
## 2    1
## 3    2
## 4  469
## 5    6

Audit the network

Verify again whether the network is undirected.

The network is undirected, as described in the data source.

## [1] FALSE

Review vertices and edges.

The network is comprised of 1965206 vertices and 5533214 edges.

Symmetry.

The number of symmetric edges is half of the number of symmetric pairs (being equal to the number of total edges in the network of roads). This means that every edge in the network has a corresponding symmetric edge, indicating that there are no one-way streets or unidirectional connections in the network.

Having a symmetric road network can have practical implications for transportation planning and routing, since it allows for more flexible and efficient travel between nodes. For example, if there is a traffic jam on one side of an edge, drivers can still use the other side of the edge to reach their destination.

Symmetric networks can be more resilient to disruptions, as there are always multiple ways to reach a given node in the network.

Reciprocity.

Since the network is undirected, bidirectional edges are defined as edges that have a reciprocal edge in the opposite direction. The result is 1, which means that there is a perfect level of bidirectionality in the network.

## [1] 1

Assortativity.

The assortativity coefficient for road network is 0.1260417. This suggests that nodes with similar degree tend to be connected to each other more often than expected by chance.

## [1] 0.1260417
## [1] 0.1260417

Density of the undirected network.

The density is approximately 5.68 x 10^-7. This indicates that the network is very sparse, meaning that only a very small fraction of all possible edges actually exist in the network.

## [1] 5533214
## [1] 1965206
## Warning in vcount(g) * vcount(g): NAs produced by integer overflow
## [1] NA
## [1] 2.865441e-06

Plot the data – plot(g).

The network is too large. Plotting the data plot(g) results in a large igraph of 334.4Mb.

Analyze the entire network

Calculate the number of nodes in the largest weakly connected component (WCC) of the network.

A weakly connected component is a group of nodes where there is a path between any two nodes in the group, but the edges may be directed or undirected. Since the network is undirected, I used mode”weak” to cluster the network.

## [1] 1957027

Calculate the number of edges in the largest weakly connected component (WCC) of the network.

A weakly connected component is a group of nodes where there is a path between any two nodes in the group, but the edges may be directed or undirected. Since the edges are undirected, I used mode”weak” to cluster the network.

## [1] 2760388

Count the Nodes in largest SCC.

This represents the number of nodes in the largest strongly connected component (SCC) of the network. A strongly connected component is a group of nodes where there is a directed path between any two nodes in the group.

## [1] 1957027

Calculate the number of edges in largest SCC.

This represents the number of edges in the largest strongly connected component of the network.

## [1] 2760388

Average Clustering Coefficient.

This is a measure of how tightly clustered the network is. It is calculated as the average of the local clustering coefficients of all nodes in the graph. The local clustering coefficient of a node is the proportion of pairs of the node’s neighbors that are also neighbors of each other.

## [1] 0.05542

Calculate the global clustering coefficient.

The clustering coefficient of the entire network is low, implying that nodes in the network are not highly likely to form triangles or clusters of interconnected nodes. Instead, they tend to have relatively few connections to other nodes. This could be due to a variety of factors, such as the sparsity of the network or the nature of the relationships between the nodes.

## [1] 0.06039

Count the number of triangles in the graph.

There are 120676 triangles in the entire network.

## [1] 120676

Compute the fraction of closed triangles in the graph.

The fraction of closed triangles in a graph is defined as the ratio between the number of closed triangles (triangles where all three vertices are connected to each other) and the total number of connected triples of vertices in the graph.

## [1] 0.02142

Compute the ratio of the size of the largest weakly connected component to the total number of vertices in the network.

There are 2368 clusters in the entire network. The largest weakly connected component includes approximately 99.6% of the vertices, meaning that it includes almost all of the vertices in the network. This indicates that the network is highly cohesive, with only a small number of isolated vertices or small components.

Calculate the degree of connection between roads.

Overall, on average, each road is connected to about 5-6 other roads in the network, both as an incoming and outgoing connection.

## [1] 5.63118
## [1] 5.63118
## [1] 5.63118

Plot the distribution and the frequency of the values in the “degree” variable. Save the plot in jpeg format.

Eigenvector Centrality.

Eigenvector centrality measures the influence of a node in a network, taking into account the centrality of its neighbors. The scalar value of the eigenvector centrality for the entire network is 9.276724. This value represents the overall influence of the network, taking into account the centrality of all nodes and their connections.

##         Length  Class  Mode   
## vector  1965206 -none- numeric
## value         1 -none- numeric
## options      20 -none- list
## [1] 9.276724

Test for Randomness … to determine if the observed patterns of connections in the network are likely to have occurred by chance or if they are meaningful and indicative of underlying structures or processes.

## Warning in ks.test.default(igraph::degree(randomgraph), igraph::degree(g)):
## p-value will be approximate in the presence of ties
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  igraph::degree(randomgraph) and igraph::degree(g)
## D = 0.23893, p-value < 2.2e-16
## alternative hypothesis: two-sided

Plot the road network based on the degree of distribution.

The observed roads network deviates significantly from a random model, the observed patterns are indicative of non-random processes or structures.

Plot the road network using the density() function based on the degree of distribution.

The degree of distribution of the road network is heavily skewed compared to the random generated network. Another approach for analyzing large networks is to use sampling methods, such as random sampling. This can be more efficient than analyzing the entire network, especially if the degree distribution is highly skewed.

Analyze the network using random sampling

For the purpose of this analysis, I am reducing the data set to a sample of 300 rows.

##            from      to
## 2017743  695394  695395
## 3334670 1187240 1175408
## 4328362 1529088 1529090
## 3269750 1154424 1154482
## 2782437  977390  977391
## 3715420 1312379 1312352
## [1] 300   2
## 'data.frame':    300 obs. of  2 variables:
##  $ from: int  695394 1187240 1529088 1154424 977390 1312379 1659386 139799 207174 1928616 ...
##  $ to  : int  695395 1175408 1529090 1154482 977391 1312352 1659387 139795 207172 1928609 ...
## [1] "from" "to"

The roads data frame has only two attributes.

I added the heterogeneity, as the third attribute. Heterogeneity can be used as a measure of how interconnected different parts of a graph are, and can be used in various network analysis applications.

The “heterogeneity” attribute, calculated based on the Jaccard index, represents a measure of the similarity between the sets of neighbors of two vertices in a graph. Specifically, it measures the ratio of the number of common neighbors between two vertices to the number of distinct neighbors they have.

A higher heterogeneity score (closer to 1) indicates that the two vertices have a more diverse set of neighbors, while a lower score (closer to 0) indicates that they share more common neighbors.

Randomly sample roads data frame based on the distribution of heterogeneity, to ensure that data analyzed is representative, while the size is being reduced to 150 variables.

Group heterogeneity into 3 bins

## [1] 60  4

Plot the graph with the node coordinates generated by layout_with_fr, and the communities detected by the cluster_edge_betweenness algorithm.

There are 60 communities in the network. This can help to identify densely connected regions of the road network and potentially reveal patterns in how different roads are used or connected.

## [1] 60

Plot g_roads where each node in the g_roads graph is colored based on the community it belongs to, as identified by the leading eigenvalue algorithm.

This can help identify groups of closely connected roads within the network. By identifying these communities, one can potentially reveal patterns in the road network that may not be immediately apparent from a visual inspection of the network. One may find that certain areas of the road network are more densely connected than others, or that there are clusters of roads that are more connected to each other than to the rest of the network. These patterns can provide valuable insights for urban planners and transportation engineers in terms of optimizing traffic flow and improving overall transportation infrastructure.

## [1] 60

Create the network dataframe:

Plot the network using the fruchterman.reingold layout.

Save the plot in jpeg format.

Plot the network based on the the heterogeneity groups using Fruchterman-Reingold layout algorithm.

## IGRAPH b8d3bdc UN-- 120 60 -- 
## + attr: name (v/c), label (v/n), heterogeneity (e/n),
## | heterogeneity_group (e/x)
## + edges from b8d3bdc (vertex names):
##  [1] 1065844--1065730 1334290--1334403 1541164--1541166 1102756--1102236
##  [5] 1777294--1775472 277910 --277916  1392208--1392211 140581 --140450 
##  [9] 1358615--1358591 493052 --493047  881504 --881536  321521 --321514 
## [13] 1249164--1249165 436028 --436027  884118 --883872  833343 --833367 
## [17] 1212260--1212257 842772 --842771  1892692--1892693 401902 --401903 
## [21] 1744387--1754130 1873266--1873213 303117 --303198  115011 --115010 
## [25] 1917361--1917360 171417 --171416  1821541--1821540 1041976--1041975
## + ... omitted several edges

Plot the network using the Louvain algorithm.

Save the plot in jpeg format.

Analyze the sampled_data / the network

Network “dimensions”.

The network has 120 vertices and 60 edges.

Density of the network.

The density of the network is 0.008403361, which means that the network is sparse,

indicating that only a very small fraction of all possible edges actually exist in the network.

## [1] 0.008403361

Symmetry.

The number of symmetric edges is half of the number of symmetric pairs, (being equal to the number of total edges in the network of roads).

This means that every edge in the network has a corresponding symmetric edge, indicating that there are no one-way streets or unidirectional connections in the network.

Reciprocity of the network.

Since the network is undirected, bidirectional edges are defined as edges that have a reciprocal edge in the opposite direction. The result is 1, which means that there is a perfect level of bidirectionality in the network.

## [1] 1

Calculate the degree of each node in the sampled network.

Every road has 1 connection in this network. Each road is connected to 1 other road in the network, both as an incoming and outgoing connection.

The largest weakly connected component includes only 1.67% of all of the vertices in the network. This indicates that the sampled network is poorly connected and cohesive, with many isolated vertices or small components.

The eigenvector centrality for the entire network is 1. This value represents the overall influence of the network, taking into account the centrality of all nodes and their connections.

Test for randomness

to determine if the observed patterns of connections in the sampled network are likely to have occurred by chance, or if they are meaningful and indicative of underlying structures or processes.

## Warning in ks.test.default(igraph::degree(randomgraph),
## igraph::degree(network)): p-value will be approximate in the presence of ties
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  igraph::degree(randomgraph) and igraph::degree(network)
## D = 0.375, p-value = 9.382e-08
## alternative hypothesis: two-sided

Plot the road network based on the degree of distribution.

The observed roads network deviates significantly from a random model, the observed patterns are indicative of non-random processes or structures.

Plot the road network using the density() function based on the degree of distribution.

Similarly to the road network, the degree of distribution of the sampled road network is heavily skewed compared to the random generated network. It makes sense since the sampled road network is a random sample of the (large) road network (the source).