Part 1 Data Description and Transformation

For this project, I explored a dataset from the Network Data Repository focused on diseasome studies. Diseasome research maps diseases to the specific genes associated with them, helping uncover patterns of disease causation based on genetic mutations. I chose this dataset because I had little prior knowledge of the field and wanted to dive deeper into bioinformatics, particularly in understanding how genetic overlap between diseases can reveal new insights into their relationships.

library(readr)
library(igraph)
library(threejs)
library(scales)
library(viridis)
library(graphlayouts) 
library(dplyr)

Here I am reading in the raw data and making a basic plot for it.

biodata_edge <- read_table("C:/Users/sumner/downloads/bio-diseasome.txt")

basic_graph <- graph_from_data_frame(d=biodata_edge,directed = FALSE)

plot(basic_graph)

This dataset was a little tricky to work with because there were very few obvious cleaning steps needed. I chose not to remove many nodes, as those with few connections could still hold meaningful insights into rare or less-researched diseases. While the data contained no self-loops or isolated nodes, I still performed basic cleaning to verify this and created a subgraph of diseases with a degree greater than 10. This subgraph highlights the most interconnected diseases representing major bridges within the network. However, I ultimately chose not to focus solely on this subset, because the smaller, sparsely connected nodes may represent rare diseases where further genetic research could uncover important new links.

degree <- degree(basic_graph)

important_nodes <- V(basic_graph)[degree > 10]

subgraph<- induced_subgraph(basic_graph, vids = important_nodes)

plot(subgraph, vertex.label = NA, directed = FALSE, vertex.size = 5)

This dataframe shows the total amount of nodes and edges in the data.

num_nodes <- vcount(basic_graph)
num_edges <- ecount(basic_graph)

data.frame( 
  nodes = num_nodes,
  edges = num_edges
  )

##   nodes edges
## 1   516  1188

Part 2 Visualization

Basic Visualization

For my first visualization, I created a basic analysis of the diseasome network using the Fruchterman-Reingold layout. This layout revealed a massive interconnected “hairball” structure, where many diseases clustered together making it difficult to visualize. In this plot, larger nodes represent diseases with more genetic connections, while smaller nodes correspond to diseases with fewer shared genes. Based on the density of this visualization, I decided to use alternative layouts in later analyses to better separate and highlight the structure of the network.

plot(basic_graph, 
     layout = layout_with_fr(basic_graph), #Fruchterman - Reingold layout
     vertex.label.cex = 1, 
     vertex.size = sqrt(degree(basic_graph)) * 3, #Sqrt vertex size to not overly crowd plot
     edge.arrow.size = 1,
     vertex.label = NA,  
     vertex.color = "green", 
     edge.color = "gray"
)

Advanced Visualization

For my next visualization, I applied Louvain community detection to the network and used the viridis color palette to distinguish communities. I also scaled node sizes based on degree to highlight the most connected diseases. This plot revealed that many diseases cluster into distinct communities based on shared genetic mutations, with a few highly connected diseases acting as bridges that link multiple groups together. The use of 3D visualization made it easier to see the overall structure of the network and how certain nodes tie the whole network together.

cl <- cluster_louvain(basic_graph) #Cluster louvain community

V(basic_graph)$community <- cl$membership #Creates community as a vertex attribute
numcommunity <- length(unique(cl$membership))

viridis_colors <- viridis((numcommunity),option = 'plasma')
V(basic_graph)$color <- viridis_colors[cl$membership] #Groups color by community

V(basic_graph)$size <- rescale(degree(basic_graph), to = c(0.5, 3)) #Zies nodes by degree


graphjs(
  basic_graph,
  vertex.layout = layout_with_stress, #Used stress layout to handle hairballs and get a better layout than FR
  vertex.size = V(basic_graph)$size,
  vertex.color = V(basic_graph)$color,
  vertex.label = NA,
  edge.width = 0.5,
  edge.color = "gray",
)

For my second advanced visualization, I focused on highlighting the top 5% of diseases with the highest degree. These highly connected diseases were colored differently to make them stand out from the rest of the network. This plot revealed that nearly every community shares at least one central hub disease that acts as a bridge, tethering otherwise separate groups together. Identifying these key diseases is vital as it highlights the major points of genetic overal across the data.

deg <- degree(basic_graph) #Finds degree from graph
threshold <- quantile(deg, 0.95) #Creates a threshold for the top 5 percent of node degree


V(basic_graph)$color <- ifelse(deg >= threshold, "red", "white")
V(basic_graph)$size <- rescale(deg, to = c(0.5, 5)) #Adjust sizing 


graphjs(
  basic_graph,
  vertex.layout = layout_with_stress,
  vertex.color = V(basic_graph)$color,
  vertex.size = V(basic_graph)$size,
  vertex.label = NA,  # no label clutter
  edge.color = "gray",
  edge.width = 0.5,
)

Part 3 Analysis

btw <- betweenness(basic_graph, normalized = TRUE) #Calculates betweeness centrality
top_btw <- sort(btw, decreasing = TRUE)[1:10] #Grabs top 10 nodes with most betweeness


graph_density <- edge_density(basic_graph)


cl <- cluster_louvain(basic_graph)
num_communities <- length(unique(cl$membership)) #Cluster louvain for community detection
community_sizes <- sizes(cl)



top_btw_df <- data.frame(
  Node = names(top_btw),
  Betweenness = round(top_btw, 5)
)

community_sizes_df <- data.frame(
  Community = names(community_sizes),
  Size = as.integer(community_sizes)
)

summary_info_df <- data.frame(
  Metric = c("Graph Density", "Number of Communities"),
  Value = c(round(graph_density, 5), num_communities)
)

list(
  Community_Sizes = community_sizes_df,
  Top_Betweenness = top_btw_df,
  Graph_Summary = summary_info_df
)

## $Community_Sizes
##    Community Size
## 1          1   25
## 2          2   21
## 3          3   31
## 4          4   26
## 5          5   37
## 6          6   32
## 7          7   61
## 8          8   28
## 9          9   14
## 10        10   24
## 11        11   15
## 12        12   47
## 13        13   15
## 14        14    5
## 15        15   30
## 16        16   22
## 17        17   10
## 18        18    7
## 19        19   17
## 20        20   12
## 21        21   16
## 22        22    6
## 23        23    7
## 24        24    8
## 
## $Top_Betweenness
##     Node Betweenness
## 80    80     0.46937
## 257  257     0.40949
## 121  121     0.27820
## 169  169     0.23729
## 113  113     0.20029
## 310  310     0.17571
## 83    83     0.16830
## 93    93     0.13318
## 252  252     0.12387
## 24    24     0.11612
## 
## $Graph_Summary
##                  Metric    Value
## 1         Graph Density  0.00894
## 2 Number of Communities 24.00000

Community Size

First, this data frame shows the Louvain clustering split that each disease is divided into showing natural groupings of related diseases because of genetic linkage. Bigger communities show clusters of disease that share more genetic connections while smaller ones are less common or most likely less studied. This solidifies the community structure shown below.

Betweeness

This second data frame shows the nodes by id with the highest level of betweeness. This is a measure of how often this node lies on the shortest path between other nodes. This essentially shows how “connected” these nodes are to the others so without these central diseases, we could be missing out on genetic overlay and the links that tie multiple diseases together. Studying these would be especially important because these “hubs” could help solve multiple heath issues just by investigating one specific disease.

Density

This final data frame is important because it shows that even though there are 22 communities they have a very low density which is to be expected because while there are many entries, not every disease has the same genetic ties. The number of communities gives a general idea of how many families of diseases there are recorded in this data.

Community Visual

info_map <- cluster_infomap(basic_graph) # Detect communities using InfoMap  
comm <- info_map$membership


edge_list <- ends(basic_graph, es = E(basic_graph)) #Creates edge list
edge_crosses <- as.numeric(comm[edge_list[,1]] != comm[edge_list[,2]]) #Calculates edge crosses

community_colors <- viridis(length(unique(info_map$membership)), option = 'turbo') #Color from viridis
par(mfrow = c(1,2)) #Seeding
par(mar = c(0,0,0,0))

set.seed(123)
plot(
  basic_graph,
  layout = layout_with_graphopt, #Used graphopt layout for better visual
  vertex.color = community_colors, #Colors by community
  vertex.label = NA,
  vertex.size = 5,
  edge.arrow.size = 0.5,

)

set.seed(123)
plot(
  basic_graph,
  layout = layout_with_graphopt, #Used graphopt layout for better visual
  vertex.color = community_colors, #Colors by community
  vertex.label = NA,
  vertex.size = 5,
  edge.color = ifelse(edge_crosses == 1, "red", "gray"), #Colors edge crosses (there are none)
  edge.width = 0.5,            
  edge.arrow.size = 0.5,
)

This plot is showing the communities in the data color-coded by community membership with edges representing genetic ties. From the first plot we can see there are multiple disease communities across this data with each community tied by a common node that is not part of either community. In the second plot, edge crosses are highlighted in red however, no edges cross in this plot showing that each disease community is distinctly individual and there are tied by common nodes that aren’t part of either community.

Community Visual Subgraph

info_map <- cluster_infomap(basic_graph)
community_colors <- viridis(length(unique(info_map$membership)), option = 'turbo')

par(mfrow = c(1,2))
par(mar = c(0,0,0,0))



set.seed(123)
plot(
  subgraph,
  layout = layout_with_graphopt,
  vertex.color = community_colors,
  vertex.label = NA,
  vertex.size = 5,
  edge.arrow.size = 0.5,

)

set.seed(123)
plot(
  subgraph,
  layout = layout_with_graphopt,
  vertex.color = community_colors,
  vertex.label = NA,
  vertex.size = 5,
  edge.color = ifelse(edge_crosses == 1, "red", "gray"),
  edge.width = 0.5,            
  edge.arrow.size = 0.5,
)

Finally, with this plot I went back to the subgraph from before that only included nodes with 10 or more connections. With a lot of the extra nodes out of the way it is apparent that our conclusions from before still hold that each disease community does not share any edges with those of other communities. This makes sense in the context of our data because diseases of certain genetic pathways will not naturally form direct connections with diseases made up of different geneolgy. This separation solidifes the communities made before.

Research Questions

Unfortunately, this dataset doesn’t contain certain disease labels and information on them so I made due with ID labels however if I had access to this dataset with better labeling and disease classification then these are some research questions I would try and answer from these analyses.

RQ 1

Are diseases with higher degree more likely to have a higher betweeness between different diseases?

RQ 2

Are smaller disease communties associated with rarer diseases while the large communities are more common with more resarch on them?

RQ 3

How do bridge diseases differ from those that are clustered within each community?

Final 295

Sumner Wilson

2025-04-30