For this project, I explored a dataset from the Network Data Repository focused on diseasome studies. Diseasome research maps diseases to the specific genes associated with them, helping uncover patterns of disease causation based on genetic mutations. I chose this dataset because I had little prior knowledge of the field and wanted to dive deeper into bioinformatics, particularly in understanding how genetic overlap between diseases can reveal new insights into their relationships.
library(readr)
library(igraph)
library(threejs)
library(scales)
library(viridis)
library(graphlayouts)
library(dplyr)
Here I am reading in the raw data and making a basic plot for it.
biodata_edge <- read_table("C:/Users/sumner/downloads/bio-diseasome.txt")
basic_graph <- graph_from_data_frame(d=biodata_edge,directed = FALSE)
plot(basic_graph)
This dataset was a little tricky to work with because there were very few obvious cleaning steps needed. I chose not to remove many nodes, as those with few connections could still hold meaningful insights into rare or less-researched diseases. While the data contained no self-loops or isolated nodes, I still performed basic cleaning to verify this and created a subgraph of diseases with a degree greater than 10. This subgraph highlights the most interconnected diseases representing major bridges within the network. However, I ultimately chose not to focus solely on this subset, because the smaller, sparsely connected nodes may represent rare diseases where further genetic research could uncover important new links.
degree <- degree(basic_graph)
important_nodes <- V(basic_graph)[degree > 10]
subgraph<- induced_subgraph(basic_graph, vids = important_nodes)
plot(subgraph, vertex.label = NA, directed = FALSE, vertex.size = 5)
This dataframe shows the total amount of nodes and edges in the data.
num_nodes <- vcount(basic_graph)
num_edges <- ecount(basic_graph)
data.frame(
nodes = num_nodes,
edges = num_edges
)
## nodes edges
## 1 516 1188
For my first visualization, I created a basic analysis of the diseasome network using the Fruchterman-Reingold layout. This layout revealed a massive interconnected “hairball” structure, where many diseases clustered together making it difficult to visualize. In this plot, larger nodes represent diseases with more genetic connections, while smaller nodes correspond to diseases with fewer shared genes. Based on the density of this visualization, I decided to use alternative layouts in later analyses to better separate and highlight the structure of the network.
plot(basic_graph,
layout = layout_with_fr(basic_graph), #Fruchterman - Reingold layout
vertex.label.cex = 1,
vertex.size = sqrt(degree(basic_graph)) * 3, #Sqrt vertex size to not overly crowd plot
edge.arrow.size = 1,
vertex.label = NA,
vertex.color = "green",
edge.color = "gray"
)
For my next visualization, I applied Louvain community detection to the network and used the viridis color palette to distinguish communities. I also scaled node sizes based on degree to highlight the most connected diseases. This plot revealed that many diseases cluster into distinct communities based on shared genetic mutations, with a few highly connected diseases acting as bridges that link multiple groups together. The use of 3D visualization made it easier to see the overall structure of the network and how certain nodes tie the whole network together.
cl <- cluster_louvain(basic_graph) #Cluster louvain community
V(basic_graph)$community <- cl$membership #Creates community as a vertex attribute
numcommunity <- length(unique(cl$membership))
viridis_colors <- viridis((numcommunity),option = 'plasma')
V(basic_graph)$color <- viridis_colors[cl$membership] #Groups color by community
V(basic_graph)$size <- rescale(degree(basic_graph), to = c(0.5, 3)) #Zies nodes by degree
graphjs(
basic_graph,
vertex.layout = layout_with_stress, #Used stress layout to handle hairballs and get a better layout than FR
vertex.size = V(basic_graph)$size,
vertex.color = V(basic_graph)$color,
vertex.label = NA,
edge.width = 0.5,
edge.color = "gray",
)
For my second advanced visualization, I focused on highlighting the top 5% of diseases with the highest degree. These highly connected diseases were colored differently to make them stand out from the rest of the network. This plot revealed that nearly every community shares at least one central hub disease that acts as a bridge, tethering otherwise separate groups together. Identifying these key diseases is vital as it highlights the major points of genetic overal across the data.
deg <- degree(basic_graph) #Finds degree from graph
threshold <- quantile(deg, 0.95) #Creates a threshold for the top 5 percent of node degree
V(basic_graph)$color <- ifelse(deg >= threshold, "red", "white")
V(basic_graph)$size <- rescale(deg, to = c(0.5, 5)) #Adjust sizing
graphjs(
basic_graph,
vertex.layout = layout_with_stress,
vertex.color = V(basic_graph)$color,
vertex.size = V(basic_graph)$size,
vertex.label = NA, # no label clutter
edge.color = "gray",
edge.width = 0.5,
)
btw <- betweenness(basic_graph, normalized = TRUE) #Calculates betweeness centrality
top_btw <- sort(btw, decreasing = TRUE)[1:10] #Grabs top 10 nodes with most betweeness
graph_density <- edge_density(basic_graph)
cl <- cluster_louvain(basic_graph)
num_communities <- length(unique(cl$membership)) #Cluster louvain for community detection
community_sizes <- sizes(cl)
top_btw_df <- data.frame(
Node = names(top_btw),
Betweenness = round(top_btw, 5)
)
community_sizes_df <- data.frame(
Community = names(community_sizes),
Size = as.integer(community_sizes)
)
summary_info_df <- data.frame(
Metric = c("Graph Density", "Number of Communities"),
Value = c(round(graph_density, 5), num_communities)
)
list(
Community_Sizes = community_sizes_df,
Top_Betweenness = top_btw_df,
Graph_Summary = summary_info_df
)
## $Community_Sizes
## Community Size
## 1 1 25
## 2 2 21
## 3 3 31
## 4 4 26
## 5 5 37
## 6 6 32
## 7 7 61
## 8 8 28
## 9 9 14
## 10 10 24
## 11 11 15
## 12 12 47
## 13 13 15
## 14 14 5
## 15 15 30
## 16 16 22
## 17 17 10
## 18 18 7
## 19 19 17
## 20 20 12
## 21 21 16
## 22 22 6
## 23 23 7
## 24 24 8
##
## $Top_Betweenness
## Node Betweenness
## 80 80 0.46937
## 257 257 0.40949
## 121 121 0.27820
## 169 169 0.23729
## 113 113 0.20029
## 310 310 0.17571
## 83 83 0.16830
## 93 93 0.13318
## 252 252 0.12387
## 24 24 0.11612
##
## $Graph_Summary
## Metric Value
## 1 Graph Density 0.00894
## 2 Number of Communities 24.00000
First, this data frame shows the Louvain clustering split that each disease is divided into showing natural groupings of related diseases because of genetic linkage. Bigger communities show clusters of disease that share more genetic connections while smaller ones are less common or most likely less studied. This solidifies the community structure shown below.
This second data frame shows the nodes by id with the highest level of betweeness. This is a measure of how often this node lies on the shortest path between other nodes. This essentially shows how “connected” these nodes are to the others so without these central diseases, we could be missing out on genetic overlay and the links that tie multiple diseases together. Studying these would be especially important because these “hubs” could help solve multiple heath issues just by investigating one specific disease.
This final data frame is important because it shows that even though there are 22 communities they have a very low density which is to be expected because while there are many entries, not every disease has the same genetic ties. The number of communities gives a general idea of how many families of diseases there are recorded in this data.
info_map <- cluster_infomap(basic_graph) # Detect communities using InfoMap
comm <- info_map$membership
edge_list <- ends(basic_graph, es = E(basic_graph)) #Creates edge list
edge_crosses <- as.numeric(comm[edge_list[,1]] != comm[edge_list[,2]]) #Calculates edge crosses
community_colors <- viridis(length(unique(info_map$membership)), option = 'turbo') #Color from viridis
par(mfrow = c(1,2)) #Seeding
par(mar = c(0,0,0,0))
set.seed(123)
plot(
basic_graph,
layout = layout_with_graphopt, #Used graphopt layout for better visual
vertex.color = community_colors, #Colors by community
vertex.label = NA,
vertex.size = 5,
edge.arrow.size = 0.5,
)
set.seed(123)
plot(
basic_graph,
layout = layout_with_graphopt, #Used graphopt layout for better visual
vertex.color = community_colors, #Colors by community
vertex.label = NA,
vertex.size = 5,
edge.color = ifelse(edge_crosses == 1, "red", "gray"), #Colors edge crosses (there are none)
edge.width = 0.5,
edge.arrow.size = 0.5,
)
This plot is showing the communities in the data color-coded by community membership with edges representing genetic ties. From the first plot we can see there are multiple disease communities across this data with each community tied by a common node that is not part of either community. In the second plot, edge crosses are highlighted in red however, no edges cross in this plot showing that each disease community is distinctly individual and there are tied by common nodes that aren’t part of either community.
info_map <- cluster_infomap(basic_graph)
community_colors <- viridis(length(unique(info_map$membership)), option = 'turbo')
par(mfrow = c(1,2))
par(mar = c(0,0,0,0))
set.seed(123)
plot(
subgraph,
layout = layout_with_graphopt,
vertex.color = community_colors,
vertex.label = NA,
vertex.size = 5,
edge.arrow.size = 0.5,
)
set.seed(123)
plot(
subgraph,
layout = layout_with_graphopt,
vertex.color = community_colors,
vertex.label = NA,
vertex.size = 5,
edge.color = ifelse(edge_crosses == 1, "red", "gray"),
edge.width = 0.5,
edge.arrow.size = 0.5,
)
Finally, with this plot I went back to the subgraph from before that only included nodes with 10 or more connections. With a lot of the extra nodes out of the way it is apparent that our conclusions from before still hold that each disease community does not share any edges with those of other communities. This makes sense in the context of our data because diseases of certain genetic pathways will not naturally form direct connections with diseases made up of different geneolgy. This separation solidifes the communities made before.
Unfortunately, this dataset doesn’t contain certain disease labels and information on them so I made due with ID labels however if I had access to this dataset with better labeling and disease classification then these are some research questions I would try and answer from these analyses.
Are diseases with higher degree more likely to have a higher betweeness between different diseases?
Are smaller disease communties associated with rarer diseases while the large communities are more common with more resarch on them?
How do bridge diseases differ from those that are clustered within each community?