class: center, middle, title-slide .title[ # MD7115: R for Graph, Network and Trees ] .subtitle[ ## Applications in Biomedical Research ] .author[ ###
Xiaotao Shen
Assistant Professor
shen-lab.org
xiaotao.shen@ntu.edu.sg
] .institute[ ### Lee Kong Chian School of Medicine, NTU Singapore ] .date[ ### 2025-03-30 ] --- layout: true background-image: url(data:image/png;base64,#images/lkc_logo.png) background-position: 100% 0% background-size: 20% <div class="my-footer"><span>Xiaotao Shen/www.shen-lab.org</span></div> --- # About the class - **Instructor**: Xiaotao Shen (https://www.shen-lab.org/) - **Email**: xiaotao.shen@ntu.edu.stg - **WhatsApp**: +1 5712679283 - **Address**: Level 3, EMB, NTU - **Github Repo**: https://github.com/jaspershen-lab/MD7115-NTU --- # Introduction In this course, we will learn how to use R to analyze and visualize data in the form of graphs, networks and trees. We will cover the following topics. 1. What is a graph, network and tree? 2. Their applications in biomedical research 3. How to use R to analyze and visualize them 4. Examples in publications --- # Install RStudio **Ref:** 1. https://rstudio-education.github.io/hopr/starting.html 2. https://posit.co/download/rstudio-desktop/ <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/rstudio.png" alt="a graph" width="60%"> --- # RStudio project: step 1 Link: https://github.com/jaspershen-lab/MD7115-NTU <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/step1.png" alt="a graph" width="80%"> --- # RStudio project: step 2 <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/step2.png" alt="a graph" width="80%"> --- # RStudio project: step 3 <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/step3.png" alt="a graph" width="80%"> --- # RStudio project: step 4 <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/step4.png" alt="a graph" width="80%"> --- # RStudio project: step 5 <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/step5.png" alt="a graph" width="80%"> --- # RStudio project: step 6 <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/step6.png" alt="a graph" width="80%"> --- # Preparation: general packages 1 We need to install some general R packages before the class. Some packages are in Bioconductor and GitHub, so we need to install `BiocManager` and `remotes` first. ``` r # BiocManager is a package manager for Bioconductor if (!require("BiocManager", quietly = TRUE)){ install.packages("BiocManager") } # remotes is a package manager for GitHub if (!require("remotes", quietly = TRUE)){ install.packages("remotes") } ``` `tidyverse` is a collection of R packages designed for data science, and it includes `ggplot2`, which is a basic graph package for network and tree in R. ``` r if (!require("tidyverse", quietly = TRUE)){ install.packages("tidyverse") } ``` --- # Preparation: general packages 2 ``` r ##readr to read data (csv, tsv, etc.) if (!require("readr", quietly = TRUE)){ install.packages("readr") } ##tibble to create data.frame if (!require("tibble", quietly = TRUE)){ install.packages("tibble") } # shatdowtext is a package for adding shadow to text if (!require("shadowtext", quietly = TRUE)){ install.packages("shadowtext") } # ggsci is a package for scientific journal color palettes if (!require("ggsci", quietly = TRUE)){ install.packages("ggsci") } ``` --- # Preparation: packages for graph/network Some specific packages for graph and network. ``` r # igraph is a package for creating, manipulating, and analyzing network data if (!require("igraph", quietly = TRUE)){ install.packages("igraph") } # ggraph is a package for creating network visualizations using ggplot2 if(!require("ggraph", quietly = TRUE)){ install.packages("ggraph") } #tidygraph is a package for tidy manipulation of graph data if(!require("tidygraph", quietly = TRUE)){ install.packages("tidygraph") } # ggnetword is another package for creating network visualizations using ggplot2 if(!require("ggnetwork", quietly = TRUE)){ install.packages("ggnetwork") } # extrafont is a package for using extra fonts in ggplot2 if (!require(extrafont)) { install.packages("extrafont") } ``` --- # Preparation: packages for trees Some specific packages for tree analysis. ``` r # microbiomeViz is a package for visualizing microbiome data if (!require("microbiomeViz", quietly = TRUE)) { remotes::install_github("lch14forever/microbiomeViz") } # phyloseq is a package for microbiome data analysis if (!require("phyloseq", quietly = TRUE)) { BiocManager::install("phyloseq") } # ggtree is a package for visualizing phylogenetic trees if(!require("ggtree", quietly = TRUE)){ BiocManager::install("ggtree") } # treeio is a package for reading and writing phylogenetic tree data if(!require("treeio", quietly = TRUE)){ BiocManager::install("treeio") } # ggtreeExtra is a package for adding extra features to ggtree if(!require("ggtreeExtra", quietly = TRUE)){ BiocManager::install("ggtreeExtra") } ``` --- # What is a graph/network? .pull-left[ A graph is a collection of nodes (vertices) and edges (links) that connect pairs of nodes. A small undirected graph with numbered nodes. ] .pull-right[ <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/graph.png" alt="a graph" width="70%"> ] --- # Biological networks - **Protein-protein interaction networks**: Protein-protein interaction networks (PPIN) are mathematical representations of the physical contacts between proteins in the cell. - **Metabolic networks**: A metabolic network is the complete set of metabolic and physical processes that determine the physiological and biochemical properties of a cell. - **Correlation network**: Correlation networks are a way to represent and analyze the relationships between different biological entities (like genes, proteins, or metabolites) based on how their levels or activity change together across various samples or conditions. - **Gene regulatory networks**: A gene regulatory network is a collection of molecular regulators that interact with each other and with other substances in the cell to regulate the gene expression levels of mRNA and proteins. - **Microbe-metabolite network**: A microbe-metabolite network is a network that describes the interactions between microbes and metabolites in a biological system. - **Drug-target networks**: A drug-target network is a network that describes the interactions between drugs and their target proteins in the cell. --- # Protein-protein interaction networks .pull-left[ Protein-protein interaction networks (PPIN) are mathematical representations of the physical contacts between proteins in the cell. * Nodes: proteins * Edges: interactions between proteins ] .pull-right[ <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/ppi_network.png" alt="Description of the image" width="100%"> </div> > Jeong et al. Nature 2001 ] --- # Metabolic networks .pull-left[ A metabolic network is the complete set of metabolic and physical processes that determine the physiological and biochemical properties of a cell. As such, these networks comprise the chemical reactions of metabolism, the metabolic pathways, as well as the regulatory interactions that guide these reactions. * Nodes: protein (enzyme and metabolites) * Edges: chemical reactions * Database: KEGG, Reactome, HMDB... ] .pull-right[ <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/metabolic_network.png" alt="Description of the image" width="100%"> </div> > https://en.wikipedia.org/wiki/Metabolic_network ] --- # Correlation networks .pull-left[ Correlation networks are a way to represent and analyze the relationships between different biological entities (like genes, proteins, or metabolites) based on how their levels or activity change together across various samples or conditions. * Node: Each node in the network represents a biological entity (e.g., a gene, a protein, or a metabolite). * Edge: Edges connect nodes and represent the correlation between the entities they represent. The strength or weight of the edge indicates the strength of the correlation. Gene Co-expression Networks ] .pull-right[ <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/correlation_network.png" alt="Description of the image" width="100%"> </div> > Zhou et al. Cell Host & Microbe 2024 ] --- # Gene Co-expression Networks Step 1: Expression data. Step 2: DEG. Step 3: Correlation. Step 4: Network <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/genes-10-00962-g001.png" alt="Description of the image" width="100%"> </div> > Delgado-Chaves et al. Genes 2019 --- # Some important properties of a network * **Degree Distribution**: The distribution of the number of connections (or "degree") each node has in the network. * **Shortest Paths**: The shortest path between two nodes in a network is the smallest number of edges that must be traversed to get from one node to the other. * **Centrality**: Measures the importance or influence of a node within the network, often based on the number of connections, shortest paths, or intermediate positions. * **Betweenness Centrality**: A measure of how often a node appears on the shortest path between two other nodes in the network. --- # Degree distribution .pull-left[ What is the degree of nodes? Degree is the number of edges connected to a node. In this network, node 3's degree is 3. node 1's degree is 1. ] .pull-right[ <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/graph.png" alt="a graph" width="70%"> ] --- # Scale-free networks .pull-left[ A "scale-free network" refers to a network where the degree distribution, which is the distribution of the number of connections (or links) each node has, follows a power law. This means that a small number of nodes (the "hubs") have a very large number of connections, while the majority of nodes have relatively few connections. In biological network, scale-free networks are common, with a few highly connected nodes (hubs) and many nodes with only a few connections. **These hubs are often important for the overall structure and function of the network (hub gene/protein/metabolite).** ] .pull-right[ <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/scale_free_network.png" alt="a graph" width="70%"> > https://link.springer.com/article/10.1007/s10142-022-00907-y ] --- # Graph/Network in R Some important R packages for graph and network analysis. * `igraph` A R package for creating, manipulating, and analyzing network data. [https://igraph.org/](https://igraph.org/) * `ggraph` A R package for creating network visualizations using `ggplot2`. [https://ggraph.data-imaginist.com/](https://ggraph.data-imaginist.com/) * `tidygraph` A R package for tidy manipulation of graph data. [https://tidygraph.data-imaginist.com/](https://tidygraph.data-imaginist.com/) --- # Create a network: from the data.frame ``` r library(tidyverse) library(tibble) library(igraph) library(ggraph) library(tidygraph) ##demo data head(highschool) ``` ``` ## from to year ## 1 1 14 1957 ## 2 1 15 1957 ## 3 1 21 1957 ## 4 1 54 1957 ## 5 1 55 1957 ## 6 2 21 1957 ``` --- # Create a network: from the data.frame ``` r graph = as_tbl_graph(highschool) class(graph) ``` ``` ## [1] "tbl_graph" "igraph" ``` ``` r graph ``` ``` ## # A tbl_graph: 70 nodes and 506 edges ## # ## # A directed multigraph with 1 component ## # ## # Node Data: 70 × 0 (active) ## # ## # Edge Data: 506 × 3 ## from to year ## <int> <int> <dbl> ## 1 1 13 1957 ## 2 1 14 1957 ## 3 1 20 1957 ## # ℹ 503 more rows ``` --- # Create a network: from the data.frame .pull-left[ ``` r plot(graph) ``` ] .pull-right[ <!-- --> ] --- # Create a network: from the node and edge data ``` r node_data = tibble::tibble( id = c("A", "B", "C", "D", "E"), label = c("Alice", "Bob", "Charlie", "David", "Eve"), sex = c("F", "M", "M", "F", "F") ) edge_data = tibble::tibble( from = c("A", "A", "B", "C", "D", "E"), to = c("B", "C", "D", "D", "E", "A"), relationship = c("friend", "neighbor", "friend", "neighbor", "friend", "friend") ) ``` --- # Create a network: from the node and edge data .pull-left[ ``` r node_data ``` ``` ## # A tibble: 5 × 3 ## id label sex ## <chr> <chr> <chr> ## 1 A Alice F ## 2 B Bob M ## 3 C Charlie M ## 4 D David F ## 5 E Eve F ``` ] .pull-right[ ``` r edge_data ``` ``` ## # A tibble: 6 × 3 ## from to relationship ## <chr> <chr> <chr> ## 1 A B friend ## 2 A C neighbor ## 3 B D friend ## 4 C D neighbor ## 5 D E friend ## 6 E A friend ``` ] --- # Create a network: from the node and edge data .pull-left[ `tbl_graph` is a function from the package `tidygraph` to create a network from the node and edge data. ``` r graph2 = tbl_graph(nodes = node_data, edges = edge_data, directed = FALSE) graph2 ``` ] .pull.right[ ``` ## # A tbl_graph: 5 nodes and 6 edges ## # ## # An undirected simple graph with 1 component ## # ## # Node Data: 5 × 3 (active) ## id label sex ## <chr> <chr> <chr> ## 1 A Alice F ## 2 B Bob M ## 3 C Charlie M ## 4 D David F ## 5 E Eve F ## # ## # Edge Data: 6 × 3 ## from to relationship ## <int> <int> <chr> ## 1 1 2 friend ## 2 1 3 neighbor ## 3 2 4 friend ## # ℹ 3 more rows ``` ] --- # Network visualization .pull-left[ * `ggraph` function is used to visualize the network. * `geom_edge_xxx` functions are used to add edges to the plot. * `geom_node_xxx` functions are used to add nodes to the plot. ``` r ggraph(graph2, layout = 'kk') + geom_edge_link() + geom_node_point() ``` ] .pull-right[ <!-- --> ] --- # Layout .pull-left[ The layout of one network is the arrangement of nodes and edges in the network. `ggraph` supports different layouts. * kk: Kamada-Kawai layout * fr: Fruchterman-Reingold layout * stress: Stress layout * Customized layout ``` r ggraph(graph2, layout = 'fr') + geom_edge_link() + geom_node_point() ``` ] .pull-right[ <!-- --> ] --- # Layout .pull-left[ Stress layout is a layout that tries to minimize the stress of the network. ``` r ggraph(graph2, layout = 'stress') + geom_edge_link() + geom_node_point() ``` ] .pull-right[ <!-- --> ] --- # Layout .pull-left[ Linear layout is a layout that arranges the nodes in a straight line. ``` r ggraph(graph2, layout = 'linear') + geom_edge_arc() + geom_node_point() ``` ] .pull-right[ <!-- --> ] --- # Layout .pull-left[ ``` r ggraph(graph2, layout = 'linear', circular = TRUE) + geom_edge_arc() + geom_node_point() + coord_fixed() ``` ] .pull-right[ <!-- --> ] # Node and edge attributes .pull-left[ ``` r ggraph(graph2, layout = 'linear', circular = TRUE) + geom_edge_arc(aes(color = relationship)) + geom_node_point(aes(color = sex), size = 5) + coord_fixed() ``` ] .pull-right[ <!-- --> ] --- # Add labels .pull-left[ `geom_node_text` function is used to add labels to the nodes. ``` r ggraph(graph2, layout = 'linear', circular = TRUE) + geom_edge_arc(aes(color = relationship)) + geom_node_point(aes(color = sex), size = 5) + geom_node_text(aes(label = label), repel = TRUE) + coord_fixed() ``` ] .pull-right[ <!-- --> ] --- # Case study ## 1. Correlation network ## 2. GO similarity network --- # Example 1: Correlation network <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/ED_FIG7.jpg" alt="Description of the image" width="55%"> </div> > [Shen, et al. Nature Biomedical Engineering, 2024](https://www.shen-lab.org/publication/Multi-omics%20microsampling%20for%20the%20profiling%20of%20lifestyle-associated%20changes%20in%20health.pdf) --- # Example 1: Correlation network What's the network? .pull-left[ **Nodes**: - Wearable data: Step, HR, and CGM (continuous glucose monitoring) data - Omics data: Proteins, metabolites, and lipids **Edges**: - Lagged correlations - Blue: negative correlations - Red: positive correlations ] .pull-right[ <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/ED_FIG7.jpg" alt="Description of the image" width="100%"> </div> ] --- # Read the data ``` r # node data if (!require(readr)) { install.packages("readr") } node_data = readr::read_csv( "https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/2_demo_data/example_node_data.csv" ) ``` --- # Read the data ``` r #edge data edge_data = readr::read_csv( "https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/2_demo_data/example_edge_data.csv" ) ``` --- # Node data .pull-left[ ``` r dim(node_data) ``` ``` ## [1] 319 4 ``` ``` r head(node_data) ``` ``` ## # A tibble: 6 × 4 ## node class mol_name class2 ## <chr> <chr> <chr> <chr> ## 1 CGM cgm CGM wearable ## 2 cortisol_1 cortisol Cortisol internal-omics ## 3 cytokine_34 cytokine IP10 internal-omics ## 4 cytokine_40 cytokine TNFB internal-omics ## 5 cytokine_20 cytokine IL15 internal-omics ## 6 cytokine_4 cytokine TGFA internal-omics ``` ] .pull-right[ 319 nodes - node: The ID of the node, character - class: The class of the node, character - mol_name: The name of the molecule, character - Other: all the other attributes of the edge ] --- # Edge data .pull-left[ ``` r dim(edge_data) ``` ``` ## [1] 337 11 ``` ``` r head(edge_data) ``` ``` ## # A tibble: 6 × 11 ## from to global_cor lagged_cor ## <chr> <chr> <dbl> <dbl> ## 1 CGM cytokine_… 0.274 0.292 ## 2 CGM cytokine_… 0.308 0.308 ## 3 CGM cytokine_… 0.327 0.327 ## 4 CGM cytokine_4 0.331 0.343 ## 5 CGM cytokine_6 0.343 0.355 ## 6 CGM lipid_113 -0.118 -0.324 ## # ℹ 7 more variables: shift_time <dbl>, ## # global_cor_p <dbl>, ## # lagged_cor_p <dbl>, ## # global_cor_p_adjust <dbl>, ## # lagged_cor_p_adjust <dbl>, ## # class1 <chr>, class <chr> ``` ] .pull-right[ 337 edges - from: The ID of the source node, character - to: The ID of the target node, character - Other: all the other attributes of the edge ] --- # Create a graph ``` r library(igraph) library(tidyverse) library(ggraph) library(tidygraph) graph_data = tidygraph::tbl_graph(nodes = node_data, edges = edge_data, directed = FALSE) %>% dplyr::mutate(Degree = centrality_degree(mode = 'all')) ``` `centrality_degree` is from the package `tidygraph` to calcualte the degree of the nodes. ``` r ?centrality_degree ``` --- # Create a graph ``` r graph_data ``` ``` ## # A tbl_graph: 319 nodes and 337 edges ## # ## # An undirected simple graph with 1 component ## # ## # Node Data: 319 × 5 (active) ## node class mol_name class2 Degree ## <chr> <chr> <chr> <chr> <dbl> ## 1 CGM cgm CGM wearable 20 ## 2 cortisol_1 cortisol Cortisol internal-omics 2 ## 3 cytokine_34 cytokine IP10 internal-omics 1 ## 4 cytokine_40 cytokine TNFB internal-omics 1 ## 5 cytokine_20 cytokine IL15 internal-omics 1 ## 6 cytokine_4 cytokine TGFA internal-omics 2 ## 7 cytokine_6 cytokine FLT3L internal-omics 1 ## 8 cytokine_12 cytokine IL10 internal-omics 2 ## 9 cytokine_17 cytokine PDGFAA internal-omics 1 ## 10 cytokine_3 cytokine EOTAXIN internal-omics 1 ## # ℹ 309 more rows ## # ## # Edge Data: 337 × 11 ## from to global_cor lagged_cor shift_time global_cor_p lagged_cor_p ## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 3 0.274 0.292 -40 0.0107 0.00607 ## 2 1 4 0.308 0.308 0 0.00393 0.00393 ## 3 1 5 0.327 0.327 0 0.00211 0.00211 ## # ℹ 334 more rows ## # ℹ 4 more variables: global_cor_p_adjust <dbl>, lagged_cor_p_adjust <dbl>, ## # class1 <chr>, class <chr> ``` --- # Visualize the graph .pull-left[ `coord_fixed` (in `ggplot2` package) is used to fix the aspect ratio of the plot. ``` r ggraph(graph_data, layout = 'kk') + geom_edge_link() + geom_node_point( ) + coord_fixed() ``` ] .pull-right[ <!-- --> ] --- # Another layout .pull-left[ ``` r ggraph(graph_data, layout = 'fr') + geom_edge_link() + geom_node_point( ) + coord_fixed() ``` ] .pull-right[ <!-- --> ] --- # Modification: more information in the network .pull-left[ ``` r plot = ggraph(graph_data, layout = 'kk') + geom_edge_link(aes( color = lagged_cor, width = -log(lagged_cor_p_adjust, 10) ), alpha = 0.5, show.legend = TRUE) + geom_node_point( aes(fill = class, size = Degree), shape = 21, alpha = 0.5, show.legend = TRUE ) + coord_fixed() plot ``` ] .pull-right[ <!-- --> ] --- # Modification: more information in the network .pull-left[ ``` r plot = plot + shadowtext::geom_shadowtext(aes(x = x, y = y, label = ifelse(class2 == "wearable", mol_name, NA), color = class), bg.color = "white", size = 5, show.legend = FALSE ) plot ``` ] .pull-right[ <!-- --> ] --- # Modification: more information in the network Set colors for nodes ``` r class_color = c( "lipidomics" = ggsci::pal_aaas()(10)[1], "metabolomics" = ggsci::pal_aaas()(10)[3], "cytokine" = ggsci::pal_aaas()(10)[4], "total_protein" = ggsci::pal_aaas()(10)[5], "cortisol" = ggsci::pal_aaas()(10)[6], "metabolic_panel" = ggsci::pal_aaas()(10)[7], "proteomics" = ggsci::pal_aaas()(10)[8] ) wearable_color = c( "sleep" = ggsci::pal_d3()(n = 10)[2], "cgm" = ggsci::pal_d3()(n = 10)[6], "hr" = ggsci::pal_d3()(n = 10)[7], "step" = ggsci::pal_d3()(n = 10)[9], "food" = ggsci::pal_d3()(n = 10)[10] ) ``` --- # Modification: more information in the network .pull-left[ ``` r plot = plot + scale_size_continuous(range = c(3, 10)) + scale_fill_manual(values = c(class_color, wearable_color)) + scale_color_manual(values = c(class_color, wearable_color)) + scale_edge_width_continuous(range = c(0.3, 2)) + ggraph::scale_edge_color_gradientn(colours = c(alpha("#3B4992FF", 0.7), "white", alpha("#EE0000FF", 0.7))) + ggraph::theme_graph() + theme( plot.background = element_rect(fill = "transparent", color = NA), panel.background = element_rect(fill = "transparent", color = NA), legend.position = "right", legend.background = element_rect(fill = "transparent", color = NA) ) plot ``` ] .pull-right[ <!-- --> ] --- # Change to circle layout ``` r g = graph_data library(igraph) library(ggraph) library(tidygraph) V(g)$type = bipartite_mapping(g)$type coords = create_layout(g, layout = "bipartite") %>% dplyr::select(node, class, mol_name, class2, x, y) head(coords) ``` ``` ## # A tibble: 6 × 6 ## node class mol_name class2 x y ## <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 CGM cgm CGM wearable 9.5 1 ## 2 cortisol_1 cortisol Cortisol internal-omics 175 0 ## 3 cytokine_34 cytokine IP10 internal-omics 1 0 ## 4 cytokine_40 cytokine TNFB internal-omics 6 0 ## 5 cytokine_20 cytokine IL15 internal-omics 9 0 ## 6 cytokine_4 cytokine TGFA internal-omics 56 0 ``` --- # Change to circle layout .pull-left[ ``` r coords$y[coords$class2 == "internal-omics"] = 1 coords$y[coords$class2 == "wearable"] = -0.5 coords$y[coords$class == "metabolomics"] = 1 coords$y[coords$class == "lipidomics"] = 0.5 coords$y[coords$class == "proteomics"] = 0 coords$y[coords$class == "cytokine"] = 0 coords$y[coords$class == "metabolic_panel"] = 0 coords$y[coords$class == "total_protein"] = 0 # calculate the x and y coordinates for circle layout coords = coords %>% dplyr::select(x, y) %>% dplyr::mutate( theta = x / (max(x) + 1) * 2 * pi, r = y + 1, x = r * cos(theta), y = r * sin(theta) ) ``` ] .pull-right[ ``` r head(coords) ``` ``` ## # A tibble: 6 × 4 ## x y theta r ## <dbl> <dbl> <dbl> <dbl> ## 1 0.491 0.0939 0.189 0.5 ## 2 -1.89 -0.663 3.48 2 ## 3 1.00 0.0199 0.0199 1 ## 4 0.993 0.119 0.119 1 ## 5 0.984 0.178 0.179 1 ## 6 0.442 0.897 1.11 1 ``` ] --- # Change to circle layout .pull-left[ ``` r my_graph = create_layout( graph = g, layout = "manual", x = coords$x, y = coords$y ) ``` ] .pull-right[ ``` r head(my_graph) ``` ``` ## # A tibble: 6 × 11 ## x y circular node class mol_name class2 Degree type ## <dbl> <dbl> <lgl> <chr> <chr> <chr> <chr> <dbl> <lgl> ## 1 0.491 0.0939 FALSE CGM cgm CGM wearable 20 FALSE ## 2 -1.89 -0.663 FALSE cortisol_1 cortisol Cortisol internal-o… 2 TRUE ## 3 1.00 0.0199 FALSE cytokine_34 cytokine IP10 internal-o… 1 TRUE ## 4 0.993 0.119 FALSE cytokine_40 cytokine TNFB internal-o… 1 TRUE ## 5 0.984 0.178 FALSE cytokine_20 cytokine IL15 internal-o… 1 TRUE ## 6 0.442 0.897 FALSE cytokine_4 cytokine TGFA internal-o… 2 TRUE ## # ℹ 2 more variables: .ggraph.orig_index <int>, .ggraph.index <int> ``` ] --- # Visualize the circle layout .pull-left[ ``` r plot = ggraph(my_graph, layout = 'bipartite') + geom_edge_diagonal(aes( color = lagged_cor, width = -log(lagged_cor_p_adjust, 10) ), alpha = 0.5, show.legend = TRUE) + geom_node_point( aes(fill = class, size = Degree), shape = 21, alpha = 0.5, show.legend = TRUE ) ``` ] .pull-right[ ``` r plot ``` <!-- --> ] --- # Visualize the circle layout .pull-left[ ``` r plot = plot + shadowtext::geom_shadowtext( aes(x, y, label = ifelse(class2 == "wearable", mol_name, NA), color = class), bg.color = "white", size = 5, show.legend = FALSE) + shadowtext::geom_shadowtext( aes(x, y, label = ifelse(class2 == "wearable", NA, mol_name), color = class, nudge_y = ifelse(class2 == "wearable", "inward", 'outward'), angle = -((-node_angle(x, y) + 90) %% 180) + 90), bg.color = "white", size = 2, show.legend = FALSE, check_overlap = TRUE) plot ``` ] .pull-right[ <!-- --> ] --- # Visualize the circle layout ``` r plot = plot + scale_size_continuous(range = c(3, 10)) + scale_fill_manual(values = c(class_color, wearable_color)) + scale_color_manual(values = c(class_color, wearable_color)) + scale_edge_width_continuous(range = c(0.3, 2)) + ggraph::scale_edge_color_gradientn(colours = c(alpha("#3B4992FF", 0.7), "white", alpha("#EE0000FF", 0.7))) + ggraph::theme_graph() + theme( plot.background = element_rect(fill = "transparent", color = NA), panel.background = element_rect(fill = "transparent", color = NA), legend.position = "bottom", legend.background = element_rect(fill = "transparent", color = NA) ) ``` --- # Visualize the circle layout .pull-left[ ``` r plot ``` ] .pull-right[ <!-- --> ] --- # Save the plot We need to install `extrafont` to save the plot. ``` r if (!require(extrafont)) { install.packages("extrafont") } ``` ``` r extrafont::loadfonts() ggsave(plot, filename = "network.pdf", width = 9.5, height = 10) ``` --- # Example 2: GO similarity network What's the network? .pull-left[ **Nodes**: - Enriched GO terms **Edges**: - Similarity ] .pull-right[ <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/similarity_network_plot.jpg" alt="Description of the image" width="100%"> </div> ] --- # Load libraries ``` r library(readr) library(ggraph) library(tidygraph) library(igraph) library(tidyverse) library(simplifyEnrichment) library(tibble) ``` --- # Read the data ``` r pathway_result = readr::read_csv( "https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/2_demo_data/pathway_result.csv" ) head(pathway_result) ``` ``` ## # A tibble: 6 × 10 ## ONTOLOGY ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> ## 1 MF GO:0… transmembr… 14/114 86/206… 3.39e-17 7.46e-15 4.57e-15 ENSG0… ## 2 MF GO:0… growth fac… 16/114 143/20… 8.29e-17 9.12e-15 5.59e-15 ENSG0… ## 3 MF GO:0… transmembr… 12/114 69/206… 2.92e-15 2.14e-13 1.31e-13 ENSG0… ## 4 BP GO:0… wound heal… 23/112 495/20… 1.85e-15 4.95e-12 3.27e-12 ENSG0… ## 5 MF GO:0… protein ty… 12/114 151/20… 4.27e-11 2.35e- 9 1.44e- 9 ENSG0… ## 6 MF GO:0… cytokine b… 11/114 144/20… 4.39e-10 1.93e- 8 1.18e- 8 ENSG0… ## # ℹ 1 more variable: Count <dbl> ``` --- # Compute GO term similarity matrices: BP (Biological Process) ``` r bp_sim_matrix = simplifyEnrichment::GO_similarity(go_id = pathway_result$ID[pathway_result$ONTOLOGY == "BP"], ont = "BP", measure = "Wang") %>% as.data.frame() %>% tibble::rownames_to_column(var = "name1") %>% tidyr::pivot_longer(cols = -name1, names_to = "name2", values_to = "sim") %>% dplyr::filter(name1 != name2) %>% dplyr::filter(sim > 0.5) head(bp_sim_matrix) ``` ``` ## # A tibble: 6 × 3 ## name1 name2 sim ## <chr> <chr> <dbl> ## 1 GO:0042060 GO:0061041 0.616 ## 2 GO:0042060 GO:0090303 0.504 ## 3 GO:0042060 GO:0007596 0.552 ## 4 GO:1903034 GO:1903036 0.801 ## 5 GO:1903034 GO:0061041 0.820 ## 6 GO:1903034 GO:0090303 0.657 ``` --- # Compute GO term similarity matrices: BP (Biological Process) ``` r # Create a unique identifier for each GO term pair # This helps remove duplicate pairs (A-B and B-A are the same relationship) name = apply(bp_sim_matrix, 1, function(x) { paste(sort(x[1:2]), collapse = "_") }) # Remove duplicate GO term pairs while keeping the data structure bp_sim_matrix = bp_sim_matrix %>% dplyr::mutate(name = name) %>% dplyr::arrange(name) %>% dplyr::distinct(name, .keep_all = TRUE) %>% dplyr::select(-name) ``` --- # Compute GO term similarity matrices: MF (Molecular Function) ``` r mf_sim_matrix = simplifyEnrichment::GO_similarity(go_id = pathway_result$ID[pathway_result$ONTOLOGY == "MF"], ont = "MF", measure = "Wang") %>% as.data.frame() %>% tibble::rownames_to_column(var = "name1") %>% tidyr::pivot_longer(cols = -name1, names_to = "name2", values_to = "sim") %>% dplyr::filter(name1 != name2) %>% dplyr::filter(sim > 0.5) head(mf_sim_matrix) ``` ``` ## # A tibble: 6 × 3 ## name1 name2 sim ## <chr> <chr> <dbl> ## 1 GO:0019199 GO:0004714 0.854 ## 2 GO:0019199 GO:0004713 0.654 ## 3 GO:0019838 GO:0019955 0.605 ## 4 GO:0004714 GO:0019199 0.854 ## 5 GO:0004714 GO:0004713 0.727 ## 6 GO:0004713 GO:0019199 0.654 ``` --- # Compute GO term similarity matrices: MF (Molecular Function) ``` r name = apply(mf_sim_matrix, 1, function(x) { paste(sort(x[1:2]), collapse = "_") }) mf_sim_matrix = mf_sim_matrix %>% dplyr::mutate(name = name) %>% dplyr::arrange(name) %>% dplyr::distinct(name, .keep_all = TRUE) %>% dplyr::select(-name) ``` --- # Compute GO term similarity matrices: CC (Cellular Component) ``` r cc_sim_matrix = simplifyEnrichment::GO_similarity(go_id = pathway_result$ID[pathway_result$ONTOLOGY == "CC"], ont = "CC", measure = "Wang") %>% as.data.frame() %>% tibble::rownames_to_column(var = "name1") %>% tidyr::pivot_longer(cols = -name1, names_to = "name2", values_to = "sim") %>% dplyr::filter(name1 != name2) %>% dplyr::filter(sim > 0.5) head(cc_sim_matrix) ``` ``` ## # A tibble: 6 × 3 ## name1 name2 sim ## <chr> <chr> <dbl> ## 1 GO:0005775 GO:0043202 0.841 ## 2 GO:0005775 GO:0005788 0.725 ## 3 GO:0005775 GO:0005796 0.725 ## 4 GO:0031983 GO:0034774 0.591 ## 5 GO:0031983 GO:0060205 0.719 ## 6 GO:0043202 GO:0005775 0.841 ``` --- # Compute GO term similarity matrices: CC (Cellular Component) ``` r name = apply(cc_sim_matrix, 1, function(x) { paste(sort(x[1:2]), collapse = "_") }) cc_sim_matrix = cc_sim_matrix %>% dplyr::mutate(name = name) %>% dplyr::arrange(name) %>% dplyr::distinct(name, .keep_all = TRUE) %>% dplyr::select(-name) ``` --- # Final similarity matrix ``` r # Combine all three ontology similarity matrices into one sim_matrix = rbind(bp_sim_matrix, mf_sim_matrix, cc_sim_matrix) %>% tibble::as_tibble() # Preview the combined similarity matrix head(sim_matrix) ``` ``` ## # A tibble: 6 × 3 ## name1 name2 sim ## <chr> <chr> <dbl> ## 1 GO:0000302 GO:0006979 0.663 ## 2 GO:0001101 GO:0009410 0.605 ## 3 GO:0001101 GO:1902074 0.605 ## 4 GO:0001503 GO:0001649 0.564 ## 5 GO:0048771 GO:0001503 0.543 ## 6 GO:0050817 GO:0001503 0.543 ``` --- # Graph data: edge data ``` r edge_data = sim_matrix %>% # Rename columns to standard network terminology dplyr::rename(from = name1, to = name2) %>% # Keep only strong relationships (similarity > 0.5) dplyr::filter(sim > 0.5) # Display sample of edge data head(edge_data) ``` ``` ## # A tibble: 6 × 3 ## from to sim ## <chr> <chr> <dbl> ## 1 GO:0000302 GO:0006979 0.663 ## 2 GO:0001101 GO:0009410 0.605 ## 3 GO:0001101 GO:1902074 0.605 ## 4 GO:0001503 GO:0001649 0.564 ## 5 GO:0048771 GO:0001503 0.543 ## 6 GO:0050817 GO:0001503 0.543 ``` --- # Graph data: node data ``` r # Create node data from pathway results # Nodes represent individual GO terms node_data = pathway_result %>% dplyr::select(ID, everything()) %>% dplyr::rename(node = ID) # Display sample of node data head(node_data) ``` ``` ## # A tibble: 6 × 10 ## node ONTOLOGY Description GeneRatio BgRatio pvalue p.adjust qvalue geneID ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> ## 1 GO:0… MF transmembr… 14/114 86/206… 3.39e-17 7.46e-15 4.57e-15 ENSG0… ## 2 GO:0… MF growth fac… 16/114 143/20… 8.29e-17 9.12e-15 5.59e-15 ENSG0… ## 3 GO:0… MF transmembr… 12/114 69/206… 2.92e-15 2.14e-13 1.31e-13 ENSG0… ## 4 GO:0… BP wound heal… 23/112 495/20… 1.85e-15 4.95e-12 3.27e-12 ENSG0… ## 5 GO:0… MF protein ty… 12/114 151/20… 4.27e-11 2.35e- 9 1.44e- 9 ENSG0… ## 6 GO:0… MF cytokine b… 11/114 144/20… 4.39e-10 1.93e- 8 1.18e- 8 ENSG0… ## # ℹ 1 more variable: Count <dbl> ``` --- # Create a graph data structure ``` r # Create a graph data structure using tidygraph # Combines node and edge information into a single object graph_data = tidygraph::tbl_graph(nodes = node_data, edges = edge_data, directed = FALSE) %>% # Calculate node degree (number of connections) as a centrality measure dplyr::mutate(degree = tidygraph::centrality_degree()) graph_data ``` ``` ## # A tbl_graph: 331 nodes and 712 edges ## # ## # An undirected simple graph with 66 components ## # ## # Node Data: 331 × 11 (active) ## node ONTOLOGY Description GeneRatio BgRatio pvalue p.adjust qvalue ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 GO:0019199 MF transmembra… 14/114 86/206… 3.39e-17 7.46e-15 4.57e-15 ## 2 GO:0019838 MF growth fact… 16/114 143/20… 8.29e-17 9.12e-15 5.59e-15 ## 3 GO:0004714 MF transmembra… 12/114 69/206… 2.92e-15 2.14e-13 1.31e-13 ## 4 GO:0042060 BP wound heali… 23/112 495/20… 1.85e-15 4.95e-12 3.27e-12 ## 5 GO:0004713 MF protein tyr… 12/114 151/20… 4.27e-11 2.35e- 9 1.44e- 9 ## 6 GO:0019955 MF cytokine bi… 11/114 144/20… 4.39e-10 1.93e- 8 1.18e- 8 ## 7 GO:0005125 MF cytokine ac… 14/114 286/20… 6.13e-10 2.25e- 8 1.38e- 8 ## 8 GO:0050840 MF extracellul… 8/114 56/206… 7.53e-10 2.37e- 8 1.45e- 8 ## 9 GO:1903034 BP regulation … 13/112 183/20… 1.99e-11 2.67e- 8 1.76e- 8 ## 10 GO:0008083 MF growth fact… 11/114 162/20… 1.54e- 9 4.25e- 8 2.60e- 8 ## # ℹ 321 more rows ## # ℹ 3 more variables: geneID <chr>, Count <dbl>, degree <dbl> ## # ## # Edge Data: 712 × 3 ## from to sim ## <int> <int> <dbl> ## 1 187 316 0.663 ## 2 177 240 0.605 ## 3 177 325 0.605 ## # ℹ 709 more rows ``` --- # Community/module detection ``` r # Detect subnetworks/communities within the graph # Edge betweenness clustering algorithm identifies modular structures subnetwork = igraph::cluster_edge_betweenness(graph = graph_data, weights = abs(edge_attr(graph_data, "sim"))) # Create module labels for each detected community cluster = paste("GO_Module", as.character(igraph::membership(subnetwork)), sep = "_") head(cluster) ``` ``` ## [1] "GO_Module_1" "GO_Module_2" "GO_Module_1" "GO_Module_3" "GO_Module_1" ## [6] "GO_Module_2" ``` --- # Add module information to the graph data ``` r # Add module information to the graph data graph_data = graph_data %>% tidygraph::mutate(module = cluster) graph_data ``` ``` ## # A tbl_graph: 331 nodes and 712 edges ## # ## # An undirected simple graph with 66 components ## # ## # Node Data: 331 × 12 (active) ## node ONTOLOGY Description GeneRatio BgRatio pvalue p.adjust qvalue ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 GO:0019199 MF transmembra… 14/114 86/206… 3.39e-17 7.46e-15 4.57e-15 ## 2 GO:0019838 MF growth fact… 16/114 143/20… 8.29e-17 9.12e-15 5.59e-15 ## 3 GO:0004714 MF transmembra… 12/114 69/206… 2.92e-15 2.14e-13 1.31e-13 ## 4 GO:0042060 BP wound heali… 23/112 495/20… 1.85e-15 4.95e-12 3.27e-12 ## 5 GO:0004713 MF protein tyr… 12/114 151/20… 4.27e-11 2.35e- 9 1.44e- 9 ## 6 GO:0019955 MF cytokine bi… 11/114 144/20… 4.39e-10 1.93e- 8 1.18e- 8 ## 7 GO:0005125 MF cytokine ac… 14/114 286/20… 6.13e-10 2.25e- 8 1.38e- 8 ## 8 GO:0050840 MF extracellul… 8/114 56/206… 7.53e-10 2.37e- 8 1.45e- 8 ## 9 GO:1903034 BP regulation … 13/112 183/20… 1.99e-11 2.67e- 8 1.76e- 8 ## 10 GO:0008083 MF growth fact… 11/114 162/20… 1.54e- 9 4.25e- 8 2.60e- 8 ## # ℹ 321 more rows ## # ℹ 4 more variables: geneID <chr>, Count <dbl>, degree <dbl>, module <chr> ## # ## # Edge Data: 712 × 3 ## from to sim ## <int> <int> <dbl> ## 1 187 316 0.663 ## 2 177 240 0.605 ## 3 177 325 0.605 ## # ℹ 709 more rows ``` --- # Extract node attributes with module assignments ``` r result_with_module = igraph::vertex_attr(graph_data) %>% do.call(cbind, .) %>% tibble::as_tibble() %>% # Convert p.adjust to numeric for proper sorting dplyr::mutate(p.adjust = as.numeric(p.adjust)) %>% # Sort by module and significance dplyr::arrange(module, p.adjust) # Further organize results by ontology type, module, and significance result_with_module = result_with_module %>% dplyr::arrange(ONTOLOGY, module, p.adjust) ``` --- # Extract node attributes with module assignments ``` r # Calculate the number of GO terms in each module module_content_number = result_with_module %>% dplyr::count(module) %>% dplyr::rename(module_content_number = n) # Add module size information to results result_with_module = result_with_module %>% dplyr::left_join(module_content_number, by = "module") head(result_with_module) ``` ``` ## # A tibble: 6 × 13 ## node ONTOLOGY Description GeneRatio BgRatio pvalue p.adjust qvalue geneID ## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 GO:00140… BP phosphatid… 11/112 164/20… 1.399… 4.68e-7 3.090… ENSG0… ## 2 GO:00971… BP extrinsic … 13/112 262/20… 1.701… 5.06e-7 3.339… ENSG0… ## 3 GO:00480… BP phosphatid… 11/112 194/20… 8.197… 1.46e-6 9.652… ENSG0… ## 4 GO:00703… BP regulation… 14/112 367/20… 1.133… 1.69e-6 1.111… ENSG0… ## 5 GO:00328… BP regulation… 11/112 215/20… 2.378… 2.77e-6 1.826… ENSG0… ## 6 GO:00703… BP ERK1 and E… 14/112 393/20… 2.675… 2.94e-6 1.938… ENSG0… ## # ℹ 4 more variables: Count <chr>, degree <chr>, module <chr>, ## # module_content_number <int> ``` --- # Add module size information to graph data ``` r graph_data = graph_data %>% # Activate nodes for manipulation activate(what = "nodes") %>% dplyr::left_join(module_content_number, by = "module") # Display the final graph data structure graph_data ``` ``` ## # A tbl_graph: 331 nodes and 712 edges ## # ## # An undirected simple graph with 66 components ## # ## # Node Data: 331 × 13 (active) ## node ONTOLOGY Description GeneRatio BgRatio pvalue p.adjust qvalue ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 GO:0019199 MF transmembra… 14/114 86/206… 3.39e-17 7.46e-15 4.57e-15 ## 2 GO:0019838 MF growth fact… 16/114 143/20… 8.29e-17 9.12e-15 5.59e-15 ## 3 GO:0004714 MF transmembra… 12/114 69/206… 2.92e-15 2.14e-13 1.31e-13 ## 4 GO:0042060 BP wound heali… 23/112 495/20… 1.85e-15 4.95e-12 3.27e-12 ## 5 GO:0004713 MF protein tyr… 12/114 151/20… 4.27e-11 2.35e- 9 1.44e- 9 ## 6 GO:0019955 MF cytokine bi… 11/114 144/20… 4.39e-10 1.93e- 8 1.18e- 8 ## 7 GO:0005125 MF cytokine ac… 14/114 286/20… 6.13e-10 2.25e- 8 1.38e- 8 ## 8 GO:0050840 MF extracellul… 8/114 56/206… 7.53e-10 2.37e- 8 1.45e- 8 ## 9 GO:1903034 BP regulation … 13/112 183/20… 1.99e-11 2.67e- 8 1.76e- 8 ## 10 GO:0008083 MF growth fact… 11/114 162/20… 1.54e- 9 4.25e- 8 2.60e- 8 ## # ℹ 321 more rows ## # ℹ 5 more variables: geneID <chr>, Count <dbl>, degree <dbl>, module <chr>, ## # module_content_number <int> ## # ## # Edge Data: 712 × 3 ## from to sim ## <int> <int> <dbl> ## 1 187 316 0.663 ## 2 177 240 0.605 ## 3 177 325 0.605 ## # ℹ 709 more rows ``` --- # Select representative labels for each module ``` r # Choose the most significant GO term with highest count in each module cluster_label_module = igraph::as_data_frame(graph_data, what = "vertices") %>% dplyr::group_by(module) %>% dplyr::filter(p.adjust == min(p.adjust) & Count == max(Count)) %>% dplyr::slice_head(n = 1) %>% pull(Description) # Get all GO term descriptions for reference cluster_label_all = igraph::as_data_frame(graph_data, what = "vertices")$Description ``` --- # Visualization of GO term clusters .pull-left[ ``` r # Create network visualization plot = graph_data %>% ggraph(layout = 'fr', circular = FALSE) + geom_edge_link( aes(width = sim), strength = 1, color = "black", alpha = 1, show.legend = FALSE ) + geom_node_point( aes(fill = module, size = -log(p.adjust, 10)), shape = 21, alpha = 1, show.legend = FALSE ) + coord_fixed() plot ``` ] .pull-right[ <!-- --> ] --- # Visualization of GO term clusters .pull-left[ ``` r plot = plot + geom_node_text(aes( x = x, y = y, label = ifelse(Description %in% cluster_label_module, Description, NA) ), size = 3, repel = TRUE) plot ``` ] .pull-right[ <!-- --> ] --- # Visualization of GO term clusters .pull-left[ ``` r plot = plot + guides(fill = guide_legend(ncol = 1)) + scale_edge_width_continuous(range = c(0.1, 2)) + scale_size_continuous(range = c(1, 7)) plot ``` ] .pull-right[ <!-- --> ] --- # Visualization of GO term clusters .pull-left[ ``` r plot = plot + ggraph::theme_graph() + theme( plot.background = element_rect(fill = "transparent", color = NA), panel.background = element_rect(fill = "transparent", color = NA), legend.position = "right", legend.background = element_rect(fill = "transparent", color = NA) ) plot ``` ] .pull-right[ <!-- --> ] --- # Save the plot ``` r extrafont::loadfonts() ggplot2::ggsave(plot, filename = "similarity_network_plot.pdf", width = 9, height = 7) ``` --- # What is a tree in biology? What's the tree? * **Taxonomic trees (classification hierarchies)** * **Phylogenetic trees (evolutionary relationships)** * Gene trees (gene family evolution) * Species trees (speciation events) * Population trees (population structure) --- # Taxonomic trees .pull-left[ **Purpose:** * Organize and categorize organisms based on shared characteristics * Represent hierarchical relationships between taxonomic groups * Visualize abundance and diversity of taxonomic groups **Structure:** * Hierarchical: Each level represents a taxonomic rank * Nested: Lower ranks are contained within higher ranks * Non-overlapping: An organism belongs to exactly one group at each rank ] .pull-right[ <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/Fig5_primer.png" alt="Description of the image" width="100%"> ] --- # Tree terminology * Root (domain level, such as Bacteria) * Internal nodes (phylum, class, order, family, genus) * Leaf nodes (species or ASVs or OTUs in microbiome data) * Branches connecting taxonomic levels --- # Taxonomic hierarchies in microbiology .pull-left[ **Standard taxonomic ranks:** * Domain (Bacteria, Archaea) * Phylum * Class * Order * Family * Genus * Species ] .pull-right[ **Microbiome-specific concepts:** * OTUs (Operational Taxonomic Units) * ASVs (Amplicon Sequence Variants) * MAGs (Metagenome-Assembled Genomes) ] --- # Microbiome data structure **Data formats:** * OTU/ASV tables with taxonomic annotations * Taxonomic abundance tables at different levels --- # Taxonomic tree representation .pull-left[ **Key attributes in microbiome taxonomic trees:** * Hierarchical classification * Relative abundance (often mapped to node/branch size) * Diversity metrics * Sample metadata ] .pull-right[ **Common visualization types:** * Traditional hierarchical trees * Circular/radial dendrograms * Treemaps * Sunburst plots * Heat trees ] --- # Create tree data from phyloseq object .pull-left[ `phyloseq` is a popular package for microbiome data analysis in R. ``` r library(phyloseq) data(GlobalPatterns) GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns) GP.chl <- subset_taxa(GP, Phylum=="Chlamydiae") plot_tree(GP.chl, color="SampleType", shape="Family", label.tips="Genus", size="Abundance") + ggtitle("tree annotation using phyloseq") ``` ] .pull-right[ <!-- --> ] --- # Create tree data from phyloseq object .pull-left[ ``` r library(scales) library(ggtree) p <- ggtree(GP.chl, ladderize = FALSE) + geom_text2(aes(subset = !isTip, label = label), hjust = -.2, size = 4) + geom_tiplab(aes(label = Genus), hjust = -.3) + geom_point(aes( x = x + hjust, color = SampleType, shape = Family, size = Abundance ), na.rm = TRUE) + scale_size_continuous(trans = log_trans(5)) + theme(legend.position = "right") + ggtitle("reproduce phyloseq by ggtree") print(p) ``` ] .pull-right[ <!-- --> ] --- # Example: Microbiome data .pull-left[ The taxonomic tree of microbiome data. ] .pull-right[ <div style="display: flex; justify-content: center; align-items: center;"> <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/cell_host_microbe.png" alt="Description of the image" width="80%"> </div> > Zhou et al. Cell Host & Microbe 2024 ] --- # Load libraries ``` r if (!require("microbiomeViz", quietly = TRUE)) { remotes::install_github("lch14forever/microbiomeViz") } if (!require("phyloseq", quietly = TRUE)) { BiocManager::install("phyloseq") } library(phyloseq) library(microbiomeViz) library(tidyverse) library(ggtreeExtra) library(ggtree) ``` --- # Read the data ``` r personalized_score = readr::read_csv("https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/refs/heads/main/2_demo_data/personalized_score.csv") head(personalized_score) ``` ``` ## # A tibble: 6 × 22 ## genus class fc1_p fc2_p fc3_p between_mean1 within_mean1 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Acetivibrio Stool 6.73e- 2 NA NA 0.754 0.649 ## 2 Acidaminococcus Stool 3.73e- 60 0.411 6.69e- 30 0.813 0.510 ## 3 Acutalibacter Stool 1.90e- 65 0.552 5.07e- 46 0.941 0.665 ## 4 Adlercreutzia Stool 1.41e-166 1.00 1.28e-189 0.947 0.702 ## 5 Aestuariispira Stool 5.08e- 10 NA NA 0.727 0.626 ## 6 Agathobacter Stool 0 1 0 0.763 0.575 ## # ℹ 15 more variables: family_mean1 <dbl>, total_number_between <dbl>, ## # total_number_family <dbl>, total_number_within <dbl>, ## # number_1_between <dbl>, number_1_family <dbl>, number_1_within <dbl>, ## # freq_between <dbl>, freq_family <dbl>, freq_within <dbl>, ## # fc1_p_adjust <dbl>, fc2_p_adjust <dbl>, fc3_p_adjust <dbl>, fc1 <dbl>, ## # fc2 <dbl> ``` --- # Read the data ``` r load(url("https://github.com/jaspershen-lab/MD7115-NTU/raw/main/2_demo_data/microbiome_data.RData")) microbiome_data ``` ``` ## phyloseq-class experiment-level object ## otu_table() OTU Table: [ 411 taxa and 2 samples ] ## sample_data() Sample Data: [ 2 samples by 1 sample variables ] ## tax_table() Taxonomy Table: [ 411 taxa by 7 taxonomic ranks ] ``` --- # Create a phylogenetic tree ``` r tree = microbiomeViz::parsePhyloseq( physeq = microbiome_data, use_abundance = FALSE, node.size.scale = 0, node.size.offset = 0 ) tree ``` ``` ## 'treedata' S4 object'. ## ## ...@ phylo: ## ## Phylogenetic tree with 411 tips and 311 internal nodes. ## ## Tip labels: ## g__Unclassified_Gp16, g__Unclassified_Gp3, g__Unclassified_Gp6, ## g__Blastocatella, g__Stenotrophobacter, g__Iamia, ... ## Node labels: ## r__Root, k__Bacteria, p__Acidobacteria, p__Actinobacteria, ## p__Armatimonadetes, p__Bacteroidetes, ... ## ## Rooted; includes branch lengths. ## ## with the following features available: ## 'taxaAbun', 'nodeSize', 'nodeClass'. ## ## # The associated data tibble abstraction: 722 × 6 ## # The 'node', 'label' and 'isTip' are from the phylo tree. ## node label isTip taxaAbun nodeSize nodeClass ## <int> <chr> <lgl> <dbl> <dbl> <fct> ## 1 1 g__Unclassified_Gp16 TRUE 5 0 g ## 2 2 g__Unclassified_Gp3 TRUE 5 0 g ## 3 3 g__Unclassified_Gp6 TRUE 5 0 g ## 4 4 g__Blastocatella TRUE 5 0 g ## 5 5 g__Stenotrophobacter TRUE 5 0 g ## 6 6 g__Iamia TRUE 5 0 g ## 7 7 g__Unclassified_Iamiaceae TRUE 5 0 g ## 8 8 g__Unclassified_Acidimicrobiales TRUE 5 0 g ## 9 9 g__Actinomyces TRUE 5 0 g ## 10 10 g__Actinotignum TRUE 5 0 g ## # ℹ 712 more rows ``` --- # Visualize the tree .pull-left[ ``` r raw_plot = tree.backbone( tree = tree, size = 0.3, shape = 16, layout = "circular", fill = "black", color = "black" ) raw_plot ``` ] .pull-right[ <!-- --> ] --- # Add nodes ``` r raw_plot$data$nodeClass2 = as.character(raw_plot$data$nodeClass) raw_plot$data$nodeClass2[is.na(raw_plot$data$nodeClass2)] = "Root" raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "f"] = "Family" raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "c"] = "Class" raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "o"] = "Order" raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "p"] = "Phylum" raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "g"] = "Genus" raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "k"] = "Kingdom" raw_plot$data$nodeSize2 = raw_plot$data$nodeSize raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Root"] = 3.5 raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Kingdom"] = 3 raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Phylum"] = 2.5 raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Class"] = 2 raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Order"] = 1.5 raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Family"] = 1 raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Genus"] = 0.5 ``` --- # Add nodes .pull-left[ ``` r # #####add node point raw_plot = raw_plot + ggtree::geom_point2(aes(color = nodeClass2, size = nodeSize2), show.legend = FALSE) + ggnewscale::new_scale(new_aes = "fill") raw_plot ``` ] .pull-right[ <!-- --> ] --- # Prepare information for heatmap ``` r ###label 2 is the name of taxa raw_plot$data$label2 = raw_plot$data$label %>% stringr::str_split("__") %>% purrr::map(function(x) { x[2] }) %>% unlist() raw_plot$data$label2[as.character(raw_plot$data$nodeClass) != "g"] = NA ``` --- # Prepare information for heatmap ``` r #####add new information personalized_score_fc1 = personalized_score %>% dplyr::select(genus, class, fc1) %>% tidyr::pivot_wider(names_from = class, values_from = "fc1") colnames(personalized_score_fc1)[-1] = paste(colnames(personalized_score_fc1)[-1], "fc1", sep = "_") personalized_score_fc2 = personalized_score %>% dplyr::select(genus, class, fc2) %>% tidyr::pivot_wider(names_from = class, values_from = "fc2") colnames(personalized_score_fc2)[-1] = paste(colnames(personalized_score_fc2)[-1], "fc2", sep = "_") personalized_score_fc1_p_adjust = personalized_score %>% dplyr::select(genus, class, fc1_p_adjust) %>% tidyr::pivot_wider(names_from = class, values_from = "fc1_p_adjust") colnames(personalized_score_fc1_p_adjust)[-1] = paste(colnames(personalized_score_fc1_p_adjust)[-1], "fc1_p_adjust", sep = "_") variable_info = tax_table(microbiome_data) ``` --- # Prepare information for heatmap ``` r new_info = data.frame(Genus = raw_plot$data$label2) %>% dplyr::left_join(as.data.frame(variable_info)[, c("Genus", "Kingdom", "Phylum", "Class", "Order", "Family")], by = "Genus") %>% dplyr::left_join(personalized_score_fc1, by = c("Genus" = "genus")) %>% dplyr::left_join(personalized_score_fc1_p_adjust, by = c("Genus" = "genus")) %>% dplyr::left_join(personalized_score_fc2, by = c("Genus" = "genus")) %>% dplyr::mutate(Stool = case_when(!is.na(Stool_fc1) ~ "Stool", TRUE ~ "no")) %>% dplyr::mutate(Skin = case_when(!is.na(Skin_fc1) ~ "Skin", TRUE ~ "no")) %>% dplyr::mutate(Oral = case_when(!is.na(Oral_fc1) ~ "Oral", TRUE ~ "no")) %>% dplyr::mutate(Nasal = case_when(!is.na(Nasal_fc1) ~ "Nasal", TRUE ~ "no")) ``` --- # Prepare information for heatmap ``` r new_info = new_info %>% dplyr::mutate(Stool_fc1_star = case_when( !is.na(Stool_fc1_p_adjust) & Stool_fc1_p_adjust < 0.05 ~ "*", TRUE ~ "" )) %>% dplyr::mutate(Skin_fc1_star = case_when( !is.na(Skin_fc1_p_adjust) & Skin_fc1_p_adjust < 0.05 ~ "*", TRUE ~ "" )) %>% dplyr::mutate(Oral_fc1_star = case_when( !is.na(Oral_fc1_p_adjust) & Oral_fc1_p_adjust < 0.05 ~ "*", TRUE ~ "" )) %>% dplyr::mutate(Nasal_fc1_star = case_when( !is.na(Nasal_fc1_p_adjust) & Nasal_fc1_p_adjust < 0.05 ~ "*", TRUE ~ "" )) raw_plot$data = cbind(raw_plot$data, new_info) ``` --- # Prepare information for heatmap ``` r head(raw_plot$data ) ``` ``` ## parent node branch.length label taxaAbun nodeSize nodeClass ## 1 563 1 1 g__Unclassified_Gp16 5 0 g ## 2 564 2 1 g__Unclassified_Gp3 5 0 g ## 3 565 3 1 g__Unclassified_Gp6 5 0 g ## 4 566 4 1 g__Blastocatella 5 0 g ## 5 566 5 1 g__Stenotrophobacter 5 0 g ## 6 567 6 1 g__Iamia 5 0 g ## isTip x y branch angle nodeClass2 nodeSize2 label2 ## 1 TRUE 6 26 5.5 22.77372 Genus 0.5 Unclassified_Gp16 ## 2 TRUE 6 27 5.5 23.64964 Genus 0.5 Unclassified_Gp3 ## 3 TRUE 6 28 5.5 24.52555 Genus 0.5 Unclassified_Gp6 ## 4 TRUE 6 29 5.5 25.40146 Genus 0.5 Blastocatella ## 5 TRUE 6 30 5.5 26.27737 Genus 0.5 Stenotrophobacter ## 6 TRUE 6 91 5.5 79.70803 Genus 0.5 Iamia ## Genus Kingdom Phylum Class Order ## 1 Unclassified_Gp16 Bacteria Acidobacteria Acidobacteria_Gp16 Gp16 ## 2 Unclassified_Gp3 Bacteria Acidobacteria Acidobacteria_Gp3 Gp3 ## 3 Unclassified_Gp6 Bacteria Acidobacteria Acidobacteria_Gp6 Gp6 ## 4 Blastocatella Bacteria Acidobacteria Blastocatellia Blastocatellales ## 5 Stenotrophobacter Bacteria Acidobacteria Blastocatellia Blastocatellales ## 6 Iamia Bacteria Actinobacteria Acidimicrobiia Acidimicrobiales ## Family Stool_fc1 Skin_fc1 Oral_fc1 Nasal_fc1 ## 1 Unclassified_Gp16 NA 0.2514517 NA NA ## 2 Unclassified_Gp3 NA NA NA -0.0029484029 ## 3 Unclassified_Gp6 NA NA NA -0.0003871467 ## 4 Blastocatellaceae NA 0.2162597 NA NA ## 5 Blastocatellaceae NA 0.1879493 NA NA ## 6 Iamiaceae NA 0.2225029 NA NA ## Stool_fc1_p_adjust Skin_fc1_p_adjust Oral_fc1_p_adjust Nasal_fc1_p_adjust ## 1 NA 1.271142e-05 NA NA ## 2 NA NA NA 1 ## 3 NA NA NA 1 ## 4 NA 7.330179e-10 NA NA ## 5 NA 3.469492e-43 NA NA ## 6 NA 1.615064e-02 NA NA ## Stool_fc2 Skin_fc2 Oral_fc2 Nasal_fc2 Stool Skin Oral Nasal Stool_fc1_star ## 1 NA NA NA NA no Skin no no ## 2 NA NA NA NA no no no Nasal ## 3 NA NA NA NA no no no Nasal ## 4 NA NA NA NA no Skin no no ## 5 NA NA NA NA no Skin no no ## 6 NA NA NA NA no Skin no no ## Skin_fc1_star Oral_fc1_star Nasal_fc1_star ## 1 * ## 2 ## 3 ## 4 * ## 5 * ## 6 * ``` --- # Highlight .pull-left[ ``` r ######add highlight hight_data = raw_plot$data %>% dplyr::filter(stringr::str_detect(label, "p__")) %>% dplyr::select(node, label) %>% dplyr::mutate(label = stringr::str_replace_all(label, "p__", "")) %>% dplyr::rename(id = node, type = label) plot1 = raw_plot + ggtree::geom_hilight( data = hight_data, mapping = aes(node = id, fill = type), alpha = .4 ) plot1 ``` ] .pull-right[ <!-- --> ] --- # Add heatmap: data preparation ``` r ##add fc1 information plot1$data$Stool_fc1[which(plot1$data$Stool_fc1 < 0)] = 0 plot1$data$Skin_fc1[which(plot1$data$Skin_fc1 < 0)] = 0 plot1$data$Oral_fc1[which(plot1$data$Oral_fc1 < 0)] = 0 plot1$data$Nasal_fc1[which(plot1$data$Nasal_fc1 < 0)] = 0 ``` --- # Add heatmap: data preparation ``` r ##add multiple tip information stool_fc1_info = plot1$data[, c("Stool_fc1", "isTip")] rownames(stool_fc1_info) = plot1$data$label stool_fc1_info = stool_fc1_info %>% dplyr::filter(isTip) %>% dplyr::select(-isTip) skin_fc1_info = plot1$data[, c("Skin_fc1", "isTip")] rownames(skin_fc1_info) = plot1$data$label skin_fc1_info = skin_fc1_info %>% dplyr::filter(isTip) %>% dplyr::select(-isTip) oral_fc1_info = plot1$data[, c("Oral_fc1", "isTip")] rownames(oral_fc1_info) = plot1$data$label oral_fc1_info = oral_fc1_info %>% dplyr::filter(isTip) %>% dplyr::select(-isTip) nasal_fc1_info = plot1$data[, c("Nasal_fc1", "isTip")] rownames(nasal_fc1_info) = plot1$data$label nasal_fc1_info = nasal_fc1_info %>% dplyr::filter(isTip) %>% dplyr::select(-isTip) ``` --- # Add heatmap ``` r plot1 = plot1 + ggnewscale::new_scale_fill() body_site_color = c( "Stool" = ggsci::pal_jama()(n = 7)[2], "Skin" = ggsci::pal_jama()(n = 7)[3], "Oral" = ggsci::pal_jama()(n = 7)[4], "Nasal" = ggsci::pal_jama()(n = 7)[5] ) ``` --- # Add heatmap .pull-left[ ``` r plot1 = plot1 + ggnewscale::new_scale_fill() plot2 = gheatmap(p = plot1, data = stool_fc1_info, offset = -0.1, width = .08, colnames_angle = 95, colnames_offset_y = .25, colnames = FALSE, color = alpha(body_site_color["Stool"], 1), legend_title = "Index1") + scale_fill_gradientn( colours = c(RColorBrewer::brewer.pal(n = 11, name = "BrBG"))[-c(5:7)], na.value = "white", limits = c(0, 0.8) ) + ggnewscale::new_scale(new_aes = "fill") + geom_tippoint( mapping = aes(shape = Stool_fc1_star), x = 6.35, size = 2, show.legend = FALSE) + scale_shape_manual(values = c("*" = "*")) + ggnewscale::new_scale(new_aes = "shape") plot2 ``` ] .pull-right[ <!-- --> ] --- # Add heatmap .pull-left[ ``` r plot3 = gheatmap(p = plot2, data = skin_fc1_info, offset = 0.5, width = .08, colnames_angle = 95, colnames_offset_y = .25, colnames = FALSE, color = alpha(body_site_color["Skin"], 1), legend_title = "Index1") + scale_fill_gradientn( colours = c(RColorBrewer::brewer.pal(n = 11, name = "BrBG"))[-c(5:7)], na.value = "white", limits = c(0, 0.8)) + ggnewscale::new_scale(new_aes = "fill") + geom_tippoint( mapping = aes(shape = Skin_fc1_star), size = 2, show.legend = FALSE) + scale_shape_manual(values = c("*" = "*")) + ggnewscale::new_scale(new_aes = "shape") plot3 ``` ] .pull-right[ <!-- --> ] --- # Add heatmap .pull-left[ ``` r plot4 = gheatmap( p = plot3, data = oral_fc1_info, offset = 1.1,width = .08,colnames_angle = 95,colnames_offset_y = .25, colnames = FALSE,color = alpha(body_site_color["Oral"], 1), legend_title = "Index1") + scale_fill_gradientn( colours = c(RColorBrewer::brewer.pal(n = 11, name = "BrBG"))[-c(5:7)], na.value = "white",limits = c(0, 0.8)) + ggnewscale::new_scale(new_aes = "fill") + geom_tippoint( mapping = aes(shape = Oral_fc1_star),x = 7.55,size = 2,show.legend = FALSE) + scale_shape_manual(values = c("*" = "*")) + ggnewscale::new_scale(new_aes = "shape") plot4 ``` ] .pull-right[ <!-- --> ] --- # Add heatmap .pull-left[ ``` r plot5 = gheatmap(p = plot4, data = nasal_fc1_info,offset = 1.7,width = .08, colnames_angle = 95,colnames_offset_y = .25,colnames = FALSE, color = alpha(body_site_color["Nasal"], 1),legend_title = "Index1") + scale_fill_gradientn( colours = c(RColorBrewer::brewer.pal(n = 11, name = "BrBG"))[-c(5:7)], na.value = "white",limits = c(0, 0.8)) + ggnewscale::new_scale(new_aes = "fill") + geom_tippoint( mapping = aes(shape = Nasal_fc1_star),x = 8.15,size = 2,show.legend = FALSE) + scale_shape_manual(values = c("*" = "*")) + ggnewscale::new_scale(new_aes = "shape") plot5 ``` ] .pull-right[ <!-- --> ] --- # Add node labels .pull-left[ ``` r plot6 = plot5 + ggnewscale::new_scale(new_aes = "color") + geom_tiplab( aes(label = label2, color = Phylum), offset = 2.5, size = 2, show.legend = FALSE ) plot6 ``` ] .pull-right[ <!-- --> ] --- # Save the plot ``` r ggplot2::ggsave(plot6, filename = "microbiome_tree_heatmap.pdf", width = 9, height = 7) ``` --- # More references * [Morden Statistics for Morden Biology](https://web.stanford.edu/class/bios221/book/) * [Introduction to R for Data Science](https://bookdown.org/jdholster1/idsr/) * [Network Analysis and Visualization with R and igraph](https://kateto.net/netscix2016.html) * [ggtree: visualization and annotation of phylogenetic trees](https://guangchuangyu.github.io/software/ggtree/) * [An Intro to Phylogenetic Tree Construction in R](https://fuzzyatelin.github.io/bioanth-stats/module-24/module-24.html) * [Data Integration, Manipulation and Visualization of Phylogenetic Trees](https://yulab-smu.top/treedata-book/index.html)