MD7115: R for Graph, Network and Trees

class: center, middle, title-slide

.title[
# MD7115: R for Graph, Network and Trees
]
.subtitle[
## Applications in Biomedical Research
]
.author[
### <br><br>Xiaotao Shen <br> Assistant Professor <br> <a href="http://shen-lab.org"> <i class="fa fa-home fa-fw"></i>  shen-lab.org</a><br> <a href="mailto:xiaotao.shen@ntu.edu.sg"><i class="fa fa-paper-plane fa-fw"></i>  xiaotao.shen@ntu.edu.sg</a><br>
]
.institute[
### Lee Kong Chian School of Medicine, NTU Singapore
]
.date[
### 2025-03-30
]

---

layout: true
background-image: url(data:image/png;base64,#images/lkc_logo.png)
background-position: 100% 0%
background-size: 20%
  
<div class="my-footer"><span>Xiaotao Shen/www.shen-lab.org</span></div>

---

# About the class

-   **Instructor**: Xiaotao Shen (https://www.shen-lab.org/)

-   **Email**: xiaotao.shen@ntu.edu.stg

-   **WhatsApp**: +1 5712679283

-   **Address**: Level 3, EMB, NTU

-   **Github Repo**: https://github.com/jaspershen-lab/MD7115-NTU

---

# Introduction

In this course, we will learn how to use R to analyze and visualize data in the form of graphs, networks and trees. We will cover the following topics.

1. What is a graph, network and tree?

2. Their applications in biomedical research

3. How to use R to analyze and visualize them

4. Examples in publications

---

# Install RStudio

**Ref:**

1. https://rstudio-education.github.io/hopr/starting.html

2. https://posit.co/download/rstudio-desktop/

---

# RStudio project: step 1

Link: https://github.com/jaspershen-lab/MD7115-NTU

---

# RStudio project: step 2

---

# RStudio project: step 3

---

# RStudio project: step 4

---

# RStudio project: step 5

---

# RStudio project: step 6

---

# Preparation: general packages 1

We need to install some general R packages before the class.

Some packages are in Bioconductor and GitHub, so we need to install `BiocManager` and `remotes` first.

``` r
# BiocManager is a package manager for Bioconductor
if (!require("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
# remotes is a package manager for GitHub
if (!require("remotes", quietly = TRUE)){
install.packages("remotes")
}
```

`tidyverse` is a collection of R packages designed for data science, and it includes `ggplot2`, which is a basic graph package for network and tree in R.

``` r
if (!require("tidyverse", quietly = TRUE)){
install.packages("tidyverse")
}
```

---

# Preparation: general packages 2

``` r
##readr to read data (csv, tsv, etc.)
if (!require("readr", quietly = TRUE)){
install.packages("readr")
}
##tibble to create data.frame
if (!require("tibble", quietly = TRUE)){
install.packages("tibble")
}
# shatdowtext is a package for adding shadow to text
if (!require("shadowtext", quietly = TRUE)){
install.packages("shadowtext")
}
# ggsci is a package for scientific journal color palettes
if (!require("ggsci", quietly = TRUE)){
install.packages("ggsci")
}
```

---

# Preparation: packages for graph/network

Some specific packages for graph and network.

``` r
# igraph is a package for creating, manipulating, and analyzing network data
if (!require("igraph", quietly = TRUE)){
install.packages("igraph")
}
# ggraph is a package for creating network visualizations using ggplot2
if(!require("ggraph", quietly = TRUE)){
install.packages("ggraph")
}
#tidygraph is a package for tidy manipulation of graph data
if(!require("tidygraph", quietly = TRUE)){
install.packages("tidygraph")
}
# ggnetword is another package for creating network visualizations using ggplot2
if(!require("ggnetwork", quietly = TRUE)){
install.packages("ggnetwork")
}
# extrafont is a package for using extra fonts in ggplot2
if (!require(extrafont)) {
  install.packages("extrafont")
}
```

---

# Preparation: packages for trees

Some specific packages for tree analysis.

``` r
# microbiomeViz is a package for visualizing microbiome data
if (!require("microbiomeViz", quietly = TRUE)) {
  remotes::install_github("lch14forever/microbiomeViz")
}
# phyloseq is a package for microbiome data analysis
if (!require("phyloseq", quietly = TRUE)) {
  BiocManager::install("phyloseq")
}
# ggtree is a package for visualizing phylogenetic trees
if(!require("ggtree", quietly = TRUE)){
BiocManager::install("ggtree")  
}
# treeio is a package for reading and writing phylogenetic tree data
if(!require("treeio", quietly = TRUE)){
BiocManager::install("treeio")  
}
# ggtreeExtra is a package for adding extra features to ggtree
if(!require("ggtreeExtra", quietly = TRUE)){
BiocManager::install("ggtreeExtra")  
}
```

---

# What is a graph/network?

.pull-left[
A graph is a collection of nodes (vertices) and edges (links) that connect pairs of nodes.

A small undirected graph with numbered nodes.
]

.pull-right[
<img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/graph.png" alt="a graph" width="70%">
]

---

# Biological networks

-   **Protein-protein interaction networks**: Protein-protein interaction networks (PPIN) are mathematical representations of the physical contacts between proteins in the cell.

-   **Metabolic networks**: A metabolic network is the complete set of metabolic and physical processes that determine the physiological and biochemical properties of a cell.

-   **Correlation network**: Correlation networks are a way to represent and analyze the relationships between different biological entities (like genes, proteins, or metabolites) based on how their levels or activity change together across various samples or conditions.

-   **Gene regulatory networks**: A gene regulatory network is a collection of molecular regulators that interact with each other and with other substances in the cell to regulate the gene expression levels of mRNA and proteins.

-   **Microbe-metabolite network**: A microbe-metabolite network is a network that describes the interactions between microbes and metabolites in a biological system.

-   **Drug-target networks**: A drug-target network is a network that describes the interactions between drugs and their target proteins in the cell.

---

# Protein-protein interaction networks

.pull-left[

Protein-protein interaction networks (PPIN) are mathematical representations of the physical contacts between proteins in the cell.

* Nodes: proteins

* Edges: interactions between proteins

]

.pull-right[
<div style="display: flex; justify-content: center; align-items: center;">
    <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/ppi_network.png" alt="Description of the image" width="100%">
</div>

> Jeong et al. Nature 2001

]

---

# Metabolic networks

.pull-left[

A metabolic network is the complete set of metabolic and physical processes that determine the physiological and biochemical properties of a cell. As such, these networks comprise the chemical reactions of metabolism, the metabolic pathways, as well as the regulatory interactions that guide these reactions.

* Nodes: protein (enzyme and metabolites)

* Edges: chemical reactions

* Database: KEGG, Reactome, HMDB...

]

.pull-right[
<div style="display: flex; justify-content: center; align-items: center;">
    <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/metabolic_network.png" alt="Description of the image" width="100%">
</div>

> https://en.wikipedia.org/wiki/Metabolic_network

]

---

# Correlation networks

.pull-left[

Correlation networks are a way to represent and analyze the relationships between different biological entities (like genes, proteins, or metabolites) based on how their levels or activity change together across various samples or conditions.

* Node: Each node in the network represents a biological entity (e.g., a gene, a protein, or a metabolite).

* Edge: Edges connect nodes and represent the correlation between the entities they represent. The strength or weight of the edge indicates the strength of the correlation.

Gene Co-expression Networks

]

.pull-right[
<div style="display: flex; justify-content: center; align-items: center;">
    <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/correlation_network.png" alt="Description of the image" width="100%">
</div>

> Zhou et al. Cell Host & Microbe 2024

]

---

# Gene Co-expression Networks

Step 1: Expression data. Step 2: DEG. Step 3: Correlation. Step 4: Network

> Delgado-Chaves et al. Genes 2019

---

# Some important properties of a network

* **Degree Distribution**:

The distribution of the number of connections (or "degree") each node has in the network.

* **Shortest Paths**:

The shortest path between two nodes in a network is the smallest number of edges that must be traversed to get from one node to the other.

* **Centrality**:

Measures the importance or influence of a node within the network, often based on the number of connections, shortest paths, or intermediate positions.

* **Betweenness Centrality**:

A measure of how often a node appears on the shortest path between two other nodes in the network.

---

# Degree distribution

.pull-left[
What is the degree of nodes?

Degree is the number of edges connected to a node.

In this network, node 3's degree is 3. node 1's degree is 1.

]

.pull-right[
<img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/graph.png" alt="a graph" width="70%">
]

---

# Scale-free networks

.pull-left[

A "scale-free network" refers to a network where the degree distribution, which is the distribution of the number of connections (or links) each node has, follows a power law.

This means that a small number of nodes (the "hubs") have a very large number of connections, while the majority of nodes have relatively few connections.

In biological network, scale-free networks are common, with a few highly connected nodes (hubs) and many nodes with only a few connections.

**These hubs are often important for the overall structure and function of the network (hub gene/protein/metabolite).**

]

.pull-right[
<img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/scale_free_network.png" alt="a graph" width="70%">

> https://link.springer.com/article/10.1007/s10142-022-00907-y

]

---

# Graph/Network in R

Some important R packages for graph and network analysis.

* `igraph`

A R package for creating, manipulating, and analyzing network data.

[https://igraph.org/](https://igraph.org/)

* `ggraph`

A R package for creating network visualizations using `ggplot2`.

[https://ggraph.data-imaginist.com/](https://ggraph.data-imaginist.com/)

* `tidygraph`

A R package for tidy manipulation of graph data.

[https://tidygraph.data-imaginist.com/](https://tidygraph.data-imaginist.com/)

---

# Create a network: from the data.frame

``` r
library(tidyverse)
library(tibble)
library(igraph)
library(ggraph)
library(tidygraph)

##demo data

head(highschool)
```

```
##   from to year
## 1    1 14 1957
## 2    1 15 1957
## 3    1 21 1957
## 4    1 54 1957
## 5    1 55 1957
## 6    2 21 1957
```

---

# Create a network: from the data.frame

``` r
graph = as_tbl_graph(highschool)

class(graph)
```

```
## [1] "tbl_graph" "igraph"
```

``` r
graph
```

```
## # A tbl_graph: 70 nodes and 506 edges
## #
## # A directed multigraph with 1 component
## #
## # Node Data: 70 × 0 (active)
## #
## # Edge Data: 506 × 3
##    from    to  year
##   <int> <int> <dbl>
## 1     1    13  1957
## 2     1    14  1957
## 3     1    20  1957
## # ℹ 503 more rows
```

---

# Create a network: from the data.frame

.pull-left[

``` r
plot(graph)
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-9-1.png)
]

---

# Create a network: from the node and edge data

``` r
node_data =
  tibble::tibble(
    id = c("A", "B", "C", "D", "E"),
    label = c("Alice", "Bob", "Charlie", "David", "Eve"),
    sex = c("F", "M", "M", "F", "F")
  )

edge_data =
  tibble::tibble(
    from = c("A", "A", "B", "C", "D", "E"),
    to = c("B", "C", "D", "D", "E", "A"),
    relationship = c("friend", "neighbor", "friend", "neighbor", "friend", "friend")
  )
```

---

# Create a network: from the node and edge data

.pull-left[

``` r
node_data
```

```
## # A tibble: 5 × 3
##   id    label   sex  
##   <chr> <chr>   <chr>
## 1 A     Alice   F    
## 2 B     Bob     M    
## 3 C     Charlie M    
## 4 D     David   F    
## 5 E     Eve     F
```
]

.pull-right[

``` r
edge_data
```

```
## # A tibble: 6 × 3
##   from  to    relationship
##   <chr> <chr> <chr>       
## 1 A     B     friend      
## 2 A     C     neighbor    
## 3 B     D     friend      
## 4 C     D     neighbor    
## 5 D     E     friend      
## 6 E     A     friend
```
]

---

# Create a network: from the node and edge data

.pull-left[

`tbl_graph` is a function from the package `tidygraph` to create a network from the node and edge data.

``` r
graph2 =
  tbl_graph(nodes = node_data, edges = edge_data, directed = FALSE)
graph2
```
]

.pull.right[

```
## # A tbl_graph: 5 nodes and 6 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 5 × 3 (active)
##   id    label   sex  
##   <chr> <chr>   <chr>
## 1 A     Alice   F    
## 2 B     Bob     M    
## 3 C     Charlie M    
## 4 D     David   F    
## 5 E     Eve     F    
## #
## # Edge Data: 6 × 3
##    from    to relationship
##   <int> <int> <chr>       
## 1     1     2 friend      
## 2     1     3 neighbor    
## 3     2     4 friend      
## # ℹ 3 more rows
```
]

---

# Network visualization

.pull-left[

* `ggraph` function is used to visualize the network.

* `geom_edge_xxx` functions are used to add edges to the plot.

* `geom_node_xxx` functions are used to add nodes to the plot.

``` r
ggraph(graph2, layout = 'kk') +
  geom_edge_link() +
  geom_node_point()
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-16-1.png)
]

---

# Layout

.pull-left[

The layout of one network is the arrangement of nodes and edges in the network. `ggraph` supports different layouts.

* kk: Kamada-Kawai layout

* fr: Fruchterman-Reingold layout

* stress: Stress layout

* Customized layout

``` r
ggraph(graph2, layout = 'fr') +
  geom_edge_link() +
  geom_node_point()
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-18-1.png)
]

---

# Layout

.pull-left[

Stress layout is a layout that tries to minimize the stress of the network.

``` r
ggraph(graph2, layout = 'stress') +
  geom_edge_link() +
  geom_node_point()
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-20-1.png)
]

---

# Layout

.pull-left[

Linear layout is a layout that arranges the nodes in a straight line.

``` r
ggraph(graph2, layout = 'linear') +
   geom_edge_arc() +
  geom_node_point()
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-22-1.png)
]

---

# Layout

.pull-left[

``` r
ggraph(graph2, layout = 'linear', 
       circular = TRUE) +
   geom_edge_arc() +
  geom_node_point() +
   coord_fixed()
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-24-1.png)
]

# Node and edge attributes

.pull-left[

``` r
ggraph(graph2, layout = 'linear', circular = TRUE) +
  geom_edge_arc(aes(color = relationship)) +
  geom_node_point(aes(color = sex),
                  size = 5) +
  coord_fixed()
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-26-1.png)
]

---

# Add labels

.pull-left[

`geom_node_text` function is used to add labels to the nodes.

``` r
ggraph(graph2, layout = 'linear', 
       circular = TRUE) +
  geom_edge_arc(aes(color = relationship)) +
  geom_node_point(aes(color = sex),
                  size = 5) +
  geom_node_text(aes(label = label),
                 repel = TRUE) +
  coord_fixed()
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-28-1.png)
]

---

# Case study

## 1. Correlation network

## 2. GO similarity network

---

# Example 1: Correlation network

> [Shen, et al. Nature Biomedical Engineering, 2024](https://www.shen-lab.org/publication/Multi-omics%20microsampling%20for%20the%20profiling%20of%20lifestyle-associated%20changes%20in%20health.pdf)

---

# Example 1: Correlation network

What's the network?

.pull-left[
**Nodes**:

-   Wearable data: Step, HR, and CGM (continuous glucose monitoring) data

-   Omics data: Proteins, metabolites, and lipids

**Edges**:

-   Lagged correlations

-   Blue: negative correlations

-   Red: positive correlations
]

.pull-right[
<div style="display: flex; justify-content: center; align-items: center;">
    <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/ED_FIG7.jpg" alt="Description of the image" width="100%">
</div>
]

---

# Read the data

``` r
# node data
if (!require(readr)) {
  install.packages("readr")
}
node_data =
  readr::read_csv(
    "https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/2_demo_data/example_node_data.csv"
  )
```

---

# Read the data

``` r
#edge data
edge_data =
  readr::read_csv(
    "https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/2_demo_data/example_edge_data.csv"
  )
```

---

# Node data

.pull-left[

``` r
dim(node_data)
```

```
## [1] 319   4
```

``` r
head(node_data)
```

```
## # A tibble: 6 × 4
##   node        class    mol_name class2        
##   <chr>       <chr>    <chr>    <chr>         
## 1 CGM         cgm      CGM      wearable      
## 2 cortisol_1  cortisol Cortisol internal-omics
## 3 cytokine_34 cytokine IP10     internal-omics
## 4 cytokine_40 cytokine TNFB     internal-omics
## 5 cytokine_20 cytokine IL15     internal-omics
## 6 cytokine_4  cytokine TGFA     internal-omics
```
]

.pull-right[

319 nodes

- node: The ID of the node, character

- class: The class of the node, character

- mol_name: The name of the molecule, character

- Other: all the other attributes of the edge
]

---

# Edge data

.pull-left[

``` r
dim(edge_data)
```

```
## [1] 337  11
```

``` r
head(edge_data)
```

```
## # A tibble: 6 × 11
##   from  to         global_cor lagged_cor
##   <chr> <chr>           <dbl>      <dbl>
## 1 CGM   cytokine_…      0.274      0.292
## 2 CGM   cytokine_…      0.308      0.308
## 3 CGM   cytokine_…      0.327      0.327
## 4 CGM   cytokine_4      0.331      0.343
## 5 CGM   cytokine_6      0.343      0.355
## 6 CGM   lipid_113      -0.118     -0.324
## # ℹ 7 more variables: shift_time <dbl>,
## #   global_cor_p <dbl>,
## #   lagged_cor_p <dbl>,
## #   global_cor_p_adjust <dbl>,
## #   lagged_cor_p_adjust <dbl>,
## #   class1 <chr>, class <chr>
```
]

.pull-right[

337 edges

- from: The ID of the source node, character

- to: The ID of the target node, character

- Other: all the other attributes of the edge
]

---

# Create a graph

``` r
library(igraph)
library(tidyverse)
library(ggraph)
library(tidygraph)
graph_data =
  tidygraph::tbl_graph(nodes = node_data,
                       edges = edge_data,
                       directed = FALSE) %>%
  dplyr::mutate(Degree = centrality_degree(mode = 'all'))
```

`centrality_degree` is from the package `tidygraph` to calcualte the degree of the nodes.

``` r
?centrality_degree
```

---

# Create a graph

``` r
graph_data
```

```
## # A tbl_graph: 319 nodes and 337 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 319 × 5 (active)
##    node        class    mol_name class2         Degree
##    <chr>       <chr>    <chr>    <chr>           <dbl>
##  1 CGM         cgm      CGM      wearable           20
##  2 cortisol_1  cortisol Cortisol internal-omics      2
##  3 cytokine_34 cytokine IP10     internal-omics      1
##  4 cytokine_40 cytokine TNFB     internal-omics      1
##  5 cytokine_20 cytokine IL15     internal-omics      1
##  6 cytokine_4  cytokine TGFA     internal-omics      2
##  7 cytokine_6  cytokine FLT3L    internal-omics      1
##  8 cytokine_12 cytokine IL10     internal-omics      2
##  9 cytokine_17 cytokine PDGFAA   internal-omics      1
## 10 cytokine_3  cytokine EOTAXIN  internal-omics      1
## # ℹ 309 more rows
## #
## # Edge Data: 337 × 11
##    from    to global_cor lagged_cor shift_time global_cor_p lagged_cor_p
##   <int> <int>      <dbl>      <dbl>      <dbl>        <dbl>        <dbl>
## 1     1     3      0.274      0.292        -40      0.0107       0.00607
## 2     1     4      0.308      0.308          0      0.00393      0.00393
## 3     1     5      0.327      0.327          0      0.00211      0.00211
## # ℹ 334 more rows
## # ℹ 4 more variables: global_cor_p_adjust <dbl>, lagged_cor_p_adjust <dbl>,
## #   class1 <chr>, class <chr>
```

---

# Visualize the graph

.pull-left[

`coord_fixed` (in `ggplot2` package) is used to fix the aspect ratio of the plot.

``` r
ggraph(graph_data, layout = 'kk') +
  geom_edge_link() +
  geom_node_point(
  ) +
  coord_fixed()
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-39-1.png)
]

---

# Another layout

.pull-left[

``` r
ggraph(graph_data, layout = 'fr') +
  geom_edge_link() +
  geom_node_point(
  ) +
  coord_fixed()
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-41-1.png)
]

---

# Modification: more information in the network

.pull-left[

``` r
plot =  
ggraph(graph_data, layout = 'kk') +
  geom_edge_link(aes(
    color = lagged_cor,
    width = -log(lagged_cor_p_adjust, 10)
  ),
  alpha = 0.5,
  show.legend = TRUE) +
  geom_node_point(
    aes(fill = class, size = Degree),
    shape = 21,
    alpha = 0.5,
    show.legend = TRUE
  ) +
  coord_fixed()
plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-43-1.png)
]

---

# Modification: more information in the network

.pull-left[

``` r
plot = plot +
  shadowtext::geom_shadowtext(aes(x = x, y = y,
      label = ifelse(class2 == "wearable", mol_name, NA), color = class),
    bg.color = "white",
    size = 5,
    show.legend = FALSE
  )
plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-45-1.png)
]

---

# Modification: more information in the network

Set colors for nodes

``` r
class_color =
  c(
    "lipidomics" = ggsci::pal_aaas()(10)[1],
    "metabolomics" = ggsci::pal_aaas()(10)[3],
    "cytokine" = ggsci::pal_aaas()(10)[4],
    "total_protein" = ggsci::pal_aaas()(10)[5],
    "cortisol" = ggsci::pal_aaas()(10)[6],
    "metabolic_panel" = ggsci::pal_aaas()(10)[7],
    "proteomics" = ggsci::pal_aaas()(10)[8]
  )

wearable_color =
  c(
    "sleep" = ggsci::pal_d3()(n = 10)[2],
    "cgm"  = ggsci::pal_d3()(n = 10)[6],
    "hr" = ggsci::pal_d3()(n = 10)[7],
    "step"  = ggsci::pal_d3()(n = 10)[9],
    "food" = ggsci::pal_d3()(n = 10)[10]
  )
```

---

# Modification: more information in the network

.pull-left[

``` r
plot = 
  plot +
  scale_size_continuous(range = c(3, 10)) +
  scale_fill_manual(values = c(class_color, wearable_color)) +
  scale_color_manual(values = c(class_color, wearable_color)) +
  scale_edge_width_continuous(range = c(0.3, 2)) +
  ggraph::scale_edge_color_gradientn(colours = c(alpha("#3B4992FF", 0.7), "white", alpha("#EE0000FF", 0.7))) +
  ggraph::theme_graph() +
  theme(
    plot.background = element_rect(fill = "transparent", color = NA),
    panel.background = element_rect(fill = "transparent", color = NA),
    legend.position = "right",
    legend.background = element_rect(fill = "transparent", color = NA)
  )
plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-48-1.png)
]

---

# Change to circle layout

``` r
g = graph_data
library(igraph)
library(ggraph)
library(tidygraph)

V(g)$type = bipartite_mapping(g)$type

coords =
  create_layout(g, layout = "bipartite") %>%
  dplyr::select(node, class, mol_name, class2, x, y)

head(coords)
```

```
## # A tibble: 6 × 6
##   node        class    mol_name class2             x     y
##   <chr>       <chr>    <chr>    <chr>          <dbl> <dbl>
## 1 CGM         cgm      CGM      wearable         9.5     1
## 2 cortisol_1  cortisol Cortisol internal-omics 175       0
## 3 cytokine_34 cytokine IP10     internal-omics   1       0
## 4 cytokine_40 cytokine TNFB     internal-omics   6       0
## 5 cytokine_20 cytokine IL15     internal-omics   9       0
## 6 cytokine_4  cytokine TGFA     internal-omics  56       0
```

---

# Change to circle layout

.pull-left[

``` r
coords$y[coords$class2 == "internal-omics"] = 1
coords$y[coords$class2 == "wearable"] = -0.5

coords$y[coords$class == "metabolomics"] = 1
coords$y[coords$class == "lipidomics"] = 0.5
coords$y[coords$class == "proteomics"] = 0
coords$y[coords$class == "cytokine"] = 0
coords$y[coords$class == "metabolic_panel"] = 0
coords$y[coords$class == "total_protein"] = 0

# calculate the x and y coordinates for circle layout
coords =
  coords %>%
  dplyr::select(x, y) %>%
  dplyr::mutate(
    theta = x / (max(x) + 1) * 2 * pi,
    r = y + 1,
    x = r * cos(theta),
    y = r * sin(theta)
  )
```
]

.pull-right[

``` r
head(coords)
```

```
## # A tibble: 6 × 4
##        x       y  theta     r
##    <dbl>   <dbl>  <dbl> <dbl>
## 1  0.491  0.0939 0.189    0.5
## 2 -1.89  -0.663  3.48     2  
## 3  1.00   0.0199 0.0199   1  
## 4  0.993  0.119  0.119    1  
## 5  0.984  0.178  0.179    1  
## 6  0.442  0.897  1.11     1
```
]

---

# Change to circle layout

.pull-left[

``` r
my_graph =
  create_layout(
    graph = g,
    layout = "manual",
    x = coords$x,
    y = coords$y
  )
```
]

.pull-right[

``` r
head(my_graph)
```

```
## # A tibble: 6 × 11
##        x       y circular node        class    mol_name class2      Degree type 
##    <dbl>   <dbl> <lgl>    <chr>       <chr>    <chr>    <chr>        <dbl> <lgl>
## 1  0.491  0.0939 FALSE    CGM         cgm      CGM      wearable        20 FALSE
## 2 -1.89  -0.663  FALSE    cortisol_1  cortisol Cortisol internal-o…      2 TRUE 
## 3  1.00   0.0199 FALSE    cytokine_34 cytokine IP10     internal-o…      1 TRUE 
## 4  0.993  0.119  FALSE    cytokine_40 cytokine TNFB     internal-o…      1 TRUE 
## 5  0.984  0.178  FALSE    cytokine_20 cytokine IL15     internal-o…      1 TRUE 
## 6  0.442  0.897  FALSE    cytokine_4  cytokine TGFA     internal-o…      2 TRUE 
## # ℹ 2 more variables: .ggraph.orig_index <int>, .ggraph.index <int>
```
]

---

# Visualize the circle layout

.pull-left[

``` r
plot =
  ggraph(my_graph, layout = 'bipartite') +
  geom_edge_diagonal(aes(
    color = lagged_cor,
    width = -log(lagged_cor_p_adjust, 10)
  ),
  alpha = 0.5,
  show.legend = TRUE) +
  geom_node_point(
    aes(fill = class, size = Degree),
    shape = 21,
    alpha = 0.5,
    show.legend = TRUE
  )
```
]

.pull-right[

``` r
plot
```

![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-55-1.png)
]

---

# Visualize the circle layout

.pull-left[

``` r
plot = plot +
  shadowtext::geom_shadowtext(
    aes(x, y, label = ifelse(class2 == "wearable", mol_name, NA),
      color = class),
    bg.color = "white", size = 5, show.legend = FALSE) +
  shadowtext::geom_shadowtext(
    aes(x, y,
      label = ifelse(class2 == "wearable", NA, mol_name),
      color = class,
      nudge_y = ifelse(class2 == "wearable", "inward", 'outward'),
      angle = -((-node_angle(x, y) + 90) %% 180) + 90),
    bg.color = "white", size = 2, show.legend = FALSE, check_overlap = TRUE)
plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-57-1.png)
]

---

# Visualize the circle layout

``` r
plot =
  plot +
  scale_size_continuous(range = c(3, 10)) +
  scale_fill_manual(values = c(class_color, wearable_color)) +
  scale_color_manual(values = c(class_color, wearable_color)) +
  scale_edge_width_continuous(range = c(0.3, 2)) +
  ggraph::scale_edge_color_gradientn(colours = c(alpha("#3B4992FF", 0.7), "white", alpha("#EE0000FF", 0.7))) +
  ggraph::theme_graph() +
  theme(
    plot.background = element_rect(fill = "transparent", color = NA),
    panel.background = element_rect(fill = "transparent", color = NA),
    legend.position = "bottom",
    legend.background = element_rect(fill = "transparent", color = NA)
  ) 
```

---

# Visualize the circle layout

.pull-left[

``` r
plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-60-1.png)
]

---

# Save the plot

We need to install `extrafont` to save the plot.

``` r
if (!require(extrafont)) {
  install.packages("extrafont")
}
```

``` r
extrafont::loadfonts()
ggsave(plot, filename = "network.pdf", width = 9.5, height = 10)
```

---

# Example 2: GO similarity network

What's the network?

.pull-left[
**Nodes**:

- Enriched GO terms

**Edges**:

- Similarity

]

.pull-right[
<div style="display: flex; justify-content: center; align-items: center;">
    <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/similarity_network_plot.jpg" alt="Description of the image" width="100%">
</div>
]

---

# Load libraries

``` r
library(readr)
library(ggraph)
library(tidygraph)
library(igraph)
library(tidyverse)
library(simplifyEnrichment)
library(tibble)
```

---

# Read the data

``` r
pathway_result =
  readr::read_csv(
    "https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/2_demo_data/pathway_result.csv"
  )
head(pathway_result)
```

```
## # A tibble: 6 × 10
##   ONTOLOGY ID    Description GeneRatio BgRatio   pvalue p.adjust   qvalue geneID
##   <chr>    <chr> <chr>       <chr>     <chr>      <dbl>    <dbl>    <dbl> <chr> 
## 1 MF       GO:0… transmembr… 14/114    86/206… 3.39e-17 7.46e-15 4.57e-15 ENSG0…
## 2 MF       GO:0… growth fac… 16/114    143/20… 8.29e-17 9.12e-15 5.59e-15 ENSG0…
## 3 MF       GO:0… transmembr… 12/114    69/206… 2.92e-15 2.14e-13 1.31e-13 ENSG0…
## 4 BP       GO:0… wound heal… 23/112    495/20… 1.85e-15 4.95e-12 3.27e-12 ENSG0…
## 5 MF       GO:0… protein ty… 12/114    151/20… 4.27e-11 2.35e- 9 1.44e- 9 ENSG0…
## 6 MF       GO:0… cytokine b… 11/114    144/20… 4.39e-10 1.93e- 8 1.18e- 8 ENSG0…
## # ℹ 1 more variable: Count <dbl>
```

---

# Compute GO term similarity matrices: BP (Biological Process)

``` r
bp_sim_matrix =
  simplifyEnrichment::GO_similarity(go_id = pathway_result$ID[pathway_result$ONTOLOGY == "BP"],
                                    ont = "BP",
                                    measure = "Wang") %>%
  as.data.frame() %>%
  tibble::rownames_to_column(var = "name1") %>%
  tidyr::pivot_longer(cols = -name1,
                      names_to = "name2",
                      values_to = "sim") %>%
  dplyr::filter(name1 != name2) %>%
  dplyr::filter(sim > 0.5)
head(bp_sim_matrix)
```

```
## # A tibble: 6 × 3
##   name1      name2        sim
##   <chr>      <chr>      <dbl>
## 1 GO:0042060 GO:0061041 0.616
## 2 GO:0042060 GO:0090303 0.504
## 3 GO:0042060 GO:0007596 0.552
## 4 GO:1903034 GO:1903036 0.801
## 5 GO:1903034 GO:0061041 0.820
## 6 GO:1903034 GO:0090303 0.657
```

---

# Compute GO term similarity matrices: BP (Biological Process)

``` r
# Create a unique identifier for each GO term pair
# This helps remove duplicate pairs (A-B and B-A are the same relationship)
name = apply(bp_sim_matrix, 1, function(x) {
  paste(sort(x[1:2]), collapse = "_")
})

# Remove duplicate GO term pairs while keeping the data structure
bp_sim_matrix =
  bp_sim_matrix %>%
  dplyr::mutate(name = name) %>%
  dplyr::arrange(name) %>%
  dplyr::distinct(name, .keep_all = TRUE) %>%
  dplyr::select(-name)
```

---

# Compute GO term similarity matrices: MF (Molecular Function)

``` r
mf_sim_matrix =
  simplifyEnrichment::GO_similarity(go_id = pathway_result$ID[pathway_result$ONTOLOGY == "MF"],
                                    ont = "MF",
                                    measure = "Wang") %>%
  as.data.frame() %>%
  tibble::rownames_to_column(var = "name1") %>%
  tidyr::pivot_longer(cols = -name1,
                      names_to = "name2",
                      values_to = "sim") %>%
  dplyr::filter(name1 != name2) %>%
  dplyr::filter(sim > 0.5)
head(mf_sim_matrix)
```

```
## # A tibble: 6 × 3
##   name1      name2        sim
##   <chr>      <chr>      <dbl>
## 1 GO:0019199 GO:0004714 0.854
## 2 GO:0019199 GO:0004713 0.654
## 3 GO:0019838 GO:0019955 0.605
## 4 GO:0004714 GO:0019199 0.854
## 5 GO:0004714 GO:0004713 0.727
## 6 GO:0004713 GO:0019199 0.654
```

---

# Compute GO term similarity matrices: MF (Molecular Function)

``` r
name = apply(mf_sim_matrix, 1, function(x) {
  paste(sort(x[1:2]), collapse = "_")
})

mf_sim_matrix =
  mf_sim_matrix %>%
  dplyr::mutate(name = name) %>%
  dplyr::arrange(name) %>%
  dplyr::distinct(name, .keep_all = TRUE) %>%
  dplyr::select(-name)
```

---

# Compute GO term similarity matrices: CC (Cellular Component)

``` r
cc_sim_matrix =
  simplifyEnrichment::GO_similarity(go_id = pathway_result$ID[pathway_result$ONTOLOGY == "CC"],
                                    ont = "CC",
                                    measure = "Wang") %>%
  as.data.frame() %>%
  tibble::rownames_to_column(var = "name1") %>%
  tidyr::pivot_longer(cols = -name1,
                      names_to = "name2",
                      values_to = "sim") %>%
  dplyr::filter(name1 != name2) %>%
  dplyr::filter(sim > 0.5)
head(cc_sim_matrix)
```

```
## # A tibble: 6 × 3
##   name1      name2        sim
##   <chr>      <chr>      <dbl>
## 1 GO:0005775 GO:0043202 0.841
## 2 GO:0005775 GO:0005788 0.725
## 3 GO:0005775 GO:0005796 0.725
## 4 GO:0031983 GO:0034774 0.591
## 5 GO:0031983 GO:0060205 0.719
## 6 GO:0043202 GO:0005775 0.841
```

---

# Compute GO term similarity matrices: CC (Cellular Component)

``` r
name = apply(cc_sim_matrix, 1, function(x) {
  paste(sort(x[1:2]), collapse = "_")
})

cc_sim_matrix =
  cc_sim_matrix %>%
  dplyr::mutate(name = name) %>%
  dplyr::arrange(name) %>%
  dplyr::distinct(name, .keep_all = TRUE) %>%
  dplyr::select(-name)
```

---

# Final similarity matrix

``` r
# Combine all three ontology similarity matrices into one
sim_matrix =
  rbind(bp_sim_matrix, mf_sim_matrix, cc_sim_matrix) %>%
  tibble::as_tibble()

# Preview the combined similarity matrix
head(sim_matrix)
```

```
## # A tibble: 6 × 3
##   name1      name2        sim
##   <chr>      <chr>      <dbl>
## 1 GO:0000302 GO:0006979 0.663
## 2 GO:0001101 GO:0009410 0.605
## 3 GO:0001101 GO:1902074 0.605
## 4 GO:0001503 GO:0001649 0.564
## 5 GO:0048771 GO:0001503 0.543
## 6 GO:0050817 GO:0001503 0.543
```

---

# Graph data: edge data

``` r
edge_data =
  sim_matrix %>%
  # Rename columns to standard network terminology
  dplyr::rename(from = name1, to = name2) %>%
  # Keep only strong relationships (similarity > 0.5)
  dplyr::filter(sim > 0.5)

# Display sample of edge data
head(edge_data)
```

```
## # A tibble: 6 × 3
##   from       to           sim
##   <chr>      <chr>      <dbl>
## 1 GO:0000302 GO:0006979 0.663
## 2 GO:0001101 GO:0009410 0.605
## 3 GO:0001101 GO:1902074 0.605
## 4 GO:0001503 GO:0001649 0.564
## 5 GO:0048771 GO:0001503 0.543
## 6 GO:0050817 GO:0001503 0.543
```

---

# Graph data: node data

``` r
# Create node data from pathway results
# Nodes represent individual GO terms
node_data =
  pathway_result %>%
  dplyr::select(ID, everything()) %>%
  dplyr::rename(node = ID)

# Display sample of node data
head(node_data)
```

```
## # A tibble: 6 × 10
##   node  ONTOLOGY Description GeneRatio BgRatio   pvalue p.adjust   qvalue geneID
##   <chr> <chr>    <chr>       <chr>     <chr>      <dbl>    <dbl>    <dbl> <chr> 
## 1 GO:0… MF       transmembr… 14/114    86/206… 3.39e-17 7.46e-15 4.57e-15 ENSG0…
## 2 GO:0… MF       growth fac… 16/114    143/20… 8.29e-17 9.12e-15 5.59e-15 ENSG0…
## 3 GO:0… MF       transmembr… 12/114    69/206… 2.92e-15 2.14e-13 1.31e-13 ENSG0…
## 4 GO:0… BP       wound heal… 23/112    495/20… 1.85e-15 4.95e-12 3.27e-12 ENSG0…
## 5 GO:0… MF       protein ty… 12/114    151/20… 4.27e-11 2.35e- 9 1.44e- 9 ENSG0…
## 6 GO:0… MF       cytokine b… 11/114    144/20… 4.39e-10 1.93e- 8 1.18e- 8 ENSG0…
## # ℹ 1 more variable: Count <dbl>
```

---

# Create a graph data structure

``` r
# Create a graph data structure using tidygraph
# Combines node and edge information into a single object
graph_data =
  tidygraph::tbl_graph(nodes = node_data,
                       edges = edge_data,
                       directed = FALSE) %>%
  # Calculate node degree (number of connections) as a centrality measure
  dplyr::mutate(degree = tidygraph::centrality_degree())
graph_data
```

```
## # A tbl_graph: 331 nodes and 712 edges
## #
## # An undirected simple graph with 66 components
## #
## # Node Data: 331 × 11 (active)
##    node       ONTOLOGY Description  GeneRatio BgRatio   pvalue p.adjust   qvalue
##    <chr>      <chr>    <chr>        <chr>     <chr>      <dbl>    <dbl>    <dbl>
##  1 GO:0019199 MF       transmembra… 14/114    86/206… 3.39e-17 7.46e-15 4.57e-15
##  2 GO:0019838 MF       growth fact… 16/114    143/20… 8.29e-17 9.12e-15 5.59e-15
##  3 GO:0004714 MF       transmembra… 12/114    69/206… 2.92e-15 2.14e-13 1.31e-13
##  4 GO:0042060 BP       wound heali… 23/112    495/20… 1.85e-15 4.95e-12 3.27e-12
##  5 GO:0004713 MF       protein tyr… 12/114    151/20… 4.27e-11 2.35e- 9 1.44e- 9
##  6 GO:0019955 MF       cytokine bi… 11/114    144/20… 4.39e-10 1.93e- 8 1.18e- 8
##  7 GO:0005125 MF       cytokine ac… 14/114    286/20… 6.13e-10 2.25e- 8 1.38e- 8
##  8 GO:0050840 MF       extracellul… 8/114     56/206… 7.53e-10 2.37e- 8 1.45e- 8
##  9 GO:1903034 BP       regulation … 13/112    183/20… 1.99e-11 2.67e- 8 1.76e- 8
## 10 GO:0008083 MF       growth fact… 11/114    162/20… 1.54e- 9 4.25e- 8 2.60e- 8
## # ℹ 321 more rows
## # ℹ 3 more variables: geneID <chr>, Count <dbl>, degree <dbl>
## #
## # Edge Data: 712 × 3
##    from    to   sim
##   <int> <int> <dbl>
## 1   187   316 0.663
## 2   177   240 0.605
## 3   177   325 0.605
## # ℹ 709 more rows
```

---

# Community/module detection

``` r
# Detect subnetworks/communities within the graph
# Edge betweenness clustering algorithm identifies modular structures
subnetwork =
  igraph::cluster_edge_betweenness(graph = graph_data, weights = abs(edge_attr(graph_data, "sim")))

# Create module labels for each detected community
cluster =
  paste("GO_Module", as.character(igraph::membership(subnetwork)), sep = "_")
head(cluster)
```

```
## [1] "GO_Module_1" "GO_Module_2" "GO_Module_1" "GO_Module_3" "GO_Module_1"
## [6] "GO_Module_2"
```

---

# Add module information to the graph data

``` r
# Add module information to the graph data
graph_data =
  graph_data %>%
  tidygraph::mutate(module = cluster)
graph_data
```

```
## # A tbl_graph: 331 nodes and 712 edges
## #
## # An undirected simple graph with 66 components
## #
## # Node Data: 331 × 12 (active)
##    node       ONTOLOGY Description  GeneRatio BgRatio   pvalue p.adjust   qvalue
##    <chr>      <chr>    <chr>        <chr>     <chr>      <dbl>    <dbl>    <dbl>
##  1 GO:0019199 MF       transmembra… 14/114    86/206… 3.39e-17 7.46e-15 4.57e-15
##  2 GO:0019838 MF       growth fact… 16/114    143/20… 8.29e-17 9.12e-15 5.59e-15
##  3 GO:0004714 MF       transmembra… 12/114    69/206… 2.92e-15 2.14e-13 1.31e-13
##  4 GO:0042060 BP       wound heali… 23/112    495/20… 1.85e-15 4.95e-12 3.27e-12
##  5 GO:0004713 MF       protein tyr… 12/114    151/20… 4.27e-11 2.35e- 9 1.44e- 9
##  6 GO:0019955 MF       cytokine bi… 11/114    144/20… 4.39e-10 1.93e- 8 1.18e- 8
##  7 GO:0005125 MF       cytokine ac… 14/114    286/20… 6.13e-10 2.25e- 8 1.38e- 8
##  8 GO:0050840 MF       extracellul… 8/114     56/206… 7.53e-10 2.37e- 8 1.45e- 8
##  9 GO:1903034 BP       regulation … 13/112    183/20… 1.99e-11 2.67e- 8 1.76e- 8
## 10 GO:0008083 MF       growth fact… 11/114    162/20… 1.54e- 9 4.25e- 8 2.60e- 8
## # ℹ 321 more rows
## # ℹ 4 more variables: geneID <chr>, Count <dbl>, degree <dbl>, module <chr>
## #
## # Edge Data: 712 × 3
##    from    to   sim
##   <int> <int> <dbl>
## 1   187   316 0.663
## 2   177   240 0.605
## 3   177   325 0.605
## # ℹ 709 more rows
```

---

# Extract node attributes with module assignments

``` r
result_with_module =
  igraph::vertex_attr(graph_data) %>%
  do.call(cbind, .) %>%
  tibble::as_tibble() %>%
  # Convert p.adjust to numeric for proper sorting
  dplyr::mutate(p.adjust = as.numeric(p.adjust)) %>%
  # Sort by module and significance
  dplyr::arrange(module, p.adjust)

# Further organize results by ontology type, module, and significance
result_with_module =
  result_with_module %>%
  dplyr::arrange(ONTOLOGY, module, p.adjust)
```

---

# Extract node attributes with module assignments

``` r
# Calculate the number of GO terms in each module
module_content_number =
  result_with_module %>%
  dplyr::count(module) %>%
  dplyr::rename(module_content_number = n)

# Add module size information to results
result_with_module =
  result_with_module %>%
  dplyr::left_join(module_content_number, by = "module")
head(result_with_module)
```

```
## # A tibble: 6 × 13
##   node      ONTOLOGY Description GeneRatio BgRatio pvalue p.adjust qvalue geneID
##   <chr>     <chr>    <chr>       <chr>     <chr>   <chr>     <dbl> <chr>  <chr> 
## 1 GO:00140… BP       phosphatid… 11/112    164/20… 1.399…  4.68e-7 3.090… ENSG0…
## 2 GO:00971… BP       extrinsic … 13/112    262/20… 1.701…  5.06e-7 3.339… ENSG0…
## 3 GO:00480… BP       phosphatid… 11/112    194/20… 8.197…  1.46e-6 9.652… ENSG0…
## 4 GO:00703… BP       regulation… 14/112    367/20… 1.133…  1.69e-6 1.111… ENSG0…
## 5 GO:00328… BP       regulation… 11/112    215/20… 2.378…  2.77e-6 1.826… ENSG0…
## 6 GO:00703… BP       ERK1 and E… 14/112    393/20… 2.675…  2.94e-6 1.938… ENSG0…
## # ℹ 4 more variables: Count <chr>, degree <chr>, module <chr>,
## #   module_content_number <int>
```

---

# Add module size information to graph data

``` r
graph_data =
  graph_data %>%
  # Activate nodes for manipulation
  activate(what = "nodes") %>%
  dplyr::left_join(module_content_number, by = "module")

# Display the final graph data structure
graph_data
```

```
## # A tbl_graph: 331 nodes and 712 edges
## #
## # An undirected simple graph with 66 components
## #
## # Node Data: 331 × 13 (active)
##    node       ONTOLOGY Description  GeneRatio BgRatio   pvalue p.adjust   qvalue
##    <chr>      <chr>    <chr>        <chr>     <chr>      <dbl>    <dbl>    <dbl>
##  1 GO:0019199 MF       transmembra… 14/114    86/206… 3.39e-17 7.46e-15 4.57e-15
##  2 GO:0019838 MF       growth fact… 16/114    143/20… 8.29e-17 9.12e-15 5.59e-15
##  3 GO:0004714 MF       transmembra… 12/114    69/206… 2.92e-15 2.14e-13 1.31e-13
##  4 GO:0042060 BP       wound heali… 23/112    495/20… 1.85e-15 4.95e-12 3.27e-12
##  5 GO:0004713 MF       protein tyr… 12/114    151/20… 4.27e-11 2.35e- 9 1.44e- 9
##  6 GO:0019955 MF       cytokine bi… 11/114    144/20… 4.39e-10 1.93e- 8 1.18e- 8
##  7 GO:0005125 MF       cytokine ac… 14/114    286/20… 6.13e-10 2.25e- 8 1.38e- 8
##  8 GO:0050840 MF       extracellul… 8/114     56/206… 7.53e-10 2.37e- 8 1.45e- 8
##  9 GO:1903034 BP       regulation … 13/112    183/20… 1.99e-11 2.67e- 8 1.76e- 8
## 10 GO:0008083 MF       growth fact… 11/114    162/20… 1.54e- 9 4.25e- 8 2.60e- 8
## # ℹ 321 more rows
## # ℹ 5 more variables: geneID <chr>, Count <dbl>, degree <dbl>, module <chr>,
## #   module_content_number <int>
## #
## # Edge Data: 712 × 3
##    from    to   sim
##   <int> <int> <dbl>
## 1   187   316 0.663
## 2   177   240 0.605
## 3   177   325 0.605
## # ℹ 709 more rows
```

---

# Select representative labels for each module

``` r
# Choose the most significant GO term with highest count in each module
cluster_label_module =
  igraph::as_data_frame(graph_data, what = "vertices") %>%
  dplyr::group_by(module) %>%
  dplyr::filter(p.adjust == min(p.adjust) &
                  Count == max(Count)) %>%
  dplyr::slice_head(n = 1) %>%
  pull(Description)

# Get all GO term descriptions for reference
cluster_label_all =
  igraph::as_data_frame(graph_data, what = "vertices")$Description
```

---

# Visualization of GO term clusters

.pull-left[

``` r
# Create network visualization
plot =
  graph_data %>%
  ggraph(layout = 'fr', circular = FALSE) +
  geom_edge_link(
    aes(width = sim),
    strength = 1,
    color = "black",
    alpha = 1,
    show.legend = FALSE
  ) +
  geom_node_point(
    aes(fill = module, size = -log(p.adjust, 10)),
    shape = 21,
    alpha = 1,
    show.legend = FALSE
  ) +
  coord_fixed()
plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-82-1.png)
]

---

# Visualization of GO term clusters

.pull-left[

``` r
plot =
  plot +
  geom_node_text(aes(
    x = x,
    y = y,
    label = ifelse(Description %in% cluster_label_module, Description, NA)
  ),
  size = 3,
  repel = TRUE)
plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-84-1.png)
]

---

# Visualization of GO term clusters

.pull-left[

``` r
plot =
  plot +
  guides(fill = guide_legend(ncol = 1)) +
  scale_edge_width_continuous(range = c(0.1, 2)) +
  scale_size_continuous(range = c(1, 7))
plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-86-1.png)
]

---

# Visualization of GO term clusters

.pull-left[

``` r
plot =
  plot +
  ggraph::theme_graph() +
  theme(
    plot.background = element_rect(fill = "transparent", color = NA),
    panel.background = element_rect(fill = "transparent", color = NA),
    legend.position = "right",
    legend.background = element_rect(fill = "transparent", color = NA)
  )
plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-88-1.png)
]

---

# Save the plot

``` r
extrafont::loadfonts()
ggplot2::ggsave(plot,
                filename = "similarity_network_plot.pdf",
                width = 9,
                height = 7)
```

---

# What is a tree in biology?

What's the tree?

* **Taxonomic trees (classification hierarchies)**

* **Phylogenetic trees (evolutionary relationships)**

* Gene trees (gene family evolution)

* Species trees (speciation events)

* Population trees (population structure)

---

# Taxonomic trees

.pull-left[

**Purpose:**

* Organize and categorize organisms based on shared characteristics

* Represent hierarchical relationships between taxonomic groups

* Visualize abundance and diversity of taxonomic groups

**Structure:**

* Hierarchical: Each level represents a taxonomic rank

* Nested: Lower ranks are contained within higher ranks

* Non-overlapping: An organism belongs to exactly one group at each rank

]

.pull-right[

<div style="display: flex; justify-content: center; align-items: center;">
    <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/Fig5_primer.png" alt="Description of the image" width="100%">
]

---

# Tree terminology

* Root (domain level, such as Bacteria)

* Internal nodes (phylum, class, order, family, genus)

* Leaf nodes (species or ASVs or OTUs in microbiome data)

* Branches connecting taxonomic levels

---

# Taxonomic hierarchies in microbiology

.pull-left[

**Standard taxonomic ranks:**

* Domain (Bacteria, Archaea)

* Phylum

* Class

* Order

* Family

* Genus

* Species
]

.pull-right[

**Microbiome-specific concepts:**

* OTUs (Operational Taxonomic Units)

* ASVs (Amplicon Sequence Variants)

* MAGs (Metagenome-Assembled Genomes)

]

---

# Microbiome data structure

**Data formats:**

* OTU/ASV tables with taxonomic annotations

* Taxonomic abundance tables at different levels

---

# Taxonomic tree representation

.pull-left[

**Key attributes in microbiome taxonomic trees:**

* Hierarchical classification

* Relative abundance (often mapped to node/branch size)

* Diversity metrics

* Sample metadata

]

.pull-right[

**Common visualization types:**

* Traditional hierarchical trees

* Circular/radial dendrograms

* Treemaps

* Sunburst plots

* Heat trees

]

---

# Create tree data from phyloseq object

.pull-left[

`phyloseq` is a popular package for microbiome data analysis in R.

``` r
library(phyloseq)
data(GlobalPatterns)
GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
GP.chl <- subset_taxa(GP, Phylum=="Chlamydiae")

plot_tree(GP.chl, color="SampleType", 
          shape="Family", 
          label.tips="Genus", size="Abundance") + 
  ggtitle("tree annotation using phyloseq")
```
]

.pull-right[

![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-91-1.png)
]

---

# Create tree data from phyloseq object

.pull-left[

``` r
library(scales)
library(ggtree)
p <- ggtree(GP.chl, ladderize = FALSE) +
  geom_text2(aes(subset = !isTip, label = label),
             hjust = -.2,
             size = 4) +
  geom_tiplab(aes(label = Genus), hjust = -.3) +
  geom_point(aes(
    x = x + hjust,
    color = SampleType,
    shape = Family,
    size = Abundance
  ),
  na.rm = TRUE) +
  scale_size_continuous(trans = log_trans(5)) +
  theme(legend.position = "right") +
  ggtitle("reproduce phyloseq by ggtree")
print(p)
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-93-1.png)
]

---

# Example: Microbiome data

.pull-left[

The taxonomic tree of microbiome data.

]

.pull-right[
<div style="display: flex; justify-content: center; align-items: center;">
    <img src="https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/main/3_materials/images/cell_host_microbe.png" alt="Description of the image" width="80%">
</div>

> Zhou et al. Cell Host & Microbe 2024

]

---

# Load libraries

``` r
if (!require("microbiomeViz", quietly = TRUE)) {
  remotes::install_github("lch14forever/microbiomeViz")
}
if (!require("phyloseq", quietly = TRUE)) {
  BiocManager::install("phyloseq")
}
library(phyloseq)
library(microbiomeViz)
library(tidyverse)
library(ggtreeExtra)
library(ggtree)
```

---

# Read the data

``` r
personalized_score = 
  readr::read_csv("https://raw.githubusercontent.com/jaspershen-lab/MD7115-NTU/refs/heads/main/2_demo_data/personalized_score.csv")

head(personalized_score)
```

```
## # A tibble: 6 × 22
##   genus           class     fc1_p  fc2_p      fc3_p between_mean1 within_mean1
##   <chr>           <chr>     <dbl>  <dbl>      <dbl>         <dbl>        <dbl>
## 1 Acetivibrio     Stool 6.73e-  2 NA     NA                 0.754        0.649
## 2 Acidaminococcus Stool 3.73e- 60  0.411  6.69e- 30         0.813        0.510
## 3 Acutalibacter   Stool 1.90e- 65  0.552  5.07e- 46         0.941        0.665
## 4 Adlercreutzia   Stool 1.41e-166  1.00   1.28e-189         0.947        0.702
## 5 Aestuariispira  Stool 5.08e- 10 NA     NA                 0.727        0.626
## 6 Agathobacter    Stool 0          1      0                 0.763        0.575
## # ℹ 15 more variables: family_mean1 <dbl>, total_number_between <dbl>,
## #   total_number_family <dbl>, total_number_within <dbl>,
## #   number_1_between <dbl>, number_1_family <dbl>, number_1_within <dbl>,
## #   freq_between <dbl>, freq_family <dbl>, freq_within <dbl>,
## #   fc1_p_adjust <dbl>, fc2_p_adjust <dbl>, fc3_p_adjust <dbl>, fc1 <dbl>,
## #   fc2 <dbl>
```

---

# Read the data

``` r
load(url("https://github.com/jaspershen-lab/MD7115-NTU/raw/main/2_demo_data/microbiome_data.RData"))

microbiome_data
```

```
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 411 taxa and 2 samples ]
## sample_data() Sample Data:       [ 2 samples by 1 sample variables ]
## tax_table()   Taxonomy Table:    [ 411 taxa by 7 taxonomic ranks ]
```

---

# Create a phylogenetic tree

``` r
tree =
  microbiomeViz::parsePhyloseq(
    physeq = microbiome_data,
    use_abundance = FALSE,
    node.size.scale = 0,
    node.size.offset = 0
  )
tree
```

```
## 'treedata' S4 object'.
## 
## ...@ phylo:
## 
## Phylogenetic tree with 411 tips and 311 internal nodes.
## 
## Tip labels:
##   g__Unclassified_Gp16, g__Unclassified_Gp3, g__Unclassified_Gp6,
## g__Blastocatella, g__Stenotrophobacter, g__Iamia, ...
## Node labels:
##   r__Root, k__Bacteria, p__Acidobacteria, p__Actinobacteria,
## p__Armatimonadetes, p__Bacteroidetes, ...
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##   'taxaAbun', 'nodeSize', 'nodeClass'.
## 
## # The associated data tibble abstraction: 722 × 6
## # The 'node', 'label' and 'isTip' are from the phylo tree.
##     node label                            isTip taxaAbun nodeSize nodeClass
##    <int> <chr>                            <lgl>    <dbl>    <dbl> <fct>    
##  1     1 g__Unclassified_Gp16             TRUE         5        0 g        
##  2     2 g__Unclassified_Gp3              TRUE         5        0 g        
##  3     3 g__Unclassified_Gp6              TRUE         5        0 g        
##  4     4 g__Blastocatella                 TRUE         5        0 g        
##  5     5 g__Stenotrophobacter             TRUE         5        0 g        
##  6     6 g__Iamia                         TRUE         5        0 g        
##  7     7 g__Unclassified_Iamiaceae        TRUE         5        0 g        
##  8     8 g__Unclassified_Acidimicrobiales TRUE         5        0 g        
##  9     9 g__Actinomyces                   TRUE         5        0 g        
## 10    10 g__Actinotignum                  TRUE         5        0 g        
## # ℹ 712 more rows
```

---

# Visualize the tree

.pull-left[

``` r
raw_plot =
  tree.backbone(
    tree = tree,
    size = 0.3,
    shape = 16,
    layout = "circular",
    fill = "black",
    color = "black"
  )
raw_plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-99-1.png)
]

---

# Add nodes

``` r
raw_plot$data$nodeClass2 =
  as.character(raw_plot$data$nodeClass)

raw_plot$data$nodeClass2[is.na(raw_plot$data$nodeClass2)] = "Root"
raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "f"] = "Family"
raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "c"] = "Class"
raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "o"] = "Order"
raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "p"] = "Phylum"
raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "g"] = "Genus"
raw_plot$data$nodeClass2[raw_plot$data$nodeClass2 == "k"] = "Kingdom"

raw_plot$data$nodeSize2 = raw_plot$data$nodeSize
raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Root"] = 3.5
raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Kingdom"] = 3
raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Phylum"] = 2.5
raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Class"] = 2
raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Order"] = 1.5
raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Family"] = 1
raw_plot$data$nodeSize2[raw_plot$data$nodeClass2 == "Genus"] = 0.5
```

---

# Add nodes

.pull-left[

``` r
# #####add node point
raw_plot =
  raw_plot +
  ggtree::geom_point2(aes(color = nodeClass2, size = nodeSize2), show.legend = FALSE) +
  ggnewscale::new_scale(new_aes = "fill")

raw_plot
```
]

.pull-right[
![](data:image/png;base64,#MD7115-Xiaotao_files/figure-html/unnamed-chunk-102-1.png)
]

---