Overview

In today’s workshop we will be looking to:

We will also see how to:

Networks

Networks are everywhere and they are formed by the interactions. Specifically, interactions between elements in a network. Those elements might be computers, if we’re thinking about the network of computers on campus, or they might be species in a complex ecosystem. Today, we’ll explore various networks we have available in biosciences.

Biological networks are often incomplete. Some networks are very complete, meaning we know all interactions, because we have designed them to be. Others, like interactions between organisms (microbes to macrobes) in coastal wetlands, may be severely underspecified - meaning we are missing many real interactions. As we add more layers of information to networks, we often change (sometimes drastically) the topology of a network.

Networks are useful tools to help us explain a complex world. Useful as networks are they not without faults. It is important to remember that our ideas about networks, and the theory and practice of analyzing them, are human inventions. Organisms in nature do not care about networks. Networks are an emergent property of organisms living their lives, day to day. Networks also imply something static about interactions. This can be very misleading, particularly in natural systems where species members or diets may be ephemeral (or short lived).

So, as we examine networks here or as you use them in your research, remain critical about what interactions in a network mean and whether the network being used is a fairly complete representation of possible interactions.

Ecological network databases

There are a wide variety ecological databases. Most public databases describe interactions of species in a given environment.

Here are a few:

The transition to public availability in ecological data is fairly new. As a result, many of these databases are underdeveloped compared to some of the extensive cellular pathway databases. However, there is a lot of interest in sharing these types of data in developing ecological models so there is rapid growth in ecological data sharing.

Pathway databases

Major sources of biological pathways are in cellular signaling, cellular metabolism, and nutrient cycling. We’ll focus on a couple of pathway databases.

Reactome

Take a look at the Pathway Browser.


Question

What kind of information do you find available for Homo sapiens? List 3 or 4.


BioCyc

Now, let’s take a look at BioCyc. In the search field, input “glycolysis”. Select “Glycolysis I”.


Question

Describe 3 types of information you could gather from this database entry?


Functional interacterion databases

There are many different types of interactions in a cell that result in changes in cellular behavior. Some of the primary ones that draw much biological interest are protein-protein interactions and protein-metabolite interactions. Here, we’ll examine a protein-protein interaction database, BioGRID.

BioGRID

BioGRID is a curated database of interactions defined by experiments and computation in published literature. There is also substantial integration into various comparative bioinformatics tools we have covered elsewhere in the course (NCBI Entrez, etc.).

Perform a search for human “ACE2”.


Question

Examine the Network of ACE2 interactions in humans. What are the known chemical interactions based on the BioGRID data?


Using R to analyze networks

There are a number of R packages to help you visualize and analyze data. There are also a growing number of interfaces that help you interact with large databases. We’ll explore a few here.

Beyond some of the basic functionality I will demonstrate here, there are number of great online resources. For example, check out the site put together by Katherine Ognyanova (https://kateto.net/networks-r-igraph/).

Our exercise will focus on analyzing networks from the Mangal database (https://mangal.io/).

library(rmangal)
library(igraph)

mgs <- search_datasets("Aspen Parkland") #search for datasets involving "Aspen Parkland" environments (these are data set from a marshland in southern Mantioba)

mgn <- get_collection(mgs) #download the networks from Mangal
class(mgn)
mgn

ig1 <- as.igraph(mgn[[3]]) #covert the Mangal data into igraph format for analysis with the igraph package / we'll focus on a large interaction dataset

Let’s look at what is in the graph (a.k.a. network). You will see it is a matrix.

head(ig1)

Question

What is your interpretation of the . and 1 values in this matrix? What are the rows and columns?


What the heck?! What do those numbers mean?! Great questions! Let’s see if we can figure this out. We’ll start by interrogating the original data we downloaded from Mangal.

mgn[[3]] #we chose the 3rd dataset from our query.

If we look in our Environment tab, we’ll see mgn is an Object variable meaning it has several types of information stored in it. To access the components of an object in R, we can use the $ after our variable name. Let’s try it!

In the space below, type mgn[[3]]$ and wait a second after the $ or press the down arrow key. You should see a dropdown menu appear full of variable names. These are the components of this object variable. Let’s access the nodes. I think we have our answer! Converting between data formats sometimes results in data loss as we see when we convert to the igraph format, above. In this case, the data is lost it is just associated with our igraph object a bit differently. We’ll see below.

mgn[[3]]$nodes

Now, let’s see our network!

plot(ig1)

The igraph package adds additional plotting capabilities for networks. We see, here, this is a little cumbersome looking using default settings. We can modify a variety of variables if want and those can be found in the igraph Help page.

Let’s start by changing our node labels to something useful.


vertLabels <- vertex_attr(ig1)$original_name #extract the original names from our data

plot(
  ig1,
  vertex.label=vertLabels
  )

An example that might make things look better.

plot(ig1,
     vertex.label=vertLabels,
     vertex.label.cex=0.2, # reduce label size on nodes (a.k.a. vertices)
     vertex.size=5, # make the vertices a little smaller
     edge.arrow.size=0, # since we might not know anything about directionality
     )

There are wide variety of approaches and metrics for understanding networks… in fact there are whole areas of math and computer science dedicated to it. We’ll just sample a few ideas here.

Maybe we want to try and identify clusters (or subnetworks). Here, we can use the number of interconnections between nodes as a metric to define clusters.

ceb <- cluster_edge_betweenness(as.undirected(ig1))

plot(ceb, 
     ig1,
     vertex.label=vertLabels,
     vertex.label.cex=0.2, # reduce label size on nodes (a.k.a. vertices)
     vertex.size=5, # make the vertices a little smaller
     edge.arrow.size=0, # since we might not know anything about directionality
     )

Or, we can use greedy algorithms to search for most “optimal” clusters. This is beyond our scope, I just want you to see there are different methods out there. As we can see, there are some different outcomes based on clustering approach.

cfg <- cluster_fast_greedy(as.undirected(ig1))

plot(cfg, 
     as.undirected(ig1),
     vertex.label=vertLabels,
     vertex.label.cex=0.2, # reduce label size on nodes (a.k.a. vertices)
     vertex.size=5, # make the vertices a little smaller
     edge.arrow.size=0, # since we might not know anything about directionality
)

There are number of other metrics that often used to describe networks.

Things like distance can tell us about how spread out elements of the network are. This calculates the number of edges that would need to be transversed to get from one point to another.

distances(ig1)
mean(distances(ig1)) # this reports the average distance between each node

We may also want to understand how many connections/interactions each node has, on average.

degree(ig1, mode="all")
hist(degree(ig1, mode="all"), # makes a simple histogram plot of these data
     main="Distribution of Node Connections",
     xlab="Number of connections at each node"
)

mean(degree(ig1, mode="all"))

As we can imagine, networks are everywhere. There is an extraordinary set of theory behind analyzing networks but it always requires deeper understanding the systems studied to interpret what these theoretical values and graphs mean in the real world.

---
title: "Module 11 Workshop Exercises"
output:
  html_notebook: default
  word_document: default
---

# Overview

In today's workshop we will be looking to:

- Explore various network databases

We will also see how to:

- Use R to access information in network databases


# Networks

Networks are everywhere and they are formed by the interactions. Specifically, interactions between elements in a network. Those elements might be computers, if we're thinking about the network of computers on campus, or they might be species in a complex ecosystem. Today, we'll explore various networks we have available in biosciences.

Biological networks are often incomplete. Some networks are very complete, meaning we know all interactions, because we have designed them to be. Others, like interactions between organisms (microbes to macrobes) in coastal wetlands, may be severely underspecified - meaning we are missing many real interactions. As we add more layers of information to networks, we often change (sometimes drastically) the topology of a network.

Networks are useful tools to help us explain a complex world. Useful as networks are they not without faults. It is important to remember that our ideas about networks, and the theory and practice of analyzing them, are human inventions. Organisms in nature do not care about networks. Networks are an emergent property of organisms living their lives, day to day. Networks also imply something static about interactions. This can be very misleading, particularly in natural systems where species members or diets may be ephemeral (or short lived).

So, as we examine networks here or as you use them in your research, remain critical about what interactions in a network mean and whether the network being used is a fairly complete representation of possible interactions.

## Ecological network databases

There are a wide variety ecological databases. Most public databases describe interactions of species in a given environment. 

Here are a few:

- Mangal: https://www.mangal.io
- Web of Life: https://www.web-of-life.es/
- Globi: https://www.globalbioticinteractions.org/

The transition to public availability in ecological data is fairly new. As a result, many of these databases are underdeveloped compared to some of the extensive cellular pathway databases. However, there is a lot of interest in sharing these types of data in developing ecological models so there is rapid growth in ecological data sharing.

## Pathway databases

Major sources of biological pathways are in cellular signaling, cellular metabolism, and nutrient cycling. We'll focus on a couple of pathway databases.

- Reactome: https://reactome.org/
- BioCyc: https://biocyc.org/ (we'll focus on EcoCyc)

### Reactome

Take a look at the Pathway Browser.

---

**Question**

What kind of information do you find available for Homo sapiens? List 3 or 4.

---

### BioCyc

Now, let's take a look at BioCyc. In the search field, input "glycolysis". Select "Glycolysis I".

---

**Question**

Describe 3 types of information you could gather from this database entry?

---

## Functional interacterion databases

There are many different types of interactions in a cell that result in changes in cellular behavior. Some of the primary ones that draw much biological interest are protein-protein interactions and protein-metabolite interactions. Here, we'll examine a protein-protein interaction database, BioGRID.

- BioGRID: https://thebiogrid.org/

### BioGRID

BioGRID is a curated database of interactions defined by experiments and computation in published literature. There is also substantial integration into various comparative bioinformatics tools we have covered elsewhere in the course (NCBI Entrez, etc.).

Perform a search for human "ACE2".

---

**Question**

Examine the Network of ACE2 interactions in humans. What are the known chemical interactions based on the BioGRID data?

---


# Using R to analyze networks

There are a number of R packages to help you visualize and analyze data. There are also a growing number of interfaces that help you interact with large databases. We'll explore a few here.

Beyond some of the basic functionality I will demonstrate here, there are number of great online resources. For example, check out the site put together by Katherine Ognyanova (https://kateto.net/networks-r-igraph/).

Our exercise will focus on analyzing networks from the Mangal database (https://mangal.io/).

```{r}
library(rmangal)
library(igraph)

mgs <- search_datasets("Aspen Parkland") #search for datasets involving "Aspen Parkland" environments (these are data set from a marshland in southern Mantioba)

mgn <- get_collection(mgs) #download the networks from Mangal
```

```{r}
class(mgn)
mgn

ig1 <- as.igraph(mgn[[3]]) #covert the Mangal data into igraph format for analysis with the igraph package / we'll focus on a large interaction dataset
```

Let's look at what is in the graph (a.k.a. network). You will see it is a matrix.
```{r}
head(ig1)
```

---

*Question*

What is your interpretation of the `.` and `1` values in this matrix? What are the rows and columns?

---

**What the heck?! What do those numbers mean?!** Great questions! Let's see if we can figure this out. We'll start by interrogating the original data we downloaded from Mangal.

```{r}
mgn[[3]] #we chose the 3rd dataset from our query.
```

If we look in our Environment tab, we'll see `mgn` is an Object variable meaning it has several types of information stored in it. To access the components of an object in R, we can use the `$` after our variable name. Let's try it!

In the space below, type `mgn[[3]]$` and wait a second after the `$` or press the down arrow key. You should see a dropdown menu appear full of variable names. These are the components of this object variable. Let's access the `nodes`. I think we have our answer! Converting between data formats sometimes results in data loss as we see when we convert to the `igraph` format, above. In this case, the data is lost it is just associated with our igraph object a bit differently. We'll see below.

```{r}
mgn[[3]]$nodes
```

Now, let's see our network!

```{r}
plot(ig1)
```

The `igraph` package adds additional plotting capabilities for networks. We see, here, this is a little cumbersome looking using default settings. We can modify a variety of variables if want and those can be found in the `igraph` Help page.

Let's start by changing our node labels to something useful.

```{r}

vertLabels <- vertex_attr(ig1)$original_name #extract the original names from our data

plot(
  ig1,
  vertex.label=vertLabels
  )
```

An example that might make things look better.

```{r}
plot(ig1,
     vertex.label=vertLabels,
     vertex.label.cex=0.2, # reduce label size on nodes (a.k.a. vertices)
     vertex.size=5, # make the vertices a little smaller
     edge.arrow.size=0, # since we might not know anything about directionality
     )
```

There are wide variety of approaches and metrics for understanding networks... in fact there are whole areas of math and computer science dedicated to it. We'll just sample a few ideas here.

Maybe we want to try and identify clusters (or subnetworks). Here, we can use the number of interconnections between nodes as a metric to define clusters.
```{r}
ceb <- cluster_edge_betweenness(as.undirected(ig1))

plot(ceb, 
     ig1,
     vertex.label=vertLabels,
     vertex.label.cex=0.2, # reduce label size on nodes (a.k.a. vertices)
     vertex.size=5, # make the vertices a little smaller
     edge.arrow.size=0, # since we might not know anything about directionality
     )
```

Or, we can use greedy algorithms to search for most "optimal" clusters. This is beyond our scope, I just want you to see there are different methods out there. As we can see, there are some different outcomes based on clustering approach.

```{r}
cfg <- cluster_fast_greedy(as.undirected(ig1))

plot(cfg, 
     as.undirected(ig1),
     vertex.label=vertLabels,
     vertex.label.cex=0.2, # reduce label size on nodes (a.k.a. vertices)
     vertex.size=5, # make the vertices a little smaller
     edge.arrow.size=0, # since we might not know anything about directionality
)
```
There are number of other metrics that often used to describe networks. 

Things like distance can tell us about how spread out elements of the network are. This calculates the number of edges that would need to be transversed to get from one point to another.
```{r}
distances(ig1)
mean(distances(ig1)) # this reports the average distance between each node
```

We may also want to understand how many connections/interactions each node has, on average.

```{r}
degree(ig1, mode="all")
hist(degree(ig1, mode="all"), # makes a simple histogram plot of these data
     main="Distribution of Node Connections",
     xlab="Number of connections at each node"
)

mean(degree(ig1, mode="all"))
```
As we can imagine, networks are everywhere. There is an extraordinary set of theory behind analyzing networks but it always requires deeper understanding the systems studied to interpret what these theoretical values and graphs mean in the real world.