This project explores the structure of the Saccharomyces
cerevisiae (yeast) protein-protein interaction network using the
yeast dataset from the igraphdata package in
R. Proteins often interact as part of larger biological processes, and
this network reveals how those interactions form structural and
functional modules within cells.
The goal of this project is to use real-world biological data to
strengthen my skills in network visualization and analysis. I analyze
key structural features such as degree centrality, betweenness,
components, and community detection. I also create visualizations using
ggraph to communicate how tightly clustered and highly
connected proteins play key roles in the yeast interaction network.
This project investigates:
Which proteins in the yeast network act as key hubs based on centrality?
What community structures exist within the network, and how do they relate to biological function?
How can different visual layouts help communicate these patterns effectively?
Dataset description: - Nodes
(vertices): proteins - Edges: physical
interactions between proteins - Size: 2,361 proteins
and 7,182 edges - Source: Gabor Csardi (2022), loaded
by igraphdata - Link: igraphdata
package
g <- g %>% mutate(degree = centrality_degree())
summary(g$degree)
## Length Class Mode
## 0 NULL NULL
Interpretation: Degree centrality shows how many direct interactions each protein has. Proteins with high degree values may act as hubs in biological processes.
| Systematic ID | Standard Name | Degree | Biological Role |
|---|---|---|---|
| YPR110C | RPC40 | 118 | RNA polymerase subunit (transcription) |
| YPL131W | RPL5 | 115 | Ribosomal protein (translation) |
| YNL178W | RPS3 | 114 | Ribosomal protein (DNA repair) |
| YIL021W | RPB3 | 113 | RNA polymerase II subunit |
| YOL127W | RPL25 | 113 | Ribosomal protein |
g %>%
as_tibble() %>%
arrange(desc(degree)) %>%
slice_head(n = 10) %>%
mutate(name = fct_reorder(name, degree)) %>%
ggplot(aes(x = degree, y = name)) +
geom_segment(aes(xend = 0, yend = name),
color = "gray50", linewidth = 0.8) +
geom_point(aes(size = degree, color = degree),
shape = 19, alpha = 0.8) +
geom_text(aes(label = degree),
hjust = -0.5, size = 3.5, color = "black") +
scale_color_viridis_c(option = "plasma", begin = 0.2) +
scale_size_continuous(range = c(3, 6)) +
labs(title = "Top 10 Hub Proteins in Yeast Network",
subtitle = "Size and color reflect degree centrality",
x = "Number of Interactions (Degree)",
y = "Protein",
caption = "Data: Gabor Csardi (2022)") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
panel.grid.major.y = element_blank(),
plot.title = element_text(face = "bold"))
Interpretation: Proteins group into tight clusters, potentially representing functional protein complexes. Fast-greedy modularity optimization helps identify these communities.
Interpretation: Instead of a dense dendrogram, this plot tracks modularity scores across merge steps. A pink vertical line highlights the peak modularity point, suggesting the most optimal community split during clustering.
g <- g %>% mutate(betweenness = centrality_betweenness())
summary(g$betweenness)
## Length Class Mode
## 0 NULL NULL
Interpretation: Proteins with high betweenness scores likely serve as bridges between clusters, helping connect different biological modules.
The main takeaway from this project is how network visualization and analysis can bring clarity to complex biological systems like the yeast protein-protein interaction network. Using centrality measures and community detection helped highlight not only which proteins are most connected, but also how they cluster into functional modules. Visual tools like ggraph were essential in making these patterns more understandable and engaging.
While the analysis revealed meaningful clusters and hub proteins, it’s important to consider that some patterns might reflect experimental or dataset biases rather than true biological centrality. For instance, certain proteins may appear more connected simply because they’ve been studied more extensively.
If I had more time, I’d explore dynamic or interactive visualizations—possibly with Shiny or plotly—to let users explore the network themselves. It would also be interesting to compare multiple community detection algorithms side-by-side to see how the groupings change. A key limitation of this project is that the analysis relies on one static dataset; protein networks are often context-dependent, and different environmental or cellular conditions might reveal other interaction patterns not captured here.