1 Narrative Summary of the Project

This project explores the structure of the Saccharomyces cerevisiae (yeast) protein-protein interaction network using the yeast dataset from the igraphdata package in R. Proteins often interact as part of larger biological processes, and this network reveals how those interactions form structural and functional modules within cells.

The goal of this project is to use real-world biological data to strengthen my skills in network visualization and analysis. I analyze key structural features such as degree centrality, betweenness, components, and community detection. I also create visualizations using ggraph to communicate how tightly clustered and highly connected proteins play key roles in the yeast interaction network.

This project investigates:

  1. Which proteins in the yeast network act as key hubs based on centrality?

  2. What community structures exist within the network, and how do they relate to biological function?

  3. How can different visual layouts help communicate these patterns effectively?

2 Setup: Packages, Data Loading, and Network Description

Dataset description: - Nodes (vertices): proteins - Edges: physical interactions between proteins - Size: 2,361 proteins and 7,182 edges - Source: Gabor Csardi (2022), loaded by igraphdata - Link: igraphdata package

3 Network Plot

4 Network Analysis

4.1 Degree Centrality

g <- g %>% mutate(degree = centrality_degree())
summary(g$degree)
## Length  Class   Mode 
##      0   NULL   NULL

Interpretation: Degree centrality shows how many direct interactions each protein has. Proteins with high degree values may act as hubs in biological processes.

5 Biological Context: Key Hub Proteins

Top 5 Hub Proteins
Systematic ID Standard Name Degree Biological Role
YPR110C RPC40 118 RNA polymerase subunit (transcription)
YPL131W RPL5 115 Ribosomal protein (translation)
YNL178W RPS3 114 Ribosomal protein (DNA repair)
YIL021W RPB3 113 RNA polymerase II subunit
YOL127W RPL25 113 Ribosomal protein

5.1 Top 10 Most Connected Proteins

g %>%
  as_tibble() %>%
  arrange(desc(degree)) %>% 
  slice_head(n = 10) %>% 
  mutate(name = fct_reorder(name, degree)) %>% 
  ggplot(aes(x = degree, y = name)) +
  geom_segment(aes(xend = 0, yend = name), 
               color = "gray50", linewidth = 0.8) +
  geom_point(aes(size = degree, color = degree), 
             shape = 19, alpha = 0.8) +
  geom_text(aes(label = degree), 
            hjust = -0.5, size = 3.5, color = "black") +
  scale_color_viridis_c(option = "plasma", begin = 0.2) +
  scale_size_continuous(range = c(3, 6)) +
  labs(title = "Top 10 Hub Proteins in Yeast Network",
       subtitle = "Size and color reflect degree centrality",
       x = "Number of Interactions (Degree)",
       y = "Protein",
       caption = "Data: Gabor Csardi (2022)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        panel.grid.major.y = element_blank(),
        plot.title = element_text(face = "bold"))

5.2 Community Detection (Fast-Greedy)

Interpretation: Proteins group into tight clusters, potentially representing functional protein complexes. Fast-greedy modularity optimization helps identify these communities.

5.3 Fast-Greedy Modularity Score Plot

Interpretation: Instead of a dense dendrogram, this plot tracks modularity scores across merge steps. A pink vertical line highlights the peak modularity point, suggesting the most optimal community split during clustering.

5.4 Betweenness Centrality

g <- g %>% mutate(betweenness = centrality_betweenness())
summary(g$betweenness)
## Length  Class   Mode 
##      0   NULL   NULL

Interpretation: Proteins with high betweenness scores likely serve as bridges between clusters, helping connect different biological modules.

6 Final Summary

The main takeaway from this project is how network visualization and analysis can bring clarity to complex biological systems like the yeast protein-protein interaction network. Using centrality measures and community detection helped highlight not only which proteins are most connected, but also how they cluster into functional modules. Visual tools like ggraph were essential in making these patterns more understandable and engaging.

While the analysis revealed meaningful clusters and hub proteins, it’s important to consider that some patterns might reflect experimental or dataset biases rather than true biological centrality. For instance, certain proteins may appear more connected simply because they’ve been studied more extensively.

If I had more time, I’d explore dynamic or interactive visualizations—possibly with Shiny or plotly—to let users explore the network themselves. It would also be interesting to compare multiple community detection algorithms side-by-side to see how the groupings change. A key limitation of this project is that the analysis relies on one static dataset; protein networks are often context-dependent, and different environmental or cellular conditions might reveal other interaction patterns not captured here.