Introduction

AlphaFold is a powerful tool for predicting protein structures based on sequence data. In particular, the third and newest version of AlphaFold (AlphaFold3) can predict the way ions and proteins interact based on sequence data. AlphaFold 3’s outputs include iPTM (interface predicted alignment score) and pTM (predicted alignment score), which are critical for understanding protein stability and interactions. More formally, iPTM is a measure of predicted structural alignment between protein domains, often indicating interaction potential, while pTM is a measure of overall protein stability.

The objective of this study was to develop a streamlined, automated workflow for AlphaFold 3 runs to predict protein interactions and process iPTM and pTM data efficiently. This workflow ultimately was applied to evaluate the interactions of proteins identified as either enriched or de-enriched in Endo-U immunoprecipitation experiments.

In Mus musculus, or the common house mouse, Endo-U serves as an RNA processing enzyme with roles in RNA cleavage and turnover. By studying Endo-U’s interactions with other proteins and/or ions, we can better understand how to regulate gene expression and RNA stability or processing.

Using existing mass spectrometry data, candidate proteins for stable interaction with Endo-U (enriched proteins) and internal negative controls (de-enriched proteins) were identified. The AlphaFold 3 predictions allowed us to assess the statistical significance of these interactions.

Methodology and Workflow

We began by streamlining a Google Sheets workflow to organize, automate, and generate JSON files for AlphaFold input.

JSON Generator Page of Google Sheets Method.
JSON Generator Page of Google Sheets Method.


Lookup Page of Google Sheets Method.

Lookup Page of Google Sheets Method.

This Google Sheets method was adapted from Yuri Malina at UC Berkeley’s HHMI Meyer Lab. Our adapted version is available publicly here.

Briefly, the JSON Generator Page generates formatted JSON files based on the inputs to columns D through H. There are spaces to add ions, DNA sequences, or ligands to each build. Multiple JSON files may be run at once through row 1’s outputs. The sequences are automatically found through the Protein Lookup Table, shown in Figure 2. These sequences, along with protein or ion name/length, are derived from Uniprot. After adapting this Google Sheets workflow for server run automation, our aim was to generate positive and negative control tests to evaluate and standardize the iPTM and pTM values that AlphaFold 3 produces.

To that end, we first randomly created 3 pairs of proteins from Uniprot protein data for negative controls. To do this, we downloaded all FASTA files from Uniprot under the “REVIEWED” category, assigned a number to each FASTA file, and randomly generated two numbers within the lower and upper bounds. As for positive controls, we utilized known interactions between proteins within Mus musculus specifically, and ran those interactions as builds within the AlphaFold 3 server.

Using this method, we were ultimately able to generate four positive control groups with sufficient ipTM values (>0.75) and pTM values (on average >0.71). We also randomly generated three negative control groups, with ipTM values falling between 0.19-0.43 and pTM values falling between 0.31-0.4.

iPTM and pTM Values for Control Runs
Build_Name Inputs iPTM_Value pTM_Value Type
Run 1 2x HBA_MOUSE, 1x HBB2_MOUSE, 1x HBB1_MOUSE, 1x Ca2+ 0.76 0.81 +
Run 2 1x CCNB1_MOUSE, 1x CDK1_MOUSE 0.87 0.75 +
Run 3 1x IGKC_MOUSE, 1x IGH1M_MOUSE 0.94 0.44 +
Run 4 1x KAPCA_MOUSE, 1x IPKA_MOUSE 0.87 0.85 +
Run 5 1x Q8BIC7_MOUSE, 1x SYT7_MOUSE 0.34 0.31 -
Run 6 1x KLRD1_MOUSE, 1x DUS15_MOUSE 0.43 0.40 -
Run 7 1x GCNA_MOUSE, 1x MA1B1_MOUSE, 1x OTUD3_MOUSE 0.19 0.40 -

The high iPTM/pTM values for the positive control groups makes sense because the proteins we chose for those groups are confirmed to interact through wet lab methods. The low iPTM/pTM values for the negative control groups also make sense because we would expect the likelihood of an interaction between two random proteins to be quite slim. After verifying these pairs, we moved on to the analysis of experimental data for Endo-U.

In our experimental data, Endo-U was immunoprecipitated from both the cytoplasmic and nuclear extracts of cells because Endo-U is present in both the cytoplasm and the nucleus, indicating that it could have different interacting proteins in the two cell compartments. After doing a knockout control to clean up the immunoprecipitated protein data, all of the proteins in each sample were identified and quantified by mass-spectrometry. The enrichment value, or ratio of each protein’s amount in the wild-type vs. knockout sample, indicates likely candidate proteins for stable interaction with Endo-U. Enriched proteins have a log2 ratio greater than 0 are enriched, while de-enriched proteins will have negative log ratios. These values were found for each protein in the experimental data. Following this, our goal was to use our automated Google Sheets method to run AlphaFold 3 builds with every experimental protein found to be statistically significant in the cytoplasmic extract or the nuclear extract using Endo-U as an interacting partner. The results of this investigation can be found here, as well as visualized in the table below.

Discussion and Evaluation

# df control
df_control <- data.frame(
  group = c('pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'neg'),  
  iPTM = c(0.76, 0.87, 0.94, 0.87, 0.34, 0.43, 0.19)
)

# barplot for controls
ggplot(df_control, aes(x = group, y = iPTM)) +
  geom_bar(stat = "summary", fun = "mean", fill = "lightblue", color = "black", width = 0.6) +
  geom_point(position = position_jitter(width = 0.15), size = 3, alpha = 0.7) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
  theme_minimal() +
  labs(title = "iPTMs for Positive and Negative Controls", x = "Group", y = "iPTM") +
  theme(plot.title = element_text(hjust = 0.5))

In this barplot, negative and positive control iPTM values are visualized.

# df experimental
df_experimental <- data.frame(
  group = c('enriched', 'enriched', 'enriched', 'enriched', 'enriched', 'enriched', 'enriched', 'enriched', 'enriched', 'enriched', 'enriched', 'deenriched', 'deenriched', 'deenriched', 'deenriched', 'deenriched', 'deenriched', 'deenriched', 'deenriched', 'deenriched', 'deenriched', 'deenriched'),
  iPTM = c(0.15, 0.09, 0.17, 0.36, 0.3, 0.17, 0.2, 0.6, 0.33, 0.14, 0.26, 0.27, 0.19, 0.2, 0.12, 0.39, 0.21, 0.38, 0.15, 0.17, 0.17, 0.52)
)

# barplot experimental
ggplot(df_experimental, aes(x = group, y = iPTM)) +
  geom_bar(stat = "summary", fun = "mean", fill = "gray", color = "black", width = 0.6) +
  geom_point(position = position_jitter(width = 0.15), size = 3, alpha = 0.7) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
  theme_minimal() +
  labs(title = "iPTMs for Enriched and DeEnriched Experimental Data", x = "Enriched/DeEnriched", y = "iPTM") +
  theme(plot.title = element_text(hjust = 0.5))

In this bar plot, iPTMs for enriched and de-enriched proteins are visualized.

Following the visualization of the differing iPTM values for both the control groups and the experimental groups, we used the t-test to determine if there was statistically significant differences in the iPTM data. We evaluated a t-test between the positive control and the negative control iPTM values, as well as the enriched and de-enriched iPTM values.

# t-test for controls
pos_control <- df_control[df_control$group == 'pos', "iPTM"]
neg_control <- df_control[df_control$group == 'neg', "iPTM"]
t_test_result <- t.test(pos_control, neg_control)

# t-test experimental
enriched_experimental <- df_experimental[df_experimental$group == 'enriched', "iPTM"]
deenriched_experimental <- df_experimental[df_experimental$group == 'deenriched', "iPTM"]
t_test_result_experimental <- t.test(enriched_experimental, deenriched_experimental)

print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  pos_control and neg_control
## t = 6.8124, df = 3.1227, p-value = 0.005686
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2932527 0.7867473
## sample estimates:
## mean of x mean of y 
##      0.86      0.32
print(t_test_result_experimental)
## 
##  Welch Two Sample t-test
## 
## data:  enriched_experimental and deenriched_experimental
## t = 0, df = 19.638, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1200915  0.1200915
## sample estimates:
## mean of x mean of y 
## 0.2518182 0.2518182

Discussion

We can see that there is a statistically significant difference in the positive and negative control dataset, which verifies that AlphaFold 3 runs are accurate in regards to prediction of protein interactions. Although the t-test for our experimental data does not show many statistically significant differences between the enriched and de-enriched proteins, AlphaFold 3 predicted a high iPTM value of 0.6 for the interaction between HNRPD_MOUSE and Endo-U. This points to an area of future exploration, with possible wet lab confirmation of an interaction existing between HNRPD_MOUSE and Endo-U.

Our automation pipeline significantly streamlines AlphaFold 3 runs and iPTM/pTM analysis. Our results for Endo-U provide insights into its protein interactions and will inform future experiments in this sector.

Acknowledgements

Thank you to Dr. Fedor Karginov for his support throughout this work.