Introduction

This script demonstrates how to map Ensembl Gene IDs to Ensembl Peptide ID Versions using the biomaRt package, match the results with data from UniProt (output from miPFinder2), and extract the corresponding peptide sequences. The results are saved as a CSV file for further evolutionary analyses.

knitr::opts_chunk$set(echo = TRUE)
#Load Required Library
#We begin by loading the biomaRt library for interacting with the Ensembl database.
library(biomaRt)
#Step 1: Read the IGFinder Output File and Extract Ensembl Gene IDs

#We load a file containing Ensembl Gene IDs. Ensure the file is properly formatted, and the column containing the Gene IDs is named accordingly.
input_file <- "homo_sapiens-intronless_genes.tsv"
data <- read.delim(input_file, header = TRUE)

# Extract the column with Ensembl Gene IDs 
ensembl_gene_ids <- data[[1]]  
#Step 2: Convert Ensembl Gene IDs to Ensembl Peptide ID Versions Using biomaRt, we map the Ensembl Gene IDs to their corresponding Ensembl Peptide ID Versions.
# Connect to the Ensembl BioMart database
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# Retrieve mapping of Ensembl Gene IDs to Peptide ID Versions
gene_peptide_versions <- getBM(
  attributes = c("ensembl_gene_id", "ensembl_peptide_id_version"),
  filters = "ensembl_gene_id",
  values = ensembl_gene_ids,
  mart = mart
)
head(gene_peptide_versions)

##   ensembl_gene_id ensembl_peptide_id_version
## 1 ENSG00000043591          ENSP00000358301.2
## 2 ENSG00000054598          ENSP00000493906.1
## 3 ENSG00000101898          ENSP00000496921.1
## 4 ENSG00000101898          ENSP00000520739.1
## 5 ENSG00000101898          ENSP00000520740.1
## 6 ENSG00000101898          ENSP00000520741.1

#Step 3: Read the UniProt Mapping Results (Output from miPFinder2)
#We load UniProt mapping results and extract Ensembl Peptide ID Versions for matching.
uniprot_file <- "/Users/katiaavinapadilla/Desktop/uniprot_mapping_results.tsv"
uniprot_data <- read.delim(uniprot_file, header = TRUE,sep = "\t")
# Extract the column with Ensembl Peptide ID Versions 
ensembl_peptide_versions <- uniprot_data[[2]] 
head(ensembl_peptide_versions)

## [1] "ENSP00000373254.5" "ENSP00000388094.1" "ENSP00000319222.9"
## [4] "ENSP00000443758.1" "ENSP00000261169.6" "ENSP00000312856.6"

# Step 4: Find Matches Between Ensembl Peptide ID Versions in Both Files
# We find the Ensembl Peptide ID Versions that match between the IGFinder output and the miPFinder2 results.

# Find matches between the two datasets
matched_versions <- gene_peptide_versions$ensembl_peptide_id_version[
  gene_peptide_versions$ensembl_peptide_id_version %in% ensembl_peptide_versions
]
# Save the matched versions as a CSV file
write.csv(matched_versions, "Human-IGmips.csv", row.names = FALSE)

Conclusion This workflow provides a structured approach to mapping Ensembl Peptide IDs and saving the results as a CSV file for further evolutionary analyses. The output, Human-IGmips.csv, contains the list of intronless gene (IG)-encoded microproteins identified through this matching process.

Mapping and Matching Peptide IDs Using miPFinder2 and IGFinder Outputs

Katia Aviña Padilla

2025-01-20

Introduction