Introduction

This script demonstrates how to map Ensembl Gene IDs to Ensembl Gene ID Versions using the biomaRt package, match the results with data from UniProt, and extract corresponding protein sequences. The sequences are then saved in FASTA format for further analysis.

Load Required Library

We begin by loading the biomaRt library for interacting with the Ensembl database.

library(biomaRt)

Step 1: Read the File and Extract Ensembl Gene IDs

We load a file containing Ensembl Gene IDs. Ensure the file is properly formatted and the column containing the Gene IDs is named accordingly.

input_file <- "homo sapiens-intronless_genes.tsv"
data <- read.delim(input_file, header = TRUE)
ensembl_gene_ids <- data[[1]]  # Assuming the first column contains Ensembl Gene IDs

Step 2: Convert Ensembl Gene IDs to Ensembl Gene ID Versions

Using biomaRt, we map the Ensembl Gene IDs to their corresponding versions.

mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

gene_id_versions <- getBM(
  attributes = c("ensembl_gene_id", "ensembl_gene_id_version"),
  filters = "ensembl_gene_id",
  values = ensembl_gene_ids,
  mart = mart
)

Step 3: Read the UniProt Mapping Results

We load UniProt mapping results and extract Ensembl Gene ID Versions for matching.

uniprot_file <- "uniprot_mapping_results.tsv"
uniprot_data <- read.delim(uniprot_file, header = TRUE)

# Extract the second column (Ensembl Gene ID Versions)
ensembl_gene_versions <- uniprot_data[[2]]  # Replace with actual column name if available

Step 4: Find Matches Between Ensembl Gene ID Versions

We find the Ensembl Gene ID Versions that match between the BioMart and UniProt results.

matched_versions <- gene_id_versions$ensembl_gene_id_version[
  gene_id_versions$ensembl_gene_id_version %in% ensembl_gene_versions
]

Step 5: Retrieve Protein Sequences for Matched IDs

We use biomaRt to retrieve protein sequences corresponding to the matched Ensembl Gene ID Versions.

fasta_sequences <- getBM(
  attributes = c("ensembl_gene_id_version", "peptide"),
  filters = "ensembl_gene_id_version",
  values = matched_versions,
  mart = mart
)

Step 6: Save the Sequences in FASTA Format

Finally, we save the retrieved protein sequences in FASTA format.

output_fasta <- "matched_protein_sequences.fasta"
write_fasta <- function(data, file) {
  sink(file)
  for (i in 1:nrow(data)) {
    cat(paste0(">", data[i, "ensembl_gene_id_version"], "\n"))
    cat(data[i, "peptide"], "\n")
  }
  sink()
}
write_fasta(fasta_sequences, output_fasta)

cat("FASTA sequences saved to", output_fasta, "\n")

## FASTA sequences saved to matched_protein_sequences.fasta

Conclusion

This workflow provides a structured approach to mapping gene IDs, matching with UniProt data, and extracting protein sequences using BioMart. The resulting FASTA file can be used for downstream analyses such as sequence alignment or functional annotation.

Mapping and Extracting Protein Sequences

Katia Aviña Padilla

2024-12-09