This script demonstrates how to map Ensembl Gene IDs to Ensembl Gene
ID Versions using the biomaRt package, match the results
with data from UniProt, and extract corresponding protein sequences. The
sequences are then saved in FASTA format for further analysis.
We begin by loading the biomaRt library for interacting
with the Ensembl database.
library(biomaRt)
We load a file containing Ensembl Gene IDs. Ensure the file is properly formatted and the column containing the Gene IDs is named accordingly.
input_file <- "homo sapiens-intronless_genes.tsv"
data <- read.delim(input_file, header = TRUE)
ensembl_gene_ids <- data[[1]] # Assuming the first column contains Ensembl Gene IDs
Using biomaRt, we map the Ensembl Gene IDs to their
corresponding versions.
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
gene_id_versions <- getBM(
attributes = c("ensembl_gene_id", "ensembl_gene_id_version"),
filters = "ensembl_gene_id",
values = ensembl_gene_ids,
mart = mart
)
We load UniProt mapping results and extract Ensembl Gene ID Versions for matching.
uniprot_file <- "uniprot_mapping_results.tsv"
uniprot_data <- read.delim(uniprot_file, header = TRUE)
# Extract the second column (Ensembl Gene ID Versions)
ensembl_gene_versions <- uniprot_data[[2]] # Replace with actual column name if available
We find the Ensembl Gene ID Versions that match between the BioMart and UniProt results.
matched_versions <- gene_id_versions$ensembl_gene_id_version[
gene_id_versions$ensembl_gene_id_version %in% ensembl_gene_versions
]
We use biomaRt to retrieve protein sequences
corresponding to the matched Ensembl Gene ID Versions.
fasta_sequences <- getBM(
attributes = c("ensembl_gene_id_version", "peptide"),
filters = "ensembl_gene_id_version",
values = matched_versions,
mart = mart
)
Finally, we save the retrieved protein sequences in FASTA format.
output_fasta <- "matched_protein_sequences.fasta"
write_fasta <- function(data, file) {
sink(file)
for (i in 1:nrow(data)) {
cat(paste0(">", data[i, "ensembl_gene_id_version"], "\n"))
cat(data[i, "peptide"], "\n")
}
sink()
}
write_fasta(fasta_sequences, output_fasta)
cat("FASTA sequences saved to", output_fasta, "\n")
## FASTA sequences saved to matched_protein_sequences.fasta
This workflow provides a structured approach to mapping gene IDs, matching with UniProt data, and extracting protein sequences using BioMart. The resulting FASTA file can be used for downstream analyses such as sequence alignment or functional annotation.