KEGG is a widely used database for biological pathways. This document demonstrates how to programmatically extract gene names and their descriptions from a KEGG pathway using R and the KEGGREST package.
Install the required package if you haven’t already:
library(KEGGREST)
The function below fetches the pathway information using a KEGG pathway ID and extracts gene names with their full descriptions.
extract_gene_info <- function(kegg_id) {
pathway_data <- keggGet(kegg_id)
gene_info <- pathway_data[[1]]$GENE
gene_names <- gene_info[seq(2, length(gene_info), by = 2)]
split_gene_names <- strsplit(gene_names, "; ")
gene_table <- data.frame(
GeneName = sapply(split_gene_names, `[`, 1),
FullDescription = sapply(split_gene_names, `[`, 2),
stringsAsFactors = FALSE
)
return(gene_table)}
Let’s extract gene information from the Mismatch Repair pathway in humans (hsa03430):
gene_table <- extract_gene_info("hsa03430")
head(gene_table, 10)
## GeneName FullDescription
## 1 POLD3 DNA polymerase delta 3, accessory subunit [KO:K03504]
## 2 MLH3 mutL homolog 3 [KO:K08739]
## 3 MSH6 mutS homolog 6 [KO:K08737]
## 4 RPA4 replication protein A4 [KO:K10741]
## 5 LIG1 DNA ligase 1 [KO:K10747] [EC:6.5.1.1 6.5.1.6 6.5.1.7]
## 6 MLH1 mutL homolog 1 [KO:K08734]
## 7 MSH2 mutS homolog 2 [KO:K08735]
## 8 MSH3 mutS homolog 3 [KO:K08736]
## 9 PCNA proliferating cell nuclear antigen [KO:K04802]
## 10 PMS2 PMS1 homolog 2, mismatch repair system component [KO:K10858]
This approach enables you to fetch and organize gene information from KEGG pathways in a tidy format, suitable for reporting and further analyses.
Feel free to adapt this document for your own analyses!