Introduction

This R notebook demonstrates how to retrieve and analyze genetic variant information for the PTPN11 gene using data from ClinVar and the Ensembl Variant Effect Predictor (VEP). The analysis focuses on a specific variant, rs121918459, to illustrate data retrieval and interpretation.

Setup: Install and Load Required Libraries

First, we need to install and load the necessary R packages. If you are running this notebook for the first time, uncomment and run the install.packages and devtools::install_github lines. BiocManager::install is used for Bioconductor packages.

# Install clinvaR from GitHub (uncomment and run if not already installed)
# install.packages("devtools")
# devtools::install_github("jamesdiao/clinvaR")
# BiocManager::install("ensemblVEP")
library(DT)
library(ensemblVEP)
library(clinvaR)
library(httr)
library(jsonlite)

# Confirm successful loading of libraries
print("Required libraries loaded successfully.")
## [1] "Required libraries loaded successfully."

Step 1: Download ClinVar Data

We will download the ClinVar database using the get_clinvar() function from the clinvaR package. This may take some time depending on your internet connection as it retrieves a large dataset.

After downloading, we will filter the data to only include entries related to the PTPN11 gene.

# Download ClinVar data
clinvar_data <- get_clinvar()
# Subset data for the PTPN11 gene
ptpn11_subset <- clinvar_data[which(clinvar_data$GENE == "PTPN11") ,]
print(paste("Entries for PTPN11 gene:", nrow(ptpn11_subset)))
## [1] "Entries for PTPN11 gene: 259"
# Display the first few rows of the PTPN11 subset
head(as.data.frame(ptpn11_subset))
##                  VAR_ID   GENE CHROM       POS REF ALT          ID  CLNSIG
## 159034 12_112856737_G_C PTPN11    12 112856737   G   C rs545369072 0, 0, 0
## 159035 12_112856777_G_A PTPN11    12 112856777   G   A rs886048964 0, 0, 0
## 159037 12_112856871_T_G PTPN11    12 112856871   T   G rs587781131       2
## 159038 12_112856920_C_T PTPN11    12 112856920   C   T rs267606990 5, 5, 5
## 159039 12_112856925_C_G PTPN11    12 112856925   C   G rs886041517       0
## 159040 12_112856937_G_T PTPN11    12 112856937   G   T rs886048965 0, 0, 0
##              CLNDBN pathogenic_incl_conflicts pathogenic_no_conflicts
## 159034 Noonan_s....                     FALSE                   FALSE
## 159035 Noonan_s....                     FALSE                   FALSE
## 159037 not_spec....                     FALSE                   FALSE
## 159038 Noonan_s....                      TRUE                    TRUE
## 159039 not_spec....                     FALSE                   FALSE
## 159040 Noonan_s....                     FALSE                   FALSE

Step 2: Identify a Specific Variant

From the PTPN11 subset, we will search for a specific variant using its ClinVar ID, rs121918459. This variant is c.188A>G.

# Search for the specific variant rs121918459 (c.188A>G) in the PTPN11 subset
specific_variant <- ptpn11_subset[which(ptpn11_subset$ID == "rs121918459"),]

if (nrow(specific_variant) > 0) {
  print("Specific variant rs121918459 found:")
  
  # Display as interactive DT table
  datatable(specific_variant,
            caption = "Variant rs121918459 (c.188A>G) in PTPN11",
            rownames = FALSE,
            options = list(
              scrollX = TRUE,  # Enable horizontal scrolling
              autoWidth = TRUE,
              pageLength = 10,
              columnDefs = list(
                list(width = '100px', targets = "_all")  # Set minimum column width
              ),
              dom = 't'  # Only show table (no pagination needed for single row)
            ),
            class = 'cell-border stripe hover') %>%
    formatStyle(columns = colnames(specific_variant), 
                fontSize = '12px',
                backgroundColor = styleEqual(1, '#f0f8ff'))
  
} else {
  print("Specific variant rs121918459 not found in the PTPN11 subset.")
}
## [1] "Specific variant rs121918459 found:"

Step 3: Define a Function to Query Ensembl VEP

To get detailed annotations for the variant, we will use the Ensembl Variant Effect Predictor (VEP) API. A custom R function get_vep_variant is defined below to interact with the Ensembl REST API. This function constructs the API request based on chromosome, position, reference allele, alternative allele, species, and genome build.

get_vep_variant <- function(chromosome, position, ref_allele, alt_allele,
                            species = "human", genome = "GRCh38") {

  # Select the appropriate server based on genome build
  if (genome == "GRCh37" || genome == "hg19") {
    server <- "https://grch37.rest.ensembl.org"
  } else if (genome == "GRCh38" || genome == "hg38") {
    server <- "https://rest.ensembl.org"
  } else {
    stop("Genome must be either 'GRCh37' or 'GRCh38'")
  }

  # Construct the region string for the VEP API
  # The VEP API expects format: chr:start-end/allele
  region <- paste0(chromosome, ":", position, "-", position, "/", alt_allele)

  # Build the API endpoint
  endpoint <- paste0("/vep/", species, "/region/", region)

  # Make the HTTP GET request to the Ensembl VEP API
  response <- GET(paste0(server, endpoint),
                  content_type("application/json"),
                  accept("application/json"))

  # Parse the JSON response
  if (status_code(response) == 200) {
    result <- fromJSON(toJSON(content(response)))
    return(result)
  } else {
    stop(paste("API request failed with status:", status_code(response), "and message:", content(response, "text")))
  }
}

print("get_vep_variant function defined.")
## [1] "get_vep_variant function defined."

Step 4: Query VEP for the Specific Variant (GRCh37)

Now, we will use the get_vep_variant function to query the Ensembl VEP API for our specific variant (rs121918459) using the GRCh37 genome build. This will provide detailed annotations such as transcript consequences.

# GRCh37 variant
variant_grch37 <- get_vep_variant(specific_variant$CHROM, specific_variant$POS, specific_variant$REF, specific_variant$ALT, genome = "GRCh37")

# Display transcript consequences as interactive table
if (!is.null(variant_grch37$transcript_consequences)) {
  tc_df <- as.data.frame(variant_grch37$transcript_consequences)
  
  datatable(tc_df,
            caption = "Transcript Consequences for Variant",
            filter = 'top',
            options = list(
              pageLength = 10,
              scrollX = TRUE,
              autoWidth = TRUE,
              columnDefs = list(list(width = '150px', targets = "_all"))
            ))
}

Conclusion

This notebook successfully demonstrated how to:

  1. Set up an R environment with necessary bioinformatics packages.
  2. Download and filter ClinVar data for a gene of interest (PTPN11).
  3. Identify a specific genetic variant (rs121918459).
  4. Query the Ensembl VEP API to obtain detailed annotations for the variant, specifically focusing on the GRCh37 genome build.

This workflow can be adapted to analyze other genes and variants, providing valuable insights for genetic research and clinical interpretation.