Scraping Tables From Snpedia For Genetic Analysis Using Rvest

In this tutorial, I’m going to show you how to scrape data tables of SNPs from Snpedia, and how to use that data to check to see if you have those SNPs in your own genome. I’m also going to show you how to save that information in a general form for future use.

First, we load the rvest package

library(rvest)

Next we go to the Snpedia page for our desired gene, which in this example is the SUOX gene. https://www.snpedia.com/index.php/SUOX

If you go to this page, or any other gene page on snpedia, you will get a table of SNP rsid’s’ that are relevant to this gene along with information about them. The point of this exercise is to import, or scrape this table into R without having to do tedious data entry.

Here are the rvest commands for reading the html of this page and extracting our desired element.

rootnode<-read_html("https://www.snpedia.com/index.php/SUOX")

tablenode<-html_elements(rootnode, "table")

dat<-html_table(tablenode[[2]])

Rvest is a tidyverse package. As with other tidyverse packages, it simplifies coding but at the expense of turning the logic of its functions into something of a black box. So it is difficult to know what went wrong in instances where your code doesn’t work, because you don’t actually know why your code works. Or the code actually just isn’t adaptable to more complex scenarios.

For example, I was trying to scrap a data table from the Methylation report that LiveWello generates, and rvest just did not work for it at all.

When this happens, go to the webpage and inspect the html elements. Move your cursor over the various elements until the table you want gets highlighted. This can be trickier than you’d think because table nodes can be nested within many other nodes. Once you find the table node that you need, copy and paste it to a notepad doc, save it, and then use the read_html command on that file.

In this case, rvest worked fine.

head(dat)

## # A tibble: 6 x 4
##   ``          `Max Magnitude` `Chromosome position` Summary                     
##   <chr>                 <int> <chr>                 <chr>                       
## 1 rs11171718                0 55,995,838            ""                          
## 2 rs121908007               5 56,004,039            "sulfite oxidase deficiency"
## 3 rs121908008               5 56,004,183            ""                          
## 4 rs121908009               5 56,004,978            ""                          
## 5 rs202085145               8 56,002,720            ""                          
## 6 rs34962186                0 55,995,105            ""

Generally, what we want to do is match the rsid’s of the SNPs that matter here with the rsid’s in an actual genomic file, so we can see if that individual has the bad SNPs of a gene.

First we change the name of the rsid column so that it matches the name of the rsid column in our other table. In this case it is simply “rsid”.

names(dat)[1]<-"rsid"

Next we want to filter this data frame so that it only contains SNPs that matter. This information is contained in the Max Magnitude column, which is higher the more impactful a SNP is

dat2<-dat[dat$`Max Magnitude`>0,]

For the purpose of analysis we are only interested in the rsid column from this data frame, which we are going to use to build a new dataframe. For general comparative analysis, we want to have a data frame that contains the rsid, the minor allele, and the gene that they belong to.

Unfortunately the minor allele is not in the data table and needs to be looked up by hand, which we then take and put into it’s own vector.

For the gene variable, we can just create a vector that repeats the gene name. Lastly, we combine these vectors together into a data frame.

rsid<-dat2$rsid
`Minor Allele`<-c("A","A","A","T")
Gene<-rep("SUOX", length(rsid))

suox_df<-as.data.frame(cbind(rsid,`Minor Allele`,Gene))
suox_df

##          rsid Minor Allele Gene
## 1 rs121908007            A SUOX
## 2 rs121908008            A SUOX
## 3 rs121908009            A SUOX
## 4 rs202085145            T SUOX

Now we are going to merge this data frame with an individual’s ancestry data to see if they have any of these SNPs

df1 <- read.csv("AncestryDNA.txt",sep = "", skip=18)
suox_kp<-merge(df1,suox_df, by="rsid", all.y=T)
suox_kp

##          rsid chromosome position allele1 allele2 Minor Allele Gene
## 1 rs121908007         12 56397823       G       G            A SUOX
## 2 rs121908008         12 56397967       C       C            A SUOX
## 3 rs121908009         12 56398762       G       G            A SUOX
## 4 rs202085145         NA       NA    <NA>    <NA>            T SUOX

all.y basically means that we are adding matched information from x to y. The inverse all.x would be a huge data frame of all the persons sequenced rsid’s.

We can see just by visual inspection that this person doesn’t have any of the bad SNPs in this gene. However, if you are dealing with a huge number of SNPs you want to have a way of filtering out all of the negative results.

suox_kp2<-suox_kp[which(suox_kp$allele1==suox_kp$`Minor Allele`|suox_kp$allele2==suox_kp$`Minor Allele`),]
suox_kp2

## [1] rsid         chromosome   position     allele1      allele2     
## [6] Minor Allele Gene        
## <0 rows> (or 0-length row.names)

For those who are uninitiated to Base R, “which” is used here to filter out the NAs, the “|” symbol is used to logically symbolize “or”, and “==” tests whether two statements are equivalent. Basically we are testing to see if either of the alleles are equal to the minor allele. Since this is not the case we just get an empty data frame.

We (myself) did a good job creating a generally applicable data frame with suox_df, which can be used on any genomic file to test for these SNPs. So we should save it, especially so we aren’t dependent on this Snpedia page’s continued existence for this information. Here’s how to save just the data frame for future reference.

write.csv(suox_df, file="suox_df", row.names = F)

Scraping Tables From Snpedia For Genetic Analysis Using Rvest

2024-01-23