For the last 4 year in my Computational Biology course students work on an unknown gene or mutation as part of an independent data analysis project. This year, I expanded and adapted this to my 260-person introductory biology course. For these exercises, my intro bio students have been given the ID code (accession number) for a mutation, and from that code they will determine various related biological attributes, such as the gene the mutation is in, what chromosome it is on, etc. This script loads student submissions and accesses two online databases to collect the information necessary to check student submissions.
This code will be shared with my undergraduate teaching assistants who have taken Computational Biology with me so they can see what I’m doing and help me with the assignments. The overall format of this code is similar to what I’d use in a class or share with collaborators who may need to use the code without my future input.
Key approaches used in this script
ggplot2tryCatchThe data used in this script are on GitHub here: https://github.com/brouwern/biosci0160_2024
For the first exercise students took a mutation ID number and located the gene it was contained in using a web-based program. From the gene name, they can access information about the protein coded for by the gene, such as its 3-dimensional structure.
As part of assignment students turned in
First, I want to check that the information that they submitted are valid gene names and accession numbers. Second, I want to download information that will be used to check future assignments, such as the length of the gene and the size of the protein it codes for.
Creating this workflow requires dealing with several issues
The code below carries out the following steps:
For testing code, set prototype to TRUE and the number of entries to query from database.
prototype <- FALSE
prototype_N <- 100
verbose <- FALSE
This workflows requires several packages to interface with the APIs of the bioinformatics databases.
Un-comment the code as needed to download the packages.
NOTE: The htmltools package is used by
Biostrings and was giving me some issues; I needed to
re-install Rtools to get everything to work.
## CRAN bioinformatics packages
#install.packages("rentrez")
#install.packages("UniprotR")
## BioConductor bioinformatics packages
#install.packages("BiocManager")
#BiocManager::install("Biostrings")
#BiocManager::install("GenomicAlignments")
## BioConductor dependencies
#install.packages("htmltools")
## Other
#install.packages(XML)
## Bioinformatics database APIs
library(rentrez)
library(UniprotR)
library(Biostrings)
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
# other
library(XML)
library(ggplot2)
Students submitted data on a Google Doc, which was download and anonymized.
The code below load data from students and does basic cleaning.
# load data
dat <- read.csv("introbio_unknowns_spr24 - gene_master_list.csv",
skip = 1,
na.strings = c("NA", ""))
# change blank space character . inserted by R to _
names(dat) <- gsub("\\.", "_", names(dat))
# shorten names from version given to students
names(dat) <- gsub("AlphaFold_Accession_Number", "Access", names(dat))
# remove NAs
## **TODO** - create workflow to flag students who
## have not submitted work
dat2 = dat[is.na(dat$Access) == FALSE, ]
dim(dat2)
## [1] 230 13
The key column of these data is “Access”, which is the
student-supplied accession number for a gene.
TODO:
head(dat2$Access,10)
## [1] "Q93052" "Q14511" "F6VA37" "O14529" "A0A852CDZ5"
## [6] "A0A024R2I2" "A0A4W2F114" "Q92556" "A0A2J8P183" "A0A7L0SVW8"
The accession number submitted by students is a UniProt accession
number. The UniProt API can be easily accessed using the CRAN package
UniprotR.
The UniprotR package has the function
GetProteinAnnontate()`, which downloads fields from the
database summary information. Some sample calls:
UniprotR::GetProteinAnnontate(dat2$Access[1] , "sequence")
UniprotR::GetProteinAnnontate(dat2$Access[1] , "cc_tissue_specificity")
UniprotR::GetProteinAnnontate(dat2$Access[1] , "go_c")
UniprotR::GetProteinAnnontate(dat2$Access[1] , "go_f")
UniprotR::GetProteinAnnontate(dat2$Access[1] , "cc_subcellular_location")
UniprotR::GetProteinAnnontate(dat2$Access[1] , "xref_refseq")
UniprotR::GetProteinAnnontate(dat2$Access[1] , "xref_geneid")
UniprotR::GetProteinAnnontate(dat2$Access[1] , "organism_name")
UniprotR::GetProteinAnnontate(dat2$Access[1] , "gene_primary")
UniprotR::GetProteinAnnontate(dat2$Access[1] , "gene_names")
UniprotR::GetProteinAnnontate() can download a single
column of data
UniprotR::GetProteinAnnontate(dat$Access[1] , "length")
…or multiple columns if supplied a vector of column names, which is what I’ll do:
UniprotR::GetProteinAnnontate(dat$Access[1] , c("xref_geneid","length","mass"))
The focal columns I’ll access from UniProt are:
# dataframe to hold output from UniProt
df <- data.frame(# metadata
## student-supplied data
"uniprot_access" = dat2$Access, # accession number recorded by student
"uniprot_works" = "yes", # changed to "no" if dbase query fails
# output from dbase query
"geneid" = NA, # NCBI geneid from UniProt (xref_geneid)
"gene_primary" = NA,
"gene_names" = NA, # may be complex output
"organism_name" = NA,
#"protein_name" = NA, #long-form text
"prot_length" = NA, # length of protein
"prot_mass" = NA) # mass of protein
# columns from database to extract
## columns names
cols_get = c("xref_geneid",
"gene_primary",
"gene_names",
"organism_name",
#"protein_name",
"length",
"mass")
## number of columns
n_cols_get = length(cols_get)
Querying the database throws errors if
Errors caught using tryCatch(), which has a complex
syntax. I always refer to this StackOverflow
post.
Here are some sample calls to tryCatch().
This accession works:
# valid call
access_i <- tryCatch(expr = UniprotR::GetProteinAnnontate(dat2$Access[18] , cols_get),
error = function(cond){
rep("error",n_cols_get)}
)
This accession throws an error. When an error is thrown, a vector containing “error” is returned to fill in the dataframe instead of the desired data.
#bad call
access_i <- tryCatch(expr = UniprotR::GetProteinAnnontate(dat2$Access[19] , cols_get),
error = function(cond){
rep("error",n_cols_get)}
)
## [1] "Internal server error. Most likely a temporary problem, but if the problem persists please contact us."
TODO: Troubleshoot error handing
Types of errors:
This for() loop requests the data for each
student-supplied accession number.
The code will throw a warning if the loop appears to have exited early.
TODO: Wrap this into a function that allows flexibility in e.g. which columns to get
# number of proteins to loop over
n_rows <- length(dat2$Access)
#verbose <- TRUE
#prototype <- TRUE
#prototype_N <- 5
for(i in 1:n_rows){
if(verbose == TRUE){
print(i)
}
# query dbase
## use tryCatch() to skip over errors
access_i <- tryCatch(expr = UniprotR::GetProteinAnnontate(dat2$Access[i] , cols_get),
error = function(cond){
rep("error",n_cols_get)}
)
# store information
df[i, -c(1,2)] <- access_i
if(prototype == TRUE & i == prototype_N){
break
}
}
## [1] "Internal server error. Most likely a temporary problem, but if the problem persists please contact us."
## Internet connection problem occurs and the function will return the original error
## [1] "Internal server error. Most likely a temporary problem, but if the problem persists please contact us."
# flag in case for() loop exits early
if(i < n_rows){
warning(paste("Only", i, "of",n_rows,"irrerations run."))
}
Annotate output, e.g. did tryCatch throw an error, or even if error is thrown was there no geneid?
df$uniprot_works <- ifelse(df$geneid == "error","NO","yes" )
df$uniprot_works <- as.factor(df$uniprot_works)
df$no_geneid <- ifelse(df$geneid == "NA","*","")
df$no_geneid <- as.factor(df$no_geneid )
message("NOTE: NO on these tables indicate how many times requests to the database had issues")
## NOTE: NO on these tables indicate how many times requests to the database had issues
summary(df$uniprot_works)
## NO yes
## 3 227
summary(df$no_geneid)
## *
## 186 44
Clean output and format relevant things as numbers rather than text.
#- remove ";" from geneid column
df$geneid <- gsub(";","",df$geneid)
# shorten species names
df$organism_name[grep("Homo", df$organism_name )] <- "human"
df$organism_name[grep("Equus ", df$organism_name )] <- "horse"
df$organism_name[grep("^Mus ", df$organism_name )] <- "mouse"
df$organism_name[grep("^Drosophila", df$organism_name )] <- "fruit_fly"
df$organism_name[grep("Ramphastos", df$organism_name )] <- "toucan"
# format text as numbers as needed
## length
df$prot_length <- ifelse(df$prot_length == "NA", NA, df$prot_length)
df$prot_length <- as.numeric(df$prot_length)
## Warning: NAs introduced by coercion
#mass
df$prot_mass <- ifelse(df$prot_mass == "NA", NA, df$prot_mass)
df$prot_mass <- as.numeric(df$prot_mass)
## Warning: NAs introduced by coercion
Save output
write.csv(df, file = "temp_uniprot_output.csv")
head(df,10)
## uniprot_access uniprot_works geneid gene_primary gene_names
## 1 Q93052 yes 4026 LPP LPP
## 2 Q14511 yes 4739 NEDD9 NEDD9 CASL
## 3 F6VA37 yes 100055851 PPP1R3A PPP1R3A
## 4 O14529 yes 23316 CUX2 CUX2 CUTL2 KIAA0293
## 5 A0A852CDZ5 yes NA Klhl1_0 Klhl1_0 RAMSUL_R01126
## 6 A0A024R2I2 yes NA NA NA
## 7 A0A4W2F114 yes NA SPINK14 SPINK14
## 8 Q92556 yes 9844 ELMO1 ELMO1 KIAA0281
## 9 A0A2J8P183 yes NA NA CK820_G0007481
## 10 A0A7L0SVW8 yes NA Specc1 Specc1 PODPOD_R00299
## organism_name prot_length prot_mass no_geneid
## 1 human 612 65746
## 2 human 834 92861
## 3 horse 1119 126097
## 4 human 1486 161677
## 5 toucan 168 17271 *
## 6 NA NA NA *
## 7 Bos indicus x Bos taurus (Hybrid cattle) 47 5258 *
## 8 human 727 83829
## 9 Pan troglodytes (Chimpanzee) 689 79281 *
## 10 Podilymbus podiceps (Pied-billed grebe) 1069 118393 *
Plot the data to make sure there are no obviously unreasonable values
theme_set(theme_bw())
ggplot(data = df,
aes(x = prot_length)) +
geom_histogram() +
xlab("Protein length (amino acids)") +
ggtitle("Histogram of length of proteins in student data") +
theme(plot.title = element_text(size=22, face = "bold"),
axis.title.x = element_text(size = 18, angle = 0, hjust = .5, vjust = 0, face = "bold"),
axis.title.y = element_text(size = 18, angle = 90, hjust = .5, vjust = .5, face = "bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
theme_set(theme_bw())
ggplot(data = df,
aes(x = prot_length)) +
geom_histogram() +
xlab("Protein mass (kilodaltons; kDa)") +
ggtitle("Histogram of mass of proteins in student data") +
theme(plot.title = element_text(size=22, face = "bold"),
axis.title.x = element_text(size = 18, angle = 0, hjust = .5, vjust = 0, face = "bold"),
axis.title.y = element_text(size = 18, angle = 90, hjust = .5, vjust = .5, face = "bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
ggplot(data = df,
aes(x = prot_mass,
y = prot_length)) +
geom_point() +
geom_smooth(method='lm') +
ggtitle("Scatterplot of mass of proteins vsersus length") +
xlab("Protein mass") +
ylab("Protein length") +
theme(plot.title = element_text(size=22, face = "bold"),
axis.title.x = element_text(size = 18, angle = 0, hjust = .5, vjust = 0, face = "bold"),
axis.title.y = element_text(size = 18, angle = 90, hjust = .5, vjust = .5, face = "bold"))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 6 rows containing missing values (`geom_point()`).