When counting the number of individuals within a sample with a particular genotype at thousands of loci across the genome, it is important that calculations are efficient. It is not uncommon to screen individuals at > 50,000 SNP loci, with sample size often being > 30. Factoring in the possibility of multiple population samples, the number of computations can become very large.
The example below compares three methods of counting the number of individuals of a particular genotype class.
table: base R function.tabulate: base R function (The ‘workhorse’ of table).Tab: A simple Rcpp definition of the table function (see below).# read some genotype data using internal diveRsity function
dat <- diveRsity:::rgp("test.gen")
# extract an example locus
gt <- apply(dat$genos[[1]][,1,], 1, paste, collapse = "")
# remove "NANA" (paste coerces NA to "NA")
gt[gt == "NANA"] <- NA
gt <- as.vector(na.omit(gt))
gt
## [1] "126130" "126130" "130130" "126130" "126130" "130130" "126130"
## [8] "126126" "126130" "126130" "126130" "130130" "126130" "130130"
## [15] "126130" "130130" "126130" "130130" "126126" "130130" "126130"
## [22] "126130" "126130" "130130" "126126" "126130" "130134" "126126"
## [29] "126130" "118126" "126130" "130134" "130130" "126130" "130130"
## [36] "130130" "126126" "130130" "126138" "126130" "126130" "126126"
## [43] "126126" "126130" "126130" "126126"
Rcpp::cppFunction('IntegerVector Tab(CharacterVector x){
return table(x);
}')
library(microbenchmark)
microbenchmark(
table = table(gt),
tabulate = tabulate(as.factor(gt)),
Tab = Tab(gt),
times = 100000
)
## Unit: microseconds
## expr min lq median uq max neval
## table 197.135 206.116 210.819 225.358 128375.6 1e+05
## tabulate 108.617 115.459 118.452 125.721 141217.6 1e+05
## Tab 5.559 7.698 11.974 13.257 122821.2 1e+05
tabulates performance without as.factorThe as.factor expression adds time to the tabulate evaluation. Although it is not relaistic for my purpose, here are the comparisons without it.
gtf <- as.factor(gt)
library(microbenchmark)
microbenchmark(
table = table(gt),
tabulate = tabulate(gtf),
Tab = Tab(gt),
times = 100000
)
## Unit: microseconds
## expr min lq median uq max neval
## table 196.279 205.688 209.964 222.793 128809.646 1e+05
## tabulate 9.408 11.974 15.395 17.106 3846.047 1e+05
## Tab 5.559 7.698 11.546 12.829 2593.965 1e+05