Which method of tabulating genotypes is fastest?

When counting the number of individuals within a sample with a particular genotype at thousands of loci across the genome, it is important that calculations are efficient. It is not uncommon to screen individuals at > 50,000 SNP loci, with sample size often being > 30. Factoring in the possibility of multiple population samples, the number of computations can become very large.

The example below compares three methods of counting the number of individuals of a particular genotype class.

  1. table: base R function.
  2. tabulate: base R function (The ‘workhorse’ of table).
  3. Tab: A simple Rcpp definition of the table function (see below).

Prepare some suitable data

# read some genotype data using internal diveRsity function
dat <- diveRsity:::rgp("test.gen")
# extract an example locus
gt <- apply(dat$genos[[1]][,1,], 1, paste, collapse = "")
# remove "NANA" (paste coerces NA to "NA")
gt[gt == "NANA"] <- NA
gt <- as.vector(na.omit(gt))
gt
##  [1] "126130" "126130" "130130" "126130" "126130" "130130" "126130"
##  [8] "126126" "126130" "126130" "126130" "130130" "126130" "130130"
## [15] "126130" "130130" "126130" "130130" "126126" "130130" "126130"
## [22] "126130" "126130" "130130" "126126" "126130" "130134" "126126"
## [29] "126130" "118126" "126130" "130134" "130130" "126130" "130130"
## [36] "130130" "126126" "130130" "126138" "126130" "126130" "126126"
## [43] "126126" "126130" "126130" "126126"

Define an Rcpp Tab function

Rcpp::cppFunction('IntegerVector Tab(CharacterVector x){
                  return table(x);
                  }')

Some benchmarking

library(microbenchmark)
microbenchmark(
  table = table(gt),
  tabulate = tabulate(as.factor(gt)),
  Tab = Tab(gt),
  times = 100000
)
## Unit: microseconds
##      expr     min      lq  median      uq      max neval
##     table 197.135 206.116 210.819 225.358 128375.6 1e+05
##  tabulate 108.617 115.459 118.452 125.721 141217.6 1e+05
##       Tab   5.559   7.698  11.974  13.257 122821.2 1e+05

tabulates performance without as.factor

The as.factor expression adds time to the tabulate evaluation. Although it is not relaistic for my purpose, here are the comparisons without it.

gtf <- as.factor(gt)
library(microbenchmark)
microbenchmark(
  table = table(gt),
  tabulate = tabulate(gtf),
  Tab = Tab(gt),
  times = 100000
)
## Unit: microseconds
##      expr     min      lq  median      uq        max neval
##     table 196.279 205.688 209.964 222.793 128809.646 1e+05
##  tabulate   9.408  11.974  15.395  17.106   3846.047 1e+05
##       Tab   5.559   7.698  11.546  12.829   2593.965 1e+05