Building on Stephen Turner’s analysis here.
tl,dr:
length(unique(x)) appears to be faster than n_distinct(x) for both numeric and character vectors of length 10-10000 and with varying duplication. The less duplication there is in the vector, the slower n_distinct(x) appears to be, with little effect on length(unique(x)).library(dplyr)
library(microbenchmark)
set.seed(2015-08-05)
time_distinct <- function(uniques, len, class) {
x <- sample(uniques, len, replace = TRUE)
x <- as(x, class)
m <- microbenchmark(n_distinct(x),
length(unique(x)),
times = 1000L)
ret <- summary(m)
ret$uniques <- length(unique(x))
ret
}
timings <- expand.grid(uniques = round(10 ^ seq(0, 5, 1)),
len = round(10 ^ seq(1, 4, 1)),
class = c("numeric", "character"),
stringsAsFactors = FALSE) %>%
group_by(uniques, len, class) %>%
do(do.call(time_distinct, .))
library(ggplot2)
ggplot(timings, aes(uniques, median, color = expr, lty = class)) +
geom_line() +
scale_x_log10() +
geom_errorbar(aes(ymin = lq, ymax = uq), width = .1) +
facet_wrap(~ len, scales = "free") +
xlab("Number of unique elements in vector") +
ylab("Median time")
Note that for all the vectors here, elements were sampled with equal probability (meaning the frequency of each was Poisson distributed). It is possible that vectors with a more lopsided distribution (e.g. a power law) will lead to different results.
Also note that this is only on vectors outside a data frame- preliminary results show n_distinct is usually faster when run within group_by() %>% summarize() (data not shown).
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.0 digest_0.6.8 MASS_7.3-40 grid_3.2.0
## [5] plyr_1.8.3 gtable_0.1.2 formatR_1.2 magrittr_1.5
## [9] scales_0.2.5 evaluate_0.7 stringi_0.5-5 reshape2_1.4.1
## [13] rmarkdown_0.7 labeling_0.3 proto_0.3-10 tools_3.2.0
## [17] stringr_1.0.0 munsell_0.4.2 yaml_2.1.13 colorspace_1.2-6
## [21] htmltools_0.2.6 knitr_1.10.5