archSeries::aorist and my version of the same analysis; my_aorist.I performed this purely out of interest in how the packages approach (data.table and for loops) and my approach (data.table::foverlaps() and dplyr). The key difference to these approaches is that archSeries::aorist() uses a loop and a series of comparisons to find date range overlaps, where as my_aorist uses the data.table::foverlaps() function to do this. The foverlaps() function is a C++ implementation of a range join. There is also a slightly different approach to the two functions calculations of the aoristic weight, but it does not effect execution time or the results (to a very small rounding error).
my_aorist() executes faster when the site/event sample is less the 5,000. At less than 1,000 sites/events, it is about 0.25 seconds faster.aorist() executes faster over about 5,000 sites/events. At about 10,000 sites/events it is about 0.25 seconds faster.my_aorist() scales poorly with an increase of about 0.55 seconds per 10,000 sites/eventaorist() scales very well, nearly linearly, with an increase of only about 0.03 seconds per 10,000 sites/eventsfoverlaps() was scaled so much worse than the supposedly slower implementation of a for loop. The answer is dplyr!dplyr code, the my_aorist() with foverlaps() executes much quicker and scales as well as archSeries::aorist()data.table munge functions and data.table::foverlaps() may be the best compromise of computation time.In the function profiling at the bottom of this analysis, using profVis, I think I found the answer to the scaling issue. The biggest time usage of the aorist() is <= and %in% operator that is executed somewhere within the data.table evaluations (not explicit in the function code). The slowest part of the my_aorist() is in the series of tryCatch and doTryCatch that execute within mutate_tbl.df(), presumably within dplyr::mutate() that I use. So the foverlaps() is not the culprit and leads to the overall quicker execution. This idea was confirmed when I repeated the below code with a version of my_aorist() that removed the dplyr code; in which the foverlaps() code scale equally to aorist(), but 0.31 seconds faster on average across a range of 10 to 10,000 sites/events.
The dplyr data munge in the my_aorist() function does not scale well and slows it down pretty dramatically once over 5,000 sites/events. This may be entirely because of my crappy coding, but I’ll have try to see if that can improve the time. The foverlaps() is lightning fast and without the dplyr slowdown makes my_aorist() much faster than aorist() across a full range of 10 to 10,000 sites/events. It might be interesting to work foverlaps() into the archSeries::aorist() function for a speed boost. Does 0.3 seconds matter to anyone running this analysis; probably not. However, it was fun to find out how these different implementations work.
library("ggplot2")
library("profvis")
library("dplyr")
library("tidyr")
library("archSeries")
archSeries::aorist() functionI am testing this as a Global environment function to eliminate any time difference that running it from the installed archSeries package may lead to. (PS - there is no difference, I ran it both ways)
archSeries_aorist <- function (data, weight = 1, start.date = 0, end.date = 2000,
bin.width = 100){
require(data.table)
data <- data.table(cbind(data, weight))
data <- data[End >= start.date & Start <= end.date]
data[, `:=`(duration, End - Start)]
data[, `:=`(weight.per.unit, weight/duration)]
breaks <<- seq(start.date, end.date, bin.width)
labels <- numeric(0)
for (i in 1:(length(breaks) - 1)) {
labels[i] <- paste(breaks[i], breaks[i + 1], sep = "-")
}
aorist <- data.table(bin = labels, bin.no = 1:length(labels),
aorist = 0)
for (i in 1:length(labels)) {
bin.1 <- breaks[i]
bin.2 <- breaks[i + 1]
data[, `:=`(assign("a", labels[i]), 0)]
data[Start >= bin.1 & Start < bin.2, `:=`(assign("a",
labels[i]), (bin.2 - Start) * weight.per.unit)]
data[End > bin.1 & End <= bin.2, `:=`(assign("a", labels[i]),
(End - bin.1) * weight.per.unit)]
data[Start < bin.1 & End > bin.2, `:=`(assign("a", labels[i]),
bin.width * weight.per.unit)]
data[Start >= bin.1 & End <= bin.2, `:=`(assign("a",
labels[i]), as.double(weight))]
aorist$aorist[i] <- sum(data[, get(labels[i])], na.rm = TRUE)
}
aorist
}
my_aorist() functionHere is my implementation of the aorist function using data.table::foverlaps() and dplyr.
my_aorist <- function(data, weight = 1, start.date = 0, end.date = 2000,
bin.width = 100, round_int = 4) {
require(dplyr)
require(data.table)
time_steps <- data.frame(
step = seq(1:(abs(end.date)/bin.width)), # end not start b/c pos dates
Start = seq(start.date,(end.date-bin.width), by = bin.width),
End = seq((start.date+bin.width),end.date, by = bin.width))
setDT(time_steps)
setDT(data)
setkey(time_steps, Start, End)
overlap_names <- data.table::foverlaps(data, time_steps,
type = "any", which = FALSE) %>%
filter(i.Start != End & i.End != Start)
aorist <- overlap_names %>%
data.frame() %>%
group_by(name) %>%
mutate(site_count = n(),
W = (bin.width / (i.End - i.Start)), # SWITCHED b/c of pos dates
W = round(W,round_int)) %>%
arrange(name, step) %>%
group_by(step) %>%
summarise(step_W = sum(W),
median_step_W = median(W),
site_count = n()) %>%
right_join(., time_steps, by = "step") %>%
replace_na(list(step_W = 0, site_count = 0, median_step_W = 0)) %>%
mutate(bin.no = step,
bin = paste0(Start, "-", End),
aorist = step_W) %>%
dplyr::select(bin, bin.no, aorist)
return(aorist)
}
The for loop to execute both functions over a range of sites/events and record the execution time. I commented out the progress bar code because I don’t think knitr will appreciate it.
execute_time_results <- data.frame(execute_time_results) %>%
mutate(diff = my_aorist - aorist)
execute_time_summary <- execute_time_results %>%
summarise(site_count_min = floor(min(site_count)),
site_count_max = floor(max(site_count)),
mean_aorist = round(mean(aorist),3),
mean_my_aorist = round(mean(my_aorist),3),
max_diff = round(min(diff),3), # reverse b/c negative time
min_diff = round(max(diff),4), # reverse b/c negative time
mean_diff = round(mean(diff),3)) %>%
t()
print(execute_time_summary)
## [,1]
## site_count_min 10.0000
## site_count_max 10000.0000
## mean_aorist 0.2700
## mean_my_aorist 0.2450
## max_diff -0.3590
## min_diff 0.2682
## mean_diff -0.0250
quick check to see the rates of execution time increases over range of sites/events
aor_lm <- lm(aorist ~ site_count, data = execute_time_results)
aor_lm <- (as.numeric(coef(aor_lm))[2]) * 10000
my_aor_lm <- lm(my_aorist ~ site_count, data = execute_time_results)
my_aor_lm <- (as.numeric(coef(my_aor_lm))[2]) * 10000
message(paste0("Initially, my_aoristic is ", round(execute_time_results[2,"diff"],3),
" seconds faster at ", execute_time_results[2,"site_count"], " sites, is equivalent at 5000 sites, and is ",
round(execute_time_results[201,"diff"],3), " seconds slower at 10,000 sites. For every 1000 sites added: aoristic = ", round(aor_lm,4), " seconds increase; my_aoristic = ",
round(my_aor_lm,4), " seconds increase in execution time."))
## Initially, my_aoristic is -0.205 seconds faster at 50 sites, is equivalent at 5000 sites, and is 0.239 seconds slower at 10,000 sites. For every 1000 sites added: aoristic = 0.0098 seconds increase; my_aoristic = 0.4158 seconds increase in execution time.
plot_dat <- execute_time_results %>%
dplyr::select(site_count, aorist, my_aorist) %>%
gather(model, time, -site_count)
ggplot(plot_dat, aes(x = site_count, y = time, group = model, color = model)) +
geom_line(size = 1) +
theme_bw()
These are best to use the print() function to open the profiles in your browser
profvis({
aor_profile <- aorist(sites2, end.date = 5000, bin.width = 100)
# print(aor_profile)
})
profvis({
my_aor_profile <- my_aorist(sites2, end.date = 5000, bin.width = 100)
# print(aor_profile)
})
dplyrThe same code chunks as above mashed into one chunk with a the only change being the commenting out of dplyr code from my_aorist2() function. While this shows the speed increase that is achieved by foverlaps(), it is not a fair comparison as the removal of mutate() will still need to be replaced by some data munging. Or, just deal with the half-second of execution time and move along :)
my_aorist2 <- function(data, weight = 1, start.date = 0, end.date = 2000,
bin.width = 100, round_int = 4) {
require(dplyr)
require(data.table)
time_steps <- data.frame(
step = seq(1:(abs(end.date)/bin.width)), # end not start b/c pos dates
Start = seq(start.date,(end.date-bin.width), by = bin.width),
End = seq((start.date+bin.width),end.date, by = bin.width))
setDT(time_steps)
setDT(data)
setkey(time_steps, Start, End)
overlap_names <- data.table::foverlaps(data, time_steps,
type = "any", which = FALSE) %>%
filter(i.Start != End & i.End != Start)
aorist <- overlap_names # %>%
# data.frame() %>%
# group_by(name) %>%
# mutate(site_count = n(),
# W = (bin.width / (i.End - i.Start)), # SWITCHED b/c of pos dates
# W = round(W,round_int)) %>%
# arrange(name, step) %>%
# group_by(step) %>%
# summarise(step_W = sum(W),
# median_step_W = median(W),
# site_count = n()) %>%
# right_join(., time_steps, by = "step") %>%
# replace_na(list(step_W = 0, site_count = 0, median_step_W = 0)) %>%
# mutate(bin.no = step,
# bin = paste0(Start, "-", End),
# aorist = step_W) %>%
# dplyr::select(bin, bin.no, aorist)
return(aorist)
}
execute_time_results <- matrix(nrow = length(sim_site_count_seq), ncol = 3)
colnames(execute_time_results) <- c("site_count", "aorist", "my_aorist")
# pb <- txtProgressBar(min = 0, max = length(sim_site_count_seq), style = 3, char="*")
for(i in seq_along(sim_site_count_seq)){
sim_site_count <- sim_site_count_seq[i]
sites2 <- data.frame(
name = seq(1:sim_site_count),
begin = sample(seq(-5000,-1000, by = 100), sim_site_count, replace = TRUE)) %>%
mutate(end = begin + sample(seq(100, 3000, by = 100),
length(begin), replace = TRUE),
end = ifelse(end > 0, 0, end)) %>%
dplyr::select(begin, end, name) %>%
mutate(begin = abs(begin), end = abs(end)) %>%
dplyr::rename(Start = end, End = begin)
aor.start.time <- Sys.time()
aor <- archSeries_aorist(sites2, end.date = 5000, bin.width = 100)
aor.end.time <- Sys.time()
my.aor.start.time <- Sys.time()
my_aor <- my_aorist2(sites2, end.date = 5000, bin.width = 100) # my_aorist2()
my.aor.end.time <- Sys.time()
aor.execute.time <- aor.end.time - aor.start.time
my.aor.execute.time <- my.aor.end.time - my.aor.start.time
execute_time_results[i,1] <- sim_site_count
execute_time_results[i,2] <- aor.execute.time
execute_time_results[i,3] <- my.aor.execute.time
# setTxtProgressBar(pb, i)
}
# close(pb)
execute_time_results <- data.frame(execute_time_results) %>%
mutate(diff = my_aorist - aorist)
execute_time_summary <- execute_time_results %>%
summarise(site_count_min = floor(min(site_count)),
site_count_max = floor(max(site_count)),
mean_aorist = round(mean(aorist),3),
mean_my_aorist = round(mean(my_aorist),3),
max_diff = round(min(diff),3), # reverse b/c negative time
min_diff = round(max(diff),4), # reverse b/c negative time
mean_diff = round(mean(diff),3)) %>%
t()
print(execute_time_summary)
## [,1]
## site_count_min 10.0000
## site_count_max 10000.0000
## mean_aorist 0.2490
## mean_my_aorist 0.0250
## max_diff -0.3960
## min_diff -0.1475
## mean_diff -0.2230
aor_lm <- lm(aorist ~ site_count, data = execute_time_results)
aor_lm <- (as.numeric(coef(aor_lm))[2]) * 10000
my_aor_lm <- lm(my_aorist ~ site_count, data = execute_time_results)
my_aor_lm <- (as.numeric(coef(my_aor_lm))[2]) * 10000
message(paste0("Initially, my_aoristic is ", round(execute_time_results[2,"diff"],3),
" seconds faster at ", execute_time_results[2,"site_count"], " sites, is equivalent at 5000 sites, and is ",
round(execute_time_results[201,"diff"],3), " seconds slower at 10,000 sites. For every 1000 sites added: aoristic = ", round(aor_lm,4), " seconds increase; my_aoristic = ",
round(my_aor_lm,4), " seconds increase in execution time."))
## Initially, my_aoristic is -0.196 seconds faster at 50 sites, is equivalent at 5000 sites, and is -0.287 seconds slower at 10,000 sites. For every 1000 sites added: aoristic = 0.0403 seconds increase; my_aoristic = 0.0426 seconds increase in execution time.
plot_dat <- execute_time_results %>%
dplyr::select(site_count, aorist, my_aorist) %>%
gather(model, time, -site_count)
ggplot(plot_dat, aes(x = site_count, y = time, group = model, color = model)) +
geom_line(size = 1) +
theme_bw()