Make Your R Code Faster
Pada tutorial ini kita akan membahas tentang bagaimana cara meningkatkan kecepatan dalam Pemrograman R. Secara umum yang akan dibahas ada dua topik yaitu mengukur performa kode dan meningkatkan performa kode. Tutorial ini dijalankan pada komputer yang memiliki spesifikasi berikut:
sessionInfo()## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] bookdown_0.22 digest_0.6.27 R6_2.5.0 jsonlite_1.7.2
## [5] magrittr_2.0.1 evaluate_0.14 rmdformats_1.0.2 stringi_1.6.2
## [9] rlang_0.4.11 jquerylib_0.1.4 bslib_0.2.5.1 rmarkdown_2.8
## [13] tools_4.1.0 stringr_1.4.0 xfun_0.23 yaml_2.2.1
## [17] compiler_4.1.0 htmltools_0.5.1.1 knitr_1.33 sass_0.4.0
Mengukur Performa Kode
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered.
— Donald Knuth.
Before you can make your code faster, you first need to figure out what’s making it slow. This sounds easy, but it’s not. Even experienced programmers have a hard time identifying bottlenecks in their code.
install.packages("microbenchmark")
install.packages("tictoc")
install.packages("profvis")
install.packages("biglm")
install.packages("ggplot2movies")
install.packages("tidyverse")
install.packages("tidytable")
install.packages("furrr")library(microbenchmark)
library(tictoc)
library(profvis)Kegiatan untuk mengukur perfoma kode suatu pemrograman sering disebut sebagai code profiling. Code Profiling di R bisa dilakukan dengan mengukur kecepatan run pada setiap baris kode. Untuk membantu hal ini, kita akan menggunakan package profvis. Package ini bisa melakukan visualisasi tentang kecepatan kode dan memory yang digunakan setiap kode.
profvis({
data(movies, package = "ggplot2movies") # Load data
movies = movies[movies$Comedy == 1,]
plot(movies$year, movies$rating)
model = loess(rating ~ year, data = movies) # loess regression line
j = order(movies$year)
lines(movies$year[j], model$fitted[j]) # Add line to the plot
})profvis({x <- integer()
for (i in 1:1e4) {
x <- c(x, i)
}
})Cara lain untuk code profiling adalah menggunakan package microbenchmark dan tictoc.
Package microbenchmark melakukan code profiling dengan ulangan beberapa kali, nilai defaultnya 10 kali.
test1 <- function(){x <- integer()
for (i in 1:1e4) {
x <- c(x, i)
}}
microbenchmark(test1,times = 10)## Unit: nanoseconds
## expr min lq mean median uq max neval
## test1 0 1 421 1 2 4102 10
test2 <- function(){ data(movies, package = "ggplot2movies") # Load data
movies = movies[movies$Comedy == 1,]
plot(movies$year, movies$rating)
model = loess(rating ~ year, data = movies) # loess regression line
j = order(movies$year)
lines(movies$year[j], model$fitted[j]) # Add line to the plot
}
microbenchmark(test2,times = 10)## Unit: nanoseconds
## expr min lq mean median uq max neval
## test2 0 0 470.8 1 1 4601 10
Selain itu juga kita bisa membandingakan output dari dua fungsi
set.seed(123)
dta <- data.frame(x=rnorm(5e5),
y=rnorm(5e5))library(biglm)## Loading required package: DBI
bench1 <- microbenchmark(times = 10,
lm=lm(y~x,data = dta),
biglm=biglm(y~x,data = dta))
bench1## Unit: milliseconds
## expr min lq mean median uq max neval cld
## lm 67.8616 213.1343 203.2664 227.1873 231.8688 285.8814 10 a
## biglm 62.0750 73.4747 182.3374 148.4539 269.1159 381.3075 10 a
plot(bench1) Selain kita menggunakan
microbenchmark, bisa juga menggunakan package tictoc yang lebih sederhana namun tidak terlalu akurat
tic()
x <- integer()
for (i in 1:1000) {
x <- c(x, i)
}
toc()## 0 sec elapsed
Meningkatkan Performa Kode
Dalam bagian ini kita akan membahas tips-tips yang bisa digunakan untuk meningkatkan performa kode kita.
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
— Donald Knuth
Once you’ve used profiling to identify a bottleneck, you need to make it faster. It’s difficult to provide general advice on improving performance, but I try my best with four techniques that can be applied in many situations. I’ll also suggest a general strategy for performance optimisation that helps ensure that your faster code is still correct.
It’s easy to get caught up in trying to remove all bottlenecks. Don’t! Your time is valuable and is better spent analysing your data, not eliminating possible inefficiencies in your code. Be pragmatic: don’t spend hours of your time to save seconds of computer time. To enforce this advice, you should set a goal time for your code and optimise only up to that goal. This means you will not eliminate all bottlenecks. Some you will not get to because you’ve met your goal. Others you may need to pass over and accept either because there is no quick and easy solution or because the code is already well optimised and no significant improvement is possible. Accept these possibilities and move on to the next candidate.
Sebelum ke tips-tips ada hal yang perlu diperhatikan dalam meningkatkan perfoma kode
- Writing faster but incorrect code.
- Writing code that you think is faster, but is actually no better.
Untuk menghindari masalah pada poin 2 sebaiknya selalu melakukan code profiling seperti yang sebelumnya dibahas
mean1 <- function(x) mean(x)
mean2 <- function(x) sum(x) / length(x)
set.seed(123)
x<- rnorm(1e7)
bench2 <- microbenchmark(mean1,mean2)## Warning in microbenchmark(mean1, mean2): Could not measure a positive execution
## time for 30 evaluations.
bench2## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## mean1 0 0 5.85 1 1 501 100 a
## mean2 0 0 32.85 1 1 3201 100 a
plot(bench2)Selanjutnya kita akan membahas tentang tips-tips meningkatkan performa kode
- Gunakan package yang tepat
Cara paling mudah untuk meningkatkan performa kode adalah mencari package yang tepat, salah satu tipsnya adalah mencari package dengan reverse dependency Rccp. Untuk definisi dari reverse dependency di R bisa dilihat di web Package Dependencies. Sedangkan untuk melihat reverse dependency Rccp bisa melihat di link berikut ini
Selain itu kita bisa lihat di web Cran Task View kemudian memilih High Perfomance Computing.
- Hindari menggunakan
as.data.frame
quickdf <- function(l) {
class(l) <- "data.frame"
attr(l, "row.names") <- .set_row_names(length(l[[1]]))
l
}
l <- lapply(1:26, function(i) runif(1e3))
names(l) <- letters
bench3 <- microbenchmark(
as.data.frame = as.data.frame(l),
quick_df = quickdf(l)
)
bench3## Unit: microseconds
## expr min lq mean median uq max neval
## as.data.frame 857.101 947.7010 1139.71290 1150.900 1229.4005 2426.301 100
## quick_df 5.601 8.3015 38.41796 12.301 19.7505 2311.900 100
## cld
## b
## a
plot(bench3)- Vectorize your code
Vectorising your code is not just about avoiding for loops, although that’s often a step. Vectorising is about taking a whole-object approach to a problem, thinking about vectors, not scalars. There are two key attributes of a vectorised function:
- The loops in a vectorised function are written in C instead of R. Loops in C are much faster because they have much less overhead.
ttesku <- function(i){
x1 <- rnorm(100,mean = 20)
x2 <- rnorm(100,mean = 30)
t.test(x=x1,y=x2,mu=25)
}
bench5 <- microbenchmark(
test1 = for(i in 1:100) ttesku(i),
test2 = lapply(1:100, ttesku),
test3 = purrr::map(1:100,ttesku) #tidyverse
)
bench5## Unit: milliseconds
## expr min lq mean median uq max neval cld
## test1 10.115202 10.757801 12.58951 11.5169 13.10535 24.2689 100 b
## test2 8.763401 9.340201 11.30903 10.2926 11.89360 21.9016 100 a
## test3 8.864501 9.622001 11.51580 10.2728 11.97950 28.1515 100 a
plot(bench5)contoh lain adalah adalah ketika melapukan boostraping
set.seed(123)
x <- rnorm(1000)
bbku <- function(i,x){
mysamp <-sample(x = x,size = 1000,replace = TRUE)
mean(mysamp)
}
bench6 <- microbenchmark(
test1 = for(i in 1:100) bbku(i,x),
test2 = lapply(1:100, bbku,x=x),
test3 = purrr::map(1:100, bbku,x=x) #tidyverse
)
bench6## Unit: milliseconds
## expr min lq mean median uq max neval cld
## test1 5.617701 5.833051 6.855607 6.113951 6.807501 22.3854 100 b
## test2 4.310101 4.440851 5.288226 4.570451 5.209950 17.4523 100 a
## test3 4.616200 4.764801 5.875752 4.969851 5.503801 17.7264 100 a
plot(bench6)- Gunakan package
data.table
Jika kita menghadapi data yang besar maka package data.table akan memberikan hasil yang lebih cepat dibandingkan dplyr ataupun base r.
set.seed(123)
dta <- purrr::map_dfc(1:50,~rnorm(1e4))## New names:
## * NA -> ...1
## * NA -> ...2
## * NA -> ...3
## * NA -> ...4
## * NA -> ...5
## * ...
bench7 <- microbenchmark(
dplyr= dplyr::select(dta,25:40),
data_table = dta[,25:40]
)
bench7## Unit: microseconds
## expr min lq mean median uq max neval
## dplyr 1045.300 1251.3505 1712.36696 1479.451 1697.6005 10258.500 100
## data_table 17.402 18.6515 27.95597 24.451 27.4505 165.601 100
## cld
## b
## a
plot(bench7) Jika ingin melihat contoh lain menggunakan dplyr kita bisa mengakses web berikut
secara umum sintaks dplyr lebih mudah dipelajari dibandingkan dengan data.table.
- Gunakan Paralel Programming
library(furrr)## Loading required package: future
availableCores()## system
## 16
plan(multisession, workers = 4)
set.seed(123)
x <- rnorm(1000)
bbku <- function(i,x){
mysamp <-sample(x = x,size = 1000,replace = TRUE)
mean(mysamp)
}
bench8 <- microbenchmark(
test1 = for(i in 1:100) bbku(i,x),
test2 = lapply(1:100, bbku,x=x),
test3 = purrr::map(1:100, bbku,x=x)#tidyverse
,
test4 = future_map(1:100, bbku,x=x)
)
bench8## Unit: milliseconds
## expr min lq mean median uq max neval cld
## test1 5.713401 5.869851 7.223552 6.238551 7.070051 17.9296 100 a
## test2 4.293901 4.475751 5.252349 4.606651 5.354650 14.1775 100 a
## test3 4.622502 4.756151 5.567922 4.913752 5.337051 12.6565 100 a
## test4 39.058501 45.474251 61.492798 50.146550 59.570851 794.1181 100 b
plot(bench8) Jika dilihat pada hasil diatas maka bisa dilihat bahwa dengan menggunakan paralel ternyata hasilnya lebih lambat dibandingkan cara yang biasa. Oleh karena itu kita lakukan percobaan ulang dengan skenario berbeda
bbku <- function(i,x){
mysamp <-sample(x = x,size = 50,replace = TRUE)
mean(mysamp)
}
bench9 <- microbenchmark(times = 5,
test3 = purrr::map(1:10000, bbku,x=x)#tidyverse
,
test4 = future_map(1:10000, bbku,x=x)
)
bench9## Unit: milliseconds
## expr min lq mean median uq max neval cld
## test3 88.6522 97.4297 119.91036 108.6018 148.4616 156.4065 5 a
## test4 83.9375 86.9011 88.88176 86.9415 93.0463 93.5824 5 a
plot(bench9)- Gunakan Rcpp
Menggunakan package Rcpp akan membantu mempercepat kode karena kita akan menulis kode dalam bahasa C++. Package Rcpp adalah jembatan antara R dan C++