In April 2014, after seeing somewhat futile discussions about the slowness of lm()
in comparision with the speed of Rcpp
, where I think some smart people did compare the compiled code more appropriately with lm.fit()
, I wondered if I should not create a version of lm.fit()
which is really mostly about doing the basic computation and as little overhead as possible under side constraint of very little new code in R.
The result was the .lm.fit()
function (with an initial dot just so Joe Average User would not even see it, because it should be for advanced users only).
x2 <- rep(1:10, 10)
x3 <- rep(1:10, each = 10)
true.coef <- c(1, -2, 3)
Xb <- cbind(1, x2, x3) %*% true.coef
y <- Xb + rnorm(Xb)
dat <- data.frame(y,x2,x3)
Now define the basic regression functions, each returning a length-2 vector of estimated coefficients.
## The "default" based on lm() :
lmcoefs <- function(data, ind)
coef(lm(y ~ x2 + x3, data = data[ind, ]))
## good ole' lsfit() --- clearly faster
lsfCf <- function(data, ind) {
d <- as.matrix(data)[ind,,drop=FALSE]
coef(lsfit(d[,2:3], d[,1]))# does include intercept
}
## The (almost 10 x ) faster versions, using lm.fit() :
lmc2 <- function(data, ind) {
d <- as.matrix(data)[ind,,drop=FALSE]
coef(lm.fit(cbind(1, d[,c("x2","x3")]), d[,"y"]))
}
## integer indexing ..
lmc2n <- function(data, ind) {
d <- as.matrix(data)[ind,,drop=FALSE]
coef(lm.fit(cbind(1, d[,2:3]), d[,1]))
}
## in R >= 3.1.0: .lm.fit()
lmcF <- function(data, ind) {
d <- as.matrix(data)[ind,,drop=FALSE]
coef(.lm.fit(cbind(1,d[,c("x2","x3")]), d[,"y"]))
}
Now we do a (small, for demo purposes) bootstrap of this, and use the microbenchmark
package for speed comparisons, including graphical:
## B <- 1000 # realistic
B <- 64 # small, as we will *repeat* in microbenchmark
n <- nrow(dat)
set.seed(1)
str(inds <- replicate(B, sample(n,n, replace=TRUE)))
## int [1:100, 1:64] 27 38 58 91 21 90 95 67 63 7 ...
## is n x B --> will use *columns* of inds --> apply(inds, 2, *) below
library("microbenchmark")
## takes about 10 sec
mb <- microbenchmark(lm = c1 <- apply(inds, 2, lmcoefs, data=dat),
lm.fit = c2 <- apply(inds, 2, lmc2, data=dat),
lm.fi.i = c3 <- apply(inds, 2, lmc2n, data=dat),
lsfit = c4 <- apply(inds, 2, lsfCf, data=dat),
.lm.fit = c5 <- apply(inds, 2, lmcF, data=dat))
and do “prove” that all 5 ways are “equivalent”:
AE <- function(x,y, tol = 1e-14)
all.equal(x,y, tolerance=tol, check.attributes=FALSE)
stopifnot(AE(c1, c5), AE(c2, c5),
AE(c3, c5), AE(c4, c5))
so indeed, all are equql.
From the benchmark tabular result,
options(width = 90)
mb
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## lm 55.194260 56.165809 60.393790 57.231655 59.525773 189.45719 100 c
## lm.fit 6.307245 6.509221 6.838964 6.663676 6.911828 12.08894 100 ab
## lm.fi.i 6.378394 6.540182 7.147848 6.668329 6.886033 21.08978 100 b
## lsfit 7.188545 7.350980 7.765667 7.553481 7.962530 9.70730 100 b
## .lm.fit 4.018050 4.127954 4.526282 4.209046 4.487509 24.29353 100 a
it is already clear that .lm.fit()
wins hands down; note that the cld
(based on hypothesis of equality testing) is not robust, and as you can see from the following log-scaled graphics, indeed, outliers in a strongly right skewed distribution are quite “dominant”.
So we should not trust the cld
column above, very much but rather look at the (notched) boxplots:
boxplot(mb, unit="ms", notch=TRUE)