Introduction

In April 2014, after seeing somewhat futile discussions about the slowness of lm() in comparision with the speed of Rcpp, where I think some smart people did compare the compiled code more appropriately with lm.fit(), I wondered if I should not create a version of lm.fit() which is really mostly about doing the basic computation and as little overhead as possible under side constraint of very little new code in R.

The result was the .lm.fit() function (with an initial dot just so Joe Average User would not even see it, because it should be for advanced users only).

Small Speed Demonstration: Bootstrapping Linear Regressions

Generate data:

x2 <- rep(1:10, 10)
x3 <- rep(1:10, each = 10)

true.coef <- c(1, -2, 3)
Xb <- cbind(1, x2, x3) %*% true.coef
y <- Xb + rnorm(Xb)
dat <- data.frame(y,x2,x3)

Now define the basic regression functions, each returning a length-2 vector of estimated coefficients.

## The "default" based on  lm() :
lmcoefs <- function(data, ind)
  coef(lm(y ~ x2 + x3, data = data[ind, ]))

## good ole'  lsfit() --- clearly faster
lsfCf <- function(data, ind) {
    d <- as.matrix(data)[ind,,drop=FALSE]
    coef(lsfit(d[,2:3], d[,1]))# does include intercept
}

## The (almost 10 x ) faster versions, using  lm.fit() :
lmc2 <- function(data, ind) {
    d <- as.matrix(data)[ind,,drop=FALSE]
    coef(lm.fit(cbind(1, d[,c("x2","x3")]), d[,"y"]))
}
## integer indexing ..
lmc2n <- function(data, ind) {
    d <- as.matrix(data)[ind,,drop=FALSE]
    coef(lm.fit(cbind(1, d[,2:3]), d[,1]))
}

## in R >= 3.1.0: .lm.fit()
lmcF <- function(data, ind) {
    d <- as.matrix(data)[ind,,drop=FALSE]
    coef(.lm.fit(cbind(1,d[,c("x2","x3")]), d[,"y"]))
}

Now we do a (small, for demo purposes) bootstrap of this, and use the microbenchmark package for speed comparisons, including graphical:

## B <- 1000 # realistic
B <- 64 # small, as we will *repeat* in microbenchmark

n <- nrow(dat)
set.seed(1)
str(inds <- replicate(B, sample(n,n, replace=TRUE)))
##  int [1:100, 1:64] 27 38 58 91 21 90 95 67 63 7 ...
## is  n x B --> will use *columns* of inds -->  apply(inds, 2, *)  below

library("microbenchmark")
## takes about 10 sec
mb <- microbenchmark(lm      = c1 <- apply(inds, 2, lmcoefs, data=dat),
                     lm.fit  = c2 <- apply(inds, 2, lmc2,    data=dat),
                     lm.fi.i = c3 <- apply(inds, 2, lmc2n,   data=dat),
                     lsfit   = c4 <- apply(inds, 2, lsfCf,   data=dat),
                     .lm.fit = c5 <- apply(inds, 2, lmcF,    data=dat))

and do “prove” that all 5 ways are “equivalent”:

AE <- function(x,y, tol = 1e-14)
    all.equal(x,y, tolerance=tol, check.attributes=FALSE)

stopifnot(AE(c1, c5), AE(c2, c5),
          AE(c3, c5), AE(c4, c5))

so indeed, all are equql.

From the benchmark tabular result,

options(width = 90)
mb
## Unit: milliseconds
##     expr       min        lq      mean    median        uq       max neval cld
##       lm 55.194260 56.165809 60.393790 57.231655 59.525773 189.45719   100   c
##   lm.fit  6.307245  6.509221  6.838964  6.663676  6.911828  12.08894   100 ab 
##  lm.fi.i  6.378394  6.540182  7.147848  6.668329  6.886033  21.08978   100  b 
##    lsfit  7.188545  7.350980  7.765667  7.553481  7.962530   9.70730   100  b 
##  .lm.fit  4.018050  4.127954  4.526282  4.209046  4.487509  24.29353   100 a

it is already clear that .lm.fit() wins hands down; note that the cld (based on hypothesis of equality testing) is not robust, and as you can see from the following log-scaled graphics, indeed, outliers in a strongly right skewed distribution are quite “dominant”.

So we should not trust the cld column above, very much but rather look at the (notched) boxplots:

boxplot(mb, unit="ms", notch=TRUE)