Recently Norm Matloff described a formulation for random effects models in a paper and a blog post. One example he mentions involves a sample of about 1 million movie ratings from https://movielens.org. The sample can be downloaded from http://grouplens.org/datasets/movielens/ as a zip archive ml-1m.zip.
The zip archive contains files ratings.dat, movies.dat and users.dat in a directory ml-1m. We can use an unz connection to access the ratings.dat directly. The format is a bit unusual in that the field separator is two consecutive colons (::). Specifying the separator as : produces seven fields of which only the odd-numbered fields are meaningful.
str(dd <- read.delim(unz("/home/bates/Downloads/ml-1m.zip","ml-1m/ratings.dat"),sep=':',header=FALSE))
## 'data.frame': 1000209 obs. of 7 variables:
## $ V1: int 1 1 1 1 1 1 1 1 1 1 ...
## $ V2: logi NA NA NA NA NA NA ...
## $ V3: int 1193 661 914 3408 2355 1197 1287 2804 594 919 ...
## $ V4: logi NA NA NA NA NA NA ...
## $ V5: int 5 3 3 4 5 3 5 5 4 4 ...
## $ V6: logi NA NA NA NA NA NA ...
## $ V7: int 978300760 978302109 978301968 978300275 978824291 978302268 978302039 978300719 978302268 978301368 ...
We create a data frame from user, movie and rating and save it in compressed RDS (R data set) format.
ml1m <- data.frame(user = factor(dd[[1]]), movie = factor(dd[[3]]), rating = as.double(dd[[5]]))
rm(dd)
saveRDS(ml1m,file="ml1m.rds",compress='xz')
A preliminary model would be a straight variance-component model to examine the variability due to movie, the variability due to user and the residual variability. Using the lmer function from the lme4 package this model is fit as
system.time(m1 <- lmer(rating ~ 1 + (1|user) + (1|movie), ml1m,REML=FALSE))
## user system elapsed
## 346.012 232.872 165.140
summary(m1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: rating ~ 1 + (1 | user) + (1 | movie)
## Data: ml1m
##
## AIC BIC logLik deviance df.resid
## 2663980 2664027 -1331986 2663972 1000205
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.4200 -0.6128 0.0810 0.7000 5.4128
##
## Random effects:
## Groups Name Variance Std.Dev.
## user (Intercept) 0.1298 0.3603
## movie (Intercept) 0.3688 0.6073
## Residual 0.8139 0.9022
## Number of obs: 1000209, groups: user, 6040; movie, 3706
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.33902 0.01146 291.3
In his blog posting Norm stated that this model took nearly 20 minutes to fit, which was a motivation for an alternative approach. I don’t get comparable results. On my desktop, a Dell Optiplex 7010, the fit took less than 3 minutes in R using lmer and less than half a minute in Julia using lmm from the MixedModels package.
julia> versioninfo()
Julia Version 0.4.1
Commit cbe1bee (2015-11-08 10:33 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.3
julia> using DataFrames,MixedModels,RCall
julia> ml1m = rcopy(rcall(:readRDS,"/home/bates/git/stat692/ml1m.rds"));
julia> @time m1 = fit!(lmm(rating ~ 1 + (1|user) + (1|movie),ml1m));
29.575176 seconds (42.99 M allocations: 1.631 GB, 1.23% gc time)
julia> m1
Linear mixed model fit by maximum likelihood
logLik: -1331986.005811, deviance: 2663972.011622, AIC: 2663980.011622, BIC: 2664027.274500
Variance components:
Variance Std.Dev.
user 0.12985210 0.36034996
movie 0.36879694 0.60728654
Residual 0.81390867 0.90216887
Number of obs: 1000209; levels of grouping factors: 6040, 3706
Fixed-effects parameters:
Estimate Std.Error z value
(Intercept) 3.33902 0.0114624 291.302
The RCall package allows for evaluation of R expressions in an embedded R process. The formula and data arguments to lmm are similar to those for lmer. By default lmm models are fit by maximum likelihood, not REML.