MovieLens

Recently Norm Matloff described a formulation for random effects models in a paper and a blog post. One example he mentions involves a sample of about 1 million movie ratings from https://movielens.org. The sample can be downloaded from http://grouplens.org/datasets/movielens/ as a zip archive ml-1m.zip.

Loading the data

The zip archive contains files ratings.dat, movies.dat and users.dat in a directory ml-1m. We can use an unz connection to access the ratings.dat directly. The format is a bit unusual in that the field separator is two consecutive colons (::). Specifying the separator as : produces seven fields of which only the odd-numbered fields are meaningful.

str(dd <- read.delim(unz("/home/bates/Downloads/ml-1m.zip","ml-1m/ratings.dat"),sep=':',header=FALSE))

## 'data.frame':    1000209 obs. of  7 variables:
##  $ V1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V2: logi  NA NA NA NA NA NA ...
##  $ V3: int  1193 661 914 3408 2355 1197 1287 2804 594 919 ...
##  $ V4: logi  NA NA NA NA NA NA ...
##  $ V5: int  5 3 3 4 5 3 5 5 4 4 ...
##  $ V6: logi  NA NA NA NA NA NA ...
##  $ V7: int  978300760 978302109 978301968 978300275 978824291 978302268 978302039 978300719 978302268 978301368 ...

We create a data frame from user, movie and rating and save it in compressed RDS (R data set) format.

ml1m <- data.frame(user = factor(dd[[1]]), movie = factor(dd[[3]]), rating = as.double(dd[[5]]))
rm(dd)
saveRDS(ml1m,file="ml1m.rds",compress='xz')

Fitting a linear mixed-effects model

A preliminary model would be a straight variance-component model to examine the variability due to movie, the variability due to user and the residual variability. Using the lmer function from the lme4 package this model is fit as

system.time(m1 <- lmer(rating ~ 1 + (1|user) + (1|movie), ml1m,REML=FALSE))

##    user  system elapsed 
## 346.012 232.872 165.140

summary(m1)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: rating ~ 1 + (1 | user) + (1 | movie)
##    Data: ml1m
## 
##      AIC      BIC   logLik deviance df.resid 
##  2663980  2664027 -1331986  2663972  1000205 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.4200 -0.6128  0.0810  0.7000  5.4128 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  user     (Intercept) 0.1298   0.3603  
##  movie    (Intercept) 0.3688   0.6073  
##  Residual             0.8139   0.9022  
## Number of obs: 1000209, groups:  user, 6040; movie, 3706
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  3.33902    0.01146   291.3

In his blog posting Norm stated that this model took nearly 20 minutes to fit, which was a motivation for an alternative approach. I don’t get comparable results. On my desktop, a Dell Optiplex 7010, the fit took less than 3 minutes in R using lmer and less than half a minute in Julia using lmm from the MixedModels package.

julia> versioninfo()
Julia Version 0.4.1
Commit cbe1bee (2015-11-08 10:33 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> using DataFrames,MixedModels,RCall

julia> ml1m = rcopy(rcall(:readRDS,"/home/bates/git/stat692/ml1m.rds"));

julia> @time m1 = fit!(lmm(rating ~ 1 + (1|user) + (1|movie),ml1m));
 29.575176 seconds (42.99 M allocations: 1.631 GB, 1.23% gc time)

julia> m1
Linear mixed model fit by maximum likelihood
 logLik: -1331986.005811, deviance: 2663972.011622, AIC: 2663980.011622, BIC: 2664027.274500

Variance components:
           Variance   Std.Dev.  
 user     0.12985210 0.36034996
 movie    0.36879694 0.60728654
 Residual 0.81390867 0.90216887
 Number of obs: 1000209; levels of grouping factors: 6040, 3706

  Fixed-effects parameters:
             Estimate Std.Error z value
(Intercept)   3.33902 0.0114624 291.302

The RCall package allows for evaluation of R expressions in an embedded R process. The formula and data arguments to lmm are similar to those for lmer. By default lmm models are fit by maximum likelihood, not REML.

MovieLens

Douglas Bates

November 19, 2015

Loading the data

Fitting a linear mixed-effects model