Degenerate design matrices

One problem that arises sometimes in regression and related problems is a degenerate design matrix – or in other words perfect multicollinearity.

set.seed(101)
d <- data.frame(y = rnorm(5), x1 = 1:5, x2 = 1:5, f1 = rep(c("a", "b"), c(2, 
    3)), f2 = rep(c("A", "B"), c(3, 2)))

Different modeling functions/packages deal differently well with this. lm detects it automatically and sets one of the degenerate coefficients to NA automatically (there's not much user control over which coefficient is “aliased” in this way):

lm(y ~ x1 + x2, data = d)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = d)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     -0.2653       0.0936           NA

The modeling functions in lme give variously useful error messages:

library("nlme")
gls(y ~ x1 + x2, data = d)
## Error: computed "gls" fit is singular, rank 3
lme(y ~ x1 + x2, random = ~1 | f1, data = d)
## Error: Singularity in backsolve at level 0, block 1
detach("package:nlme")

lme4 gives a moderately cryptic but precise error message:

library("lme4")
suppressWarnings(lmer(y ~ x1 + x2 + (1 | f1), data = d))
## Error: rank of X = 2 < ncol(X) = 3
detach("package:lme4")

Sometimes the problem is forehead-slappingly obvious, but what if it's not? You can construct your own design matrix using R's model.matrix() function, with the same formula (you only need the right-hand side) that you put into your model:

X1 <- model.matrix(~x1 + x2, data = d)
(r <- rankMatrix(X1))
## [1] 2
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.186e-14
r < ncol(X1)  ## this signals trouble
## [1] TRUE

You can see which particular columns are involved by doing a singular value decomposition:

(s <- svd(X1))
## $d
## [1] 1.068e+01 9.361e-01 1.250e-16
## 
## $u
##         [,1]    [,2]    [,3]
## [1,] -0.1478  0.7604 -0.2075
## [2,] -0.2778  0.4721  0.6696
## [3,] -0.4077  0.1838 -0.3390
## [4,] -0.5377 -0.1045 -0.5009
## [5,] -0.6676 -0.3928  0.3778
## 
## $v
##         [,1]    [,2]    [,3]
## [1,] -0.1908  0.9816  0.0000
## [2,] -0.6941 -0.1349  0.7071
## [3,] -0.6941 -0.1349 -0.7071

and seeing which columns are associated with the near-zero elements of s$d:

mcols <- s$v[, (s$d < .Machine$double.eps)]
setNames(mcols, colnames(X1))
## (Intercept)          x1          x2 
##      0.0000      0.7071     -0.7071

This tells us that columns 2 and 3 of the design matrix (associated with x1 and x2 are perfectly collinear.

A similar situation can arise with factors, especially when one tries to fit an interaction term where not all combinations of levels have been measured (either because they are logically impossible – structural zeros – or because they simply weren't measured.

For example, the a:B combination is missing in this fake data set:

with(d, table(f1, f2))
##    f2
## f1  A B
##   a 2 0
##   b 1 2

We will get similar results as when we tried to fit above:

lm(y ~ f1 * f2, data = d)
## 
## Call:
## lm(formula = y ~ f1 * f2, data = d)
## 
## Coefficients:
## (Intercept)          f1b          f2B      f1b:f2B  
##       0.113       -0.788        0.938           NA
library("nlme")
gls(y ~ f1 * f2, data = d)
## Error: computed "gls" fit is singular, rank 4
X2 <- model.matrix(~f1 * f2, data = d)
setNames(zapsmall(svd(X2)$v[, 4]), colnames(X2))
## (Intercept)         f1b         f2B     f1b:f2B 
##      0.0000      0.0000      0.7071     -0.7071

As pointed out in the GLMM faq, the way to deal with this problem is to convert the problem into a one-way design (i.e. compute the interaction explicitly and drop unused levels):

d <- transform(d, f12 = droplevels(interaction(f1, f2)))
gls(y ~ f12, data = d)
## Generalized least squares fit by REML
##   Model: y ~ f12 
##   Data: d 
##   Log-restricted-likelihood: -1.898
## 
## Coefficients:
## (Intercept)      f12b.A      f12b.B 
##      0.1132     -0.7882      0.1494 
## 
## Degrees of freedom: 5 total; 2 residual
## Residual standard error: 0.4419
detach("package:nlme")