# Degenerate design matrices

One problem that arises sometimes in regression and related problems is a degenerate design matrix – or in other words perfect multicollinearity.

``````set.seed(101)
d <- data.frame(y = rnorm(5), x1 = 1:5, x2 = 1:5, f1 = rep(c("a", "b"), c(2,
3)), f2 = rep(c("A", "B"), c(3, 2)))
``````

Different modeling functions/packages deal differently well with this. `lm` detects it automatically and sets one of the degenerate coefficients to `NA` automatically (there's not much user control over which coefficient is “aliased” in this way):

``````lm(y ~ x1 + x2, data = d)
``````
``````##
## Call:
## lm(formula = y ~ x1 + x2, data = d)
##
## Coefficients:
## (Intercept)           x1           x2
##     -0.2653       0.0936           NA
``````

The modeling functions in `lme` give variously useful error messages:

``````library("nlme")
gls(y ~ x1 + x2, data = d)
``````
``````## Error: computed "gls" fit is singular, rank 3
``````
``````lme(y ~ x1 + x2, random = ~1 | f1, data = d)
``````
``````## Error: Singularity in backsolve at level 0, block 1
``````
``````detach("package:nlme")
``````

`lme4` gives a moderately cryptic but precise error message:

``````library("lme4")
suppressWarnings(lmer(y ~ x1 + x2 + (1 | f1), data = d))
``````
``````## Error: rank of X = 2 < ncol(X) = 3
``````
``````detach("package:lme4")
``````

Sometimes the problem is forehead-slappingly obvious, but what if it's not? You can construct your own design matrix using R's `model.matrix()` function, with the same formula (you only need the right-hand side) that you put into your model:

``````X1 <- model.matrix(~x1 + x2, data = d)
(r <- rankMatrix(X1))
``````
``````##  2
## attr(,"method")
##  "tolNorm2"
##  FALSE
## attr(,"tol")
##  1.186e-14
``````
``````r < ncol(X1)  ## this signals trouble
``````
``````##  TRUE
``````

You can see which particular columns are involved by doing a singular value decomposition:

``````(s <- svd(X1))
``````
``````## \$d
##  1.068e+01 9.361e-01 1.250e-16
##
## \$u
##         [,1]    [,2]    [,3]
## [1,] -0.1478  0.7604 -0.2075
## [2,] -0.2778  0.4721  0.6696
## [3,] -0.4077  0.1838 -0.3390
## [4,] -0.5377 -0.1045 -0.5009
## [5,] -0.6676 -0.3928  0.3778
##
## \$v
##         [,1]    [,2]    [,3]
## [1,] -0.1908  0.9816  0.0000
## [2,] -0.6941 -0.1349  0.7071
## [3,] -0.6941 -0.1349 -0.7071
``````

and seeing which columns are associated with the near-zero elements of `s\$d`:

``````mcols <- s\$v[, (s\$d < .Machine\$double.eps)]
setNames(mcols, colnames(X1))
``````
``````## (Intercept)          x1          x2
##      0.0000      0.7071     -0.7071
``````

This tells us that columns 2 and 3 of the design matrix (associated with `x1` and `x2` are perfectly collinear.

A similar situation can arise with factors, especially when one tries to fit an interaction term where not all combinations of levels have been measured (either because they are logically impossible – structural zeros – or because they simply weren't measured.

For example, the `a:B` combination is missing in this fake data set:

``````with(d, table(f1, f2))
``````
``````##    f2
## f1  A B
##   a 2 0
##   b 1 2
``````

We will get similar results as when we tried to fit above:

``````lm(y ~ f1 * f2, data = d)
``````
``````##
## Call:
## lm(formula = y ~ f1 * f2, data = d)
##
## Coefficients:
## (Intercept)          f1b          f2B      f1b:f2B
##       0.113       -0.788        0.938           NA
``````
``````library("nlme")
gls(y ~ f1 * f2, data = d)
``````
``````## Error: computed "gls" fit is singular, rank 4
``````
``````X2 <- model.matrix(~f1 * f2, data = d)
setNames(zapsmall(svd(X2)\$v[, 4]), colnames(X2))
``````
``````## (Intercept)         f1b         f2B     f1b:f2B
##      0.0000      0.0000      0.7071     -0.7071
``````

As pointed out in the GLMM faq, the way to deal with this problem is to convert the problem into a one-way design (i.e. compute the interaction explicitly and drop unused levels):

``````d <- transform(d, f12 = droplevels(interaction(f1, f2)))
gls(y ~ f12, data = d)
``````
``````## Generalized least squares fit by REML
##   Model: y ~ f12
##   Data: d
##   Log-restricted-likelihood: -1.898
##
## Coefficients:
## (Intercept)      f12b.A      f12b.B
##      0.1132     -0.7882      0.1494
##
## Degrees of freedom: 5 total; 2 residual
## Residual standard error: 0.4419
``````
``````detach("package:nlme")
``````