Stuff with contrasts

improved level naming

It would be very nice to have the capability to name the levels of model.matrix-generated input variables better. For example, the input variables that are generated when contr.sum() is used are just numbered, not named. I think there is some magic going on within model.matrix that allows the contrasts to be sensibly named only if the default contrasts are used (i.e. no contrasts are explicitly specified). For example, here are the names of the input variables generated by the default contrasts:

set.seed(102)
d <- data.frame(y = runif(100), f = factor(sample(c("low", "medium", "high"), 
    100, replace = TRUE), levels = c("low", "medium", "high")))
colnames(model.matrix(y ~ f, data = d))
## [1] "(Intercept)" "fmedium"     "fhigh"

But if we explicitly specify treatment contrasts, even though we get the same model matrix (not shown), the names are now useless:

colnames(model.matrix(y ~ f, data = d, contrasts = list(f = contr.treatment)))
## [1] "(Intercept)" "f2"          "f3"

Similarly, if we ask for

colnames(model.matrix(y ~ f, data = d, contrasts = list(f = contr.sum)))
## [1] "(Intercept)" "f1"          "f2"

What if we set the default contrasts using options()?

old_opts <- options(contrasts = c("contr.treatment", "contr.poly"))
colnames(model.matrix(y ~ f, data = d))
## [1] "(Intercept)" "fmedium"     "fhigh"
options(contrasts = c("contr.sum", "contr.poly"))
colnames(model.matrix(y ~ f, data = d))
## [1] "(Intercept)" "f1"          "f2"
options(old_opts)

The code for creating the column names of the model matrix is actually at lines 578ff of https://github.com/wch/r-source/blob/trunk/src/library/stats/src/model.c … it would require either a lot more staring at this code, or source-level debugging, t find out exactly where the difference in behaviour (i.e., why default contrasts get sensible names and non-default contrasts only get numbers) is coming from.

The car package defines a contr.Sum() function that gives slightly improved input variable names, but they're only a little bit better. The main point is that they indicate clearly that sum-to-zero contrasts are being used, but they don't help with guessing which levels are being compared.

library("car")
colnames(model.matrix(y ~ f, data = d, contrasts = list(f = contr.Sum)))
## [1] "(Intercept)" "f[S.1]"      "f[S.2]"

A clunky workaround is to specify a contrast matrix with defined column names:

cm <- contr.sum(3)
colnames(cm) <- c("a", "b")
colnames(model.matrix(y ~ f, data = d, contrasts = list(f = cm)))
## [1] "(Intercept)" "fa"          "fb"

We could even cheat and add a separator:

colnames(cm) <- paste0(".", c("a", "b"))
colnames(model.matrix(y ~ f, data = d, contrasts = list(f = cm)))
## [1] "(Intercept)" "f.a"         "f.b"

I've defined a contr.Sum2() (not shown) (I called it contr.Sum2 to differentiate it from the car version) that automates this somewhat: unfortunately, it can't be specified simply as contr.Sum2 – the optional factor or levels arguments must be provided …

colnames(m <- model.matrix(y ~ f, data = d, contrasts = list(f = contr.Sum2(factor = d$f))))
## [1] "(Intercept)" "f[S.low]"    "f[S.medium]"

I don't see an easy way to make this more convenient without fairly deep hacking …

weighted sum-to-zero contrasts

What does a weighted contrast matrix look like? Do we have to mess with the intercept column? The easy way to do it (presumably) is to center the design matrix:

colMeans(m)
## (Intercept)    f[S.low] f[S.medium] 
##        1.00        0.07        0.09
m[, -1] <- scale(m[, -1], center = TRUE, scale = FALSE)
zapsmall(colMeans(m))
## (Intercept)    f[S.low] f[S.medium] 
##           1           0           0
head(m)
##   (Intercept) f[S.low] f[S.medium]
## 1           1    -1.07       -1.09
## 2           1    -0.07        0.91
## 3           1    -1.07       -1.09
## 4           1    -0.07        0.91
## 5           1     0.93       -0.09
## 6           1    -1.07       -1.09
prop.table(table(d$f))
## 
##    low medium   high 
##   0.35   0.37   0.28
apply(m[, -1], 2, unique)
##      f[S.low] f[S.medium]
## [1,]    -1.07       -1.09
## [2,]    -0.07        0.91
## [3,]     0.93       -0.09

If I could do the simple algebra to figure out where these numbers come from, this could fairly easily be incorporated in contr.Sum2()

orthogonal contrasts???