Probits vs logits

Author

Bob O’Hara

Introduction

This is some of the code that formed the basis of the exam at NTNU for TMA4315 in Novermber 2024. The problem was to look at changes in the use of logits and probits over time, and was inspired by the data in Cramer (2003).

Web of Science was searched for “logit”, and then “probit” on 15th November 2024, and the papers that mention these terms anywhere were downloaded.

Data formatting

The data are on figshare. This downloads it, and does some re-formatting for the analysis:

AllData <- read.csv("https://figshare.com/ndownloader/files/51210113")
AllData <- AllData[AllData$UseInModel,] # remove fields with too few data
AllData$Sum <- AllData$logit+AllData$probit
AllData <- AllData[AllData$Sum>0,] # remove rows with no logits or probits
AllData$OD <- factor(1:nrow(AllData)) # For overdispersion
AllData$YearC <- AllData$Year-1990 # make 1990 the intercept
Counts <- as.matrix(AllData[,c("probit", "logit")])

Average over field

First we sum over the fields to just get the data by year:

tmpD <- by(AllData, list(AllData$Year), function(x) c(sum(x$logit), sum(x$probit)),
               simplify=TRUE)
Data <- data.frame(do.call(rbind, tmpD))
names(Data) <- c("logit", "probit")
Data$Year <- as.numeric(rownames(Data))
Data$YearC <- Data$Year - 1970
Data$PropProbit <- Data$probit/(Data$probit+Data$logit)
Data <- Data[Data$Year<2021,]

Then we plot the data:

par(mar=c(4.1,5.1,1,1), cex=1.3)
plot(Data$Year, Data$PropProbit, xlab="Year", ylab=
       "Proportion of papers using\na probit link function")

So we can see a decrease in the use of probits over time. We can fit models to this. We assume a binomial distribution and a linear effect of time on the link scale. We can, of course, use a probit or logit link:

mlgt <- glm(cbind(Data$probit, Data$logit) ~ YearC, 
          family=binomial("logit"), data=Data)
mprbt <- glm(cbind(Data$probit, Data$logit) ~ YearC, 
          family=binomial("probit"), data=Data)
predL <- predict(mlgt, type = "response")
predP <- predict(mprbt, type = "response")

We can plot the data and see they give almost the same predictions:

par(mar=c(4.1,5.1,1,1), cex=1.3)
plot(Data$Year, Data$PropProbit, xlab="Year", ylab=
       "Proportion of papers using\na probit link function")
lines(Data$Year, predL, col=2)
lines(Data$Year, predP, col=3)
legend(2005, 0.7, c("logit", "probit"), lty=1, col=2:3)

The residual deviances are 143.3 and 143.7 for the logit and probit models respectively. So the probit model fits ever so slightly better, but the difference is so tiny we can’t see the logit curve on the plot. Indeed, the largest difference in probabilities is 0.084%, i.e. less than one tenth of a percent.

We cannot formally test the difference between the probit and logit models with standard tools, because the models are not nested. But we have a way.

Comparing the models with a parametric bootstrap

The data are Bernoulli distributions, so for each year we can re-sample them using the proportion of probits as our \(p\). For each simulation we can fit both models, and compare the deviances:

ResampleDev <- function(data) {
  data$N <- data$logit + data$probit
  data$simprbt <- rbinom(nrow(data), size=data$N, prob=data$PropProbit)
  simY <- cbind(data$simprbt, data$N-data$simprbt)
  mlgt.s <- glm(simY ~ YearC, family=binomial("logit"), data=data)
  mprbt.s <- glm(simY ~ YearC, family=binomial("probit"), data=data)
  deviance(mlgt.s) - deviance(mprbt.s)
}

RepDev <- replicate(1e4, ResampleDev(Data)) # 10,000 replicates

hist(RepDev)
abline(v=0, col=2)

We can see that almost all of the differences in deviance are below 0, so the probit is preferred. In fact, only 0.3% are above 0, so we can say there is evidence for the probit being better. However, the differences are still very small: in practice both models work fine (which is why their predictions are almost the same).

Now with added fields

Now we can look at the use of logits and probits varies between fields.

Resp <- cbind(AllData$probit, AllData$logit)
library(lme4)
mm1l <- glmer(Resp ~ YearC + (1|Field), family=binomial("logit"), data=AllData)
mm2l <- glmer(Resp ~ YearC + (1|Field) + (1|OD), family=binomial("logit"), data=AllData)


mm1p <- glmer(Resp ~ YearC + (1|Field), family=binomial("probit"), data=AllData)
mm2p <- glmer(Resp ~ YearC + (1|Field) + (1|OD), family=binomial("probit"), data=AllData)

# These are not stable
# mm3l <- glmer(Resp ~ YearC + (1+Year|Field) + (1|OD), 
#               family=binomial("logit"), data=AllData)
# mm3p <- glmer(Resp ~ YearC + (1+Year|Field) + (1|OD), 
#               family=binomial("probit"), data=AllData)

Summ.l <- summary(mm2l)
Summ.p <- summary(mm2p)

The summaries contain a lot of information, so here are just the fixed effect coefficients, and the random effects, as standard deviations for the logit model:

Summ.l$coefficients
               Estimate  Std. Error    z value     Pr(>|z|)
(Intercept)  0.03092855 0.152661840  0.2025952 8.394515e-01
YearC       -0.01358016 0.001730758 -7.8463678 4.282596e-15
Summ.l$varcor
 Groups Name        Std.Dev.
 OD     (Intercept) 0.13927 
 Field  (Intercept) 0.75142 

And for the probits:

Summ.p$coefficients
                Estimate  Std. Error    z value     Pr(>|z|)
(Intercept)  0.011907798 0.091265521  0.1304742 8.961912e-01
YearC       -0.008288644 0.001063765 -7.7918002 6.606114e-15
Summ.p$varcor
 Groups Name        Std.Dev.
 OD     (Intercept) 0.084504
 Field  (Intercept) 0.448513

Both tell the same story: a trend in time, with logits winning, and most of the variation in the intercept is between fields. In essence, this is what was causing most of the overdispersion we saw earlier.

Finally, some model checking:

library(ggplot2)
library(sjPlot)

gg1F <- plot_model(mm2l, type = "re", sort.est = "(Intercept)", y.offset = 0.4, 
                   dot.size = 1.5, title = "Field")
gg1OD <- plot_model(mm2l, type = "re", sort.est = "(Intercept)", y.offset = 0.4, 
                    dot.size = 1.5, title = "OD")

gg2 <- plot_model(mm2l, type = "diag", prnt.plot = FALSE, title = "OD", geom.size = 1)

gg2f <- gg2$Field + theme(strip.text.x = element_blank()) + ggtitle("Field")
gg2od <- gg2$OD + theme(strip.text.x = element_blank()) + ggtitle("OD")

gg3 <- ggplot() + geom_density(aes(x = ranef(mm2l)$Field[[1]])) + labs(x = "x", y = "y", title = "Density of RI")

df <- data.frame(fitted = fitted(mm2l), resid = residuals(mm2l, scaled = TRUE))
gg4 <- ggplot(df, aes(fitted,resid)) + geom_point(pch = 21) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
  labs(x = "Fitted values", y = "Residuals", title = "Residuals vs Fitted values")
gg5 <- ggplot(df, aes(sample=resid)) + stat_qq(pch = 19) +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
  labs(x = "Theoretical quantiles", y = "Standardized residuals", title = "Normal Q-Q for residuals")

library(ggpubr)

ggarrange(gg1F[[2]], gg1OD[[1]], gg2f, gg2od, gg4, gg5, 
          ncol=2, nrow=3)

On the whole this looks OK, but the distributions sometimes have think tails. And Entomologists are weird in liking probits too much.