One of the most important aspects of improving our sample size estimation for logistic regression is to realize that you need at least 96 subjects just to estimate the intercept. -Frank Harrell
https://twitter.com/f2harrell/status/936216929324425216
This is based on estimating a probability within 0.1 with 95% confidence. We can demonstrate this idea with a simulation.
First we sample 96 subjects from a binomial population with probability of 0.55. In other words, the TRUE probability of “success” is 0.55. We can do that by using the rbinom function. We can think of the code below as flipping a coin 96 times that has probability of 0.55 of landing heads. We store the result in x.
x <- rbinom(n = 96, size = 1, prob = 0.55)
Now we pretend we don’t know the true probability is 0.55 and try to estimate it with our sample, x. We can do that with an intercept-only logistic regression model. The coef function extracts the intercept. The plogis function calculates the inverse logit and converts the intercept value to probability.
m <- glm(x ~ 1, family = binomial)
# use inverse-logit to get probability
plogis(coef(m))
## (Intercept)
## 0.5
Finally we check if our estimated probability is within 0.1 of the TRUE probability of 0.55.
abs(plogis(coef(m)) - 0.55) < 0.1
## (Intercept)
## TRUE
Now let’s do that 1000 times and see the proportion of times our estimate is within 0.1 of the TRUE probability of 0.55. The replicate function allows us to replicate a chunk of code as many times as we like. The result is stored in a vector I called “rep.out”. This is a vector of TRUE/FALSE values. Taking the mean of that vector returns the proportion of TRUES.
rep.out <- replicate(n = 1000, {
x <- rbinom(n = 96, size = 1, prob = 0.55)
m <- glm(x ~ 1, family = binomial)
abs(plogis(coef(m)) - 0.55) < 0.10})
mean(rep.out)
## [1] 0.944
We see that we manage to come within 0.1 of the true probability 94.4 percent of the time.