Logit(percent rank) is nice and normalish

I like using percent ranks because it removes the absurdity of outliers from raw data.

# data with outlier
raw <- c(1:100, 1000)
ggplot(NULL, aes(raw)) +
  geom_histogram()

# nicely behaved percentile data
percentiles <- percent_rank(raw)
ggplot(NULL, aes(percentiles)) +
  geom_histogram()

So much prettier now!

But what if I want to get stats like mean and standard deviation? This can be goofy for percentile data because it is bounded. For example if the mean was 0.9 +/- 0.2, then one standard deviation up is 1.10 - a data point that could never exist unless we were willing to accept 110%.

In logistic regression this kind of problem is dealt with by applying the logit function, log(x / (1 - x)), so that data that was originally 0 to 1 goes from -∞ to ∞. In that case standard error makes sense again.

How does the percent rank data look if we apply a logit to it?

logit <- function(x) log(x / (1 - x))

# generate and show the normalized data
normalized <- logit(percentiles)
ggplot(NULL, aes(normalized)) +
  geom_histogram()

Oh pretty day. Look at that beautiful nearly normal function with a mean of 0. The logit of percentile data is something we can work with!

Logit(percent rank) is nice and normalish

Benjamin Haley

April 20, 2015