This vignette describes the composition and calculation of a summary statistic that we have referred to as the product of ranks. Others have called it the rank product (Breitling 2004; Eisinga 2013).
For \(N\) units having \(M\) metrics, \(i = 1 \dots N\), it is defined as:
\[ {RP}_i = \prod_{j=1}^{M}{{R_{ij}} \over {N_{j}}} \]
… where the rank \(R_{ij}\) of value \(X_{ij}\) (for unit \(i\) and metric \(j\)) is defined such that ranks vary from 1 (highest) to N (lowest), assuming no ties and no missing data. The terms of the cumulative product are the rank fractions, which under the same assumptions vary from \(\frac{1}{N}\) (highest) to \(1\) (lowest).
Missing data. Without loss of generality, one may exclude missing data from the computations by substituting the geometric mean for the cumulative product. In this case \(N\) will vary by metric, and the rank fraction for metric \(j\) will take on values from \(\frac{1}{N_j}\) to \(1\). The code below reports the number of non-missing terms for each unit as \(k\).
Ties. When there are ties, a method must be chosen to break them. The code below assigns the highest common rank to the group of tied values, consistent with the choice embodied in the calculations of percentiles in CalEnviroScreen 2.0. If a number of values within a given indicator \(\textbf{X}_{j}\) are tied for last place (perhaps equal to zero), then the rank fractions for that indicator will vary from \(\frac{1}{N_j}\) to \(\frac{N_j - z_j}{N_j}\), where \(z_j\) is the count of values tied for last place.
Implementation. For numerical accuracy and simplicity, the code below calculates mean(-log(Frac)), where Frac is the rank fraction. The geometric mean would be exp(mean(log(Frac))), but the exp() is not needed as it does not affect the final ordering. The negation is employed solely for interpretability. It also does not affect the ordering, except to reverse it so that the statistic of interest will be positive and increasing.
suppressPackageStartupMessages({
library(testthat) # install.packages("testthat")
library(dplyr) # install.packages("dplyr")
library(tidyr) # install.packages("tidyr")
library(CalEnviroScreen) # devtools::install_github("baaqmd/CalEnviroScreen")
})
data(CES2, package = "CalEnviroScreen")
CES2_tbl <- CES2_data %>%
gather(Variable, Value, -FIPS) %>%
filter(!is.na(Value)) %>%
group_by(Variable) %>%
mutate(Rank = rank(-Value, ties.method = "min"), # rank: 1 (highest) to N (lowest)
Frac = Rank / n()) # rank fraction: 1/N (highest) to 1.0 (lowest)
cut_quantile <- function (x, n = 20, ..., na.rm = TRUE) {
q <- quantile(x, seq(0, 1, len=n+1), na.rm = na.rm)
cut(x, breaks = q, labels = names(q)[-length(q)], include.lowest = TRUE, ordered_result = TRUE, na.rm = na.rm)
}
CES2_rankprod <- CES2_tbl %>%
group_by(FIPS) %>%
summarize(k = n(), Score = mean(-log(Frac))) %>% # yields positive & increasing scores
mutate(Range = cut_quantile(Score))
show(CES2_rankprod)
## Source: local data frame [8,035 x 4]
##
## FIPS k Score Range
## 1 06001400100 19 0.8014 25%
## 2 06001400200 19 0.7087 15%
## 3 06001400300 19 0.7087 15%
## 4 06001400400 19 0.8386 30%
## 5 06001400500 19 0.9633 45%
## 6 06001400600 19 0.9527 45%
## 7 06001400700 19 1.2256 70%
## 8 06001400800 19 1.6038 90%
## 9 06001400900 19 1.4199 80%
## 10 06001401000 19 1.5626 85%
## .. ... .. ... ...
with(CES2_rankprod, table(Range) %>% rev() %>% cumsum() %>% head())
## 95% 90% 85% 80% 75% 70%
## 402 804 1206 1607 2009 2411
with(CES2_rankprod, density(exp(-Score)) %>% plot(bty='n'))
write.csv(CES2_rankprod, file = "CES2_rankprod.csv", row.names = FALSE)
Breitling R, Armengaud P, Amtmann A, Herzyk P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 2004;573(1-3):83-92.
Eisinga R, Breitling R, Heskes T. The exact probability distribution of the rank product statistics for replicated experiments. FEBS Lett. 2013;587:677-682.
rank_nonmissing <- function (x, ...) {
i <- which(!is.na(x))
as.integer(replace(x, i, rank(x[i], ...)))
}
rank1 <- CES2_tbl %>% select(FIPS, Variable, Rank) %>% spread(Variable, Rank)
rank2 <- CES2_data %>% mutate_each(funs(rank = rank_nonmissing(-1 * ., ties.method="min")), -FIPS)
show(rank2)
## Source: local data frame [8,035 x 20]
##
## FIPS Ozone PM25 DieselPM DrinkWat PestUse ToxRel Traffic Cleanup
## 1 06001400100 6258 6245 2663 7731 3012 3704 2634 3127
## 2 06001400200 6258 6263 584 7731 3012 3874 1459 5446
## 3 06001400300 6258 6269 589 7731 3012 3953 1487 4785
## 4 06001400400 6258 6284 591 7731 3012 3875 1750 5446
## 5 06001400500 6258 6306 584 7731 3012 3873 3210 3524
## 6 06001400600 6258 6296 602 7731 3012 3984 840 4162
## 7 06001400700 6258 6317 600 7731 3012 3995 3012 1661
## 8 06001400800 6258 6348 743 7731 3012 3989 4667 93
## 9 06001400900 6258 6338 686 7731 3012 4103 5708 214
## 10 06001401000 6258 6308 362 7731 3012 4135 1520 920
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: GndWat (int), HazWaste (int), ImpWat (int), SolWaste
## (int), Age (int), Asthma (int), LBW (int), Edu (int), LingIso (int),
## Poverty (int), Unemp (int)
expect_equal(rank1, rank2)
frac1 <- CES2_tbl %>% select(FIPS, Variable, Frac) %>% spread(Variable, Frac)
frac2 <- rank2 %>% mutate_each(funs(frac = . / sum(!is.na(.))), -FIPS)
show(frac2)
## Source: local data frame [8,035 x 20]
##
## FIPS Ozone PM25 DieselPM DrinkWat PestUse ToxRel Traffic
## 1 06001400100 0.7852 0.7861 0.33143 0.9664 0.3749 0.4617 0.3278
## 2 06001400200 0.7852 0.7884 0.07268 0.9664 0.3749 0.4829 0.1816
## 3 06001400300 0.7852 0.7891 0.07330 0.9664 0.3749 0.4927 0.1851
## 4 06001400400 0.7852 0.7910 0.07355 0.9664 0.3749 0.4830 0.2178
## 5 06001400500 0.7852 0.7938 0.07268 0.9664 0.3749 0.4827 0.3995
## 6 06001400600 0.7852 0.7925 0.07492 0.9664 0.3749 0.4966 0.1045
## 7 06001400700 0.7852 0.7952 0.07467 0.9664 0.3749 0.4979 0.3749
## 8 06001400800 0.7852 0.7991 0.09247 0.9664 0.3749 0.4972 0.5808
## 9 06001400900 0.7852 0.7978 0.08538 0.9664 0.3749 0.5114 0.7104
## 10 06001401000 0.7852 0.7941 0.04505 0.9664 0.3749 0.5154 0.1892
## .. ... ... ... ... ... ... ... ...
## Variables not shown: Cleanup (dbl), GndWat (dbl), HazWaste (dbl), ImpWat
## (dbl), SolWaste (dbl), Age (dbl), Asthma (dbl), LBW (dbl), Edu (dbl),
## LingIso (dbl), Poverty (dbl), Unemp (dbl)
expect_equal(frac1, frac2)
GM <- function (x) prod(na.omit(x)) ^ (1 / sum(!is.na(x)))
GM2 <- function (x) exp(mean(log(na.omit(x))))
score_GM <- function (.data) .data %>% select(-FIPS) %>% apply(GM, MARGIN = 1)
score_GM2 <- function (.data) .data %>% select(-FIPS) %>% apply(GM2, MARGIN = 1)
head(frac2 %>% score_GM)
## [1] 0.4487 0.4923 0.4923 0.4323 0.3816 0.3857
expect_equal(frac2 %>% score_GM, frac2 %>% score_GM2)
expect_equal(frac2 %>% score_GM, CES2_rankprod$Score %>% function (x) exp(-x))
expect_equal(-1 * log(frac2 %>% score_GM), CES2_rankprod$Score)
logGM <- function (x) log(prod(x, na.rm = TRUE)) / sum(!is.na(x))
score_logGM <- function (.data) .data %>% select(-FIPS) %>% apply(logGM, MARGIN = 1)
head(frac2 %>% score_logGM %>% exp())
## [1] 0.4487 0.4923 0.4923 0.4323 0.3816 0.3857