Setup

library(pacman); p_load(ggplot2, dplyr)

The Graph

Often people will attack one group’s overrepresentation due to a given normally-distributed trait in some area by looking at a low threshold for that trait and concluding that the bigger group has an absolute advantage and thus the overrepresentation makes no sense. They usually need to go out further; to illustrate this, consider the following comparison of a group of 7.5 million trait high-scorers versus a group of 197 million people with lower average trait values.

set.seed(69)
Low  <- data.frame(Trait = c(rnorm(197e6, mean = 0, sd = 1) * 15 + 100))
High <- data.frame(Trait = c(rnorm(7.5e6, mean = 0, sd = 1) * 15 + 110))

rbind(Low %>% 
        mutate(Group = "Low"),
      High %>%
        mutate(Group = "High")) %>% ggplot(aes(x = Trait, fill = Group)) + geom_density(position = "fill", alpha = 0.5) + 
  scale_y_continuous(labels = scales::percent) + theme_bw() + ylab("Percent") + xlab("Trait Points")

Despite being 3.7% of the population here, the high-scoring group outnumbered the lower-scoring, larger population at the extremes. If such extremes matter, then absolute comparisons before them will be misleading.

Overrepresentation

Setup

The Graph