Bias in promotion?

Imagine that there are 10 women and 15 men in a promotion pool. Here's such a pool in R:

females = rep(1, 10)
males = rep(0, 15)
pool = c(males, females)

We're going to study how likely it is to get five females out of eight promotions in such a pool, under the hypothesis that sex is not a factor. (Such a hypothesis, that nothing is going on, is called a “Null Hypothesis”.)

The following, pick() is an R function that samples without replacement from the pool

pick <- function(n) {
    sum(sample(pool, n))
}

The pick() function was written to be general, to work for any size promotion group. Our promotion group had 8 people. Let's simulate a random selection of 8 in a world where the null hypothesis is true.

pick(8)

## [1] 6

The above is a possible result, but we don't know if it is a common or an unlikely result in the null-hypothesis world. Let's run the simulation many times and see what we get:

n = 100
Captain <- replicate(n, pick(8))

We could look at this in several ways, e.g. a table

table(Captain)

## Captain
##  1  2  3  4  5  6 
##  6 28 33 23  9  1

or a graphic such as a histogram:

hist(Captain, col = "lightblue", ylim = c(0, 0.4 * n), breaks = seq(-1.5, 10.5, 
    1))

plot of chunk unnamed-chunk-6

Ultimately, we're interested in how often, in the null-hypothesis world, we would get a result of five or more females:

table(Captain >= 5)

## 
## FALSE  TRUE 
##    90    10

Note in passing. I think this expression is a little more obscure and verbose than needed:

length(which(Captain > 4))/n

## [1] 0.1

Of course, it's a matter of taste. The point of which() is that it extracts the indices of the entries which satisfy the criterion. To me, it's more natural to use the condition itself, e.g.

mean(Captain >= 5)

## [1] 0.1

Doing it in a `mosaic` style

There's nothing wrong with the above, but I think it has several layers of code that are not strictly needed, e.g. using 0,1 for the sex, using rep(), and so on.

Here's one way to do it in mosaic

First, load the package. (All the stuff in the .Rmd file is there just to avoid filling the html file with the messages from loading mosaic.)

require(mosaic)

Create the pool. In doing so, stick with the convention of data consisting of cases and variables, organized into a table or data frame. Here, there is just one variable, sex. But in general there might be several.

pool = data.frame(sex = c("F", "F", "F", "F", "F", "F", "F", "F", "F", "F", 
    "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"))

Make sure that we got it right:

tally(~sex, data = pool)

## 
##     F     M Total 
##    10    15    25

It just takes one time of having to type that whole expression above, before students appreciate why rep() is nice:

pool = data.frame(sex = c(rep("F", 10), rep("M", 15)))

But always check!

tally(~sex, data = pool)

## 
##     F     M Total 
##    10    15    25

Now, do one simulation where we select 8 people from the pool.

sample(pool, 8)

##    sex orig.ids
## 16   M       16
## 12   M       12
## 20   M       20
## 9    F        9
## 1    F        1
## 25   M       25
## 24   M       24
## 8    F        8

If we do this many times, we'll get somewhat different results from trial to trial:

sample(pool, 8)

##    sex orig.ids
## 1    F        1
## 9    F        9
## 20   M       20
## 13   M       13
## 7    F        7
## 10   F       10
## 19   M       19
## 3    F        3

sample(pool, 8)

##    sex orig.ids
## 6    F        6
## 20   M       20
## 13   M       13
## 21   M       21
## 18   M       18
## 9    F        9
## 22   M       22
## 5    F        5

sample(pool, 8)

##    sex orig.ids
## 15   M       15
## 23   M       23
## 5    F        5
## 25   M       25
## 16   M       16
## 21   M       21
## 14   M       14
## 2    F        2

Our interest is in the number of females who are promoted. Let's do three trials:

tally(~sex, data = sample(pool, 8))

## 
##     F     M Total 
##     2     6     8

tally(~sex, data = sample(pool, 8))

## 
##     F     M Total 
##     3     5     8

tally(~sex, data = sample(pool, 8))

## 
##     F     M Total 
##     4     4     8

To get a better ideal of the variation, let's repeat this many times and collect the results of the trials

trials = do(1000) * tally(~sex, data = sample(pool, 8))
head(trials)

##   F M Total
## 1 4 4     8
## 2 5 3     8
## 3 5 3     8
## 4 4 4     8
## 5 6 2     8
## 6 3 5     8

What did we get? Let's see how often each possibility arose for a number of promoted women:

tally(~F, data = trials)

## 
##     0     1     2     3     4     5     6     7 Total 
##     6    62   178   331   284   113    24     2  1000

It looks like having 5 women promoted out of this pool is fairly likely to happen under the null hypothesis.

tally(~F >= 5, data = trials)

## 
##  TRUE FALSE Total 
##   139   861  1000

Out of the 1000 trials, there were 5 or more women more than 10% of the time. So, there's no reason to reject the null hypothesis.

Bias in promotion?

Doing it in a mosaic style

Doing it in a `mosaic` style