Riddler Classic

From The Riddler at FiveThirtyEight: "From Kathy Bischoping comes a question we’ve all asked ourselves at one time or another:

I have 10 pairs of socks in a drawer. Each pair is distinct from another and consists of two matching socks. Alas, I’m negligent when it comes to folding my laundry, and so the socks are not folded into pairs. This morning, fumbling around in the dark, I pull the socks out of the drawer, randomly and one at a time, until I have a matching pair of socks among the ones I’ve removed from the drawer.

On average, how many socks will I pull out of the drawer in order to get my first matching pair?

(Note: This is different from asking how many socks I must pull out of the drawer to guarantee that I have a matching pair. The answer to that question, by the pigeonhole principle, is 11 socks. This question is instead asking about the average.)

Extra credit: Instead of 10 pairs of socks, what if I have some large number N pairs of socks?"

Solution

We will take a bare-bones approach to this problem, and define a random variable \(X\) to be the number of socks it takes to finally draw out a pair. In order to get a pair, you need to select at least two, and as we se from the questoin itself, we’ll get a pair for sure by the 11th. So, possible values of \(X\) range from 2 to 11.

After selecting one sock out of the drawer, you will now have a 1/19 chance that the second will match (one sock of the remaining 19 matches yours). On the same token, there is a 18/19 chance that you select another sock that doesn’t match yours.

Now with 18 socks left in the drawer, there is a 2/18 chance you select a sock that matches one of the two that you have, and a 16/18 chance that you do not. In the event you do not, you hold 3 unmatched socks with 17 left in the drawer, giving you a 3/17 chance of a match and a 14/17 chance that you do not match.

Continuing in this fashion, we can fill out the distribution for \(X\):

x <- 2:11
px <- c(1/19, 
        18/19*2/18, 
        18/19*16/18*3/17, 
        18/19*16/18*14/17*4/16, 
        18/19*16/18*14/17*12/16*5/15,
        18/19*16/18*14/17*12/16*10/15*6/14, 
        18/19*16/18*14/17*12/16*10/15*8/14*7/13, 
        18/19*16/18*14/17*12/16*10/15*8/14*6/13*8/12, 
        18/19*16/18*14/17*12/16*10/15*8/14*6/13*4/12*9/11, 
        18/19*16/18*14/17*12/16*10/15*8/14*6/13*4/12*2/11)
distX <- data.frame(x, px)
distX
##     x          px
## 1   2 0.052631579
## 2   3 0.105263158
## 3   4 0.148606811
## 4   5 0.173374613
## 5   6 0.173374613
## 6   7 0.148606811
## 7   8 0.106692070
## 8   9 0.060966897
## 9  10 0.024941003
## 10 11 0.005542445

To get the expected value, we multiply the rows together and sum the resulting products.

sum(x*px)
## [1] 5.675464

If you keep track of the calculation for this number, you will find out that it could be written as \(\frac{2^{18}}{11*13*17*19}\). The average number of socks to be selected from the drawer if there are 2, 4, 6, 8, or 10 socks in the drawer reveals the values \(2^1\), \(\frac{2^3}{3}\), \(\frac{2^4}{5}\), \(\frac{2^7}{5*7}\), and \(\frac{2^8}{7*9}\), respectively.

This shows a pattern of some sort, but it isn’t obvious. Perhaps there is a limiting value? We build a function that will take any number of pairs of socks and return the expected number of socks that need to be drawn before a pair is finally found along with a simulator to check our work.

ExNumSocks <- function(N){
   x <- 2:(N+1)  ## The possible values of socks to be drawn
   px <- c(1/(2*N-1))  ## The 2nd draw probability
   if(N==1){return(2*px)}
   for(i in 2:N){
      p <- i/(2*N-i)
      for(j in 2:i){
         p <- p*(2*N-2*(j-1))/(2*N-j+1) ## This fills out the probability dist. 
      }
      px <- c(px, p)
   }
   return(sum(x*px))  ## Returns the expected value.
}

## Does it work for 10? Answer should be 5.675464
ExNumSocks(10)
## [1] 5.675464
pairSocks <- function(n){ #The number of pairs of socks n
        socks <- c(1:n,1:n)
        socksSelected <- c()
        while(length(socksSelected) == length(unique(socksSelected))){
                s <- sample(socks, 1)
                socks <- socks[-min(which(socks==s))]
                socksSelected <- c(socksSelected,s)
        }
        return(length(socksSelected))
}

## Check the function a few times. With seed set at 42, it should be 9, 4, and 5. 
set.seed(42)
pairSocks(10)
## [1] 9
pairSocks(10)
## [1] 4
pairSocks(10)
## [1] 5

How does the average value compare with \(N\) in the long run? Let’s explore some really large values.

ExNumSocks(1000)
## [1] 56.05692
ExNumSocks(2500)
## [1] 88.62712
ExNumSocks(10000)
## [1] 177.2476

The mean number of socks does grow slowly, but how? I’ll have to admit, I would have been stuck for a long time if it wasn’t for Mark Hannan’s hint/question on Twitter: “Should the square root of pi be playing a role in the Classic extra credit?”

Indeed it does! Let me show you how really quick before an explanation of how it works. It turns out, that if we look at \(\sqrt{N\pi}\) for large numbers of pairs of socks \(N\), this is really close to the expected number of socks!

Take a look:

abs(sqrt(1000*pi) - ExNumSocks(1000))
## [1] 0.007006677
abs(sqrt(2500*pi) - ExNumSocks(2500))
## [1] 0.004431245
abs(sqrt(10000*pi) - ExNumSocks(10000))
## [1] 0.002215581

Why is this occurring? By visiting Wikipedia’s List of Formula involving \(\pi\) and scrolling down far enough, you will come upon Stirling’s approximation: \[ \pi = \lim_{n\rightarrow \infty}\frac{2^{4n}}{n\binom{2n}{n}^2} \] Since the square root function is a nice, continuous function, we can apply the square root on both sides of the equation and move the square root inside the limit. This gives \[ \sqrt{\pi} = \lim_{n\rightarrow \infty}\frac{2^{2n}}{\sqrt{n}\binom{2n}{n}} \]

It takes a lot of simplifying, but you can get the expression inside the limit to eventually look like \[ \frac{2^n \cdot n!}{\sqrt{n}\left((2n-1)(2n-3)\cdots 3\cdot 1\right)}. \]

This is beginning to look a little like the pattern of the expected values (without the square root of n in the denominator). Using the above formula with 10 (again, without the square root in the bottom), notice what happens after simplifying!

\[ \begin{aligned} \frac{2^{10}\cdot 10!}{19\cdot 17\cdot 15\cdot \ldots \cdot 3\cdot 1} &= \frac{2^{10}\cdot 10\cdot 9\cdot 8\cdot \ldots \cdot 2 \cdot 1}{19\cdot 17\cdot 15\cdot \ldots \cdot 3 \cdot 1} \\ &= \frac{2^{10}\cdot 10\cdot 8\cdot 6\cdot 4\cdot 2}{19\cdot 17\cdot 15\cdot 13\cdot 11} \\ &= \frac{2^{18}\cdot 5\cdot 3}{19\cdot 17\cdot 15\cdot 13\cdot 11} \\ &= \frac{2^{18}}{19\cdot 17\cdot 13\cdot 11} \\ &= 5.675464 \end{aligned} \] This matches up exactly with our calculation from before! So, by letting \[ E_n = \frac{2^n \cdot n!}{\left((2n-1)(2n-3)\cdots 3\cdot 1\right)}, \] we have a nice expression for the expected number of socks we need to select out of a drawer full of n pairs of socks (unfolded) before we end up with a pair. Let’s build a second function using this expression (much easier) and compare it with the first.

ExNumSocks2 <- function(n){
  return(2^(2*n)/choose(2*n,n))
}

## First function values from 1 to 15
sapply(1:15, ExNumSocks)
##  [1] 2.000000 2.666667 3.200000 3.657143 4.063492 4.432900 4.773893
##  [8] 5.092152 5.391691 5.675464 5.945724 6.204234 6.452403 6.691381
## [15] 6.922118
## Second function values from 1 to 15
sapply(1:15, ExNumSocks2)
##  [1] 2.000000 2.666667 3.200000 3.657143 4.063492 4.432900 4.773893
##  [8] 5.092152 5.391691 5.675464 5.945724 6.204234 6.452403 6.691381
## [15] 6.922118

These are equivalent! Since the limit of \(E_n/\sqrt{n}\) approaches \(\sqrt{\pi}\) as \(n\) grows without bound, then to estimate \(E_n\) for large \(n\), we simply calculate \(\sqrt{n\cdot \pi}\).

It wouldn’t be fun without a cool plot!

library(ggplot2)
library(ggthemes)
x <- 1:25
y1 <- sapply(x, ExNumSocks)
y2 <- sapply(x, function(x){return(sqrt(x*pi))})
Socks.df <- data.frame(Pairs = x, `Expected Number` = y1, `Limiting Number` = y2)

g = ggplot() + 
  geom_line(data = Socks.df, aes(x = Pairs, y = Expected.Number), color = "blue") +
  geom_line(data = Socks.df, aes(x = Pairs, y = Limiting.Number), color = "red") +
  theme_fivethirtyeight()

print(g)

Coworkers

The Riddler Express problem directly from The Riddler at FiveThirtyEight:

“You’re new at your job, and your office is voting on a theme for its holiday party. It’s fallen on you to record the percent of your coworkers (including yourself) who voted for each one. Well, since you’re in a hurry, you just write down everything in the percentage that comes before the decimal point. So for example, 35.0 percent, 35.17 percent and 35.92 percent would all be written simply as “35 percent.”

After the votes are tallied, you found that the winner received 73 percent of the vote (at least, that’s what you wrote down), while second place had 58 percent, and third place had 32 percent. Your first realization is that you work with a bunch of cheaters who voted more than once. But your second thought is that you might be able to use this information to figure out how many people work in your office. (As I said, you’re new, and this isn’t something you know off the top of your head.)

Based on these percentages, what’s the minimum number of people who could work in your office?

Extra credit: Your office could be filled with many possible numbers of people. Based on the percentages given in the problem, what’s the greatest number of people your office can’t have?"

Solution

We need to find different numbers of coworkers in which there is a percentage of those coworkers that falls in the intervals \([32,33)\), \([58,59)\), and \([73,74)\). We expect this to get more and more likely as we get closer and closer to 100.

It turns out that we don’t need to worry ourselves with which of our coworkers voted more than once. We simply need to concentrate on how these percentages are percentages of the total number of our coworkers.

Let’s find what numbers between 10 and 100 can work.

set73 <- c()
set58 <- c()
set32 <- c()
for(n in 10:100){
  for(a in 1:(n-1)){
    temp <- trunc(100*a/n)
    if(temp == 73){
      set73 <- c(set73, n)
    }
    else if(temp==58){
      set58 <- c(set58, n)
    }
    else if(temp == 32){
      set32 <- c(set32, n) 
    }
  }
}

# Numbers with a ratio [.73,.74)
set73
##  [1]  15  19  23  26  30  34  38  41  42  45  46  49  52  53  56  57  60
## [18]  61  63  64  65  67  68  69  71  72  73  75  76  78  79  80  82  83
## [35]  84  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
# Numbers with a ratio [.58,.59)
set58
##  [1]  12  17  24  29  31  34  36  39  41  43  46  48  50  51  53  55  56
## [18]  58  60  62  63  65  67  68  70  72  73  74  75  77  78  79  80  81
## [35]  82  84  85  86  87  89  90  91  92  93  94  95  96  97  98  99 100
# Numbers with a ratio [.32,.33)
set32
##  [1]  25  28  31  34  37  40  43  46  49  50  52  53  55  56  58  59  61
## [18]  62  64  65  67  68  70  71  73  74  75  76  77  78  79  80  81  82
## [35]  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99
## [52] 100

What numbers to all of these sets have in common? What is the minimum of this set, and what is the largest value not in this set?

MainSet <- intersect(set32, intersect(set58, set73))
print("The intersection of all three sets")
## [1] "The intersection of all three sets"
MainSet
##  [1]  34  46  53  56  65  67  68  73  75  78  79  80  82  84  86  87  89
## [18]  90  91  92  93  94  95  96  97  98  99 100
print("The answer to the first question")
## [1] "The answer to the first question"
min(MainSet)
## [1] 34
print("The answer to the extra credit")
## [1] "The answer to the extra credit"
max(setdiff(10:100, MainSet))
## [1] 88

Why is 34 the answer to the first question? According to the instructions we would have written 11 votes as 32% (since it is 32.35294%), 20 votes as 58% (since it is 58.82353%), and 25 votes as 73% (since it is 73.52941%). All numbers smaller than 34 cannot do this for all three percentages.

Why is 88 the answer to the extra credit? Although 29 would be written as 32% of 88, and 65 would be written as 73% of 88, 51 would be written as 57% and 52 would be written as 59%. All other values above 88 are shared in all three sets.