The German Tank Problem

A Shiny Solution

What is the German Tank Problem?

  • Formally, the problem of estimating the maximum of a discrete uniform distribution from sampling without replacement

  • Named due to WW2 - Allies wanted to estimate the total number of German Tanks from the serial numbers of captured tanks.

  • Reasoning relies on the mediocrity principle: Very unlikely that a random sample of the serial numbers would all be clustered at the end or the beginning of the set of numbers.

  • Has also been used to estimate iPod and Commodore 64 production - you can use it with random user ID's to estimate traffic to websites etc.

Frequentist Approach

obs<-c(2,6,7,14)
m<-max(obs)
k<-length(obs)
freqN<-m+(m/k)-1
freqN
[1] 16.5
lowconfinv<-m/(0.975^(1/k))
highconfinv<-m/(0.025^(1/k))
paste0("[",format(lowconfinv,digits=5),",",
       format(highconfinv,digits=5),"]")
[1] "[14.089,35.208]"

Point estimate with confidence intervals.

Bayesian Approach

obs<-c(2,6,7,14)
m<-max(obs)
k<-length(obs)
bayesMean<-(m-1)*((k-1)/(k-2))
bayesSD<-sqrt(((k-1)*(m-1)*(m-k+1))
              /((k-3)*((k-2)^2)))
paste0(format(bayesMean,digits=5),"±",
       format(bayesSD,digits=5))
[1] "19.5±10.356"

Can estimate probability distribution - only computed parameters here as computing plot of distribution can be computationally intensive.

Warnings

  • Does not perform well with small numbers of observations.
  • Must factor in bias of samples (i.e. what if the Germans sent all the old tanks to Africa?)
  • What if there are different sets of ID numbers?
  • Be careful with your own data - are you giving away more info than you realise?