Problem

You are an allied general and you have to figure out how many tanks nazis are producing every month. You are lucky and your forces have captured some tanks from nazis. You know that all tanks produced have sequential serial numbers (what a mistake to make). How would you figure out how many tank per month is produced (for simplicity, let’s assume that serial numbers are reset every month and that there is only one type of tank produced by nazis)?

Solutions

First this problem could be solved with the help from statistics. And they did it in WW2. Here are the numbers statisticians got compared to actual production and intelligence estimate (source):

Month Statistical estimate Intelligence estimate German records
June 1940 169 1000 122
June 1941 244 1550 271
August 1942 327 1550 342

As seen from the table statistics is useful. Using just your gut feeling (they called it intelligence estimate), is like trying to hit the target blindfolded.

Before I’ll give you an answer how statistical estimate was calculated, let’s think. What would you do if you have this kind of problem (and statistician is nowhere to be found)? My first guess would be the highest serial number captured. Probability that I captured the last tank produced in last month (I assume that that last tank produced previous month is sent to battlefield in this month, because transporting it to the battlefield takes time) is 1/n, where n is the number of tanks produced in last month. The more tanks produced the lower the possibility that I captured the last tank produced. So my initial guess might not be so good (accurate).

My second guess would be adding 5% to the highest serial number captured. Logic behind is that possibility that I captured the last serial number produced is pretty low (I assume that there are quite many tanks produced). Why to add 5%? God knows why, it looks beatiful proportion to add :).

As you already see that my initial and second guess tend to be not too stat-savvy. Now let’s add the equation that statisticains used during WW2 (at least one of the equations): N = m + m/k - 1,

where:

Now how do we test that last solution is the best? Of course some Monte Carlo simulations. Here is the code:

#make bunch of dataframes/list for simulation
real=c()
third.guess=data.frame(NULL)
initial.guess=data.frame(NULL)
second.guess=data.frame(NULL)
resids1=data.frame(NULL)
resids2=data.frame(NULL)
resids3=data.frame(NULL)
#function for third guess
thirdModel = function(samp) {
    max(samp) + max(samp)/length(samp) - 1
}
#simulation
x = c(10:100)
for(i in x) {
    trueTop = 10*i
    real[i-9]=trueTop
    for(j in 1:50) {
        observeds = sample(1:trueTop, 5)#every simulation we take 5 randomly five samples
        third.guess[i-9,j] = thirdModel(observeds)
        initial.guess[i-9,j]=max(observeds)
        second.guess[i-9,j]=max(observeds)*1.2
        resids1[i-9,j] = trueTop - initial.guess[i-9,j]
        resids2[i-9,j] = trueTop - second.guess[i-9,j]
        resids3[i-9,j] = trueTop - third.guess[i-9,j]
    }
}

To plot the results (I like ggplot) we need to transform the data a little bit:

#add a column to dataframes
initial.guess$real=real
second.guess$real=real
third.guess$real=real
resids1$real=real
resids2$real=real
resids3$real=real
#make data short
library(reshape2)
initial.guess.melt=melt(initial.guess, id="real")
initial.guess.melt$variable="initial.guess"
second.guess.melt=melt(second.guess, id="real")
second.guess.melt$variable="second.guess"
third.guess.melt=melt(third.guess, id="real")
third.guess.melt$variable="third.guess"
#rbind tables
data.guess=rbind(initial.guess.melt, second.guess.melt, third.guess.melt)

And now lets plot just initial and second model:

library(ggplot2)
ggplot(subset(data.guess, variable!="third.guess"), aes(x=real, y=value))+
    geom_jitter(aes(color=variable), alpha=0.5)+
    geom_abline(intercept = 0, slope=1)+
    ylab("guess")+
    xlab("number of tanks produced")+
    scale_colour_discrete(name="Model",
                         breaks=c("initial.guess", "second.guess"),
                         labels=c("Initial model", "Second model"))+
    theme_minimal()

plot of chunk unnamed-chunk-3

Plot has many dots (we simulated data between 100-1000 tanks produced by step 10 tanks, in each step we made 50 simulations). As seen from the plot initial guess tends to underestimate number of tanks produced (no way it can overestimate it). Second guess tends to be more balanced (sometimes overestimating, sometimes underestimating). So our second model might improve guess accuracy.

Now let’s compare second and third model:

ggplot(subset(data.guess, variable!="initial.guess"), aes(x=real, y=value))+
    geom_jitter(aes(color=variable), alpha=0.5)+
    geom_abline(intercept = 0, slope=1)+
    ylab("guess")+
    xlab("number of tanks produced")+
    scale_colour_discrete(name="Model",
                         breaks=c("second.guess", "third.guess"),
                         labels=c("Second model", "Third model"))+ 
    theme_minimal()

plot of chunk unnamed-chunk-4

Now dots tend to overlap each other. So my second model is not much worse than third model. Wow! But let’s see residuals (guess - true number of tanks) to understand how accurate models are.

library(reshape2)
resids1.melt=melt(resids1, id="real")
resids1.melt$variable="resids1"
resids2.melt=melt(resids2, id="real")
resids2.melt$variable="resids2"
resids3.melt=melt(resids3, id="real")
resids3.melt$variable="resids3"
data.resids=rbind(resids1.melt, resids2.melt, resids3.melt)

ggplot(subset(data.resids, variable!="resids1"), aes(x=real, y=value))+
    geom_jitter(aes(color=variable), alpha=0.5)+
    geom_abline(intercept = 0, slope=0)+
    ylab("guess - true number of tanks")+
    xlab("number of tanks produced")+
    scale_colour_discrete(name="Residuals",
                         breaks=c("resids2", "resids3"),
                         labels=c("Second model", "Third model"))+ 
    theme_minimal()

plot of chunk unnamed-chunk-5

Residuals look almost the same for second and third model. Let’s calculate sum of residuals.

library(plyr)
ddply(data.resids, "variable", summarize,
      mean=mean(value),
      median=median(value),
      sum=sum(value))
##   variable   mean median    sum
## 1  resids1 89.302   58.0 406324
## 2  resids2 -2.838  -17.6 -12911
## 3  resids3 -1.838  -16.6  -8361

As seen from the sum of residuals third model is most accurate (closest to 0), second one is close and third one is way off. So adding 5% to initial model significantly increased model accuracy. But let’s not make too wild conclusions here. Problem of my second model is that it is quite rigid.

To prove it let’s increase sample size (number of observations) for each consecutive simulation and plot guessed number of tanks and real number of tanks produced.

real=c()
third.guess=data.frame(NULL)
initial.guess=data.frame(NULL)
second.guess=data.frame(NULL)
n=c(1:100)
#simulation
for(i in n) {
    trueTop=500
    for(j in 1:50) {
        observeds = sample(1:trueTop, n[i]) 
        third.guess[i,j] = thirdModel(observeds)
        initial.guess[i,j]=max(observeds)
        second.guess[i,j]=max(observeds)*1.05
            }
}
#add sample size for every simulation
initial.guess$n=n
second.guess$n=n
third.guess$n=n
#make data short
initial.guess.melt=melt(initial.guess, id="n")
initial.guess.melt$variable="initial.guess"
second.guess.melt=melt(second.guess, id="n")
second.guess.melt$variable="second.guess"
third.guess.melt=melt(third.guess, id="n")
third.guess.melt$variable="third.guess"
data.guess=rbind(initial.guess.melt, second.guess.melt, third.guess.melt)
#plot it
ggplot(subset(data.guess, variable!="initial.guess"), aes(x=n, y=value))+
    geom_jitter(aes(color=variable), alpha=0.5)+
    geom_abline(intercept = 500, slope=0)+
    ylab("number of tanks")+
    xlab("sample size")+
    scale_colour_discrete(name="Model",
                         breaks=c("second.guess", "third.guess"),
                         labels=c("Second model", "Third model"))+ 
    theme_minimal()

plot of chunk unnamed-chunk-7

Actual number of tanks produced in every case is 500. The smaller sample size the inaccurate models tend to be (variablity is bigger). As seen from the plot as sample size (number of tanks we capture from the enemy) increases second model tends to overestimate number of tanks produced. Third model nicely gets more accurate. It is logical that second model overestimates number of tanks produced, because as number of tanks captured increases so does the possibility that we capture tank with the last serial number (or very close to it). Now if we add 5% to this number then we’ll definitely overestimate. And in this case (when you have big sample size) our first model might not be worser than second one (it might even be better).

ggplot(subset(data.guess, variable!="third.guess"), aes(x=n, y=value))+
    geom_jitter(aes(color=variable), alpha=0.5)+
    geom_abline(intercept = 500, slope=0)+
    ylab("number of tanks")+
    xlab("sample size")+
    scale_colour_discrete(name="Model",
                         breaks=c("initial.guess", "second.guess"),
                         labels=c("Initial model", "Second model"))+
    theme_minimal()

plot of chunk unnamed-chunk-8

As seen from the plot our first model slightly underestimates the number of tanks produced when sample size increases. Second model tends to overestimate (more then first model underestimates) the number of tanks produced as sample size increases.

Conclusion

So as previously seen sequential serial numbers give intelligence (or who else might your enemy) quite accurate ways to calculate your real production. There might be easier ways (taking maximum serial number observed, add 5% to it) or little more complex ways (equation statisticians used WW2). If possible use more complex ways because they tend to be more accurate. Easier methods might not be worser but you have to know criterias when they are comparable with complex methods (and this makes these methods also complex). And if you are a tank producer don’t use sequential serial numbers. Better make a huge pool of numbers (which is a many (hundreds/thousands) times bigger than your usual production capacity) and randomly pick serial numbers from there.

More articles and blog posts about German tank problem could be found here, here and here.