Question1

Please visit RPubs webpage here for the compiled report:

http://rpubs.com/columbia202/TDI_challenge_ar3177_columbia_question_1

if not, copy paste the code into an R Markdown file, and Knit as HTML.

by: Alireza Roshan Ghias (ar3177@columbia.edu)

Summary

The algorithm is very simple: Draw a coin, see if it is sufficient to pay, if not, draw another one, etc. If the sum was enough, calculate the change. Do this for n times, and calculate the mean and the standard deviation.

The question is how to get 10 digits of precision for both mean and SD.

The short answer is, in order to calculate the mean and SD with 10 digits of precision, we need a sample size of 2 × 10²⁰!

In the first part of the code, the sample size needed to have 10 digits of precision for the “mean” is calculated from Central limit theorem.

In the second part, since CLT does not hold for SD, the number of samples is estimated based on an ad hoc method, i.e. by modeling the SD over number of samples.

The Code

coins = c(0.01, 0.05, 0.1, 0.25, 0.5)
price = 0.25    # in dollars
niter = 10000   # number of samples, i.e. the times we calculate change
change = rep(NA, niter)

for (i in 1:niter) {
  total = 0
  change[i] = total - price
  while (change[i] < 0) {
    # we draw a coin with equal probability
    total = total + sample(x = coins, size = 1)
    change[i] = total - price
  }
}

For sample size of 10⁴, Mean of change is: 0.1601. Standard deviation of change is: 0.1417.

To get the precision of the mean, we can use Central Limit Theorem (CLT) and calculate the standard error of the mean. SE of the mean is: 0.0014. Now if we want to have 10 digits of precision for the mean, the SE of the mean should be around 1e-11, and the number of samples should be: 2.0071 × 10²⁰, which is simply not possible to be done by my computer!

Calculating the SD of SD

For standard deviation, however, the CLT does not hold, so in order to calculate the SD of the SD, we calculate SD for various sample sizes, and using an exponential model, we estimate the number of samples needed.

rm(list = ls())

coins = c(0.01, 0.05, 0.1, 0.25, 0.5)
price = 0.25    # in dollars
niter = c(10, 100, 1000, 10000)
m = 10

stand_dev = as.data.frame(matrix(data = NA, nrow = m, ncol = length(niter)))
colnames(stand_dev) = as.character(niter)
for (k in 1:length(niter)) { # run the code for different number of iterations
  for (j in 1:m) { # m counts for each number of iterations
    change = rep(NA, niter[k])
    for (i in 1:niter[k]) { # calculate the change for each iteration
      total = 0
      change[i] = total - price
      while (change[i] < 0) {
        # we draw a coin with equal probability
        total = total + sample(x = coins, size = 1)
        change[i] = total - price
      }
    }
    # calculate the SD of the change for m counts of 'niter' times
    stand_dev[j, k] = sd(change)
  }
}

Plotting the SD of change for different sample sizes.

require(ggplot2)
require(reshape2)

stand_dev_melt = melt(stand_dev)
stand_dev_melt$variable = as.numeric(as.character(stand_dev_melt$variable))
ggplot(stand_dev_melt, aes(variable, value)) + 
  geom_point() + 
  scale_x_log10("Sample size") + 
  scale_y_continuous("Standard deviation of change")

plot of chunk unnamed-chunk-3

As we can see, the SD converges as sample size increases. To find out what sample size is needed for the SD to be precise to 10 digits, we need to calculate the SD of SD.

sos = data.frame(x = niter,
                 y = sapply(stand_dev, function(x) sd(x)))
ggplot(sos, aes(x, y)) + 
  geom_point(size = 4) +
  stat_smooth(method = "lm", col = "red") +
  scale_x_log10("Sample size") + 
  scale_y_log10("SD of SD")

plot of chunk unnamed-chunk-4

SD of SD seems to have a linear relationship with sample size in log-log scale, so we can fit a line, and predict the sample size needed

exponential.model <- lm(data = sos, log10(y)~ log10(x))
summary(exponential.model)

## 
## Call:
## lm(formula = log10(y) ~ log10(x), data = sos)
## 
## Residuals:
##      10     100    1000   10000 
## -0.0234  0.0615 -0.0527  0.0146 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.9945     0.0741   -13.4   0.0055 **
## log10(x)     -0.5848     0.0270   -21.6   0.0021 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0605 on 2 degrees of freedom
## Multiple R-squared:  0.996,  Adjusted R-squared:  0.994 
## F-statistic:  468 on 1 and 2 DF,  p-value: 0.00213

coef = exponential.model$coefficients
num_sample = 10^((-11 - coef[1])/coef[2])

In order to have 10 digits of precision for SD, the sd(sd(x)) should be 1e-11. So the number of samples needed is 1.2875 × 10¹⁷!

End.