Please visit RPubs webpage here for the compiled report:
http://rpubs.com/columbia202/TDI_challenge_ar3177_columbia_question_1
if not, copy paste the code into an R Markdown file, and Knit as HTML.
by: Alireza Roshan Ghias (ar3177@columbia.edu)
The algorithm is very simple: Draw a coin, see if it is sufficient to pay, if not, draw another one, etc. If the sum was enough, calculate the change. Do this for n times, and calculate the mean and the standard deviation.
The question is how to get 10 digits of precision for both mean and SD.
The short answer is, in order to calculate the mean and SD with 10 digits of precision, we need a sample size of 2 × 1020!
In the first part of the code, the sample size needed to have 10 digits of precision for the “mean” is calculated from Central limit theorem.
In the second part, since CLT does not hold for SD, the number of samples is estimated based on an ad hoc method, i.e. by modeling the SD over number of samples.
coins = c(0.01, 0.05, 0.1, 0.25, 0.5)
price = 0.25 # in dollars
niter = 10000 # number of samples, i.e. the times we calculate change
change = rep(NA, niter)
for (i in 1:niter) {
total = 0
change[i] = total - price
while (change[i] < 0) {
# we draw a coin with equal probability
total = total + sample(x = coins, size = 1)
change[i] = total - price
}
}
For sample size of 104, Mean of change is: 0.1601. Standard deviation of change is: 0.1417.
To get the precision of the mean, we can use Central Limit Theorem (CLT) and calculate the standard error of the mean. SE of the mean is: 0.0014. Now if we want to have 10 digits of precision for the mean, the SE of the mean should be around 1e-11, and the number of samples should be: 2.0071 × 1020, which is simply not possible to be done by my computer!
For standard deviation, however, the CLT does not hold, so in order to calculate the SD of the SD, we calculate SD for various sample sizes, and using an exponential model, we estimate the number of samples needed.
rm(list = ls())
coins = c(0.01, 0.05, 0.1, 0.25, 0.5)
price = 0.25 # in dollars
niter = c(10, 100, 1000, 10000)
m = 10
stand_dev = as.data.frame(matrix(data = NA, nrow = m, ncol = length(niter)))
colnames(stand_dev) = as.character(niter)
for (k in 1:length(niter)) { # run the code for different number of iterations
for (j in 1:m) { # m counts for each number of iterations
change = rep(NA, niter[k])
for (i in 1:niter[k]) { # calculate the change for each iteration
total = 0
change[i] = total - price
while (change[i] < 0) {
# we draw a coin with equal probability
total = total + sample(x = coins, size = 1)
change[i] = total - price
}
}
# calculate the SD of the change for m counts of 'niter' times
stand_dev[j, k] = sd(change)
}
}
Plotting the SD of change for different sample sizes.
require(ggplot2)
require(reshape2)
stand_dev_melt = melt(stand_dev)
stand_dev_melt$variable = as.numeric(as.character(stand_dev_melt$variable))
ggplot(stand_dev_melt, aes(variable, value)) +
geom_point() +
scale_x_log10("Sample size") +
scale_y_continuous("Standard deviation of change")
As we can see, the SD converges as sample size increases. To find out what sample size is needed for the SD to be precise to 10 digits, we need to calculate the SD of SD.
sos = data.frame(x = niter,
y = sapply(stand_dev, function(x) sd(x)))
ggplot(sos, aes(x, y)) +
geom_point(size = 4) +
stat_smooth(method = "lm", col = "red") +
scale_x_log10("Sample size") +
scale_y_log10("SD of SD")
SD of SD seems to have a linear relationship with sample size in log-log scale, so we can fit a line, and predict the sample size needed
exponential.model <- lm(data = sos, log10(y)~ log10(x))
summary(exponential.model)
##
## Call:
## lm(formula = log10(y) ~ log10(x), data = sos)
##
## Residuals:
## 10 100 1000 10000
## -0.0234 0.0615 -0.0527 0.0146
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9945 0.0741 -13.4 0.0055 **
## log10(x) -0.5848 0.0270 -21.6 0.0021 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0605 on 2 degrees of freedom
## Multiple R-squared: 0.996, Adjusted R-squared: 0.994
## F-statistic: 468 on 1 and 2 DF, p-value: 0.00213
coef = exponential.model$coefficients
num_sample = 10^((-11 - coef[1])/coef[2])
In order to have 10 digits of precision for SD, the sd(sd(x)) should be 1e-11. So the number of samples needed is 1.2875 × 1017!
End.