Random Numbers, Humanity’s Favorite Pasttime

Random numbers and simulations are the key to introduce some of the most fundamental concepts in statistics, as they can achieve a similar understanding of these universal laws without the complex calculus and formal mathematical proofs. Computer-aided random number generation using R lets us conduct virtual experiments at a scale impossible by hand! This is powerful, as one can experimentally learn all about the nature of probability by just using R.

Learning goals:
* Understanding the Central Limit Theorem (CLT) and the Law of Large Numbers (LLN), and relating these to how sample means behave
* Mastering Random Number Generation (RNG)
* Developing custom R functions to automate statistical tasks and conduct virtual experiments efficiently
* Conducting virtual sampling experiments to observe the LLN in action and see how sample means converge toward the population mean as sample size increases
* Exploring sampling distributions and understanding how these differ from the population and sample, and why these tend to become normally distributed
* Applying concepts to real-world problems, such as guessing games and gambling by simulating these situations and their corresponding probability distributions

Virtual Dice Simulation

The simulation of random numbers using dice is one of the oldest RNG humanity has ever created, with the oldest dice recovered back in 3000-2500 BCE. Dice have been found in such far away sites as the Indus Valley to the Orkney archipelago in Scotland, suggesting that they were also widespread.

# setting random seed to ensure the random numbers generated are reproducible
# function specifies a starting point for the computer when it generates a random number sequence
# this can be any number, but if not supplies, usually comes from seconds on computer system clock
#       (computer counts seconds using 1st Jan 1970 as starting point)
# if you don’t random seed not setm random draws will be different each time
set.seed(10092025)

# throwing a virtual dice once -> does the actual simulation of a dice
#       1. 1st argument -> vector of numbers, characters, factors or any R-object with "length"
#       2. 2nd argument -> size -> sets number of throws, or how many samples to be taken
#       3. 3rd argument -> replacement -> ensures randomness by sampling with replacement
# sample(x, size, replace = FALSE, prob = NULL)
sample(1:6, size=1, replace=TRUE)
## [1] 5
# throwing a virtual dice five times
sample(1:6, size=5, replace=TRUE)
## [1] 2 4 2 3 1
# possible to sample with any kind of dice, with more faces than would be feasible possible in RL
sample(1:10000, size=5, replace=TRUE)
## [1] 5453 9714  880 8684  132
# possible to throw normal dice 1 million times and to count the number of times each face appeared
# 1 million -> 1,000,000 --> 1e6
onemillionthrows <- sample(1:6, size=1e6, replace=TRUE)
table(onemillionthrows)
## onemillionthrows
##      1      2      3      4      5      6 
## 167163 166619 166563 166443 166417 166795
# how many time should we expect to see each face?
1e6/6 # calculating expected frequency of each face on a fair dice
## [1] 166666.7
# each face should appear approximately 1,000,000÷6 = 166,666.7 times on average
# this expected value serves as theoretical baseline to compare against actual simulation results
#       and verify if dice is behaving fairly

Virtual Sampling from a Population

In statistics we are often interested in some measurable quantity (a variable) that describes some aspect of a set of items (a population). Variables can be anything, from the weight of unintended bycatch in the fishing industry to concentrations of radioactive material in soils. Variables come from measurement on a set of items called a population. The populations for the previously mentioned variables would be the all fish that become unintended bycatch, or all soils in a given area, respectively.

In real life it is* almost always impossible to know the values of all items in a population and in other cases it is unethical due to privacy, or because sampling is disruptive or even destructive. So we resort to sampling only a subset of the population: a sample. Commonly, the mean - or average - is of interest, because it is often the central value since it divides a population in roughly two halves (smaller vs larger than the mean).

A central problem in statistics is estimating the true population mean from a limited sample of a population. Since the mean extracted from a random sample is just an approximation, there is varying levels of uncertainty. Understanding this uncertainty is at the core of descriptive statistics. One prominant question is how large the sample has to be to estimate the true mean within a certain level of confidence. Often policy and action is based on these descriptive statistics and their uncertainty, which can have serious consequences.

# with ~ and complete file path
danish <- read.csv("guessAnonymousPolitiken.csv") # without ~ and with .csv

# first the 1,2,3!
summary(danish)
##      Guess       
##  Min.   :  0.00  
##  1st Qu.: 17.00  
##  Median : 28.00  
##  Mean   : 32.41  
##  3rd Qu.: 45.00  
##  Max.   :100.00
head(danish)
##   Guess
## 1    55
## 2    41
## 3     7
## 4    37
## 5    10
## 6   100
# rule of thumb is roughly sqrt(n) bins -> sqrt(19196)
# cex.axis sets magnification of axis so all values are included
hist(danish$Guess, breaks=sqrt(length(danish$Guess)), col="steelblue2", ylim=c(0,1500), xlab="Anonymized 
     Guess", ylab="Frequency of Guess", main="Anonymized Guesses in Politiken Game", pch=4, cex.axis=1)

# adding vertical lines for mean (32.41) and winning guess (2/3)
danMu <- mean(danish$Guess) # the mean
danMed <- median(danish$Guess) # the median
danWin <- 2/3*danMu # the winning number
abline(v=c(danMu, danWin), col=c("red","green"), lwd=2, lty=2) 
legend("topright", legend=c("Mean", "Winning Guess"), col=c("red", "green"), lty=2, lwd=2, bty="n")

# bty -> box type

Exercise 7

What was the mean and what was the winning value (2/3 of the mean) - both rounded off to 1 decimal - in 2005?

# calculating mean in 2005, rounded to one decimal
danMurounded <- round(danMu,1)
# calculating the winning guess in 2005, rounded to one decimal
danWinrounded <- round(danWin,1)

# printing the answer
print(paste("On average, participants guessed", danMurounded, "in 2005, such that the winning guess 
            was", danWinrounded, ".")) # not really necessary (see below)
## [1] "On average, participants guessed 32.4 in 2005, such that the winning guess \n            was 21.6 ."
# we know it's useful to know the population size, for future calculations
npop <- length(danish$Guess)

# how many winning guesses? (out of curiosity --> our class' mean guess)
sum(danish == 16)
## [1] 238
sum(danish == 16)/npop # the proportion of the total population
## [1] 0.01239842
# how many people wanted to screw over the smart ones? (also out of curiosity)
sum(danish == 100) 
## [1] 124
sum(danish == 100)/npop # again, the proportion of the total population
## [1] 0.006459679

Answer 7: On average, participants guessed” 32.4 in 2005, which means that the winning guess was 21.6.


Virtual Sampling from a Population…continued

Remember that the game rules explicitly state that contestants must guess a number, and whoever guesses closest to 2/3 of the average of all submitted guesses wins. Interestingly, the mode of the expected mean tells us what the majority of contestants thought about their ability of their competitors.

# dividing the observed mean of each participant by 2/3 since each contestant’s guess was based on 
#       the mean they expected for all other guesses multiplied by 2/3
expectedMean<-danish$Guess/(2/3)

# plotting the distribution of expected means
hist(expectedMean, breaks=101,col="skyblue",xlab="Expected Mean", main="Contestants' Guess of the Mean")

# getting all unique values and sorting them (so a value doesn't appear more than once)
# unique function returns a vector, data frame or array like x but with duplicate elements/rows
#       removed
uniqv <- sort(unique(expectedMean))

# calculating the mode of distribution
#       1. table(expectedMean) counts how often each value appears
#       2. which.max() finds the position of the maximum count
# which.max function gives the index of most frequent value (mode = value which happens most)
#       - max(expectedMean) -> highest value in distribution
#       - which.max(table(expectedMean)) -> most frequently ocurring value (mode)
modus <- which.max(table(expectedMean))
# table function uses cross-classifying factors to build contingency table of counts at each 
#       combination of factor levels
# e.g. max(table(expectedMean)) = 8 (the highest frequency count)
#      which.max(table(expectedMean)) = 2 (the position where that maximum occurs)

# adding vertical lines to view the mode better
# mode extracted from sorted unique values based on index of most ocurring value!
abline(v=uniqv[modus], lwd=4,lty=2,col="red")

#### Exercise 8 - Null Hypothesis Simulation

Use the random number generator to simulate a contest where all contestants simply choose a number at random. The values that can be randomly chosen should be a sequence from 0 to 100, as in the game. The sample size (e.g. ”number of throws” in the die example above) should be equal to the number of contestants in the Danish competition. Set the replace argument in the sample function to the correct and most appropriate state (TRUE or FALSE). What is the expected mean from this simulation? Use this random seed: set.seed(13092021) - and round your answer to 1 decimal.

# setting the seed and setting up sampling simulation based on Danish competition parameters
set.seed(13092021)
simulation <- data.frame("Guess"=(sample(0:100, npop, replace=TRUE))) # remember to start from 0
# replace=FALSE because contestants can only guess once
# saved column under name "Guess" to make it easier later

# calculating and plotting expected mean distribution (sampling distribution) if contestants pick
#       a random number
hist(simulation$Guess, breaks=sqrt(19196), col="skyblue", xlab="Expected Mean", 
     main="Contestants' Random Number Generation")

randomMean <- round(mean(simulation$Guess), 1)
# the sampling distribution is uniform/random as expected when contestants guess truly randomly

Answer 8: The expected mean from the above random simulation is 49.4.


Custom Functions in R

In programming, functions are sets of code that incorporate several instructions meant to be used repeatedly or sets of instruction that are complex - or very long - and so are better contained in a separate program that can be called when needed (as keeping them in your script reduces legibility of your code).

A function usually accepts arguments and returns one or more values. In R, functions have the following basic form:

     myfunct←function(arguments){  
                    instructions  
                               }  

They may accept arguments, which are preset in an argument list, and then have a set of instruction in the body of the function that may use the supplied arguments to perform some calculation or action.

# hello world function, typical in introductory texts on programming
HelloWorld <- function(){
today <- date()
print("Hello World! - I awoke at exactly: ")
print(today)
}
# essentially a function, with no arguments, but with a set of instructions in the body
HelloWorld()
## [1] "Hello World! - I awoke at exactly: "
## [1] "Sat Sep 13 16:48:31 2025"

Sample Size and Accuracy

Researchers must decide on sample size whenever sampling from a population. It can be set depending on your desired statistical strength, the known variation in the population, or - more often than not - by the number of samples you can afford both time and resource wise. It appears that the more samples taken, the closer the estimation of the mean to the true population mean.

# first, creating a sampling function with the following arguments:
#       - n = sample size
#       - dataset = dataset to sample from
#       - colname = column of interest
# then, creating a virtual sample from the dataset -> replace=FALSE since we don't want to sample
#       any participant more than once in this case
simSampler <- function(n=NA, dataset=danish, colname="Guess"){
  samples <- sample(dataset[, colname], size=n, replace=FALSE)
  return(samples)
}
# last, returning the random sample generated through sampling function -> basically sampling from
#       a set of guesses (sample()) like 1:6 earlier for dice simulation

# 2 arguments! -> "dataset" and "colname", both set to a default value
# argument "colname" added to calculate mean of specific column in stored dataset, making the
#       function general and applicable to future datasets with different column names
# "n" has a NA value to force R user to supply a value -> otherwise function will return error!

# trying out an example
# first, setting sample size to n=1
set.seed(13092021)
simSampler(n=1, dataset=danish)
## [1] 8
# not supplying a dataset results in default (which is object "danish")
set.seed(13092021)
simSampler(n=1) # same answer as above
## [1] 8
# random sample of any size my be simulated (up to N=19196) from the population of guesses

# but most interesting for this game is to estimate true mean
# therefore must add mean calculation parameter in function

simSamplerMean <- function(n, dataset=danish, colname="Guess"){
  samples <- sample(dataset[ , colname], size=n, replace=FALSE)
  average <- mean(samples)
  return(average)
}

# resetting random seed and using sample size n=2, and default values for dataset and colname
set.seed(13092021)
simSamplerMean(n=1)
## [1] 8
simSamplerMean(n=2)
## [1] 39
simSamplerMean(n=4)
## [1] 33
simSamplerMean(n=16)
## [1] 39.5625
simSamplerMean(n=111)
## [1] 30.31532
simSamplerMean(n=1111)
## [1] 32.94779

Systematic Simulation through Range of Sample Sizes

According to the Law of Large Numbers, the greater the sample size the closer we get to the true population mean! This is also evident by sampling using R as below.

# using sapply -> takes 2 main arguments: sapply(argument 1, argument 2)
#       - argument 1 -> list or vector of values that sapply supplies to argument 2
#       - argument 2 -> function (expected)
# using this to systematically simulate a range of different sample sizes
set.seed(13092021)
result <- sapply(1:3, print)
## [1] 1
## [1] 2
## [1] 3
# using sample size 1:100 to conduct sampling simulation
# sapply -> simple apply; applies function to given vectors or values
sampleSizes <- 1:100
result <- sapply(sampleSizes, simSamplerMean) # applies simSamplerMean function to sampleSizes vector
plot(sampleSizes, result, type="b", col="orangered", lwd=2, xlab="Sample Size", ylab="Sample Mean")

# adding ablines to better visualize convergence toward true population mean
abline(h=mean(result), col="blue", lty=4, lwd=2)
abline(h=c(mean(result)+1, mean(result)-1), col="black", lty=4, lwd=2)

# evidently, it’s indeed true that the greater the sample size the closer we get to the true mean!
# this is known as the Law of Large Numbers

Exercise 9

What would happen if you extended the vector of sample sizes saved in the object ”sampleSizes” closer and closer to 19196? Adapt the code to go beyond a N of 100 and re-plot the figure to find your answer.

# resetting seed for reproducibility
set.seed(13092021)

# using sample size 1:100 to conduct sampling simulation
sampleSizes <- 1:npop # increasing n to size of population
result <- sapply(sampleSizes, simSamplerMean) # applies simSamplerMean function to sampleSizes vector
plot(sampleSizes, result, type="b", col="lightpink", lwd=2, xlab="Sample Size", ylab="Sample Mean", 
     main="Law of Large Numbers in Action")

# adding ablines to better visualize convergence toward true population mean
abline(h=mean(result), col="red", lty=4, lwd=2)
abline(h=c(mean(result)+1, mean(result)-1), col="black", lty=4, lwd=2)

# LLN says the error in estimating the mean declines with 1/sqrt(n)...let's see if this is true!
# first, calculating standard error
sd <- sd(danish$Guess) # we don't divide by sqrt(n) because sample size differs 1:npop
# adding in the standard error curve (make sure to put add=TRUE or it creates separate plot)
# make sure to add in the mean!
curve(danMu+3*sd/sqrt(x), lwd=2, col="darkblue", add=TRUE) # upper boundary of data cluster; x is sampleSizes
curve(danMu-3*sd/sqrt(x), lwd=2, col="darkblue", add=TRUE) # lower boundary of data cluster

# 3*standard error because of "three-sigma rule" (fundamental principle in statistics)
#       ~ 68% of data falls within ± 1 standard error
#       ~ 95% of data falls within ± 2 standard errors
#       ~ 99.7% of data falls within ± 3 standard errors

Answer 9: The Law of Large Numbers! As sample size increases, the error in estimating the mean declines by with 1/sqrt(n).

Recall your theory! This is the Law of Large Numbers (LLN) in action. The error in estimating the mean declines with 1/sqrt(n).


Simulating Repeated Sampling from Population

How many people do we need to persuade to get the correct answer, ±1, 95% of the time? That is to say, if we were to repeat our scheme 100 times, we would get the correct answer - give or take 1 - in 95 out of 100 cases. With this 95% confidence level, the odds of submitting an answer that is NOT within ±1 of the winning answer are 1 in 20 - these are pretty good odds!

To do this, simply repeat the simulation - at a given sample size - thousands of times, then calculate how often the result falls within the acceptable range.

set.seed(13092021)
nsimulations <- 1000 # number of simulations
results80 <- replicate(nsimulations, simSamplerMean(n=80)) # conduct the simulation
head(results80)
## [1] 32.9000 29.1375 31.2250 36.2750 30.2875 33.7750
# calculating mean of all the means
results80Mu <- mean(results80)
results80Med <- median(results80)

results10 <- replicate(nsimulations, simSamplerMean(n=10))
head(results10)
## [1] 41.1 31.4 29.9 19.5 27.5 20.4
# calculating mean of all the means
results10Mu <- mean(results10)
results10Med <- median(results10)

# now plotting everything next to each other for comparison using multi-panel graph
par(mfrow=c(3,1)) # 3x1 panel figure -> 3 rows, 1 column

# plotting histograms of sample means with marker lines

# n = npop
hist(danish$Guess, nclass=50, col="skyblue", xlim=c(0,100), xlab="Guess", main="Contestant Guesses (npop)")
abline(v=c(danMu, danMed), col=c("red", "darkred"), lty=4, lwd=2)
legend("topright",legend=c("Mean","Median"), col=c("red","darkred"), lty=2, lwd=2, bty="n")

# n = 80
hist(results80, nclass=50, col="skyblue", xlim=c(0,100), ylim=c(0,100), xlab="Sample Mean", 
     main="n = 80")
abline(v=c(results80Mu, results80Med), col=c("red", "darkred"), lty=4, lwd=2)
legend("topright",legend=c("Mean","Median"), col=c("red","darkred"), lty=2, lwd=2, bty="n")

# n = 10
hist(results10, nclass=50, col="skyblue", xlim=c(0,100), ylim=c(0,100), xlab="Sample Mean", 
     main="n = 10")
abline(v=c(results10Mu, results10Med), col=c("red", "darkred"), lty=4, lwd=2)
legend("topright",legend=c("Mean","Median"), col=c("red","darkred"), lty=2, lwd=2, bty="n")

In the above plot we see several things at once:
* First, the distribution of sampled means is much narrower than the full population’s distribution. * Second, the distribution of the sampled means is centered around the true mean. This tells us that if we take a sample from the Danish participant’s guesses - the population - our sample mean will tend to be close to the true mean of the population.
* Third, the simulation tells us that the more actual people we persuade to tell us their guess, the narrower the distribution of simulated sample means will become. Hence, the less uncertain our estimate of the true mean will be. Our example follows the Law of Large Numbers!
* Finally, the distribution of simulated means does NOT follow the distribution of the population - in fact it takes a symmetrical form (standard normal curve!).

Exercise 10

  1. Calculate the mean and the median of the population.
  2. Next, do the same for the simulated “sampling distributions” at sample sizes 10 and 80.
  3. Finally, calculate the difference between the mean and median for each cases (you will have 3 differences). Notice what is happening to this difference?
set.seed(13092021) # for reproducibility 
nsimulations <- 1000

results10 <- replicate(nsimulations, simSamplerMean(n=10))
results80 <- replicate(nsimulations, simSamplerMean(n=80))

# calculating mean and median for actual population and simulated samples
# actual population
differencepop <- danMu-danMed
# n=10
difference80 <- results80Mu-results80Med
# n=80
difference10 <- results10Mu-results10Med

# differences comparison
differencepop
## [1] 4.406804
difference80
## [1] 0.0749
difference10
## [1] 0.4973
# the mean and median converge closer together with greater sample size! difference gets smaller

# expressing this new dataset in a data.frame
muMed <- data.frame("Mean"=c("npop"=danMu,"n=80"=results80Mu,"n=10"=results10Mu), 
           "Median"=c("npop"=danMed,"n=80"=results80Med,"n=10"=results10Med))
muMed$Mean-muMed$Median
## [1] 4.406804 0.074900 0.497300

95% Confidence Intervals and the Standard Normal

It does not appear that a sample size of 80 is sufficient to ensure that we will get the correct mean ±1. The “quantile” function can be used to calculate any quantile, or percentile. In statistics, quantiles and percentiles are points of division that cut a population, or set of samples, into sections of equal length or equal fractions of the whole. The 1% percentile is the cut off between the 1% smallest numbers and the 99% largest numbers in a sample.

dev.off()
## null device 
##           1
# calculating 2.5% percentile and 97.5% percentile gives us interval in which 95% of simulated
#       sample means fall
quantile(danish$Guess, prob=c(0.025,0.975))
##  2.5% 97.5% 
##     1    83
quantile(results10, prob=c(0.025,0.975))
##    2.5%   97.5% 
## 20.3975 47.2000
quantile(results80, prob=c(0.025,0.975))
##     2.5%    97.5% 
## 27.77469 37.58812
# this information can be easily displayed in 3x1 multi-panel visualization
par(mfrow=c(3,1)) # 3x1 panel

# actual population
hist(danish$Guess, breaks=50, col="skyblue", xlab="Guess", main="Population")
abline(v=quantile(danish$Guess, prob=c(0.025,0.975)), lty=2, col="red", lwd=2)
# n=10
hist(results10, breaks=50, col="skyblue", xlab="Sample Mean", main="n=10", xlim=c(0,100))
abline(v=quantile(results10, prob=c(0.025,0.975)), lty=2, col="red", lwd=2)
# n=80
hist(results80, breaks=50, col="skyblue", xlab="Sample Mean", main="n=10", xlim=c(0,100))
abline(v=quantile(results80, prob=c(0.025,0.975)), lty=2, col="red", lwd=2)
# step 1. plotting normal curve with z scores
curve(dnorm(x, mean=0, sd=1), xlim=c(-5,5), main="The ~Standard Normal Distribution", xlab="z", 
      ylab="Probability Density")

# step 2. finding median, mean and mode of standard normal
qnorm(0.5, mean=0, sd=1) # gives 0 because median=0 in standard normal
## [1] 0
# in standard normal, mean = median = mode because distribution is perfectly symmetrical

# step 3. adding in line marker for median, mean and mode
abline(v=0, lty=3, lwd=2, col="red") # lty=3 gives dashed line

# step 4. between which z-scores are 95% of the data if the data is normally distributed?
qnorm(c(0.025,0.975), mean=0, sd=1)
## [1] -1.959964  1.959964
abline(v=c(-1.959964,1.959964), lty=3, lwd=2, col="green")

# step 4. adding in a legend
legend("topleft", legend=c("median = mean = mode", "95% confidence level"), col=c("red","green"), 
       lty=3, lwd=2, bty="n")

Exercise 11

You now have enough information to calculate the sample size needed to ensure that we get the correct mean, ±1, 95% of the time. With the R-code we supply above, find the needed sample size, to the closest 100 people.

# setting seed for reproducibility
set.seed(13092021)

# calculating difference between 25% and 75% quantiles using diff function
# trying n=1500
diff(quantile(replicate(nsimulations, simSamplerMean(n=1600)), prob=c(0.025,0.975)))
##    97.5% 
## 2.002516
# this gives the width of confidence interval! -> must be limited to ±1 so continue

# trying n=1600, n=1800, n=2000
set.seed(13092021)
diff(quantile(replicate(nsimulations, simSamplerMean(n=1800)), prob=c(0.025,0.975)))
##    97.5% 
## 1.780625
# it seems sample size needed is between 1600 < n < 1800
set.seed(13092021)
diff(quantile(replicate(nsimulations, simSamplerMean(n=1700)), prob=c(0.025,0.975)))
##    97.5% 
## 1.872206
# seems sample size of n1600 is needed