# making a function for a set of spins for a single player
# bet = argument that sets the size of our bet in Euros
# spins = number of spins
# winprob = probability of winning (must be between 0 and 1)
# payoff = the payoff of a winning bet
roulette <- function(bet, spins=400, winprob=1/38, payoff=35){
# simulating the spins and bets (if win: 5 x 35, if lose: - 5)
winnings <- sample(c(payoff*bet,-bet), spins, prob=c(winprob,1-winprob),
replace=TRUE)
# calculating the total winnings!
sum(winnings)
} # end of function
# number of players (n=10,000)
nplayers <- 10000
# size of the bet (5 euros per spin)
bet <- 5
# using sapply to simulate 400 roulette spins per player
# each player's bet is always 5 euros, so repeating this 10000 times
set.seed(13092021) # for reproducibility
?rep # replicates x values for n times -> rep(x,n)
totals <- sapply(rep(bet,nplayers), roulette) # applying roulette function to each player's 400 bets
head(totals, 15) # first 20 resulting winnings -> except for one, all negative!
## [1] -740 340 -740 -380 700 -740 -20 -200 340 -740 -920 340 -20 -200 -20
# plotting winnings distribution
# first, making a red-green color palette
disCol <- colorRampPalette(c('#a50026','#d73027','#f46d43','#fdae61','#fee08b','#d9ef8b','#a6d96a'
,'#66bd63','#1a9850','#006837'))
# second, plotting the total winnings
hist(totals, breaks=40, col=disCol(40), main="Distribution of Total Winnings", xlab="Winnings")
# adding net breakeven point (where winnings = 0)
abline(v=0, lty=2, lwd=2)
# last, adding arrows and text for greater clarity
# area of profit
arrows(100,1100,2000,1100, col="forestgreen", lwd=1)
text(1200,1200, expression(bolditalic(Profit)), col="forestgreen")
# area of loss
arrows(-100,1100,-2000,1100, col="red", lwd=1)
text(-1200,1200, expression(bolditalic(Loss)), col="red")
# calculating average winnings
mean(totals)
## [1] -109.028
Based on the graph above, the distribution is interestingly
not centered around zero, rather it seems to be below
zero! This suggests that Roulette is rigged
and unfairly favors the casino, does it not? The average winnings of
10,000 players, each playing for 400
spins and betting 5 euros at a time seems to
center around mean(totals)
. This means
that one can expect to lose approximately €109 after 10 hours of playing
Roulette! Therefore, the expected income for the Casino per player is
roughly 10.9 Euros per hour (109/10) - which boils down to an expected
profit of 5.4 cents per spin.
However, what about the variation in winnings? What is the maximum loss we simulated and the maximum win? What fraction of people actually made a profit? These questions concern the spread of the distribution!
# first, calculating the range
minMax <- range(totals) # saves vector of minimum and maximum winnings as object
minMax[2] - minMax[1] # gives length of object using indexing
## [1] 4320
# in Roulette simulation, distance between biggest loser and luckiest player was €4,320
# second, calculating standard deviation
sd(totals)
## [1] 578.0215
# in Roulette simulation, at the end of play each player had - on average - a difference of €578.022
# compared to the expected -€109.028 loss
# this implies that when you go and play Roulette for hours - betting exactly as described above - a
# reasonable expectation is to lose -€109.028, give or take €578.022
# a player's winnings will **probably** lie between a loss of -€1,820 and a profit of €2,500
Look at the true population values of the mean, standard deviation and of the range (given in Figure 8.3). Where did we make the largest error compared to the true population values? And why do you think this is the case? Save your answer.
Extra information: With robust statistics, statisticians mean statistics that do not change - or change little - when you take another sample or increase your sample size. So here you should consider how these statistics are calculated and which one you would expect to change the least with a slightly different sample.
Side note: the “!” tells R “this is an image” to upload images onto rmarkdown sections
# first, running the Roulette simulator
# making a function for a set of spins for a single player
# bet = argument that sets the size of our bet in Euros
# spins = number of spins
# winprob = probability of winning (must be between 0 and 1)
# payoff = the payoff of a winning bet
roulette <- function(bet, spins=400, winprob=1/38, payoff=35){
winnings <- sample(c(payoff*bet,-bet), spins, prob=c(winprob,1-winprob),
replace=TRUE)
sum(winnings)
}
# number of players (n=10,000)
nplayers <- 10000
# size of the bet (5 euros per spin)
bet <- 5
# using sapply to simulate 400 roulette spins per player, repeating this 10000 times
set.seed(13092021) # for reproducibility
totals <- sapply(rep(bet,nplayers), roulette)
# then, demonstrating how to calculate the extact population answer from a binomial distribution
# DO NOT NEED TO KNOW THIS!
# creating object for all possible wins for 400 spins
wins <- (0:400)*5*35
# creating object for all possible losses
losses <- (0:400)*(-5)
# calculating net balance for all combinations
net <- losses + wins
# calculating likelihood of winning using binomial distribution density function
pwins <- dbinom(0:400, size=400, prob=1/38) # binomial distribution (number of successes in n trials)
?dbinom
# calculating mean and standard deviation
muTrue <- round(sum(dbinom(0:400, size=400, prob=1/38)*net), 3)
sdTrue <- round(sqrt(sum((dbinom(0:400, size=400, prob=1/38)*(net-muTrue)^2))), 3)
# calculating range
tmp <- range(net[(dbinom(0:400, size=400, prob=1/38))>0])
rangeTrue <- tmp[2]-tmp[1]
# answering the question
# the range
(rangeTrue-diff(range(totals)))/rangeTrue
## [1] 0.9062297
# the standard deviation
(sdTrue-sd(totals))/sdTrue
## [1] -0.06205732
# the mean
(muTrue-mean(totals))/muTrue
## [1] 1.060927
The range shows the largest relative error because it depends entirely on extreme values - the minimum and maximum observations. With only 10,000 simulated players, we’re unlikely to capture the true population extremes, making range highly sensitive to sampling variation. Here, if you get unlucky and don’t sample someone who wins really big or loses everything, your range will be completely wrong.
The mean and standard
deviation are much more robust because
they use information from all data
points rather than just the extremes.
- The mean is particularly stable due
to the Law of Large Numbers - as sample size increases,
sample means converge toward the true population mean.
Here, even if a few numbers are off, thousands of other data points
balance them out. This is why taking bigger samples gives you better
averages.
- The standard deviation is almost as
robust as the mean since it’s calculated from
deviations around the sample mean. However, it shows
slightly more variation because it involves squared
deviations, which amplify the effects of
sampling fluctuations. Here, since spread depends partly on extreme
values, it’s more affected by sampling luck than the mean, but much less
than the range.
In summary: Range (least robust) < Standard Deviation < Mean (most robust). This demonstrates why statisticians prefer measures that incorporate all data points rather than those dependent on extreme values when making inferences about populations.
Think of it this way: if you flip a coin 10 times vs 1000 times, which gives you a result closer to 50% heads? The 1000 flips, because more data = more reliable results. Range ignores most of your data, mean uses all of it, standard deviation uses all of it but is sensitive to extremes.
Our simulation, using random numbers, was one way to sample 10 000 times from this theoretical distribution. Another way would be to actually play Roulette and record our winnings! With enough sample size this will start to resemble the distribution of the population. Once we have the distribution, we can estimate the probability of observing certain values. Often, this information of far more valuable than the mean.
# what is the likelihood of net profit?
# first, counting instances of having winnings greater than 0 (net win)
# then, dividing it by the count of all observed winnings
probProfit <- sum(totals>0)/length(totals)
probProfit # -> only 36.05% chance!
## [1] 0.3605
# likelihood of walking out of casino with profit is less than 50% (only 36.05% chance)
# actual true population proportion is 0.2587...playing Roulette is unlikely to make one rich!
# what is the likelihood of a modest profit of >= €200?
probProfitModest<-sum(totals>=200)/length(totals)
probProfitModest # -> only 25.51% chance! even lower than breaking even...
## [1] 0.2551
# what would the luckiest 1% bring home?
quantile(totals, prob=0.99) # -> €1,420
## 99%
## 1420
As shown above, the 10,000 samples, collectively, start to provide an idea of the shape of the population’s distribution. This is useful in most situations, as we often want to calculate exactly the fraction of values smaller or larger then some threshold, and a large enough sample lets us estimate these statistics with reasonable precision. Still, even with a very large sample size (10,000 players as our sample) there still exists some uncertainty: we still estimate a mean that is slightly off from the true mean.
One possibility is to look at the average
error made. If we repeated the Roulette
simulation many times, and every time we calculated a
mean, then we could calculate this average error by
computing the mean distance from our sample
mean to our true mean.
- Standard error: \[SE =
\frac{\sigma}{\sqrt{n}}\]
- σ is the standard deviation, usually estimated from a sample
- n is the sample size
The actual sampling distribution for the previous Roulette simulation can be simulated on R, without the need for a formula. By rerunning the simulation 500 times and calculating the mean each time, the average error can be calculated!
# calculating sample size
n <- length(totals)
# calculating standard error
se <- sd(totals)/sqrt(n)
se # display value of se
## [1] 5.780215
# estimated standard error is 5.78 which means our estimate of population mean (-€109) is within one
# standard error of true mean (-€105.23)
set.seed(13092021)
# repeatedly doing the simulation and calculating the mean
repeated <- replicate(500, mean(sapply(rep(bet,nplayers), roulette)))
#tThe standard error is simply the standard deviation of this distribution!
# plotting it for visuals
hist(repeated,main="Sampling distribution", xlab="repeatedly calculated mean")
abline(v=-105.26, lty=2, lwd=2)
# the mean after 500 repeats of the simulation
muRepeated <- mean(repeated)
muRepeated
## [1] -105.1336
# the standard error estimated from the sampling distribution
seRepeated <- sd(repeated)
seRepeated
## [1] 5.9008
Conclusion: We see three things…
1) the sampling distribution is
symmetrical again, 2) the mean after
500 repeated samples is much closer to the true
mean, and 3) the standard error we estimated
is very close to the one we calculated with
standard error formula.
Use the function ”quantile” to calculate the 2.5 and 97.5% percentiles of the sampling distribution (the distribution of means saved in the object repeated). Next, estimate how many times you need to add, or subtract, the standard error from the true mean to reach these two values. Round your answers to the nearest 1 (no decimals). Save your answer.
# calculating 2.5% and 97.5% percentiles of samplig distribution
percentiles <- quantile(repeated, prob=c(0.025,0.975))
per25 <- percentiles[1]
per75 <- percentiles[2]
# x = (per75 - muRepeated)/seRepeated
# y = (per25 - muRepeated)/(-seRepeated)
# calculating how many standard errors these percentiles are from the true mean
x <- (per75 - muRepeated)/seRepeated
x
## 97.5%
## 1.90811
y <- (per25 - muRepeated)/(-seRepeated)
y
## 2.5%
## 1.726329
# rounding values to nearest integer
round(x, 0)
## 97.5%
## 2
round(y, 0)
## 2.5%
## 2
# approximately 2x standard deviation of sampling distribution contains 95% of observations
From above, it’s clear that approximately 2 times the standard deviation of sampling distribution contains 95% of observations!
Conclusion:
- Law of Large Numbers: With 500 repeated
samples, our mean (-105.26) moved much closer to the theoretical
population mean (-105.23) than any single sample mean, demonstrating
that more samples yield more accurate estimates.
- Central Limit Theorem: The sampling
distribution (500 repeated sample means) formed a symmetrical, normal
distribution regardless of the original roulette data’s non-normal
distribution, confirming sample means become normally distributed when
sample sizes are large enough.
- 2 Standard Errors Rule: We found that
approximately 2 standard errors capture 95% of sample means, which isn’t
coincidence but the foundation of confidence intervals. The formula SE =
σ/√n accurately predicted the actual variability observed, showing how
uncertainty in sample estimates follows predictable patterns.