Tutorial 8 - Deviation is Normal

Simulating a Roulette Game

# making a function for a set of spins for a single player
#       bet = argument that sets the size of our bet in Euros
#       spins = number of spins
#       winprob = probability of winning (must be between 0 and 1)
#       payoff = the payoff of a winning bet
roulette <- function(bet, spins=400, winprob=1/38, payoff=35){
# simulating the spins and bets (if win: 5 x 35, if lose: - 5)
  winnings <- sample(c(payoff*bet,-bet), spins, prob=c(winprob,1-winprob),
                     replace=TRUE)
# calculating the total winnings!
  sum(winnings)
} # end of function

# number of players (n=10,000)
nplayers <- 10000
# size of the bet (5 euros per spin)
bet <- 5

# using sapply to simulate 400 roulette spins per player
# each player's bet is always 5 euros, so repeating this 10000 times
set.seed(13092021) # for reproducibility
?rep # replicates x values for n times -> rep(x,n)
totals <- sapply(rep(bet,nplayers), roulette) # applying roulette function to each player's 400 bets
head(totals, 15) # first 20 resulting winnings -> except for one, all negative!

##  [1] -740  340 -740 -380  700 -740  -20 -200  340 -740 -920  340  -20 -200  -20

# plotting winnings distribution
# first, making a red-green color palette
disCol <- colorRampPalette(c('#a50026','#d73027','#f46d43','#fdae61','#fee08b','#d9ef8b','#a6d96a'
                             ,'#66bd63','#1a9850','#006837'))
# second, plotting the total winnings
hist(totals, breaks=40, col=disCol(40), main="Distribution of Total Winnings", xlab="Winnings")

# adding net breakeven point (where winnings = 0)
abline(v=0, lty=2, lwd=2)

# last, adding arrows and text for greater clarity
# area of profit
arrows(100,1100,2000,1100, col="forestgreen", lwd=1)
text(1200,1200, expression(bolditalic(Profit)), col="forestgreen")
# area of loss
arrows(-100,1100,-2000,1100, col="red", lwd=1)
text(-1200,1200, expression(bolditalic(Loss)), col="red")

# calculating average winnings
mean(totals)

## [1] -109.028

Based on the graph above, the distribution is interestingly not centered around zero, rather it seems to be below zero! This suggests that Roulette is rigged and unfairly favors the casino, does it not? The average winnings of 10,000 players, each playing for 400 spins and betting 5 euros at a time seems to center around mean(totals). This means that one can expect to lose approximately €109 after 10 hours of playing Roulette! Therefore, the expected income for the Casino per player is roughly 10.9 Euros per hour (109/10) - which boils down to an expected profit of 5.4 cents per spin.

However, what about the variation in winnings? What is the maximum loss we simulated and the maximum win? What fraction of people actually made a profit? These questions concern the spread of the distribution!

Exploring the Spread of a Distribution

Range -> the difference between the minimum and maximum number
- in R: range()
Standard Deviation -> average distance to the mean
- standard deviation = √(average of squared deviations from mean)
- \[\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N}(x_i - \mu)^2}\]
- calculates distance between all points in the dataset (xi) and the mean (mu) and then divides by the total number of points which yields the average distance
- in R: sd()
Variance:
- the square root of standard deviation
- plays key role in theoretical foundations of statistics as it’s more convenient in algebraic manipulation

# first, calculating the range
minMax <- range(totals) # saves vector of minimum and maximum winnings as object
minMax[2] - minMax[1] # gives length of object using indexing

## [1] 4320

# in Roulette simulation, distance between biggest loser and luckiest player was €4,320

# second, calculating standard deviation
sd(totals)

## [1] 578.0215

# in Roulette simulation, at the end of play each player had - on average - a difference of €578.022 
#       compared to the expected -€109.028 loss
# this implies that when you go and play Roulette for hours - betting exactly as described above - a 
#       reasonable expectation is to lose -€109.028, give or take €578.022
# a player's winnings will **probably** lie between a loss of -€1,820 and a profit of €2,500

Exercise 12

Look at the true population values of the mean, standard deviation and of the range (given in Figure 8.3). Where did we make the largest error compared to the true population values? And why do you think this is the case? Save your answer.

Extra information: With robust statistics, statisticians mean statistics that do not change - or change little - when you take another sample or increase your sample size. So here you should consider how these statistics are calculated and which one you would expect to change the least with a slightly different sample.

Distribution of Total Wins

Side note: the “!” tells R “this is an image” to upload images onto rmarkdown sections

# first, running the Roulette simulator
# making a function for a set of spins for a single player
#       bet = argument that sets the size of our bet in Euros
#       spins = number of spins
#       winprob = probability of winning (must be between 0 and 1)
#       payoff = the payoff of a winning bet
roulette <- function(bet, spins=400, winprob=1/38, payoff=35){
  winnings <- sample(c(payoff*bet,-bet), spins, prob=c(winprob,1-winprob),
                     replace=TRUE)
  sum(winnings)
}

# number of players (n=10,000)
nplayers <- 10000
# size of the bet (5 euros per spin)
bet <- 5

# using sapply to simulate 400 roulette spins per player, repeating this 10000 times
set.seed(13092021) # for reproducibility
totals <- sapply(rep(bet,nplayers), roulette)

# then, demonstrating how to calculate the extact population answer from a binomial distribution
# DO NOT NEED TO KNOW THIS!
# creating object for all possible wins for 400 spins
wins <- (0:400)*5*35
# creating object for all possible losses
losses <- (0:400)*(-5)
# calculating net balance for all combinations
net <- losses + wins

# calculating likelihood of winning using binomial distribution density function
pwins <- dbinom(0:400, size=400, prob=1/38) # binomial distribution (number of successes in n trials)
?dbinom

# calculating mean and standard deviation
muTrue <- round(sum(dbinom(0:400, size=400, prob=1/38)*net), 3)
sdTrue <- round(sqrt(sum((dbinom(0:400, size=400, prob=1/38)*(net-muTrue)^2))), 3)

# calculating range
tmp <- range(net[(dbinom(0:400, size=400, prob=1/38))>0])
rangeTrue <- tmp[2]-tmp[1]

# answering the question 
#       the range
(rangeTrue-diff(range(totals)))/rangeTrue

## [1] 0.9062297

#       the standard deviation
(sdTrue-sd(totals))/sdTrue

## [1] -0.06205732

#       the mean
(muTrue-mean(totals))/muTrue

## [1] 1.060927

The range shows the largest relative error because it depends entirely on extreme values - the minimum and maximum observations. With only 10,000 simulated players, we’re unlikely to capture the true population extremes, making range highly sensitive to sampling variation. Here, if you get unlucky and don’t sample someone who wins really big or loses everything, your range will be completely wrong.

The mean and standard deviation are much more robust because they use information from all data points rather than just the extremes.
- The mean is particularly stable due to the Law of Large Numbers - as sample size increases, sample means converge toward the true population mean. Here, even if a few numbers are off, thousands of other data points balance them out. This is why taking bigger samples gives you better averages.
- The standard deviation is almost as robust as the mean since it’s calculated from deviations around the sample mean. However, it shows slightly more variation because it involves squared deviations, which amplify the effects of sampling fluctuations. Here, since spread depends partly on extreme values, it’s more affected by sampling luck than the mean, but much less than the range.

In summary: Range (least robust) < Standard Deviation < Mean (most robust). This demonstrates why statisticians prefer measures that incorporate all data points rather than those dependent on extreme values when making inferences about populations.

Think of it this way: if you flip a coin 10 times vs 1000 times, which gives you a result closer to 50% heads? The 1000 flips, because more data = more reliable results. Range ignores most of your data, mean uses all of it, standard deviation uses all of it but is sensitive to extremes.

Probability of Making a Profit with Roulette

Our simulation, using random numbers, was one way to sample 10 000 times from this theoretical distribution. Another way would be to actually play Roulette and record our winnings! With enough sample size this will start to resemble the distribution of the population. Once we have the distribution, we can estimate the probability of observing certain values. Often, this information of far more valuable than the mean.

# what is the likelihood of net profit?
# first, counting instances of having winnings greater than 0 (net win)
# then, dividing it by the count of all observed winnings
probProfit <- sum(totals>0)/length(totals)
probProfit # -> only 36.05% chance!

## [1] 0.3605

# likelihood of walking out of casino with profit is less than 50% (only 36.05% chance)
# actual true population proportion is 0.2587...playing Roulette is unlikely to make one rich!

# what is the likelihood of a modest profit of >= €200?
probProfitModest<-sum(totals>=200)/length(totals)
probProfitModest # -> only 25.51% chance! even lower than breaking even...

## [1] 0.2551

# what would the luckiest 1% bring home?
quantile(totals, prob=0.99) # -> €1,420

##  99% 
## 1420

As shown above, the 10,000 samples, collectively, start to provide an idea of the shape of the population’s distribution. This is useful in most situations, as we often want to calculate exactly the fraction of values smaller or larger then some threshold, and a large enough sample lets us estimate these statistics with reasonable precision. Still, even with a very large sample size (10,000 players as our sample) there still exists some uncertainty: we still estimate a mean that is slightly off from the true mean.

The Standard Error

One possibility is to look at the average error made. If we repeated the Roulette simulation many times, and every time we calculated a mean, then we could calculate this average error by computing the mean distance from our sample mean to our true mean.
- Standard error: \[SE = \frac{\sigma}{\sqrt{n}}\]
- σ is the standard deviation, usually estimated from a sample
- n is the sample size

The actual sampling distribution for the previous Roulette simulation can be simulated on R, without the need for a formula. By rerunning the simulation 500 times and calculating the mean each time, the average error can be calculated!

# calculating sample size
n <- length(totals)

# calculating standard error
se <- sd(totals)/sqrt(n)
se # display value of se

## [1] 5.780215

# estimated standard error is 5.78 which means our estimate of population mean (-€109) is within one
#       standard error of true mean (-€105.23)

set.seed(13092021)
# repeatedly doing the simulation and calculating the mean
repeated <- replicate(500, mean(sapply(rep(bet,nplayers), roulette)))
#tThe standard error is simply the standard deviation of this distribution!

# plotting it for visuals
hist(repeated,main="Sampling distribution", xlab="repeatedly calculated mean")
abline(v=-105.26, lty=2, lwd=2)

# the mean after 500 repeats of the simulation
muRepeated <- mean(repeated)
muRepeated

## [1] -105.1336

# the standard error estimated from the sampling distribution
seRepeated <- sd(repeated)
seRepeated

## [1] 5.9008

Conclusion: We see three things…
1) the sampling distribution is symmetrical again, 2) the mean after 500 repeated samples is much closer to the true mean, and 3) the standard error we estimated is very close to the one we calculated with standard error formula.

Exercise 13

Use the function ”quantile” to calculate the 2.5 and 97.5% percentiles of the sampling distribution (the distribution of means saved in the object repeated). Next, estimate how many times you need to add, or subtract, the standard error from the true mean to reach these two values. Round your answers to the nearest 1 (no decimals). Save your answer.

# calculating 2.5% and 97.5% percentiles of samplig distribution
percentiles <- quantile(repeated, prob=c(0.025,0.975))
per25 <- percentiles[1]
per75 <- percentiles[2]

# x = (per75 - muRepeated)/seRepeated
# y = (per25 - muRepeated)/(-seRepeated)

# calculating how many standard errors these percentiles are from the true mean 
x <- (per75 - muRepeated)/seRepeated
x

##   97.5% 
## 1.90811

y <- (per25 - muRepeated)/(-seRepeated)
y

##     2.5% 
## 1.726329

# rounding values to nearest integer
round(x, 0)

## 97.5% 
##     2

round(y, 0)

## 2.5% 
##    2

# approximately 2x standard deviation of sampling distribution contains 95% of observations

From above, it’s clear that approximately 2 times the standard deviation of sampling distribution contains 95% of observations!

Conclusion:
- Law of Large Numbers: With 500 repeated samples, our mean (-105.26) moved much closer to the theoretical population mean (-105.23) than any single sample mean, demonstrating that more samples yield more accurate estimates.
- Central Limit Theorem: The sampling distribution (500 repeated sample means) formed a symmetrical, normal distribution regardless of the original roulette data’s non-normal distribution, confirming sample means become normally distributed when sample sizes are large enough.
- 2 Standard Errors Rule: We found that approximately 2 standard errors capture 95% of sample means, which isn’t coincidence but the foundation of confidence intervals. The formula SE = σ/√n accurately predicted the actual variability observed, showing how uncertainty in sample estimates follows predictable patterns.