Question 3
Financial institutions in the capital markets sector submit financial returns on a monthly basis. Hence, there is up to 10 years of data for these returns.
For a particular return X which is typically a floating-point number with 2 decimal places, supervisor B is concerned that some of these returns might not be correct. Specifically, he noticed that some firms had submitted returns which are exactly the same, from time to time. B Has been told by his supervisor to design a statistical test that involves the use of a Monte Carlo Algorithm, to identify firms, with more than a 95% chance of having submitted returns which have too few unique returns, assuming these financial returns follow a normal distribution.
You are required to write down a simple pseudo code that implements this test. Assume you need at least 10,000 runs for the Monte Carlo step, and the code you design can be applied to any number of firms.
Answer to Question 3
Bayesian A/B Testing
- Essentially, the scenario in Question 3 describes the use of Bayesian A/B Testing. In addition, Monte Carlo Simulations can be used to approximate integrals by sampling from various different distributions, and calculating the ratio between points in the area we are interested in. Here, we assume that financial returns follow a Normal Distribution.
- To illustrate the use of Bayesian A/B Testing, we can plot a Normal Distribution for a particular financial return, A from a particular firm. We will also need to plot a Normal Distribution for another financial return, B from the same firm. Essentially, we are asking the question: “What is the probability that return A is different from return B?”.
- R has a rnorm function that allows us to sample from a Normal Distribution. It also has a rbeta function that allows us to sample from a Beta Distribution. Here, we will sample 10,000 times from the Normal Distributions we have modelled using the rnorm function.
# Pseudo R Code
Number of Trials = 10,000.
Return A – ‘Yes’ Responses = 50. ‘No’ Responses = 100.
Return B – ‘Yes’ Responses = 60. ‘No’ Responses = 110.
n.trials <- 10000
a.samples <- rnorm(n.trials, 50, 100)
b.samples <- rnorm(n.trials, 60, 110)
return.chance <- sum(b.samples > a.samples) / n.trials
- If we assume that return.chance = 0.96, this is equivalent to a p-value of 0.04 from a single-tailed T-test which is statistically significant. A Null Hypothesis Significance Test is used to determine if two different distributions are likely the result of sampling from the same distribution or not.
- Here, we set a significance level (alpha) of 0.05 which indicates a 5% risk of concluding that a difference exists between return A and return B when there is no actual difference. It is the probability of rejecting the null hypothesis when it is true.
- Besides the level of significance, Monte Carlo Simulations can show us the magnitude of relative improvements in our simulations. We can plot the ratio of b.samples / a.samples to get a distribution of relative improvements in our simulations. The Cumulative Distribution Function (CDF) can be useful for interpreting our results.
Rejection Sampling
- If we do not have a sampling function like rnorm or rbeta, we will need to build a Rejection Sampler. This allows us to sample from various different distributions for our analysis.
- For example, we can plot a Beta Distribution and randomly sample two points, C and D from it. We can look at whether these points fall under the curve or not. If it is within, we accept the sample. Otherwise, we reject it.
- From our Beta Distribution, we can see that point C falls outside of the curve so we are going to reject it. However, point D is within the curve so we are going to sample it. In a (x, y) pair, the ‘y’ value determines whether or not we will sample it, while the ‘x’ value is what we sample.
- We will sample 10,000 times from our Beta Distribution to show how the algorithm works.
# Pseudo R Code
Number of Trials = 10,000.
xs = Sampling of ‘x’ values.
xy = Sampling of ‘y’ values.
n.pairs <- 10000
xy.pairs <- data.frame(xs = runif(n.pairs, min = 0, max = 1),
ys = runif(n.pairs, min = 0, max = 4))
- After we obtained the samples, we need to write a function that will let us know if the sampled point is inside or outside the curve. We define a “which.nearest” function to find the index of the ‘x’ value that is nearest to our sample.
# Pseudo R Code
which.nearest <- function(x, xs){
# Assumes ‘xs’ values are in order. Chooses the first one that is greater.
first.greater <- which(xs > x)[1]
# Determine if ‘x’ value is nearer to first greatest or last smallest.
smallest.diff <- which.min(abs(c(xs[first.greater - 1], xs[first.greater]) - x))
# Return index of either first greatest or last smallest.
first.greater + (smallest.diff - 2)}
- Next, we go on to either reject or accept our sampled points.
# Pseudo R Code
xy.pairs$posterior = Posterior Probability Distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey.
xy.pairs$posterior <- sapply(xy.pairs$xs, function(x){
posterior.ys[which.nearest(x, xs)]})
xy.pairs$accept <- xy.pairs$ys < xy.pairs$posterior
- We can plot this to look at all our sampled points.
- Monte Carlo Simulations make it relatively easy to model distributions that are difficult to analyse. Classical statistics rely heavily on the Normal Distribution because it often appears in nature and is a good approximating tool. However, there are many various different distributions in the real world where the Normal Distribution is a poor approximation. With Rejection Sampling, we can sample from various different distributions and use Monte Carlo Simulations to compute integrals for our analysis.