Welcome everyone to this R demo on the Fundamentals of Bayesian Analysis.
Before we begin, I’d like to share something important with you. The material we’ll be using in this workshop, particularly the RMarkdown file, is adapted from the “Fundamentals of Bayesian Data Analysis in R” course available on DataCamp (I have made a lot of edits to ensure smooth transitions between sections and clear comprehension). After reviewing numerous introductory materials on Bayesian analysis, I’ve found that this course offers the best introduction for those who are completely new to the Bayesian approach. It provides a clear and accessible entry point into this fascinating area.
Now, let’s set our expectations for today’s session. Bayesian analysis is an extensive field, and it’s essential to understand that becoming an expert in Bayesian methods is a journey that extends well beyond what a one-hour workshop can offer. It’s similar to trying to master all of statistics in a single hour - a challenging, if not impossible, task.
Bayesian analysis is not just a single model; it’s an approach, a different way of thinking about and analyzing data compared to the classical statistical (frequentist) approach. It’s adaptable and can be applied to various models, depending on your data and the research questions you wish to answer.
In today’s workshop, we’ll be covering the very basics of Bayesian analysis. Consider this session as your first step into a broader and deeply engaging world. While we’ll only be scratching the surface, I hope this introduction will ignite your interest and provide a solid foundation for your further exploration into Bayesian analysis.
For those keen to delve deeper, I’m more than happy to recommend additional resources for further study. Remember, today is about setting the groundwork, a starting point from which your understanding and expertise can grow.
Thank you for joining me on this intellectual adventure. Let’s embark on this learning journey together!
Bayesian Inference in a Nutshell:
A method for figuring out unobservable quantities given known facts that uses probability to describe the uncertainty over what the values of the unknown quantities could be.
Bayesian inference == Probabilistic inference.
You are making probability statements about parameters themselves, not just about the data.
Probability
The role of probability distributions in Bayesian data analysis is to represent uncertainty, and the role of Bayesian inference is to update probability distributions to reflect what has been learned from data.
Let’s look at a brief illustrative example.
A Bayesian model for the proportion of success
ex: What is the proportion of successful treatments with a new drug?
prop_model(data )
data
is a vector of failures represented by 1s and
0sThe output of prop_model
is a plot showing what the
model learns about the underlying proportion of success from each data
point in the order you entered them.
Zombie apocalypse:
Let us say there is a new drug to treat zombieism. It has never been
tested before, but the results from this pilot test have two recoveries
out of 4 subjects. Here, prop_model
plots the successive
posterior probabilities that the zombie antidote will be successful:
data <- c(1, 0, 0, 1 )
prop_model(data)
## Warning in ggridges::geom_density_ridges(stat = "identity", color = "white", :
## Ignoring unknown parameters: `size`
At n=0
there is no data, and all the model knows is that
it’s equally probable that the proportion of success is anything from 0%
to 100%. At n=4
all data has been added, and the model
knows a little bit more.
Next we observe several more subjects, however, they are all failures, leaving just 2 successes out of 13 subjects. Here is how the posterior probability changes with the new case information:
data <- c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) # only 2 failures
prop_model(data)
## Warning in ggridges::geom_density_ridges(stat = "identity", color = "white", :
## Ignoring unknown parameters: `size`
It seems like it’s not a perfect drug, but between 5% to 40% cured zombies is better than nothing!
What if we observe a lot more?
data <- rbinom(50, 1, 0.2) # the function generates 50 random numbers with parameters n = 1 (number of trials) and p = 0.2 (probability of success in each trial).
prop_model(data)
## Warning in ggridges::geom_density_ridges(stat = "identity", color = "white", :
## Ignoring unknown parameters: `size`
Subjective observation time: the more information collected about a variable, the closer the observed prior distribution approaches the ‘true’ probability of a success.
data = c(1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0)
# Extract and explore the posterior
posterior <- prop_model(data)
## Warning in ggridges::geom_density_ridges(stat = "identity", color = "white", :
## Ignoring unknown parameters: `size`
head(posterior)
## [1] 0.12036586 0.21641050 0.27290850 0.10608833 0.19117715 0.09285969
# returns a large random sample from the posterior over the underlying proportion of success
# each value in posterior is a possible "proportion of success" based on your data and prior
Let’s plot the posterior distribution:
hist(posterior, breaks = 30, xlim = c(0,1 ), col = 'palegreen4' )
Hey, that’s a good looking plot! This histogram depicts a bell-shaped distribution of the probabilities for curing a zombie, centered around 5-40%.
Compare this histogram to the plot produced directly by
prop_model
. You should notice that the histogram and the
posterior distribution (at n=13
) describe the same
distribution.
Summarizing the zombie drug experiment
The point of working with samples from a probability distribution is that it makes it easy to calculate new measures of interest. The following tasks are about doing just this!
A point estimate is a single number used to summarize what’s known about a parameter of interest. It can be seen as a “best guess” of the value of the parameter. A commonly used point estimate is the median of the posterior. It’s the midpoint of the distribution, and it’s equally probable for the parameter value to be larger than the median as it is to be smaller than it.
# Calculate the median
median(posterior)
## [1] 0.1878639
So, a best guess is that the drug would cure around 18% of all
zombies. Another common summary is to report an interval that includes
the parameter of interest with a certain probability. This is called a
credible interval (CI). With a posterior represented as
a vector of samples you can calculate a CI using the
quantile()
function.
# Calculate the credible 90% interval
quantile(posterior, c(0.05, 0.95))
## 5% 95%
## 0.06034651 0.38731579
According to the credible interval, there is a 90% probability that the proportion of zombies the drug would cure is between 6% and 38%, which is aligned with the green histogram we observed eaerlier. (Here we have to be careful to remember that the percentage of cured zombies and the percentage of probability are two different things.)
Now, there is a rival zombie laboratory that is also working on a drug. They claim that they are certain that their drug cures 7% of the zombies it’s administered to. Can we calculate how probable it is that our drug is better? Yes, we can! But it’s a two stage process.
# Calculate the sum
sum(posterior > 0.07) # the number of cases in which our drug is better than the rival drug
## [1] 9277
# Calculate the probability
sum(posterior > 0.07)/length(posterior) # the probability of our drug shows higher curing rate than the rival drug
## [1] 0.9277
Put the results in a journal
“Given the data of two cured and 11 relapsed zombies, and using the Bayesian model described before, there is a 90% probability that our drug cures between 6% and 39% of treated zombies. Further, there is 93% probability that our drug cures zombies at a higher rate than the current state of the art drug.”
Bayesian Inference = Data + a Generative Model + Priors
Generative Model can be a computer program, mathematical expression or a set of rules that can be fed parameters values to be used to simulate data.
Take a generative model for a spin
Below you have the R code that implements the generative model we just developed.
# The generative zombie drug model
# Set parameters
prop_success <- 0.42 # The probability of curing the zombie
n_zombies <- 100 # The total number of zombies we have
# Simulating data
data <- c()
for(zombie in 1:n_zombies) {
data[zombie] <- runif(1, min = 0, max = 1) < prop_success
} # Runs a loop for each zombie, where a random number between 0 and 1 is generated. If the number is less than prop_success, it means the zombie is cured (TRUE or 1), otherwise not cured (FALSE or 0).
data <- as.numeric(data) # Convert the logical vector to numeric (TRUE/FALSE to 1/0)
data # The simulated zombie data showing whether each zombie gets cured after receiving our drug (0 = cured; 1 = not cured)
## [1] 0 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0 1 1 0 0 0
## [38] 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 1 1 0 1 0 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1
## [75] 0 1 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1
# Count cured
data <- sum(data)
data
## [1] 47
Perfect! Some zombies got cured in this simulation, but far from all.
Take the binomial distribution for a spin
It turns out that the generative model you ran last exercise already
has a name. It’s called the binomial process
or the
binomial distribution
. In R you can use the
rbinom
function to simulate data from a binomial
distribution. The rbinom
function takes three
arguments:
n
The number of times you want to run the generative
modelsize
The number of trials. (For example, the number of
zombies you’re giving the drug.)prob
The underlying proportion of success as a number
between 0.0 and 1.0.# Try out rbinom
rbinom(n = 1, size = 100, prob = 0.42)
## [1] 42
# Change the parameters
rbinom(n = 200, size = 100, prob = 0.42)
## [1] 40 42 44 44 41 44 44 39 42 44 38 41 43 48 44 38 43 42 50 52 39 37 51 46 42
## [26] 41 46 44 31 38 37 43 41 48 38 50 38 31 42 46 36 37 36 41 46 43 45 50 47 45
## [51] 43 43 47 46 49 43 54 43 36 41 40 43 37 45 41 41 36 46 40 46 43 43 41 46 36
## [76] 42 35 36 45 41 43 51 41 44 46 50 33 35 34 48 41 39 40 40 49 40 40 45 42 44
## [101] 53 50 44 38 38 41 42 46 41 42 42 42 32 41 51 41 40 41 44 45 37 46 41 48 40
## [126] 40 39 41 53 50 42 47 43 47 44 48 43 36 44 39 45 40 51 48 33 49 45 45 41 41
## [151] 48 42 38 39 38 48 49 45 49 44 46 45 46 41 33 41 42 40 42 40 42 43 46 40 47
## [176] 45 37 43 43 33 35 40 40 49 52 36 43 49 45 43 44 47 34 42 47 39 39 40 44 55
Nice! That’s a lot of simulated zombies right there. Each number shown here represents the number of cured zombies we would observe out of 100 if we had given the drug to all of the zombies. Note that we repeat this process 200 times.
We can apply our generative model to a relatively large number of simulated trials. The results will for the probability distribution for our process:
cured_zombies <- rbinom(n = 100000, size = 100, prob = 0.07 )
post_df <- data.frame('Cured_Zombies' = cured_zombies )
ggplot(post_df, aes(x = Cured_Zombies ) ) +
geom_histogram(binwidth = 1, fill = 'palegreen4', color = "#e9ecef" ) +
ggtitle('Cured Zombies Probability Distribution' )
However, it is more often the case that we have data, and do not need a generative model to create it. However, we do not know the parameters that drive the process. Bayesian Inference is used to invert the probability to work backwards and learn about the underlying process given the data at hand.
New Situations: instead of the zombie example, let’s look at click-through rates….
How many visitors to a site?
To get more visitors to your website you are considering paying for an ad to be shown 100 times on a popular social media site. According to the social media site, their ads get clicked on 10% of the time. You would like the ad campaign to result in at least 5 visitors to your site.
set.seed(123)
# Fill in the parameters
n_samples <- 100000 # repeat the process 100000 times
n_ads_shown <- 100 # each time 100 ads are shown
proportion_clicks <- 0.1 # according to the social media site, 10% of the ads get clicked
n_visitors <- rbinom(n_samples, size = n_ads_shown,
prob = proportion_clicks) # The rbinom() function will generate n_samples random numbers, each one representing the number of clicks out of n_ads_shown, with the probability of a click being proportion_clicks. This could model, for example, how many times people click on an advertisement after it has been shown a certain number of times, with a given click-through rate (proportion_clicks).
post_df <- data.frame('Site_Visits' = n_visitors )
ggplot(post_df, aes(x = Site_Visits ) ) +
geom_histogram(binwidth = 1, fill = 'cadetblue', color = "#e9ecef" ) +
ggtitle('Number of Visitors Distribution' )
This graph shows how many site visits we can get if we decide to show these 100 ads (if we repeat the process 100000 times). Notice that sometimes we get fewer clicks, and sometimes we get more. The probability of getting a certain number of clicks actually varies across time.
Question
You would like the ad campaign to result in at least 5 visitors to your site.
Eyeballing the plot you just produced, what is the probability you will get 5 or more visitors because of the ad? (Roughly more than 90%)
And now to address the Priors.
Prior Probability Distribution: represents how uncertain the model is about the parameters before seeing any data. Ideally, this uncertainty should reflect our uncertainty. The reason it is called a prior is because it represents the uncertainty before having included information about the data.
Take our website click-through example, you’re not so sure that your
ad will get clicked on exactly 10% of the time. Instead of assigning
proportion_clicks
a single value you are now going to
assign it a large number of values drawn from a probability
distribution.
n_samples <- 100000
n_ads_shown <- 100
set.seed(123)
# model a uniform probability distribution that ranges from 0 to 0.2 to represent the uncertainty of our value for this prior
proportion_clicks <- runif(n = n_samples, min = 0.0, max = 0.2 )
n_visitors <- rbinom(n_samples, n_ads_shown, proportion_clicks )
Because the rbinom
function is vectorized the first
value of proportion_clicks
is used to sample the first
value in n_visitors
, the second value in
proportion_clicks
is used for the second in
n_visitors
, and so on. The result is that the samples in
n_visitors
now also incorporate the uncertainty in what the
underlying proportion of clicks could be.
Visualize the probability distribution for the prior, proportion_clicks.
pclick_df <- data.frame('probability_click' = proportion_clicks )
pclick_plot <- ggplot(pclick_df, aes(x = probability_click ) ) +
geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
ggtitle('Probability Distribution of the Prior' )
nvis_df <- data.frame('Site_Visits' = n_visitors )
nvis_plot <- ggplot(nvis_df, aes(x = Site_Visits ) ) +
geom_histogram(binwidth = 1, fill = 'cadetblue', color = "#e9ecef" ) +
ggtitle('Number of Visitors Distribution' )
grid.arrange(pclick_plot, nvis_plot, ncol = 2 )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sum(n_visitors >=5) / length(n_visitors)
## [1] 0.75295
You shouldn’t be surprised to see that the uncertainty over
proportion_clicks
is just as you specified it to be:
Uniform between 0.0 and 0.2 (except for some small variations in the
height of the bars because we took a random sample using
runif
).
Uncertainty changed the distribution. We went from a random distribution of about 90-95% to, with added uncertainty of 0 to 20%, 75.3% likelihood that at least 5 people will click on an ad.
The envelope of the n_visitors
data from this generative
model is much different. The added uncertainty for the prior increases
the uncertainty in the number of visitors the site will receive.
Now we have a generative model and priors, and together these form a Bayesian model. So if some statistician is talking about that she’s developed a Bayesian model of something, that just means she has specified a generative model for the data plus priors over what is uncertain in the model. The only thing we need now to do Bayesian inference is some data. And it’s actually the case that you got tired of me talking about probability distributions and just decided to run your ad campaign. And it went better than expected!
When the ad was shown a 100 times, 13 people clicked and visited your site. So now we have data, and we are ready to do some Bayesian inference, we are ready to use this data to learn about how likely people are to click on your ad.
How could we do that? Let’s take a look at the model we have so far
and see if we can figure that out. The prior over
proportion_clicks
and the generative model for
n_visitors
together define a prior probability distribution
over both parameters and future, unknown data. I’m going to put these
together in a data frame and call it just prior.
# This data frame represents the joint probability distribution over proportion_clicks and n_visitors together.
prior_df <- data.frame('proportion_clicks' = proportion_clicks, 'n_visitors' = n_visitors )
Now the data frame prior represents the joint probability
distribution of proportion_clicks
and
n_visitors
together. We could have a look at the first
couple of samples in this data frame.
But it’s easier to see what going on if we plot this data frame as a scatter plot.
#visualize as a scatter plot....
prior_plot <- ggplot(prior_df, aes(x = n_visitors, y = proportion_clicks ) ) +
geom_point(alpha = 1/10 ) +
theme(legend.position = 'none')
prior_plot <- ggMarginal(prior_plot, type = 'histogram',
xparams = list(bins=20),
yparams = list(bins=20))
prior_plot
On the x-axis we have the number of visitors, and on the y-axis we
have the underlying proportion of clicks. The histograms show the
marginal distributions, and you should recognize them both, the one for
the proportion of clicks is uniform between 0% and 20%, just as we
defined it to be, and the one for n_visitors
is the one we
looked at last exercise. We clearly see a pattern in the plot:
If we knew that proportion_clicks
was exactly 10%, we
could condition on this, that is, remove all samples that don’t fulfill
the condition of proportion_clicks
being 10%. This would
also reduce the uncertainty in how many visitors we would get. If we
knew the proportion of clicks was 10%, we would be pretty certain we
would end up with between 2 and 15 visitors.
However, there is no point to this, because the problem was that we didn’t know what the underlying proportions of clicks could be.
But what we can know is the data, and we can condition on the data too. Say that 5 out of a 100 clicked when we ran the ad campaign. If we remove all samples that doesn’t fulfill this condition, we also reduce the uncertainty in what the underlying proportion of clicks could be.
We end up with a distribution that doesn’t look at all like the
uniform distribution between 0% and 20% we defined before. And if we
actually would have gotten 5 visitors we could have concluded that
proportion_clicks
probably would be between 3% and 10%.
And similarly we could condition on other values for the data. We have now reached the essence of what Bayesian inference is, it is conditioning on the data, in order to learn about what parameters values likely gave rise to this data. So, when you ran the ad campaign you got 13 site visits out of a 100 shown ads. With this data, we can now update the Bayesian model.
# Create the prior data frame
prior <- data.frame(proportion_clicks, n_visitors)
# Examine the prior data frame
head(prior)
## proportion_clicks n_visitors
## 1 0.05751550 6
## 2 0.15766103 9
## 3 0.08179538 11
## 4 0.17660348 11
## 5 0.18809346 16
## 6 0.00911130 2
The reason we’ve called it prior
is because it
represents the uncertainty before (that is, prior to) having included
the information in the data. Let’s do that now!
prior$n_visitors
represented the uncertainty over how
many visitors you would get because of the ad campaign. But now you know
you got exactly 13 visitors.
# Create the posterior data frame
posterior <- prior[prior$n_visitors == 13, ]
# Visualize posterior proportion clicks
hist(posterior$proportion_clicks, col = "palegreen4")
Just now, you updated the probability distribution over the
underlying proportions of clicks (proportion_clicks
) using
new data. Now we want to use this updated proportion_clicks
to predict how many visitors we would get if we reran the ad
campaign.
The result from the last example is still in the data frame
posterior, but if you look at posterior$n_visits
you’ll see
it’s just 13 repeated over and over again. This makes sense as posterior
represents what the model knew about the outcome of the last ad campaign
after having seen the data.
# Assign posterior to a new variable called prior
prior <- posterior
# Take a look at the first rows in prior
head(prior)
## proportion_clicks n_visitors
## 28 0.1188284 13
## 58 0.1506616 13
## 100 0.1023011 13
## 115 0.1441193 13
## 180 0.1163500 13
## 223 0.1456789 13
# Let's replace prior$n_visitors with a new sample and visualize the result
n_samples <- nrow(prior)
n_ads_shown <- 100
prior$n_visitors <- rbinom(n_samples, size = n_ads_shown,
prob = prior$proportion_clicks)
hist(prior$n_visitors, col = "palegreen4")
The plot shows a probability distribution over the number of site visitors a new ad campaign would bring in. It now looks pretty likely that there would be more than, say, 5 visitors because of the campaign.
Calculate the probability that you will get 5 or more visitors next time you run your ad campaign.
# Calculate the probability that you will get 5 or more visitors
sum(prior$n_visitors >= 5)/length(prior$n_visitors)
## [1] 0.9854338
From the observation of 13/100 visits, we find that it is very probable (99% probable) that future add campaigns will generate at least 5 visits.
Bayes is very Flexible!
Since we now primarily work with probability, we will use the beta
distribution with rbeta()
as a generative model for our
informed pior
# Explore using the rbeta function
beta_sample <- rbeta(n = 1000000, shape1 = 1, shape2 = 1)
# Visualize the results
hist(beta_sample)
The above distribution is an uninformative prior because it gives an equally likely probability for all values from 0 to 1.
The rbeta()
function in R is used to generate random
variates from the beta distribution, which has three main
parameters:
n
: Number of random variates to generate. This
indicates how many random values you want to create from the beta
distribution.shape1
(α): This is the first shape parameter of the
beta distribution, often associated with the number of “successes” or
“positive outcomes” in some contexts. It influences the distribution’s
skewness and the location of the mode.shape2
(β): This is the second shape parameter, often
associated with the number of “failures” or “negative outcomes”.\[ \text{mean} = \frac{\text{shape1}}{\text{shape1} + \text{shape2}} \]
Properties Based on shape1 and shape2:
# Modify the parameters
beta_sample1 <- data.frame('dist' = rbeta(n = 1000000, shape1 = 100, shape2 = 20) )
# Visualize the results
close21 <- ggplot(beta_sample1, aes(x = dist ) ) +
geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
ggtitle('rbeta(nshape1 = 100, shape2 = 20)' )
# Modify the parameters
beta_sample2 <- data.frame('dist' = rbeta(n = 1000000, shape1 = 20, shape2 = 100) )
# Visualize the results
close20 <- ggplot(beta_sample2, aes(x = dist ) ) +
geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
ggtitle('rbeta(nshape1 = 20, shape2 = 100)' )
grid.arrange(close21, close20, ncol = 2 )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Explore using the rbeta function
# Modify the parameters
beta_sample1 <- data.frame('dist' = rbeta(n = 1000000, shape1 = 0.5, shape2 = 10) )
# Visualize the results
close21 <- ggplot(beta_sample1, aes(x = dist ) ) +
geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
ggtitle('rbeta(nshape1 = 0.5, shape2 = 10)' )
# Modify the parameters
beta_sample2 <- data.frame('dist' = rbeta(n = 1000000, shape1 = 10, shape2 = 0.5) )
# Visualize the results
close20 <- ggplot(beta_sample2, aes(x = dist ) ) +
geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
ggtitle('rbeta(nshape1 = 10, shape2 = 0.5)' )
grid.arrange(close21, close20, ncol = 2 )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now for some excitement!: Let’s update the model developed in the
previous example with our new, informed prior distribution for
proportion_clicks
.
Old Prior:
n_draws <- 1000000 # a million draws
n_ads_shown <- 100
# Change the prior on proportion_clicks
# The uniform distribution is continuous and all numbers within the specified range (from min to max) have an equal chance of being drawn. In the context of this function call, runif will generate n_draws random numbers between 0.0 and 0.2, inclusive of 0.0 but exclusive of 0.2.
proportion_clicks <-
runif(n = n_draws, # n: the number of random numbers to generate
min = 0.0, # min: the lower bound of the uniform distribution
max = 0.2) # max: the upper bound of the uniform distribution
# The rbinom function generates n_draws binomially distributed integers. Each integer represents the number of successes in n_ads_shown trials with a success probability given by proportion_clicks. If proportion_clicks is a vector, each draw uses a different success probability, useful for simulating varying click-through rates across different ads.
n_visitors <- rbinom(n_draws, # the number of random variates to generat
size = n_ads_shown, # the number of trials in each binomial experiment
prob = proportion_clicks) # the probability of success on each trial
# Use the prior that the probability of get a click-through rate between 0 and 0.2 is equal
prior <- data.frame(proportion_clicks, n_visitors)
# Plug in the data that we actually get 13 clicks-through out of 100 ads
posterior <- prior[prior$n_visitors == 13, ]
# This plots the prior and the posterior in the same plot
par(mfcol = c(2, 1))
hist(prior$proportion_clicks,
xlim = c(0, 0.25))
hist(posterior$proportion_clicks,
xlim = c(0, 0.25))
New Prior:
n_draws <- 100000
n_ads_shown <- 100
# Change the prior on proportion_clicks (roughly 5% of click-through rate)
proportion_clicks <- rbeta(n_draws,
shape1 = 5, # the probability of getting a success (click-through) is 5%
shape2 = 95)
n_visitors <- rbinom(n_draws,
size = n_ads_shown,
prob = proportion_clicks) # use the updated proportion_clicks
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 13, ]
# This plots the prior and the posterior in the same plot
par(mfcol = c(2, 1))
hist(prior$proportion_clicks,
xlim = c(0, 0.25))
hist(posterior$proportion_clicks,
xlim = c(0, 0.25))
Observe the change in the prior distribution and the leftward shift in the posterior distribution. Importantly, the informed prior tempers the overly optimistic results suggested by our small data sample, which showed a 13% click rate. The shape of the posterior distribution is influenced by both the observed data and prior knowledge.
So, which posterior should we choose? The naive model gives precedence to our collected data, while the informed posterior integrates external information. If this external information is accurate, the informed posterior is likely to provide more reliable estimates for future outcomes.
Next up on reasons to use Bayesian data analysis is that it is easy to compare and contrast any outcomes from Bayesian models. A typical example of when you want to make comparisons is when you have two different experimental groups, like two different treatments or two different methods, and you want to compare these and see which seems the best.
For example, say that the ads you’ve been running so far have been video ads, but you’ve been thinking that text ads could be more effective.
To try this out you also paid for 100 text adds to be shown on the social media site:
It seems like video ads are more effective, but how sure should you be of this?
We could run the same model on the data from the video ads and the text ads, and take a look at the corresponding posteriors over the underlying proportions of clicks.
# Define parameters
n_draws <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_draws, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_draws, size = n_ads_shown,
prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
# Create the posteriors for video and text ads
posterior_video <- prior[prior$n_visitors == 13, ]
posterior_text <- prior[prior$n_visitors == 6, ]
# Visualize the posteriors
hist(posterior_video$proportion_clicks, xlim = c(0, 0.25))
hist(posterior_text$proportion_clicks, xlim = c(0, 0.25))
It looks like it’s more probable that the proportion of clicks is lower for the text ad, but there is some overlap between the probability distributions. What we would want is to compare the performance of the text and video ads in such a way that we got out a new probability distribution showing the probable difference. And this is easy to get, especially since these two distributions are represented by long vectors of samples.
Here I’ve taken the samples that make up these two probability
distributions, given them the names video_prop
and
text_prop
, prop as in proportion, and put them into a data
frame called posterior. As long as these samples are in a random order,
and as long as I do it row by row, I can now calculate any type of
derivative quantity and the resulting new distribution of samples will
correctly retain the uncertainty of these original two distributions.
Now, we were interested in the difference, so for each row let’s
subtract text_prop
from video_prop
and put it
into the column prop_diff.
posterior <- data.frame(
video_prop = posterior_video$proportion_clicks[1:4000],
text_prop = posterior_text$proportion_click[1:4000])
# Calculate the posterior difference: video_prop - text_prop
posterior$prop_diff = posterior$video_prop - posterior$text_prop
Looking at the first couple of samples in prop_diff
we
see that for most rows video ads are better than text ads. The whole
vector of samples in prop_diff
now represents the posterior
probability distribution over the difference between video ads and text
ads.
head(posterior, n = 20)
## video_prop text_prop prop_diff
## 1 0.12191776 0.05636659 0.06555117
## 2 0.16729588 0.06636324 0.10093264
## 3 0.18777864 0.06583992 0.12193872
## 4 0.12357419 0.06044150 0.06313269
## 5 0.15671284 0.04564712 0.11106572
## 6 0.15900959 0.08428587 0.07472372
## 7 0.11709679 0.03873027 0.07836652
## 8 0.14110583 0.07872896 0.06237688
## 9 0.10117732 0.04810872 0.05306860
## 10 0.12368113 0.05130128 0.07237985
## 11 0.08648399 0.07107574 0.01540825
## 12 0.11130187 0.06074299 0.05055887
## 13 0.19612095 0.05520095 0.14092000
## 14 0.14441862 0.06284006 0.08157855
## 15 0.14416403 0.09015081 0.05401322
## 16 0.18797261 0.08664352 0.10132909
## 17 0.11501931 0.03980307 0.07521624
## 18 0.13669431 0.05874641 0.07794790
## 19 0.16727311 0.06516589 0.10210721
## 20 0.11931239 0.04156272 0.07774967
So how does this distribution of prop_diff
look?
# Visualize prop_diff
hist(posterior$prop_diff, xlim = c(-0.2, 0.25))
# Calculate the median of prop_diff
median(posterior$prop_diff)
## [1] 0.06703002
# Calculate the proportion
# That is, calculate the proportion of samples in posterior$prop_diff that are more than zero.
sum(posterior$prop_diff > 0 ) / length(posterior$prop_diff )
## [1] 0.949
Let’s update the prior information as we did earlier and plot the
distribution for video_prop
, text_prop
, and
prop_diff
.
posterior <- data.frame('video_prop' = rbeta(n = 1000000, shape1 = 13, shape2 = 87 ),
'text_prop' = rbeta(n = 1000000, shape1 = 6, shape2 = 94 ) ) %>%
mutate('prop_diff' = video_prop - text_prop )
posterior_long <- posterior %>%
pivot_longer(cols = everything(), names_to = 'dist', values_to = 'val' )
ggplot(posterior_long ,
aes(x=val, fill=dist)) + geom_histogram(alpha=0.4, position="identity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Coming up next, a third reason Bayes is useful is because you can use the result of a Bayesian analysis to do Decision Analysis. What’s that? Simply put, it’s when you take the result of a statistical analysis and post-process it to make it more about what you really care about, in order to make informed decisions. Often, you don’t really care about the values of parameters. What you really care about is to save the most lives or to make the most money.
In the case of your zombie website you definitely care about money, the reason you want more visitors to your site is that you want them to buy your patented zombie repellent. Now that we understand the posterior distributions, which advertising campaign is likely to be more profitable: video or text?
Looking at the posteriors for the underlying proportion of clicks you got earlier, there is pretty strong evidence that video ads would give the most visitors per ad shown, but will they result in the most money?
Let’s do a small decision analysis to figure that out.
The social media site charges:
video_cost <- 0.25 #per video ad
text_cost <- 0.05 #per text ad
visitor_spend <- 2.53 #return for each visitor to site
Let’s again work directly with the posterior samples. So we can calculate anything row-wise, and the resulting distribution will be correct. Let’s start by how much we earn on average per shown video ad. This is then the proportion of people that will click times how much we’ll make on average per visitor minus the cost of showing the ad.
posterior$video_profit <- posterior$video_prop * visitor_spend - video_cost
posterior$text_profit <- posterior$text_prop * visitor_spend - text_cost
posterior$profit_diff <- posterior$video_profit - posterior$text_profit
head(posterior)
## video_prop text_prop prop_diff video_profit text_profit profit_diff
## 1 0.10621877 0.09904858 0.007170194 0.018733491 0.20059290 -0.18185941
## 2 0.13863127 0.06481615 0.073815115 0.100737113 0.11398487 -0.01324776
## 3 0.08289982 0.08121904 0.001680783 -0.040263451 0.15548417 -0.19574762
## 4 0.10068524 0.05442333 0.046261915 0.004733661 0.08769102 -0.08295736
## 5 0.13012003 0.11315650 0.016963533 0.079203684 0.23628595 -0.15708226
## 6 0.11554587 0.07848810 0.037057772 0.042331049 0.14857489 -0.10624384
Looking at the first row we see that if the underlying proportion of clicks actually was 0.106 for video ads, you would make 1.8 cents on average per ad, and it seems for most samples, you’ll make maek a profit.
We can calculate the text ad profit in the same way, but this time
subtracting the cost of a text ad instead. And finally, we can calculate
profit_diff
, the probability distribution over the profit
difference between video and text ads.
This distribution is now much closer to what we really care about:
which type of ad will give us the most money, where a positive number is
in favor of video ads and a negative number in favor of text ads. We
could now use profit_diff
directly to make a data-informed
decision, so what is that decision?
posterior_long <- posterior %>%
select(c(video_profit, text_profit)) %>%
pivot_longer(cols = everything(), names_to = 'dist', values_to = 'val' )
ggplot(posterior_long ,
aes(x=val, fill=dist)) + geom_histogram(alpha=0.4, position="identity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(posterior, aes(x = profit_diff ) ) +
geom_histogram() +
geom_vline(xintercept = 0, color = 'red' ) +
geom_vline(xintercept = median(posterior$profit_diff ),
color = 'blue' ) +
annotate(geom = 'text',
x = median(posterior$profit_diff ) - 0.01,
y = 30000, label = 'Median Value',
color = 'blue', angle = 90 )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Calculate a "best guess" for the difference in profits
median(posterior$profit_diff)
## [1] -0.02483565
# Calculate the probability that text ads are better than video ads
sum(posterior$profit_diff < 0 ) / length(posterior$profit_diff )
## [1] 0.5964
So it seems that the evidence does not strongly favor neither text nor video ads. But if forced to choose at this point, we would choose text ads.
In conclusion, even though text ads get a lower proportion of clicks, they are also much cheaper. And, as you have calculated, there is a 59.6% probability that text ads are better.
The fourth reason to use Bayes is that you can also completely change the underlying statistical model, often with not too much effort. Let’s take a look at an example where you will completely switch out the binomial model we have worked with so far.
Why? Well, you have some new data to analyze, where the Binomial model isn’t going to work well. See, you happen to have a friend, also in the zombie e-commerce business, and she has offered to put up an ad for your site as a banner on her site, for a small daily fee, of course. The thing is, you wouldn’t pay per view, the banner is always up, you would pay per day.
As a trial, she already put it up on her site for just one day, and during that time it resulted in 19 clicks and visits to your site. The question is: How many daily site visits, should we expect, on average, if we pay for this banner? Now it’s obvious that the binomial model won’t directly work here, it models the number of successes out of a total number of shown ads, but now the ad is shown all the time, and we just have a daily count. What we could do is that we split each day into minutes and assume some unknown proportion of minutes results in a success, a click on the ad. One problem with this generative model is that it can’t model more than one click per minute, either it’s a success or not. We could, sort of, fix this by splitting the day into seconds instead, or milliseconds, or even smaller.
In the limit, we end up with another generative model that also has a name: the Poisson distribution.
It has one parameter: The mean number of events per measured time unit, in this case, it’s the number of clicks per day.
In R you can sample from the Poisson distribution using the
rpois
function, and just like with the rbinom
function you can plug in different parameter values and use it to
generate simulated data. For example, if we knew that the mean number of
clicks per day was 20, then this would give you the probability
distribution over how many clicks you would expect to get, say,
tomorrow. Now, of course, for the banner ad we don’t know the underlying
mean number of clicks, so let’s try to find that out.
#let's say the mean clicks is 20/day
n_clicks <- rpois(n = 100000, lambda = 20 )
hist(n_clicks )
# Simulate from a Poisson distribution and visualize the result
# let's say the mean clicks is 3/day
x <- rpois(n = 10000, lambda = 3)
hist(x )
Note that the Poisson distribution cannot be negative; it is defined for non-negative integers (0, 1, 2, …) because it models the number of times an event occurs in a fixed interval of time or space.
Here is a good example for poisson distribution.
Let’s say that you run an ice cream stand and on cloudy days you on average sell 11.5 ice creams. Unfortunately, it’s a cloudy day, and you won’t break even unless you sell 15 or more ice creams. Assuming the Poisson model is reasonable, use x to calculate the probability that you’ll break even.
Tip: For this, you need to calculate what proportion of samples in x are >= 15.
# Create a model that quantifies the relationship between the quantity of ice cream sold and the corresponding likelihood of sales events.
x <- rpois(n = 10000, lambda = 11.5)
hist(x )
# Calculate the probability of break-even
sum(x >= 15 )/length(x )
## [1] 0.1842
Is it likely that you will break even on a cloudy day, or should you stay at home?
Clicks per day instead of clicks per ad
When you put up a banner on your friend’s site you got 19 clicks in a day, how many daily clicks should you expect this banner to generate on average?
To accommodate the new banner data you’re now going to change the model so that it uses a Poisson distribution instead.
# Change the model according to instructions
n_draws <- 100000
mean_clicks <- runif(n_draws, min = 0, max = 80)
n_visitors <- rpois(n = n_draws, mean_clicks)
prior <- data.frame(mean_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 19, ]
# Visualize mean_clicks
hist(prior$mean_clicks )
hist(posterior$mean_clicks)
The model is complete!
What range could you expect the mean number of daily clicks to be in?
12 ~ 28 daily clicks on average.
The method we’ve been using to fit Bayesian models so far has involved generating a large number of samples representing the joint probability distribution over parameters and unknown data. And then conditioning on the observed data by filtering away all samples that don’t match.
This is a simple method and we implemented in just a couple of lines of R code, but the bad news is that this computational method scales horribly, both with larger data sets and with more complicated models.
But good news!
To work with and understand faster computational methods one does need to know a bit about probability theory.
This was all the probability notation and basic rules you’ll need for now.
In earlier exercises we didn’t calculate probabilities directly, we simulated.
If we wanted to figure out the probability that a generative model
would generate a certain value, we simulated a large number of samples
and then counted the proportion of samples taking on this certain value.
This was easy to do for generative models that correspond to common
probability distributions, like the binomial or the Poisson
distribution, as you can use the “r”-functions, like rbinom
and rpois
for this.
For example, here is how we got p(n_visitors = 13 | prob_success = 10%), the probability that 13 out of a 100 would visit your site if the underlying proportion of clicks was 10%.
n_visitors <- rbinom(n = 100000, size = 100, prob = 0.1)
sum(n_visitors == 13) / length(n_visitors)
## [1] 0.0741
But one thing that faster Bayesian computational methods have in
common is that they require that this probability can be calculated
directly rather than simulated. Fortunately, for these common
distributions someone already figured out how to do this and in R we can
use the “d”-functions, like dbinom
and dpois
,
for this. Using dbinom
we can directly calculate the
probability that 13 out of a 100 would visit your site, like this. This
is obviously much more efficient than to have to generate 100,000 or so
samples first.
#calculate P(n_visitors = 13 | prob_success = 10%)
dbinom(13, size = 100, prob = 0.1)
## [1] 0.07430209
We can also mapulate the results based on the rule we learned earlier. For example, the probability of getting 13 or 14 visitors would be 12.6% percent. Here calculated using the sum rule.
#calculate P(n_visitors = 13 or n_visitors = 14 | prob_success = 10 )
dbinom(13, size = 100, prob = 0.1 ) + dbinom(14, size = 100, prob = 0.1 )
## [1] 0.1256059
Finally, if we want to get a whole probability distribution, we’ll have to calculate the probability for a range of values. But the d-functions are generally vectorized, so that is easy!
If we wanted the probability distribution over the number of visitors we would expect conditional on the proportion of success being 10% we could calculate it like this. Instead of the resulting probability distribution being represented by a large number of samples, as in earlier chapters, it’s now represented by a vector giving the probability of each outcome. And we can plot it like this.
#calculate P(n_visitors | prob_success = 0.1 )
n_visitors <- seq(0, 100, by = 1 )
probability <- dbinom(n_visitors, size = 100, prob = 0.1 )
plot(n_visitors, probability, type = 'h' )
The binomial is an example of a discrete distribution; it defines
probability over whole numbers: 1, 2, 3, and so on. But there are also
continuous distributions, and earlier we used the uniform distribution
like this. There is also a d-version of runif
:
dunif
, and you might expect that this would return the
probability of the outcome being 0.12 assuming a uniform distribution
between 0 and 0.2.
dunif(x = 0.12, min = 0, max = 0.2 )
## [1] 5
But it does not. For starters, a probability can at most be 1, and here we got a 5 back. That’s because for continuous distributions you can’t really talk about the probability of a single value, the probability of getting exactly 0.12 is effectively zero.
So instead you get the probability density, a number
that on its own doesn’t tell you much, but that you can view as a
relative probability: A value with twice the probability density as
another value is twice as probable. This is also where the d in
dunif
and dbinom
comes from, it stands for
density.
x = seq(0, 0.2, by = 0.01)
dunif(x, min = 0, max = 0.2 )
## [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Since we’re here using uniform distribution, we should not be surprised that any value between 0 and 0.2 returns the same number: 5. Within a uniform probability distribution, any value is equally likely.
From rbinom to dbinom
Below is currently code that
That is, in probability notation it’s calculating P(n_visitors = 13 | proportion_clicks = 10%).
# The rbinom approach
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- rbinom(n = 99999,
size = n_ads_shown, prob = proportion_clicks)
prob_13_visitors <- sum(n_visitors == 13) / length(n_visitors)
prob_13_visitors
## [1] 0.07447074
# The dbinom approach
n_ads_shown <- 100
proportion_clicks <- 0.1
prob_13_visitors <- dbinom(13,
size = n_ads_shown, prob = proportion_clicks)
prob_13_visitors
## [1] 0.07430209
Alright! You got a similar result as with rbinom, but dbinom is much more efficient! You do not need to simulate the data 99999 times to compute the probability.
In the exercises you did earlier, we simulated a large number of samples to end up with a joint distribution over the uncertain parameter, the underlying proportion of clicks, and the future data, how many clicks we would get out of a hundred shown ads. Then we conditioned on the data we got, thirteen visitors, and that allowed the model to become more certain about the underlying proportion of clicks.
This, to condition on the data, is what Bayesian inference is. Now, you’ve learned the basic probability rules. You’ve learned how to calculate probabilities and densities using probability distributions. We’re going to put this together to do Bayesian inference.
But instead of simulation, we’ll calculate everything directly in R, but using the same example as in earlier exercises.
# Set the total number of ads shown
n_ads_shown <- 100
# Create a sequence of visitors from 0 to 100
n_visitors <- seq(0, 100, by = 1)
# Generate a sequence for the proportion of clicks from 0 to 1 by 0.01 increments
proportion_clicks <- seq(0, 1, by = 0.01)
# Create all combinations of proportion of clicks and number of visitors
pars <- expand.grid(proportion_clicks = proportion_clicks,
n_visitors = n_visitors)
# Assign a uniform prior distribution to proportion of clicks between 0 and 0.2
pars$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2)
# Calculate the likelihood of the number of visitors clicking based on the binomial distribution
pars$likelihood <- dbinom(pars$n_visitors, size = n_ads_shown,
prob = pars$proportion_clicks)
# Compute the unnormalized posterior probability
pars$probability <- pars$likelihood * pars$prior
# Normalize the probabilities so they sum to 1
pars$probability <- pars$probability / sum(pars$probability)
head(pars)
## proportion_clicks n_visitors prior likelihood probability
## 1 0.00 0 5 1.000000000 0.0476190476
## 2 0.01 0 5 0.366032341 0.0174301115
## 3 0.02 0 5 0.132619556 0.0063152169
## 4 0.03 0 5 0.047552508 0.0022644051
## 5 0.04 0 5 0.016870319 0.0008033485
## 6 0.05 0 5 0.005920529 0.0002819300
# Check that the sum of the normalized probabilities is 1
sum(pars$probability)
## [1] 1
Now, after all this calculation, if we plot the probability of each proportion_clicks and n_visitors combination, we get this:
ggplot(pars, aes(x = n_visitors, y = proportion_clicks,
fill = probability ) ) +
geom_tile() +
xlim(c(0, 35 ) ) +
ylim(c(0, 0.2 ) ) +
scale_fill_gradient(low="white", high="darkblue")
## Warning: Removed 9555 rows containing missing values or values outside the scale range
## (`geom_tile()`).
Pretty similar, right? Finally, let’s bring in the actual data. As there were 13 clicks out of a hundred shown ads, we can now remove all rows where n_visitors is something else. And while we’re at it we’ll normalize again, to make sure the remaining probabilities sum to one. Here’s the result:
pars13 <- pars[ pars$n_visitors == 13, ]
pars13$probability <- pars13$probability / sum(pars$ probability )
ggplot(pars13, aes(x = n_visitors, y = proportion_clicks,
fill = probability ) ) +
geom_tile() +
xlim(c(0, 35 ) ) +
ylim(c(0, 0.2 ) ) +
scale_fill_gradient(low="white", high="darkblue")
## Warning: Removed 82 rows containing missing values or values outside the scale range
## (`geom_tile()`).
If we now zoom in on this slice and plot from the side, instead of looking at it from above, we see that we have retrieved the same probability distribution over the likely underlying proportions of clicks as we got when we simulated in the exercise we did earlier. But running the simulation is much slower on my computer than calculating it directly!
plot(pars13$proportion_clicks, pars13$probability, type = "h")
We have seen Bayesian modeling as a way of quantifying uncertainty using probability, and we have covered what is needed to do Bayesian inference. You need data, a Bayesian model that consists of a generative model and prior probability distributions over parameters and unknowns, and finally you need a computational method to fit the model.
We looked at one basic computational method: Rejection sampling, where you directly sample from the generative model and then condition on the data by filtering away all samples that don’t match. There are other computational methods such as grid approximation, Markov chain Monte Carlo method and many others.
We mainly looked at two generative models: The binomial distribution and the Poisson distribution.
But this was a quick introduction, and there are many many things we didn’t cover. We only looked at rather simple Bayesian models, and you might be left with the impression that Bayes is for these simple situations. But a Bayesian approach can be used all over machine learning and statistics, and there are Bayesian versions of things like time series models and deep learning. A great strength with Bayes is that you have great flexibility when setting prior distributions. But to be honest, we didn’t talk much about how to decide what priors and models to use.
We didn’t contrast Bayesian statistics with classical statistics,
which is so often done. They are two different approaches, and better to
learn the good things with both of them. We didn’t look at more advanced
computational methods, like there are many Markov chain Monte Carlo
algorithms, and we didn’t look at the many good computational tools and
packages that now exist, like Stan
and JAGS
.
And to explore one of these tools would be a good next step after this
workshop.
Bayesian data analysis is increasingly being used all over data science and machine learning, and even if we just scratched the surface, you should now have the necessary knowledge to explore this exciting topic further.
Some useful resources you may want to check out for advancing your knowledge about Bayesian analysis:
From the website: An engaging, sophisticated, and fun introduction to the field of Bayesian statistics, Bayes Rules!: An Introduction to Applied Bayesian Modeling brings the power of modern Bayesian thinking, modeling, and computing to a broad audience. In particular, the book is an ideal resource for advanced undergraduate statistics students and practitioners with comparable experience. the book assumes that readers are familiar with the content covered in a typical undergraduate-level introductory statistics course. Readers will also, ideally, have some experience with undergraduate-level probability, calculus, and the R statistical software. Readers without this background will still be able to follow along so long as they are eager to pick up these tools on the fly as all R code is provided.Bayes Rules! empowers readers to weave Bayesian approaches into their everyday practice. Discussions and applications are data driven. A natural progression from fundamental to multivariable, hierarchical models emphasizes a practical and generalizable model building process. The evaluation of these Bayesian models reflects the fact that a data analysis does not exist in a vacuum.