Welcome everyone to this R demo on the Fundamentals of Bayesian Analysis.

Before we begin, I’d like to share something important with you. The material we’ll be using in this workshop, particularly the RMarkdown file, is adapted from the “Fundamentals of Bayesian Data Analysis in R” course available on DataCamp (I have made a lot of edits to ensure smooth transitions between sections and clear comprehension). After reviewing numerous introductory materials on Bayesian analysis, I’ve found that this course offers the best introduction for those who are completely new to the Bayesian approach. It provides a clear and accessible entry point into this fascinating area.

Now, let’s set our expectations for today’s session. Bayesian analysis is an extensive field, and it’s essential to understand that becoming an expert in Bayesian methods is a journey that extends well beyond what a one-hour workshop can offer. It’s similar to trying to master all of statistics in a single hour - a challenging, if not impossible, task.

Bayesian analysis is not just a single model; it’s an approach, a different way of thinking about and analyzing data compared to the classical statistical (frequentist) approach. It’s adaptable and can be applied to various models, depending on your data and the research questions you wish to answer.

In today’s workshop, we’ll be covering the very basics of Bayesian analysis. Consider this session as your first step into a broader and deeply engaging world. While we’ll only be scratching the surface, I hope this introduction will ignite your interest and provide a solid foundation for your further exploration into Bayesian analysis.

For those keen to delve deeper, I’m more than happy to recommend additional resources for further study. Remember, today is about setting the groundwork, a starting point from which your understanding and expertise can grow.

Thank you for joining me on this intellectual adventure. Let’s embark on this learning journey together!

What is Bayesian Inference?

A First Taste of Bayes

Bayesian Inference in a Nutshell:

A method for figuring out unobservable quantities given known facts that uses probability to describe the uncertainty over what the values of the unknown quantities could be.

Let’s Try Some Bayesian Data Analysis

Bayesian inference == Probabilistic inference.

You are making probability statements about parameters themselves, not just about the data.

Probability

  • A number from 0 to 1
  • A statement about the certainty / uncertainty of an event
  • 1 is complete certainty that a case is TRUE
  • 0 is complete certainty that a case is FALSE
  • Not only for binary yes / no events, but can be applied to continuous variables

The role of probability distributions in Bayesian data analysis is to represent uncertainty, and the role of Bayesian inference is to update probability distributions to reflect what has been learned from data.

Let’s look at a brief illustrative example.

A Bayesian model for the proportion of success

ex: What is the proportion of successful treatments with a new drug?

  • prop_model(data )
  • The data is a vector of failures represented by 1s and 0s
  • There is an unknown underlying proportion of success
  • If the data point is a success, it is only affected by the proportion of success
  • Prior to seeing any data, any underlying proportion of success is equally likely
  • The result is a probability distribution that represents what the model knows about the underlying proportion of success

The output of prop_model is a plot showing what the model learns about the underlying proportion of success from each data point in the order you entered them.

Zombie apocalypse:

Let us say there is a new drug to treat zombieism. It has never been tested before, but the results from this pilot test have two recoveries out of 4 subjects. Here, prop_model plots the successive posterior probabilities that the zombie antidote will be successful:

data <- c(1, 0, 0, 1 )
prop_model(data)
## Warning in ggridges::geom_density_ridges(stat = "identity", color = "white", :
## Ignoring unknown parameters: `size`

At n=0 there is no data, and all the model knows is that it’s equally probable that the proportion of success is anything from 0% to 100%. At n=4 all data has been added, and the model knows a little bit more.

Next we observe several more subjects, however, they are all failures, leaving just 2 successes out of 13 subjects. Here is how the posterior probability changes with the new case information:

data <- c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ) # only 2 failures
prop_model(data)
## Warning in ggridges::geom_density_ridges(stat = "identity", color = "white", :
## Ignoring unknown parameters: `size`

It seems like it’s not a perfect drug, but between 5% to 40% cured zombies is better than nothing!

What if we observe a lot more?

data <- rbinom(50, 1, 0.2) # the function generates 50 random numbers with parameters n = 1 (number of trials) and p = 0.2 (probability of success in each trial).
prop_model(data)
## Warning in ggridges::geom_density_ridges(stat = "identity", color = "white", :
## Ignoring unknown parameters: `size`

Subjective observation time: the more information collected about a variable, the closer the observed prior distribution approaches the ‘true’ probability of a success.

  • Take a prior probability distribution use observations from the data to update a posterior probability distribution
  • prior - a probability distribution that represents what the model knows before seeing the data
  • posterior - a probability distribution that represents what the model knows after having seen the data.
data = c(1, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0)
         
# Extract and explore the posterior
posterior <- prop_model(data)
## Warning in ggridges::geom_density_ridges(stat = "identity", color = "white", :
## Ignoring unknown parameters: `size`

head(posterior) 
## [1] 0.12036586 0.21641050 0.27290850 0.10608833 0.19117715 0.09285969
# returns a large random sample from the posterior over the underlying proportion of success
# each value in posterior is a possible "proportion of success" based on your data and prior

Let’s plot the posterior distribution:

hist(posterior, breaks = 30, xlim = c(0,1 ), col = 'palegreen4' )

Hey, that’s a good looking plot! This histogram depicts a bell-shaped distribution of the probabilities for curing a zombie, centered around 5-40%.

Compare this histogram to the plot produced directly by prop_model. You should notice that the histogram and the posterior distribution (at n=13) describe the same distribution.

Summarizing the zombie drug experiment

The point of working with samples from a probability distribution is that it makes it easy to calculate new measures of interest. The following tasks are about doing just this!

A point estimate is a single number used to summarize what’s known about a parameter of interest. It can be seen as a “best guess” of the value of the parameter. A commonly used point estimate is the median of the posterior. It’s the midpoint of the distribution, and it’s equally probable for the parameter value to be larger than the median as it is to be smaller than it.

# Calculate the median
median(posterior)
## [1] 0.1878639

So, a best guess is that the drug would cure around 18% of all zombies. Another common summary is to report an interval that includes the parameter of interest with a certain probability. This is called a credible interval (CI). With a posterior represented as a vector of samples you can calculate a CI using the quantile() function.

# Calculate the credible 90% interval
quantile(posterior, c(0.05, 0.95))
##         5%        95% 
## 0.06034651 0.38731579

According to the credible interval, there is a 90% probability that the proportion of zombies the drug would cure is between 6% and 38%, which is aligned with the green histogram we observed eaerlier. (Here we have to be careful to remember that the percentage of cured zombies and the percentage of probability are two different things.)

Now, there is a rival zombie laboratory that is also working on a drug. They claim that they are certain that their drug cures 7% of the zombies it’s administered to. Can we calculate how probable it is that our drug is better? Yes, we can! But it’s a two stage process.

# Calculate the sum
sum(posterior > 0.07) # the number of cases in which our drug is better than the rival drug
## [1] 9277
# Calculate the probability
sum(posterior > 0.07)/length(posterior) # the probability of our drug shows higher curing rate than the rival drug
## [1] 0.9277

Put the results in a journal

“Given the data of two cured and 11 relapsed zombies, and using the Bayesian model described before, there is a 90% probability that our drug cures between 6% and 39% of treated zombies. Further, there is 93% probability that our drug cures zombies at a higher rate than the current state of the art drug.”

How does Bayesian Inference Work?

The Parts Needed for Bayesian Inference

Bayesian Inference = Data + a Generative Model + Priors

Generative Model can be a computer program, mathematical expression or a set of rules that can be fed parameters values to be used to simulate data.

Take a generative model for a spin

Below you have the R code that implements the generative model we just developed.

# The generative zombie drug model

# Set parameters
prop_success <- 0.42 # The probability of curing the zombie
n_zombies <- 100 # The total number of zombies we have

# Simulating data
data <- c()
for(zombie in 1:n_zombies) {
  data[zombie] <- runif(1, min = 0, max = 1) < prop_success
} # Runs a loop for each zombie, where a random number between 0 and 1 is generated. If the number is less than prop_success, it means the zombie is cured (TRUE or 1), otherwise not cured (FALSE or 0).

data <- as.numeric(data) # Convert the logical vector to numeric (TRUE/FALSE to 1/0)

data # The simulated zombie data showing whether each zombie gets cured after receiving our drug (0 = cured; 1 = not cured)
##   [1] 0 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0 1 1 0 0 0
##  [38] 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 1 1 0 1 0 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1
##  [75] 0 1 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1
# Count cured
data <- sum(data)
data
## [1] 47

Perfect! Some zombies got cured in this simulation, but far from all.

Take the binomial distribution for a spin

It turns out that the generative model you ran last exercise already has a name. It’s called the binomial process or the binomial distribution. In R you can use the rbinom function to simulate data from a binomial distribution. The rbinom function takes three arguments:

  • n The number of times you want to run the generative model
  • size The number of trials. (For example, the number of zombies you’re giving the drug.)
  • prob The underlying proportion of success as a number between 0.0 and 1.0.
# Try out rbinom
rbinom(n = 1, size = 100, prob = 0.42)
## [1] 42
# Change the parameters
rbinom(n = 200, size = 100, prob = 0.42)
##   [1] 40 42 44 44 41 44 44 39 42 44 38 41 43 48 44 38 43 42 50 52 39 37 51 46 42
##  [26] 41 46 44 31 38 37 43 41 48 38 50 38 31 42 46 36 37 36 41 46 43 45 50 47 45
##  [51] 43 43 47 46 49 43 54 43 36 41 40 43 37 45 41 41 36 46 40 46 43 43 41 46 36
##  [76] 42 35 36 45 41 43 51 41 44 46 50 33 35 34 48 41 39 40 40 49 40 40 45 42 44
## [101] 53 50 44 38 38 41 42 46 41 42 42 42 32 41 51 41 40 41 44 45 37 46 41 48 40
## [126] 40 39 41 53 50 42 47 43 47 44 48 43 36 44 39 45 40 51 48 33 49 45 45 41 41
## [151] 48 42 38 39 38 48 49 45 49 44 46 45 46 41 33 41 42 40 42 40 42 43 46 40 47
## [176] 45 37 43 43 33 35 40 40 49 52 36 43 49 45 43 44 47 34 42 47 39 39 40 44 55

Nice! That’s a lot of simulated zombies right there. Each number shown here represents the number of cured zombies we would observe out of 100 if we had given the drug to all of the zombies. Note that we repeat this process 200 times.

Using a Generative Model

We can apply our generative model to a relatively large number of simulated trials. The results will for the probability distribution for our process:

cured_zombies <- rbinom(n = 100000, size = 100, prob = 0.07 )
post_df <- data.frame('Cured_Zombies' = cured_zombies )

ggplot(post_df, aes(x = Cured_Zombies ) ) +
  geom_histogram(binwidth = 1, fill = 'palegreen4', color = "#e9ecef" ) +
  ggtitle('Cured Zombies Probability Distribution' )

However, it is more often the case that we have data, and do not need a generative model to create it. However, we do not know the parameters that drive the process. Bayesian Inference is used to invert the probability to work backwards and learn about the underlying process given the data at hand.

New Situations: instead of the zombie example, let’s look at click-through rates….

How many visitors to a site?

To get more visitors to your website you are considering paying for an ad to be shown 100 times on a popular social media site. According to the social media site, their ads get clicked on 10% of the time. You would like the ad campaign to result in at least 5 visitors to your site.

set.seed(123) 

# Fill in the parameters
n_samples <- 100000 # repeat the process 100000 times
n_ads_shown <- 100 # each time 100 ads are shown
proportion_clicks <- 0.1 # according to the social media site, 10% of the ads get clicked

n_visitors <- rbinom(n_samples, size = n_ads_shown, 
                     prob = proportion_clicks) # The rbinom() function will generate n_samples random numbers, each one representing the number of clicks out of n_ads_shown, with the probability of a click being proportion_clicks. This could model, for example, how many times people click on an advertisement after it has been shown a certain number of times, with a given click-through rate (proportion_clicks).

post_df <- data.frame('Site_Visits' = n_visitors )

ggplot(post_df, aes(x = Site_Visits ) ) +
  geom_histogram(binwidth = 1, fill = 'cadetblue', color = "#e9ecef" ) +
  ggtitle('Number of Visitors Distribution' )

This graph shows how many site visits we can get if we decide to show these 100 ads (if we repeat the process 100000 times). Notice that sometimes we get fewer clicks, and sometimes we get more. The probability of getting a certain number of clicks actually varies across time.

Question

You would like the ad campaign to result in at least 5 visitors to your site.

Eyeballing the plot you just produced, what is the probability you will get 5 or more visitors because of the ad? (Roughly more than 90%)

Prepresenting Uncertainty with Priors

And now to address the Priors.

Prior Probability Distribution: represents how uncertain the model is about the parameters before seeing any data. Ideally, this uncertainty should reflect our uncertainty. The reason it is called a prior is because it represents the uncertainty before having included information about the data.

Take our website click-through example, you’re not so sure that your ad will get clicked on exactly 10% of the time. Instead of assigning proportion_clicks a single value you are now going to assign it a large number of values drawn from a probability distribution.

n_samples <- 100000
n_ads_shown <- 100

set.seed(123)

# model a uniform probability distribution that ranges from 0 to 0.2 to represent the uncertainty of our value for this prior
proportion_clicks <- runif(n = n_samples, min = 0.0, max = 0.2 )
n_visitors <- rbinom(n_samples, n_ads_shown, proportion_clicks )

Because the rbinom function is vectorized the first value of proportion_clicks is used to sample the first value in n_visitors, the second value in proportion_clicks is used for the second in n_visitors, and so on. The result is that the samples in n_visitors now also incorporate the uncertainty in what the underlying proportion of clicks could be.

Visualize the probability distribution for the prior, proportion_clicks.

pclick_df <- data.frame('probability_click' = proportion_clicks )

pclick_plot <- ggplot(pclick_df, aes(x = probability_click ) ) +
  geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
  ggtitle('Probability Distribution of the Prior' )

nvis_df <- data.frame('Site_Visits' = n_visitors )

nvis_plot <- ggplot(nvis_df, aes(x = Site_Visits ) ) +
  geom_histogram(binwidth = 1, fill = 'cadetblue', color = "#e9ecef" ) +
  ggtitle('Number of Visitors Distribution' )

grid.arrange(pclick_plot, nvis_plot, ncol = 2 )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sum(n_visitors >=5) / length(n_visitors)
## [1] 0.75295

You shouldn’t be surprised to see that the uncertainty over proportion_clicks is just as you specified it to be: Uniform between 0.0 and 0.2 (except for some small variations in the height of the bars because we took a random sample using runif).

Uncertainty changed the distribution. We went from a random distribution of about 90-95% to, with added uncertainty of 0 to 20%, 75.3% likelihood that at least 5 people will click on an ad.

The envelope of the n_visitors data from this generative model is much different. The added uncertainty for the prior increases the uncertainty in the number of visitors the site will receive.

Bayesian Models and Conditioning (The role of your data)

Now we have a generative model and priors, and together these form a Bayesian model. So if some statistician is talking about that she’s developed a Bayesian model of something, that just means she has specified a generative model for the data plus priors over what is uncertain in the model. The only thing we need now to do Bayesian inference is some data. And it’s actually the case that you got tired of me talking about probability distributions and just decided to run your ad campaign. And it went better than expected!

When the ad was shown a 100 times, 13 people clicked and visited your site. So now we have data, and we are ready to do some Bayesian inference, we are ready to use this data to learn about how likely people are to click on your ad.

How could we do that? Let’s take a look at the model we have so far and see if we can figure that out. The prior over proportion_clicks and the generative model for n_visitors together define a prior probability distribution over both parameters and future, unknown data. I’m going to put these together in a data frame and call it just prior.

# This data frame represents the joint probability distribution over proportion_clicks and n_visitors together.
prior_df <- data.frame('proportion_clicks' = proportion_clicks, 'n_visitors' = n_visitors )

Now the data frame prior represents the joint probability distribution of proportion_clicks and n_visitors together. We could have a look at the first couple of samples in this data frame.

But it’s easier to see what going on if we plot this data frame as a scatter plot.

#visualize as a scatter plot....
prior_plot <- ggplot(prior_df, aes(x = n_visitors, y = proportion_clicks ) ) +
  geom_point(alpha = 1/10 ) +
  theme(legend.position = 'none')
prior_plot <- ggMarginal(prior_plot, type = 'histogram', 
                          xparams = list(bins=20),
                          yparams = list(bins=20))
prior_plot

On the x-axis we have the number of visitors, and on the y-axis we have the underlying proportion of clicks. The histograms show the marginal distributions, and you should recognize them both, the one for the proportion of clicks is uniform between 0% and 20%, just as we defined it to be, and the one for n_visitors is the one we looked at last exercise. We clearly see a pattern in the plot:

  • The higher the underlying proportion of clicks, the more visitors we’ll probably get.
  • The more visitors we get, the higher the underlying proportion of clicks probably is.

If we knew that proportion_clicks was exactly 10%, we could condition on this, that is, remove all samples that don’t fulfill the condition of proportion_clicks being 10%. This would also reduce the uncertainty in how many visitors we would get. If we knew the proportion of clicks was 10%, we would be pretty certain we would end up with between 2 and 15 visitors.

However, there is no point to this, because the problem was that we didn’t know what the underlying proportions of clicks could be.

But what we can know is the data, and we can condition on the data too. Say that 5 out of a 100 clicked when we ran the ad campaign. If we remove all samples that doesn’t fulfill this condition, we also reduce the uncertainty in what the underlying proportion of clicks could be.

We end up with a distribution that doesn’t look at all like the uniform distribution between 0% and 20% we defined before. And if we actually would have gotten 5 visitors we could have concluded that proportion_clicks probably would be between 3% and 10%.

And similarly we could condition on other values for the data. We have now reached the essence of what Bayesian inference is, it is conditioning on the data, in order to learn about what parameters values likely gave rise to this data. So, when you ran the ad campaign you got 13 site visits out of a 100 shown ads. With this data, we can now update the Bayesian model.

# Create the prior data frame
prior <- data.frame(proportion_clicks, n_visitors)

# Examine the prior data frame
head(prior)
##   proportion_clicks n_visitors
## 1        0.05751550          6
## 2        0.15766103          9
## 3        0.08179538         11
## 4        0.17660348         11
## 5        0.18809346         16
## 6        0.00911130          2

The reason we’ve called it prior is because it represents the uncertainty before (that is, prior to) having included the information in the data. Let’s do that now!

prior$n_visitors represented the uncertainty over how many visitors you would get because of the ad campaign. But now you know you got exactly 13 visitors.

# Create the posterior data frame
posterior <- prior[prior$n_visitors == 13, ]

# Visualize posterior proportion clicks
hist(posterior$proportion_clicks, col = "palegreen4")

Just now, you updated the probability distribution over the underlying proportions of clicks (proportion_clicks) using new data. Now we want to use this updated proportion_clicks to predict how many visitors we would get if we reran the ad campaign.

The result from the last example is still in the data frame posterior, but if you look at posterior$n_visits you’ll see it’s just 13 repeated over and over again. This makes sense as posterior represents what the model knew about the outcome of the last ad campaign after having seen the data.

# Assign posterior to a new variable called prior
prior <- posterior

# Take a look at the first rows in prior
head(prior)
##     proportion_clicks n_visitors
## 28          0.1188284         13
## 58          0.1506616         13
## 100         0.1023011         13
## 115         0.1441193         13
## 180         0.1163500         13
## 223         0.1456789         13
# Let's replace prior$n_visitors with a new sample and visualize the result
n_samples <-  nrow(prior)
n_ads_shown <- 100
prior$n_visitors <- rbinom(n_samples, size = n_ads_shown,
                           prob = prior$proportion_clicks)
hist(prior$n_visitors, col = "palegreen4")

The plot shows a probability distribution over the number of site visitors a new ad campaign would bring in. It now looks pretty likely that there would be more than, say, 5 visitors because of the campaign.

Calculate the probability that you will get 5 or more visitors next time you run your ad campaign.

# Calculate the probability that you will get 5 or more visitors
sum(prior$n_visitors >= 5)/length(prior$n_visitors)
## [1] 0.9854338

From the observation of 13/100 visits, we find that it is very probable (99% probable) that future add campaigns will generate at least 5 visits.

Why use Bayesian Data Analysis?

Four Good Things with Bayes

Bayes is very Flexible!

  1. Can include information sources in addition to the data
  • Background Information
  • Expert Opinion
  • Common Knowledge
  • An example:
    • You: So what are really the range of proportion of clicks you see for ads?
    • Social media company person: Hi You! Most ads gets clicked on 5% of the time, but for some ads it is as low as 2% and for others as high as 8%.
    • You: Ah, but you’ve written 10% on your webpage? 😕
    • Social media company person: That’s marketing, don’t listen to them! 😜
  • Most ads get 5% clicks with a range of 2-8%. This information can be incorporated by updating the priors of a Bayesian statistical model.
  1. Can make many comparisons between groups or data sets
  • Can test different experimental conditions
  • Can easily find the probable difference between treatment groups
  • ex: suppose you have two different treatment groups
  1. Can use the result of Bayesian Analysis to do Decision Analysis
  • Decision Analysis: take the result of a statistical analysis and post-process it to apply to a process of interest.
  • The posterior distributions are not of principal interest, rather an outcome to meet a goal (e.g. highest return/clickthrough)
  1. Can change the underlying statistical model
  • If new data indicates the need for a change e.g. from a binomial model to include a new variable(s)

Can include information sources in addition to the data

Since we now primarily work with probability, we will use the beta distribution with rbeta() as a generative model for our informed pior

# Explore using the rbeta function
beta_sample <- rbeta(n = 1000000, shape1 = 1, shape2 = 1)

# Visualize the results
hist(beta_sample)

The above distribution is an uninformative prior because it gives an equally likely probability for all values from 0 to 1.

The rbeta() function in R is used to generate random variates from the beta distribution, which has three main parameters:

  • n: Number of random variates to generate. This indicates how many random values you want to create from the beta distribution.
  • shape1 (α): This is the first shape parameter of the beta distribution, often associated with the number of “successes” or “positive outcomes” in some contexts. It influences the distribution’s skewness and the location of the mode.
  • shape2 (β): This is the second shape parameter, often associated with the number of “failures” or “negative outcomes”.
  • shape1 and shape 2 both affects the skewness and the location of the mode (the value with the highest frequency).
  • The larger the shape1 parameter relative to shape2, the more the distribution skews towards 1.
  • The larger the shape2 parameter relative to shape1, the more the distribution skews towards 0.
  • The mean of the beta distribution:

\[ \text{mean} = \frac{\text{shape1}}{\text{shape1} + \text{shape2}} \]

Properties Based on shape1 and shape2:

  • If shape1 and shape2 are both greater than 1, the distribution is bell-shaped and unimodal (single peak).
  • If shape1 = shape2 > 1, the distribution is symmetric around the mean.
  • If shape1 = shape2 = 1, the distribution is uniform.
  • If shape1 > 1 and shape2 < 1 (or vice versa), the distribution is skewed towards 0 or 1, respectively.
# Modify the parameters
beta_sample1 <- data.frame('dist' = rbeta(n = 1000000, shape1 = 100, shape2 = 20) )
# Visualize the results
close21 <- ggplot(beta_sample1, aes(x = dist ) ) +
  geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
  ggtitle('rbeta(nshape1 = 100, shape2 = 20)' )
# Modify the parameters
beta_sample2 <- data.frame('dist' = rbeta(n = 1000000, shape1 = 20, shape2 = 100) )
# Visualize the results
close20 <- ggplot(beta_sample2, aes(x = dist ) ) +
  geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
  ggtitle('rbeta(nshape1 = 20, shape2 = 100)' )

grid.arrange(close21, close20, ncol = 2 )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Explore using the rbeta function

# Modify the parameters
beta_sample1 <- data.frame('dist' = rbeta(n = 1000000, shape1 = 0.5, shape2 = 10) )
# Visualize the results
close21 <- ggplot(beta_sample1, aes(x = dist ) ) +
  geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
  ggtitle('rbeta(nshape1 = 0.5, shape2 = 10)' )
# Modify the parameters
beta_sample2 <- data.frame('dist' = rbeta(n = 1000000, shape1 = 10, shape2 = 0.5) )
# Visualize the results
close20 <- ggplot(beta_sample2, aes(x = dist ) ) +
  geom_histogram(fill = 'cadetblue', color = "#e9ecef" ) +
  ggtitle('rbeta(nshape1 = 10, shape2 = 0.5)' )

grid.arrange(close21, close20, ncol = 2 )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now for some excitement!: Let’s update the model developed in the previous example with our new, informed prior distribution for proportion_clicks.

Old Prior:

n_draws <- 1000000 # a million draws
n_ads_shown <- 100

# Change the prior on proportion_clicks
# The uniform distribution is continuous and all numbers within the specified range (from min to max) have an equal chance of being drawn. In the context of this function call, runif will generate n_draws random numbers between 0.0 and 0.2, inclusive of 0.0 but exclusive of 0.2.
proportion_clicks <- 
  runif(n = n_draws,  # n: the number of random numbers to generate
        min = 0.0,    # min: the lower bound of the uniform distribution
        max = 0.2)    # max: the upper bound of the uniform distribution


# The rbinom function generates n_draws binomially distributed integers. Each integer represents the number of successes in n_ads_shown trials with a success probability given by proportion_clicks. If proportion_clicks is a vector, each draw uses a different success probability, useful for simulating varying click-through rates across different ads.
n_visitors <- rbinom(n_draws,                     # the number of random variates to generat
                     size = n_ads_shown,          # the number of trials in each binomial experiment
                     prob = proportion_clicks)    # the probability of success on each trial

# Use the prior that the probability of get a click-through rate between 0 and 0.2 is equal
prior <- data.frame(proportion_clicks, n_visitors)
# Plug in the data that we actually get 13 clicks-through out of 100 ads
posterior <- prior[prior$n_visitors == 13, ]

# This plots the prior and the posterior in the same plot
par(mfcol = c(2, 1))
hist(prior$proportion_clicks, 
     xlim = c(0, 0.25))
hist(posterior$proportion_clicks, 
     xlim = c(0, 0.25))

New Prior:

n_draws <- 100000
n_ads_shown <- 100

# Change the prior on proportion_clicks (roughly 5% of click-through rate)
proportion_clicks <- rbeta(n_draws, 
                           shape1 = 5, # the probability of getting a success (click-through) is 5%
                           shape2 = 95) 

n_visitors <- rbinom(n_draws, 
                     size = n_ads_shown, 
                     prob = proportion_clicks) # use the updated proportion_clicks
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 13, ]

# This plots the prior and the posterior in the same plot
par(mfcol = c(2, 1))
hist(prior$proportion_clicks, 
     xlim = c(0, 0.25))
hist(posterior$proportion_clicks, 
     xlim = c(0, 0.25))

Observe the change in the prior distribution and the leftward shift in the posterior distribution. Importantly, the informed prior tempers the overly optimistic results suggested by our small data sample, which showed a 13% click rate. The shape of the posterior distribution is influenced by both the observed data and prior knowledge.

So, which posterior should we choose? The naive model gives precedence to our collected data, while the informed posterior integrates external information. If this external information is accurate, the informed posterior is likely to provide more reliable estimates for future outcomes.

Can make many comparisons between groups or data sets

Next up on reasons to use Bayesian data analysis is that it is easy to compare and contrast any outcomes from Bayesian models. A typical example of when you want to make comparisons is when you have two different experimental groups, like two different treatments or two different methods, and you want to compare these and see which seems the best.

For example, say that the ads you’ve been running so far have been video ads, but you’ve been thinking that text ads could be more effective.

To try this out you also paid for 100 text adds to be shown on the social media site:

  • text ads –> 6 clicks and visits to your site.
  • video ads –> 13 clicks and visits to your site.

It seems like video ads are more effective, but how sure should you be of this?

We could run the same model on the data from the video ads and the text ads, and take a look at the corresponding posteriors over the underlying proportions of clicks.

# Define parameters
n_draws <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_draws, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_draws, size = n_ads_shown, 
                     prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)

# Create the posteriors for video and text ads
posterior_video <- prior[prior$n_visitors == 13, ]
posterior_text <- prior[prior$n_visitors == 6, ]

# Visualize the posteriors
hist(posterior_video$proportion_clicks, xlim = c(0, 0.25))

hist(posterior_text$proportion_clicks, xlim = c(0, 0.25))

It looks like it’s more probable that the proportion of clicks is lower for the text ad, but there is some overlap between the probability distributions. What we would want is to compare the performance of the text and video ads in such a way that we got out a new probability distribution showing the probable difference. And this is easy to get, especially since these two distributions are represented by long vectors of samples.

Here I’ve taken the samples that make up these two probability distributions, given them the names video_prop and text_prop, prop as in proportion, and put them into a data frame called posterior. As long as these samples are in a random order, and as long as I do it row by row, I can now calculate any type of derivative quantity and the resulting new distribution of samples will correctly retain the uncertainty of these original two distributions. Now, we were interested in the difference, so for each row let’s subtract text_prop from video_prop and put it into the column prop_diff.

posterior <- data.frame(
    video_prop = posterior_video$proportion_clicks[1:4000],
    text_prop  = posterior_text$proportion_click[1:4000])

# Calculate the posterior difference: video_prop - text_prop
posterior$prop_diff = posterior$video_prop - posterior$text_prop 

Looking at the first couple of samples in prop_diff we see that for most rows video ads are better than text ads. The whole vector of samples in prop_diff now represents the posterior probability distribution over the difference between video ads and text ads.

head(posterior, n = 20)
##    video_prop  text_prop  prop_diff
## 1  0.12191776 0.05636659 0.06555117
## 2  0.16729588 0.06636324 0.10093264
## 3  0.18777864 0.06583992 0.12193872
## 4  0.12357419 0.06044150 0.06313269
## 5  0.15671284 0.04564712 0.11106572
## 6  0.15900959 0.08428587 0.07472372
## 7  0.11709679 0.03873027 0.07836652
## 8  0.14110583 0.07872896 0.06237688
## 9  0.10117732 0.04810872 0.05306860
## 10 0.12368113 0.05130128 0.07237985
## 11 0.08648399 0.07107574 0.01540825
## 12 0.11130187 0.06074299 0.05055887
## 13 0.19612095 0.05520095 0.14092000
## 14 0.14441862 0.06284006 0.08157855
## 15 0.14416403 0.09015081 0.05401322
## 16 0.18797261 0.08664352 0.10132909
## 17 0.11501931 0.03980307 0.07521624
## 18 0.13669431 0.05874641 0.07794790
## 19 0.16727311 0.06516589 0.10210721
## 20 0.11931239 0.04156272 0.07774967

So how does this distribution of prop_diff look?

# Visualize prop_diff
hist(posterior$prop_diff, xlim = c(-0.2, 0.25))

# Calculate the median of prop_diff
median(posterior$prop_diff)
## [1] 0.06703002
# Calculate the proportion
# That is, calculate the proportion of samples in posterior$prop_diff that are more than zero.
sum(posterior$prop_diff > 0 ) / length(posterior$prop_diff )
## [1] 0.949

Let’s update the prior information as we did earlier and plot the distribution for video_prop, text_prop, and prop_diff.

posterior <- data.frame('video_prop' = rbeta(n = 1000000, shape1 = 13, shape2 = 87 ),
                         'text_prop' = rbeta(n = 1000000, shape1 = 6, shape2 = 94 ) ) %>%
  mutate('prop_diff' = video_prop - text_prop )

posterior_long <- posterior %>%
  pivot_longer(cols = everything(), names_to = 'dist', values_to = 'val' )

ggplot(posterior_long , 
       aes(x=val, fill=dist)) + geom_histogram(alpha=0.4, position="identity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Decision Analysis

Coming up next, a third reason Bayes is useful is because you can use the result of a Bayesian analysis to do Decision Analysis. What’s that? Simply put, it’s when you take the result of a statistical analysis and post-process it to make it more about what you really care about, in order to make informed decisions. Often, you don’t really care about the values of parameters. What you really care about is to save the most lives or to make the most money.

In the case of your zombie website you definitely care about money, the reason you want more visitors to your site is that you want them to buy your patented zombie repellent. Now that we understand the posterior distributions, which advertising campaign is likely to be more profitable: video or text?

Looking at the posteriors for the underlying proportion of clicks you got earlier, there is pretty strong evidence that video ads would give the most visitors per ad shown, but will they result in the most money?

Let’s do a small decision analysis to figure that out.

The social media site charges:

  • 25 cents for a video ad,
  • 5 cents for a text ad
  • based on historical data we make 2.53 dollars per visitor on average.
video_cost <- 0.25 #per video ad
text_cost <- 0.05 #per text ad
visitor_spend <- 2.53 #return for each visitor to site

Let’s again work directly with the posterior samples. So we can calculate anything row-wise, and the resulting distribution will be correct. Let’s start by how much we earn on average per shown video ad. This is then the proportion of people that will click times how much we’ll make on average per visitor minus the cost of showing the ad.

posterior$video_profit <- posterior$video_prop * visitor_spend - video_cost
posterior$text_profit <- posterior$text_prop * visitor_spend - text_cost
posterior$profit_diff <- posterior$video_profit - posterior$text_profit

head(posterior)
##   video_prop  text_prop   prop_diff video_profit text_profit profit_diff
## 1 0.10621877 0.09904858 0.007170194  0.018733491  0.20059290 -0.18185941
## 2 0.13863127 0.06481615 0.073815115  0.100737113  0.11398487 -0.01324776
## 3 0.08289982 0.08121904 0.001680783 -0.040263451  0.15548417 -0.19574762
## 4 0.10068524 0.05442333 0.046261915  0.004733661  0.08769102 -0.08295736
## 5 0.13012003 0.11315650 0.016963533  0.079203684  0.23628595 -0.15708226
## 6 0.11554587 0.07848810 0.037057772  0.042331049  0.14857489 -0.10624384

Looking at the first row we see that if the underlying proportion of clicks actually was 0.106 for video ads, you would make 1.8 cents on average per ad, and it seems for most samples, you’ll make maek a profit.

We can calculate the text ad profit in the same way, but this time subtracting the cost of a text ad instead. And finally, we can calculate profit_diff, the probability distribution over the profit difference between video and text ads.

This distribution is now much closer to what we really care about: which type of ad will give us the most money, where a positive number is in favor of video ads and a negative number in favor of text ads. We could now use profit_diff directly to make a data-informed decision, so what is that decision?

posterior_long <- posterior %>%
  select(c(video_profit, text_profit)) %>%
  pivot_longer(cols = everything(), names_to = 'dist', values_to = 'val' )

ggplot(posterior_long , 
       aes(x=val, fill=dist)) + geom_histogram(alpha=0.4, position="identity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(posterior, aes(x = profit_diff ) ) +
  geom_histogram() +
  geom_vline(xintercept = 0, color = 'red' ) +
  geom_vline(xintercept = median(posterior$profit_diff ),
              color = 'blue' ) +
  annotate(geom = 'text', 
            x = median(posterior$profit_diff ) - 0.01,
            y = 30000, label = 'Median Value', 
            color = 'blue', angle = 90 )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Calculate a "best guess" for the difference in profits
median(posterior$profit_diff)
## [1] -0.02483565
# Calculate the probability that text ads are better than video ads
sum(posterior$profit_diff < 0 ) / length(posterior$profit_diff )
## [1] 0.5964

So it seems that the evidence does not strongly favor neither text nor video ads. But if forced to choose at this point, we would choose text ads.

In conclusion, even though text ads get a lower proportion of clicks, they are also much cheaper. And, as you have calculated, there is a 59.6% probability that text ads are better.

Change Anything and Everything

The fourth reason to use Bayes is that you can also completely change the underlying statistical model, often with not too much effort. Let’s take a look at an example where you will completely switch out the binomial model we have worked with so far.

Why? Well, you have some new data to analyze, where the Binomial model isn’t going to work well. See, you happen to have a friend, also in the zombie e-commerce business, and she has offered to put up an ad for your site as a banner on her site, for a small daily fee, of course. The thing is, you wouldn’t pay per view, the banner is always up, you would pay per day.

As a trial, she already put it up on her site for just one day, and during that time it resulted in 19 clicks and visits to your site. The question is: How many daily site visits, should we expect, on average, if we pay for this banner? Now it’s obvious that the binomial model won’t directly work here, it models the number of successes out of a total number of shown ads, but now the ad is shown all the time, and we just have a daily count. What we could do is that we split each day into minutes and assume some unknown proportion of minutes results in a success, a click on the ad. One problem with this generative model is that it can’t model more than one click per minute, either it’s a success or not. We could, sort of, fix this by splitting the day into seconds instead, or milliseconds, or even smaller.

In the limit, we end up with another generative model that also has a name: the Poisson distribution.

It has one parameter: The mean number of events per measured time unit, in this case, it’s the number of clicks per day.

In R you can sample from the Poisson distribution using the rpois function, and just like with the rbinom function you can plug in different parameter values and use it to generate simulated data. For example, if we knew that the mean number of clicks per day was 20, then this would give you the probability distribution over how many clicks you would expect to get, say, tomorrow. Now, of course, for the banner ad we don’t know the underlying mean number of clicks, so let’s try to find that out.

#let's say the mean clicks is 20/day
n_clicks <- rpois(n = 100000, lambda = 20 )
hist(n_clicks )

# Simulate from a Poisson distribution and visualize the result
# let's say the mean clicks is 3/day
x  <- rpois(n = 10000, lambda = 3)
hist(x )

Note that the Poisson distribution cannot be negative; it is defined for non-negative integers (0, 1, 2, …) because it models the number of times an event occurs in a fixed interval of time or space.

Here is a good example for poisson distribution.

Let’s say that you run an ice cream stand and on cloudy days you on average sell 11.5 ice creams. Unfortunately, it’s a cloudy day, and you won’t break even unless you sell 15 or more ice creams. Assuming the Poisson model is reasonable, use x to calculate the probability that you’ll break even.

Tip: For this, you need to calculate what proportion of samples in x are >= 15.

# Create a model that quantifies the relationship between the quantity of ice cream sold and the corresponding likelihood of sales events. 
x  <- rpois(n = 10000, lambda = 11.5)
hist(x )

# Calculate the probability of break-even
sum(x >= 15 )/length(x )
## [1] 0.1842

Is it likely that you will break even on a cloudy day, or should you stay at home?

Clicks per day instead of clicks per ad

When you put up a banner on your friend’s site you got 19 clicks in a day, how many daily clicks should you expect this banner to generate on average?

To accommodate the new banner data you’re now going to change the model so that it uses a Poisson distribution instead.

# Change the model according to instructions
n_draws <- 100000
mean_clicks <- runif(n_draws, min = 0, max = 80)
n_visitors <- rpois(n = n_draws, mean_clicks)

prior <- data.frame(mean_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 19, ]

# Visualize mean_clicks
hist(prior$mean_clicks )

hist(posterior$mean_clicks)

The model is complete!

What range could you expect the mean number of daily clicks to be in?

12 ~ 28 daily clicks on average.

Bayesian Inference with Bayes’ Theorem

The method we’ve been using to fit Bayesian models so far has involved generating a large number of samples representing the joint probability distribution over parameters and unknown data. And then conditioning on the observed data by filtering away all samples that don’t match.

This is a simple method and we implemented in just a couple of lines of R code, but the bad news is that this computational method scales horribly, both with larger data sets and with more complicated models.

But good news!

  • There are tons of skilled scientists out there working hard on new methods to allow you to fit Bayesian models more efficiently.
  • The result of using a more efficient method will still be the same as if you had used the slower method, as the Bayesian model is still the same, so everything you’ve learned so far still applies. The only difference is that with a faster method you’ll get the result immediately.

To work with and understand faster computational methods one does need to know a bit about probability theory.

  • Probability
    • A number between 0 and 1
    • A statement of certainty/uncertainty
  • Mathematical Notation
    • P(n = 1) is a probability
    • P(n) is a probability distribution
    • P(n = 1|m = 1) is a conditional probability
    • P(n|m = 1) is a conditional probability distribution
  • Manipulating (independent) Probabilities
    • The Sum Rule (if possibilities are mutually exclusive)
      • p(roll a dice that has 1 or 2 or 3) = 1/6 + 1/6 + 1/6 = 0.5
    • The Product Rule (if possibilities are unrelated or independent)
      • p(roll a dice two times and get 6s in a row) = 1/6 * 1/6 = 1/36 = 2.8%

This was all the probability notation and basic rules you’ll need for now.

Simulation versus Calculation

In earlier exercises we didn’t calculate probabilities directly, we simulated.

If we wanted to figure out the probability that a generative model would generate a certain value, we simulated a large number of samples and then counted the proportion of samples taking on this certain value. This was easy to do for generative models that correspond to common probability distributions, like the binomial or the Poisson distribution, as you can use the “r”-functions, like rbinom and rpois for this.

For example, here is how we got p(n_visitors = 13 | prob_success = 10%), the probability that 13 out of a 100 would visit your site if the underlying proportion of clicks was 10%.

n_visitors <- rbinom(n = 100000, size = 100, prob = 0.1)

sum(n_visitors == 13) / length(n_visitors)
## [1] 0.0741

But one thing that faster Bayesian computational methods have in common is that they require that this probability can be calculated directly rather than simulated. Fortunately, for these common distributions someone already figured out how to do this and in R we can use the “d”-functions, like dbinom and dpois, for this. Using dbinom we can directly calculate the probability that 13 out of a 100 would visit your site, like this. This is obviously much more efficient than to have to generate 100,000 or so samples first.

#calculate P(n_visitors = 13 | prob_success = 10%)
dbinom(13, size = 100, prob = 0.1)
## [1] 0.07430209

We can also mapulate the results based on the rule we learned earlier. For example, the probability of getting 13 or 14 visitors would be 12.6% percent. Here calculated using the sum rule.

#calculate P(n_visitors = 13 or n_visitors = 14 | prob_success = 10 )
dbinom(13, size = 100, prob = 0.1 ) + dbinom(14, size = 100, prob = 0.1 )
## [1] 0.1256059

Finally, if we want to get a whole probability distribution, we’ll have to calculate the probability for a range of values. But the d-functions are generally vectorized, so that is easy!

If we wanted the probability distribution over the number of visitors we would expect conditional on the proportion of success being 10% we could calculate it like this. Instead of the resulting probability distribution being represented by a large number of samples, as in earlier chapters, it’s now represented by a vector giving the probability of each outcome. And we can plot it like this.

#calculate P(n_visitors | prob_success = 0.1 )
n_visitors <- seq(0, 100, by = 1 )
probability <- dbinom(n_visitors, size = 100, prob = 0.1 )
plot(n_visitors, probability, type = 'h' )

The binomial is an example of a discrete distribution; it defines probability over whole numbers: 1, 2, 3, and so on. But there are also continuous distributions, and earlier we used the uniform distribution like this. There is also a d-version of runif: dunif, and you might expect that this would return the probability of the outcome being 0.12 assuming a uniform distribution between 0 and 0.2.

dunif(x = 0.12, min = 0, max = 0.2 )
## [1] 5

But it does not. For starters, a probability can at most be 1, and here we got a 5 back. That’s because for continuous distributions you can’t really talk about the probability of a single value, the probability of getting exactly 0.12 is effectively zero.

So instead you get the probability density, a number that on its own doesn’t tell you much, but that you can view as a relative probability: A value with twice the probability density as another value is twice as probable. This is also where the d in dunif and dbinom comes from, it stands for density.

x = seq(0, 0.2, by = 0.01)
dunif(x, min = 0, max = 0.2 )
##  [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

Since we’re here using uniform distribution, we should not be surprised that any value between 0 and 0.2 returns the same number: 5. Within a uniform probability distribution, any value is equally likely.

From rbinom to dbinom

Below is currently code that

  • Simulates the number of clicks/visitors (n_clicks) from 100 shown ads using the rbinom function given that the underlying proportion of clicks is 10%
  • Calculates the probability of getting 13 visitors (prob_13_visitors).

That is, in probability notation it’s calculating P(n_visitors = 13 | proportion_clicks = 10%).

# The rbinom approach
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- rbinom(n = 99999, 
    size = n_ads_shown, prob = proportion_clicks)
prob_13_visitors <- sum(n_visitors == 13) / length(n_visitors)
prob_13_visitors
## [1] 0.07447074
# The dbinom approach
n_ads_shown <- 100
proportion_clicks <- 0.1
prob_13_visitors <- dbinom(13, 
    size = n_ads_shown, prob = proportion_clicks)
prob_13_visitors
## [1] 0.07430209

Alright! You got a similar result as with rbinom, but dbinom is much more efficient! You do not need to simulate the data 99999 times to compute the probability.

Bayesian Calculation

In the exercises you did earlier, we simulated a large number of samples to end up with a joint distribution over the uncertain parameter, the underlying proportion of clicks, and the future data, how many clicks we would get out of a hundred shown ads. Then we conditioned on the data we got, thirteen visitors, and that allowed the model to become more certain about the underlying proportion of clicks.

This, to condition on the data, is what Bayesian inference is. Now, you’ve learned the basic probability rules. You’ve learned how to calculate probabilities and densities using probability distributions. We’re going to put this together to do Bayesian inference.

But instead of simulation, we’ll calculate everything directly in R, but using the same example as in earlier exercises.

# Set the total number of ads shown
n_ads_shown <- 100

# Create a sequence of visitors from 0 to 100
n_visitors <- seq(0, 100, by = 1)

# Generate a sequence for the proportion of clicks from 0 to 1 by 0.01 increments
proportion_clicks <- seq(0, 1, by = 0.01)

# Create all combinations of proportion of clicks and number of visitors
pars <- expand.grid(proportion_clicks = proportion_clicks,
                     n_visitors = n_visitors)

# Assign a uniform prior distribution to proportion of clicks between 0 and 0.2
pars$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2)

# Calculate the likelihood of the number of visitors clicking based on the binomial distribution
pars$likelihood <- dbinom(pars$n_visitors, size = n_ads_shown,
                           prob = pars$proportion_clicks)

# Compute the unnormalized posterior probability
pars$probability <- pars$likelihood * pars$prior

# Normalize the probabilities so they sum to 1
pars$probability <- pars$probability / sum(pars$probability)

head(pars)
##   proportion_clicks n_visitors prior  likelihood  probability
## 1              0.00          0     5 1.000000000 0.0476190476
## 2              0.01          0     5 0.366032341 0.0174301115
## 3              0.02          0     5 0.132619556 0.0063152169
## 4              0.03          0     5 0.047552508 0.0022644051
## 5              0.04          0     5 0.016870319 0.0008033485
## 6              0.05          0     5 0.005920529 0.0002819300
# Check that the sum of the normalized probabilities is 1
sum(pars$probability)
## [1] 1

Now, after all this calculation, if we plot the probability of each proportion_clicks and n_visitors combination, we get this:

ggplot(pars, aes(x = n_visitors, y = proportion_clicks,
                   fill = probability ) ) +
  geom_tile() +
  xlim(c(0, 35 ) ) +
  ylim(c(0, 0.2 ) ) +
  scale_fill_gradient(low="white", high="darkblue")
## Warning: Removed 9555 rows containing missing values or values outside the scale range
## (`geom_tile()`).

Pretty similar, right? Finally, let’s bring in the actual data. As there were 13 clicks out of a hundred shown ads, we can now remove all rows where n_visitors is something else. And while we’re at it we’ll normalize again, to make sure the remaining probabilities sum to one. Here’s the result:

pars13 <- pars[ pars$n_visitors == 13, ]
pars13$probability <- pars13$probability / sum(pars$ probability )

ggplot(pars13, aes(x = n_visitors, y = proportion_clicks,
                   fill = probability ) ) +
  geom_tile() +
  xlim(c(0, 35 ) ) +
  ylim(c(0, 0.2 ) ) +
  scale_fill_gradient(low="white", high="darkblue")
## Warning: Removed 82 rows containing missing values or values outside the scale range
## (`geom_tile()`).

If we now zoom in on this slice and plot from the side, instead of looking at it from above, we see that we have retrieved the same probability distribution over the likely underlying proportions of clicks as we got when we simulated in the exercise we did earlier. But running the simulation is much slower on my computer than calculating it directly!

plot(pars13$proportion_clicks, pars13$probability, type = "h")

Closing remarks

We have seen Bayesian modeling as a way of quantifying uncertainty using probability, and we have covered what is needed to do Bayesian inference. You need data, a Bayesian model that consists of a generative model and prior probability distributions over parameters and unknowns, and finally you need a computational method to fit the model.

We looked at one basic computational method: Rejection sampling, where you directly sample from the generative model and then condition on the data by filtering away all samples that don’t match. There are other computational methods such as grid approximation, Markov chain Monte Carlo method and many others.

We mainly looked at two generative models: The binomial distribution and the Poisson distribution.

But this was a quick introduction, and there are many many things we didn’t cover. We only looked at rather simple Bayesian models, and you might be left with the impression that Bayes is for these simple situations. But a Bayesian approach can be used all over machine learning and statistics, and there are Bayesian versions of things like time series models and deep learning. A great strength with Bayes is that you have great flexibility when setting prior distributions. But to be honest, we didn’t talk much about how to decide what priors and models to use.

We didn’t contrast Bayesian statistics with classical statistics, which is so often done. They are two different approaches, and better to learn the good things with both of them. We didn’t look at more advanced computational methods, like there are many Markov chain Monte Carlo algorithms, and we didn’t look at the many good computational tools and packages that now exist, like Stan and JAGS. And to explore one of these tools would be a good next step after this workshop.

Bayesian data analysis is increasingly being used all over data science and machine learning, and even if we just scratched the surface, you should now have the necessary knowledge to explore this exciting topic further.

Some useful resources you may want to check out for advancing your knowledge about Bayesian analysis:

  • Johnson, A. A., Ott, M. Q., & Dogucu, M. (2022). Bayes rules! An introduction to applied Bayesian modeling. CRC Press.

From the website: An engaging, sophisticated, and fun introduction to the field of Bayesian statistics, Bayes Rules!: An Introduction to Applied Bayesian Modeling brings the power of modern Bayesian thinking, modeling, and computing to a broad audience. In particular, the book is an ideal resource for advanced undergraduate statistics students and practitioners with comparable experience. the book assumes that readers are familiar with the content covered in a typical undergraduate-level introductory statistics course. Readers will also, ideally, have some experience with undergraduate-level probability, calculus, and the R statistical software. Readers without this background will still be able to follow along so long as they are eager to pick up these tools on the fly as all R code is provided.Bayes Rules! empowers readers to weave Bayesian approaches into their everyday practice. Discussions and applications are data driven. A natural progression from fundamental to multivariable, hierarchical models emphasizes a practical and generalizable model building process. The evaluation of these Bayesian models reflects the fact that a data analysis does not exist in a vacuum.