CONTINUOUS BERNOULLI TOPIC

Abstract

What is a Probability Distribution?

The possible values a variable can take and how frequently they occur is called Distribution.

Let Y=The actual outcome of an event and y=One of the possible outcomes one of the ways to reach like hood of reaching ‘y’ is denoted as “P(Y=y)” or “p(y)”.

Example: Let Y->The number of red marbles we draw out of bad and y->5 then we expresses the probability of getting exactly 5 red marbles as P(Y=5) or p(5).

Since we p(y) represents or calls as “the probability function”.

The probability distributions or simply probabilities, measure the likelihood of an outcome depending on how often if features in the sample space.

We define distributions in two categories

1.Mean denoted as “mu” (a Greek word)

2.Variance denoted as “sigma squared”

In simple terms we define as follows

Mean->Average value

Variance->how spread-out the data is, we measure this “spread” by how far away from the mean all the values are. The more data is dispersed higher the variance will be.

This distribution is parameterized by probs, a (batch of) parameters taking values in ⁠(0, 1)⁠. Note that, unlike in the Bernoulli case, probs does not correspond to a probability, but the same name is used due to the similarity with the Bernoulli.

The continuous Bernoulli is a distribution over the interval ⁠[0, 1]⁠, parameterized by probs in ⁠(0, 1)⁠.

The continuous Bernoulli distribution arises in deep learning and computer vision, specifically in the context of variational autoencoders for modeling the pixel intensities of natural images.

Types of probability distributions

Here I am giving the full details of all distributions.

Consider like rolling a die or picking a card, have a finite number of outcomes they follow “discreet distributions”. Other types like recording time and distance in track and field, have infinitely many outcomes, they follow “continuous distributions”.

Discrete probability distributions

A discrete distribution describes the probability of occurrence of each value of a discrete random variable. The number of spoiled apples out of 6 in your refrigerator can be an example of a discrete probability distribution.

Each possible value of the discrete random variable can be associated with a non-zero probability in a discrete probability distribution.

Binomial Distribution

The binomial distribution is a discrete distribution with a finite number of possibilities. When observing a series of what are known as Bernoulli trials, the binomial distribution emerges. A Bernoulli trial is a scientific experiment with only two outcomes: success or failure.

Consider a random experiment in which you toss a biased coin six times with a 0.4 chance of getting head. If ‘getting a head’ is considered a ‘success’, the binomial distribution will show the probability of r successes for each value of r.

The binomial random variable represents the number of successes (r) in n consecutive independent Bernoulli trials.

Uses of this binomial distribution is that many complex problems in business can be solves by using this distribution.

Ex: what is the oil price?

What is the probability that stock market will crash..etc

The Binomial distribution computes the probabilities of events where only two possible outcomes can occur (success or failure), e.g. when you look at the when stock market will crash, the outcome of interest is whether the stock market is crashed or not.

Bernoulli’s Distribution

The continuous Bernoulli can be thought of as a continuous relaxation of the Bernoulli distribution, which is defined on the discrete set {}

by the probability mass function:

{p(x)=p^{x}(1-p){1-x},}

The Bernoulli distribution is a variant of the Binomial distribution in which only one experiment is conducted, resulting in a single observation. As a result, the Bernoulli distribution describes events that have exactly two outcomes.

The Bernoulli random variable’s expected value is p, which is also known as the Bernoulli distribution’s parameter.

The experiment’s outcome can be a value of 0 or 1. Bernoulli random variables can have values of 0 or 1.

The Bernoulli distribution is the basis of the extremely widely used Binomial distribution. The binomial has the parameters N and p, the Bernoulli is the same but with N=1, so it covers, e.g. one toss of a coin.

So in every situation of the Binomial with N=1, we can say it’s the Bernoulli. But in real life applications N > 1 (I’m sure there must be cases of N =1, but I can’’t think of any right now. I consider one flip of a coin to be so basic, the I don’t use the term “distribution” in such a case.) this is the one of the use of Bernoulli.

Poisson Distribution

A Poisson distribution is a probability distribution used in statistics to show how many times an event is likely to happen over a given period of time. To put it another way, it’s a count distribution. Poisson distributions are frequently used to comprehend independent events at a constant rate over a given time interval. Simeon Denis Poisson, a French mathematician, was the inspiration for the name.

It has two parameters:

1. Lam: Known number of occurrences

2. Size: The shape of the returned array

The main uses of this distribution is that

1.To determine the how much variation there will likely be from that average number of occurrences.

2.And the probable maximum and minimum number of times the event will occur in limit of time.

Companies can utilize the Poisson Distribution to examine how they may be able to take steps to improve their operational efficiency

Continuous Probability Distributions

A continuous distribution describes the probabilities of a continuous random variable’s possible values. A continuous random variable has an infinite and uncountable set of possible values (known as the range). The mapping of time can be considered as an example of the continuous probability distribution. It can be from 1 second to 1 billion seconds, and so on.

The area under the curve of a continuous random variable’s PDF is used to calculate its probability. As a result, only value ranges can have a non-zero probability. A continuous random variable’s probability of equaling some value is always zero.

Now, look at some varieties of the continuous probability distribution.

Normal Distribution

Normal Distribution is one of the most basic continuous distribution types. Gaussian distribution is another name for it. Around its mean value, this probability distribution is symmetrical. It also demonstrates that data close to the mean occurs more frequently than data far from it. Here, the mean is 0, and the variance is a finite value.

Uses

The main use of this distribution is that continuous data in nature and psychology display this bell-shaped curve when graphed and one more main use is “to find prob of observatons in a distribution for both higher and lower limit values”.

The graph shown is the sample of Normal distribution.

Continuous Uniform Distribution

In continuous uniform distribution, all outcomes are equally possible. Each variable has the same chance of being hit as a result. Random variables are spaced evenly in this symmetric probabilistic distribution, with a 1/ (b-a) probability

Log-Normal Distribution

The random variables whose logarithm values follow a normal distribution are plotted using this distribution. Take a look at the random variables X and Y. The variable represented in this distribution is Y = ln(X), where ln denotes the natural logarithm of X values. The size distribution of rain droplets can be plotted using log normal distribution.

Continuous Bernoulli distribution

The Bernoulli distribution Is the simplest discret distribution and it the building block for the other more complicated discret distributions.

The continuous Bernoulli distribution arises in deep learning and computer vision, specifically in the context of variational autoencoders, for modelling the pixel intensities of natural images. As such, it defines a proper probabilistic counterpart for the commonly used binary cross entropy loss, which is often applied to continuous, [0,1]{} [-valued data. This practice amounts to ignoring the normalizing constant of the continuous Bernoulli distribution, since the binary cross entropy loss only defines a true log-likelihood for discrete, {0,1}{}{{-valued data. {p(x|)^{x}(1-){1-x}.}

Here we introduce and fully characterize the continuous Bernoulli distribution (§3), both as a means to study the impact of this widespread modeling error, and to provide a proper VAE for [0, 1]-valued data. Before these details, let us ask the central question: who cares?{p(x|)(x)}

In order to analyse the implications of this modelling error, we introduce the continuous Bernoulli, a novel distribution on [0, 1], which is parameterized by λ ∈ (0, 1) and defined by: X ∼ CB(λ) ⇐⇒ p(x|λ) ∝ p˜(x|λ) = λ ^v(1 − λ)^ 1−x.

Formulas:

Here we are discussing about the formulas of the continuous Bernoulli’s distribution. We have the probability distribution function formula, mean and variance. Through variance we can find the standard deviation.

The probability density function (pdf) is,

pdf(x; probs) = probs**x * (1 - probs)**(1 - x) * C(probs)

tnh(1 - 2 * probs) / (1 - 2 * probs) if probs != 0.5 else 2.).

While the normalizing constant C(probs) is a continuous function of probs (even at probs = 0.5), computing it at values close to 0.5 can result in numerical instabilities due to 0/0 errors. A Taylor approximation of C(probs) is thus used for values of probs in a small interval ⁠[lims[0], lims[1]]⁠ around 0.5

Real life application of distribution

Variational autoencoders (VAE)

VAE has become the one of the most used tool in Machine learning which used to a broad range of data types and variable modes. While designing the VAE we consider the continuous Bernoulli distribution as per design technique.

Though doing so will not throw an obvious type error, the implied object is no longer a coherent probabilistic model, due to a neglected normalizing constant. This practice is extremely pervasive in the VAE literature, including the seminal work of Kingma and Welling [20] (who, while aware of it, set it aside as an inconvenience), highly-cited follow up work (for example [25, 37, 17, 6] to name but a few), VAE tutorials [7, 1], including those in hugely popular deep learning frameworks such as PyTorch [32] and Keras [3], and more.

Second, one might suppose this error can be interpreted or fixed via data augmentation, binarizing data (which is also a common practice), stipulating a different lower bound, or as a nonprobabilistic model with a “negative binary cross-entropy” objective. §4 explores these possibilities and finds them wanting. Also, one might be tempted to call the Bernoulli VAE a toy model or a minor point. Let us avoid that trap: MNIST is likely the single most widely used dataset in machine learning, and VAE is quickly becoming one of our most popular probabilistic models

Third, and most importantly, empiricism; shows three key results:

(i) as a result of this error, we show that the Bernoulli VAE significantly underperforms the continuous Bernoulli VAE across a range of evaluation metrics, models, and datasets;

(ii) a further unexpected finding is that this performance loss is significant even when the data is close to binary, a result that becomes clear by consideration of continuous Bernoulli limits; and

(iii) we further compare the continuous Bernoulli to beta likelihood and Gaussian likelihood VAE, again finding the continuous Bernoulli performant. All together this work suggests that careful treatment of data type – neither ignoring normalizing constants nor defaulting immediately to a Gaussian likelihood – can produce optimal results when modelling some of the most core datasets in machine learning.

Mostly we are using this for fixing a pervasive error in variational autoencoders.

Usually this will explains when there are two outcomes in the field like winning a championship, outcome of tossing a coin. Etc.

Through the continuous Bernoulli’s distribution we can find the probabilities of some other distributions like bernoulis,beta and exponential. Etc.

The continuous Bernoulli VAE, and why the normalizing constant matters We define the continuous Bernoulli VAE analogously to the Bernoulli VAE: Zn ∼ N (0, IM) and Xn|Zn ∼ CB (λθ(zn)) , for n = 1, . . . , N (9) where again λθ : RM → R D is a neural network with parameters θ, and CB(λ) now denotes the product of D independent continuous Bernoulli distributions. Operationally, this modification results only in a change to the optimized objective; for clarity we compare the ELBO of the continuous Bernoulli VAE (top), E(p, θ, φ), to the Bernoulli VAE (bottom): E(p, θ, φ) = XN n=1 −KL(qφ||p0) + Eqφ “XD d=1 xn,d log λθ,d(zn) + (1 − xn,d) log(1 − λθ,d(zn)) + log C(λθ,d(zn))# E(˜p, θ, φ) = XN n=1 −KL(qφ||p0) + Eqφ”XD d=1 xn,d log λθ,d(zn) + (1 − xn,d) log(1 − λθ,d(zn))# , Analogously, we denote θ ∗ (p) and φ ∗ (p) as the maximizers of the continuous Bernoulli ELBO: (θ ∗ (p), φ∗ (p)) = argmax (θ,φ) E(p, θ, φ).

Here it shows the probability distributions for different inputs.

Wrapped main

data set

The most common justification for the Bernoulli VAE is that MNIST pixel values are ‘close’ to binary. An important study is thus to ask how the performance of continuous Bernoulli VAE vs the Bernoulli VAE changes as a function of this ‘closeness.’ We formalize this concept by introducing a warping function fγ(x) that, depending on the warping parameter γ, transforms individual pixel values to produce a dataset that is anywhere from fully binarized (every pixel becomes {0, 1}) to fully degraded (every pixel becomes 0.5).

Analysis:

We can use the continuous Bernoulli distribution as some of the distributions.

Here we are using the beta distribution.

Here I am writing The ‘R’ code to implement continues Bernoulli distribution.

tfd_continuous_bernoulli (

  logits = NULL,

  probs = NULL,

  dtype = tf$float32,

  validate_args = FALSE,

  allow_nan_stats = TRUE,

  name = "ContinuousBernoulli"

)

Arguments

`logits`	An N-D `Tensor`. Each entry in the `Tensor` parameterizes an independent continuous Bernoulli distribution with parameter sigmoid(logits). Only one of `logits` or `probs` should be passed in. Note that this does not correspond to the log-odds as in the Bernoulli case.
`probs`	An N-D `Tensor` representing the parameter of a continuous Bernoulli. Each entry in the `Tensor` parameterizes an independent continuous Bernoulli distribution. Only one of `logits` or `probs` should be passed in. Note that this also does not correspond to a probability as in the Bernoulli case.
`dtype`	The type of the event samples. Default: `float32`.
`validate_args`	Logical, default FALSE. When TRUE distribution parameters are checked for validity despite possibly degrading runtime performance. When FALSE invalid inputs may silently render incorrect outputs. Default value: FALSE.
`allow_nan_stats`	Logical, default TRUE. When TRUE, statistics (e.g., mean, mode, variance) use the value NaN to indicate the result is undefined. When FALSE, an exception is raised if one or more of the statistic’s batch members are undefined.
`name`	name prefixed to Ops created by this class.

Another program

library(Rlab)

set.seed(98999)

N <- 1000

random_values <- rbern(N, prob = 0.5)

print(random_values)

hist(random_values,breaks = 10,main = ““)

Problems

Data augmentation problem in VAE

Since the expectation of a Bernoulli random variable is precisely its parameter, the Bernoulli VAE might (erroneously) be assumed to be equivalent to a continuous Bernoulli VAE on an infinitely augmented dataset, obtained by sampling binary data whose mean is given by the observed data; indeed this idea is suggested by Kingma and Welling. . This interpretation does not hold3 ; it would result in a reconstruction term as in the first line in the equation below, while a correct Bernoulli

VAE on the augmented dataset would have a reconstruction term given by the second line (not equal, as the order of expectation can not be switched since qφ depends on Xr on the second line):

Conclusion:

Cbd is a linked to many distributions which are based and used to solve the many machine learning problems and shows the easy way to find the solution.As is the like the Bernoulli distribution these are can we conclude the required results from these distribution.As these values are between the [0,1] we can retrive the probabilities.

Here through this we have studied how can we fix the passive errors in the “variational auto encoders” which works on deep learning.aAnd we can use this by for huge datasets,range of matrices and sharpen images.

references:

https://en.wikipedia.org/wiki/Continuous_Bernoulli_distribution