Whenever we make any critical decision in our life we tend to take suggestions and recommendations from different people. We try to analyse information from various sources and based on our analysis, we decide what to do.
To understand this process more intuitively lets take an example. Last week one of my friend recommended me to watch a movie called “Attack of the Killer Tomatoes!”. Now looking at the title of the movie I felt bit suspicious and started pondering if its actually as good as my friend was praising it. The obvious next step was to gather more information about the movie. I checked the movie trailer on YouTube according to which it seemed like pretty funny movie. Despite of the claims made by the trailer like “perhaps the funniest film ever made” I decided to check reviews and ratings for the movie on IMDB and RotttenTomatoes, where it was rated below 5. As I trust IMBD movie ratings more than anything when it comes to watching movies, I decided to pass “Attack of the killer Tomatoes !” and watch it some other time. In real life we mostly follow this kind of reasoning where we combine various evidences and come to a conclusion.
Now mathematically we can formulate this process as follows, we can consider that “Attack of the Killer Tomatoes!” is a binomial random variable which can either take values “Good” or “Bad”. Initially I had a recommendation for the movie so the probability distribution was more toward the value Good i.e P(“Attack of the Killer Tomatoes!” =Good) was 0.7 . As we start getting data from different sources, we can start inferring from it, if the movie is good or bad, and start changing our initial belief. Sadly, for me at the end the P(“Attack of the Killer Tomatoes!” =Good) was below 0.5 so I decided to not watch the movie.
A very elegant mathematical framework called “The Bayes Theorem” given by an English philosopher and statistician, Thomas Bayes, allows us to do precisely this kind of reasoning. According to Bayes rule : \[P(A|B)=P(A) \times P(A,B)=\frac{P(A) \times P(B|A)}{P(B)}\]
where \(P(A)\)(Prior Distribution) represent our prior knowledge of the world , \(\frac{P(B|A)}{P(B)}\) represent the Probability Distribution of data given our prior knowledge and \(P(A|B)\) represent the posterior or updated probability distribution. In simple words, you take what you know about the world (i.e your prior distribution), then take the new data and try to evaluate if this new data conforms with your prior knowledge. Multiply the 2 distribution and you get an update belief about the world, which in next iteration you can use as your new prior distribution/belief.
To help you visualize more clearly first lets take a simple example which can help you visualize this process and then we will take a toy problem and try to solve it using Bayesian estimation.
For purpose of demonstration I am using R programming language. All the code after this point is written in R.
First lets import few libraries which we would be using for our examples, ‘ggplot2’ is a library for visualization and ‘mvtnorm’ provides functions to generate multivariate distributions.
library(ggplot2)
library(mvtnorm)
Assume that you have a random variable A which can take values from 1 to 6. Now you want to know which value of A is the most probable value. You think that it is somewhere close to 3.5 and your belief can be described by a normal distribution around 3.5 with variance of 0.5. This distribution can be plotted as follows
plot(seq(1, 6, 0.01), dnorm(seq(1,6,0.01), 3.5,0.5), xlab = "A", ylab="P(A)", main="Prior Distribution", type="l")
Now you start getting data B from some source. To calculate \(P(A|B)\) we need the distribution \(P(A,B)\), a joint distribution of our prior knowledge and the data. In real life getting this distribution is hard and we normally employ various assumptions and approximation techniques to get this distribution. Later in this article while solving a toy problem we would use a simple assumption to estimate this distribution. For now lets assume you already have this distribution and can be represented by the 3d plot drawn below. Now by multiplying our prior distribution and this joint distribution so that we can get an updated distribution. But as you can see our prior is a uni-variate distribution and \(P(A,B)\) is bi-variate, so we can not directly multiply these 2 distribution. We would take a slice out the \(P(A,B)\) which is shown in graph below as conditional probability \(P(B|A=3)\). Now if you are familiar with basic rules of probability theory you must already know that The sum of probability of all the events is equal to 1, i.e. \(\int P(A)=1\) and also \(\int P(A,B)=1\) or in simple words area under the surface/curve of a probability distribution must be 1. As we know that the distribution P(A,B) is a probability distribution and the area under the surface of distribution is 1, so if we take a slice out of this distribution, this conditional distribution P(B|A=3) is no more a valid probability distribution. Therefore, if we multiply this distribution with our prior we would get an invalid probability distribution. Hence we need to normalize it by dividing by P(B) so that we get a valid posterior probability distribution.
mean=c(4,4)
sigma=matrix(c(1,0,0,1), nrow = 2, ncol = 2)
dist<-matrix(rep(1, 36), nrow = 6, ncol = 6)
for(j in c(1:6))
{
for(k in c(1:6) )
{
dist[j,k]<-dmvnorm(c(j,k), mean=mean, sigma=sigma)
}
}
par(mfrow=c(1,2))
xlab<-seq(1,6,1)
ylab<-seq(1,6,1)
persp(x=xlab,y=ylab,
z=dist,
col="lightblue",
theta=130,
phi=20,
r=30,
d=0.1,
expand=1,
shade=0.5,
ticktype="detailed",scale=TRUE, xlab="A", ylab="B", zlab="P(A,B)",
nticks=5, zlim=c(0,0.2), main="Probability Distribution of P(A,B)"
)
dist<-dist[3,]
plot(dist, xlim=c(1,6), type="l", ylab="P(B|A=3)", xlab="B", main="Condition Probability Distribution P(B|A=3)")
par(mfrow=c(1,1))
So lets summarize what we have covered so far, we had our prior knowledge about the world, we took some data, computed the conditional distribution of data and normalized it with distribution of data. By plugging these values in Bayes theorem we would get a posterior distribution. Now in later iterations when we will get a new piece of data we would consider this posterior as our prior and again compute our posterior.
This recursive process is commonly known as Recursive Bayesian Filter.
Now lets describe a toy problem and try to solve it using Recursive Bayesian Filter.
I am an ardent fan of Iron Man movies so I will take an example in which we would help Iron Man to kill the bad guy: Whiplash. Imagine that there is a wall of height and width of 40 * 40 ft. and Whiplash is hiding behind the wall. Also, during his duel with Whiplash, Iron man got his Infra-Red sensor damaged and is left with a single bullet using which he wants to kill whiplash. Now, as the IR sensor got damaged Iron Man is getting the erroneous heat signatures on the wall. Hence, Iron Man decides to take 100 readings and sends it to Jarvis. Now as Jarvis it is our job to figure out where precisely Whiplash is hiding.
So lets see how we can help him kill “Whiplash” using Recursive Bayesian Estimation we discussed earlier.
In the code block below we are performing following steps: 1. we will create a matrix of size 41*41, representing the wall behind which whiplash is hiding. 2. we will mention true location of whiplash which Iron Man does not know, in the plot this location is represented with Red dot. 3. we will assume that the reading from Infrared sensor is faulty and we will sample 100 readings from a normal distribution with mean at true location and variance of 10.
#CREATE A WALL
wall<-matrix(rep(1, 1681), nrow = 41, ncol = 41)
#LETS SAY THAT TRUE LOCATION OF WHIPLASH IS AT COORDINATE 20, 20 on the wall
trueCol<-20
trueRow<-20
# 100 Readings cerated by the Infrared sensor. Here we are sampling 100 points from a bivariate normal distribution with mean centered on the true location of Whiplash and due to damage to the sensor there is now a variance of 10 in the sensor.
data<-as.data.frame((rmvnorm(100, mean=c(trueRow, trueCol), sigma= matrix(c(10,0,0,10), nrow = 2, ncol = 2))))
#Lets plot this
ggplot(data , aes(x=data$V1,y=data$V2), xlab="x coordinate", ylab="y coordinate")+geom_point(shape=7, col="Blue")+geom_point(aes(x=trueRow, y=trueCol), col="RED", size=5)+scale_x_continuous(limits=c(0,40))+scale_y_continuous(limits=c(0,40))+labs(title="WALL", x="X Coordinate", y="Y Coordinate")
So as we can see the true location of Whiplash is at the red dot in the map, but after taking 100 readings from infrared sensor Iron man can only see the blue dots. Our job now is to estimate the position of red dot by looking at these 100 dots.
Next to start using Recursive Bayesian we need a prior distribution which would represent our initial knowledge about the world. Now for this we can either assume that Iron man does not know anything about the location of Whiplash and would use a uniform distribution (i.e. he would assume that Whiplash can be anywhere with equal probability.) or if he saw Whiplash going from left side of the wall then we can create a distribution incorporating this knowledge. To keep things simple for now I would assume that he does not know anything about the location of Whiplash (Update : I have also pasted the code to create a prior in which Iron man thinks that Whiplash is somewhere around coordinate 30,35)
#initial distribution representing a uniform prior distribution
initial_dist<-wall/sum(wall)
#comment the line above and uncomment the code block below to change the initial distribution to be centered around 30,35
# mean=c(30,35)
# sigma=matrix(c(1,0,0,1), nrow = 2, ncol = 2)
# initial_dist<-wall
# for(j in c(0:41))
# {
# for(k in c(0:41) )
# {
# initial_dist[j,k]<-dmvnorm(c(j,k), mean=mean, sigma=sigma)
# }
# }
#display the plot
persp(x=seq(0,40,1), y=seq(0,40,1),
z=initial_dist,
col="lightblue",
theta=30,
phi=20,
r=30,
d=0.1,
expand=1,
shade=0.5,
ticktype="detailed",scale=TRUE, xlab="x Coordinate", ylab="y coordinate", zlab="Probability",
nticks=5, zlim=c(-0.1,1), main="Surface plot of probability distribution"
)
As you can see the distribution is uniform and the probability of Whiplash being at any point on this wall is equal.
Now lets take this prior distribution and start our recursive Bayesian estimation. We would perform following steps to get a posterior distribution : 1. We will take each point and create a normal distribution around this point (earlier in this article I have mentioned that estimation of joint distribution or conditional distribution is difficult in real life and we employ various assumptions and approximations. To get this distribution, here we are assuming that the conditional distribution is just a normal distribution with mean at the location of the point and a variance of 20 on each coordinate) 2. we will multiply this distribution around the data point with our prior distribution to get posterior distribution. 3. the posterior distribution which we got is not a probability distribution anymore so we will normalize this distribution. 4. make this posterior distribution prior and repeat for all the data points. The out put of this part is in form of video in which you can clearly see the results of each iteration in form of surface plot and heat map.
At the end of iterating through 100 data points, you can clearly see that we have estimated that the location of whiplash is near 20,20 as the probability near 20,20 is maximum.
#convert data to a matrix
data<-data.matrix(data, rownames.force=NA)
#create a matrix representing posterior distribution, initially same as prior distribution
final_dist<-initial_dist
#iterate through all the data points
for( i in (1:nrow(data)))
{
# make posterior distribution new prior distribution for first iteration this would not do anything
initial_dist<-final_dist
#define mean of the distribution as the data point
mean=c(data[i,1],data[i,2])
#and covariance matrix
sigma= matrix(c(20,0,0,20), nrow = 2, ncol = 2)
for(j in c(0:41))
{
for(k in c(0:41) )
{
#sample normal distribution with mean and sigma values described above
pd<-dmvnorm(c(j,k), mean=mean+1, sigma=sigma)
#multiply the sampled value with our prior distribution and store it in posterior distribution
final_dist[j,k]<-initial_dist[j,k]*pd
}
}
#normalize the posterior distribution
final_dist=final_dist/sum(final_dist)
#Display the plots
#display 2 graphs in same window
par(mfrow=c(1,2))
sub=paste("Point Number : ",i)
# Surface plot for posterior distribution
persp(x=seq(0,40,1), y=seq(0,40,1),
z=final_dist,
col="lightblue",
theta=30,
phi=20,
r=30,
d=0.1,
expand=1,
shade=0.5,
ticktype="detailed",scale=TRUE, xlab="x coordinate", ylab="y coordinate", zlab="Probability", sub=sub,
nticks=5, zlim=c(-0.1,1), main="Surface plot of posterior distribution",
)
#heatmap of posterior distribution
image(x=seq(0,40,1), y=seq(0,40,1),z=final_dist, xlab="x Coordinate", ylab="y coordinate",main="Heat map of posterior distribution")
par(mfrow=c(1,1))
}