Predictive Analytics: Likelihood

Rasim Muzaffer Musal

Review: Likelihood

To understand the notation and information in the next set of slides we will go over some important terminology and notation.
Imagine observing values (Y) that can be designated as success/failure (loan default/not, defective/non defective etc…)

\[ \begin{align} Y={0,1,1,0,0,1} \end{align} \]

Generating Functions

If we want to predict or infer about Y, for we should have an idea about \(g(y)\), the generating function of Y.
Unless you do simulations, in most real-world problems you will not know \(g(y)\).
This will be an important consideration between choosing AIC vs BIC.
AIC is optimal for predicting new data and BIC is optimal for identifying \(g(y)\). We will discuss this if we have time.

Review: Bernoulli Likelihood

The likelihood is joint probability distribution of Y given parameter(s).
We estimate the parameter(s) of likelihood in order to understand what drives Y’s values.
Assumptions lie behind the choice of the likelihood.
The easiest assumption for \(Y={0,1,1,0,0,1}\) is i.i.d Bernoulli. \[ \begin{aligned} P(Y=0) \times P(Y=1) \times P(Y=1) \times & \\ P(Y=0) \times P(Y=0) \times P(Y=1) & \end{aligned} \]

Review: Bernoulli Likelihood

If you recall Bernoulli distribution had the form \[ \begin{aligned} P(Y=y)=p^{y} \times (1-p)^{1-y} \end{aligned} \]
Therefore with the i.i.d. assumption the above form reduces to

\[ \begin{aligned} (1-p) \times p \times p \times (1-p) \times (1-p) \times p \end{aligned} \]

Any time you enter a valid value for \(p\) you obtain a likelihood.

Review: Bernoulli Likelihood

The above form can be written as \[ \begin{aligned} (p)^{3} \times (1-p)^{3} \end{aligned} \]
So if \(p=0.1\) the above form is evaluated via \[0.1^{3} \times (0.9)^{3} = 0.000729\]
And if \(p=0.2\) the above form is evaluated via \[0.2^{3} \times (0.8)^{3} = 0.004096\]

Review: Bernoulli Likelihood

Since the above is calculated for discrete r.v. what we calculated is \(P(Y_{1},\ldots,Y_{6}|p)=L(Y_{1},\ldots,Y_{6};p)\)
If the data is continuous \(f(Y_{1},\ldots,Y_{6};p)=L(Y_{1},\ldots,Y_{6};p)\)
Recall that if the random variable is continuous the uncertainty function is not likelihood, it is the multiplication of the distribution heights.

Review: Bernoulli Likelihood

Since \(P(Y|p=0.2)>P(Y|p=0.1)\) we can claim that \(p=0.2\) is a parameter that fits the data better.
It should then be self evident that we would want to identify value of the \(p\) which maximizes likelihood.
Finding the parameter value(s) which gives the highest likelihood value for a given class of models (i.e.: Bernoulli i.i.d.) is referred to as finding the maximum likelihood value(s).

Review: Bernoulli Likelihood

In short:
The likelihood for the case of i.i.d. variables have been evaluated via \[ \begin{aligned} Likelihood = P(Y|p) = \prod_{i=1}^{i=n} P(Y=y_{i}|p) \end{aligned} \]
So we can create a vector of hypothetical \(p\) values and identify which value of \(p\) maximizes the likelihood.

Evaluating Bernoulli Likelihoods

library(ggplot2)
# p is going to be the vector of hypothetical values that go from 0 to 1 in increments of 0.01.
p=seq(from=0,to=1,by=0.01)
# vectorized operations note we did not have to declare 
# the likelihood vector. The arithmetic function will be evaluated for each element 

Likelihood=p^3*(1-p)^3
#Find the index likelihood is maximized.
which.max(Likelihood)

[1] 51

#The likelihood value itself
Likelihood[which.max(Likelihood)]

[1] 0.015625

#Find the p value which maximizes Likelihood
p[which.max(Likelihood)]

[1] 0.5

Plot of Likelihood and p

pandLikelihood=cbind(p,Likelihood)
ggplot(data=pandLikelihood)+geom_point(aes(y=Likelihood,x=p))

Likelihood of Normal Distribution

The workhorse distribution of the regression models is Normal/Gaussian.
The Normal distribution has the form for r.v. Z: \[ \begin{aligned} f(Z=z|\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}exp^{-\frac{(z-\mu)^{2}}{2\sigma^{2}}} \end{aligned} \]
The parameters of the distribution are \(\mu\) and \(\sigma\).
In standard regression we model \(\mu\) and estimate \(\sigma\) from data.

Likelihood of Linear Regression Models

Assume you observed the following set of values \[Z={3,1,7.2,-4,5}\].
\(f(Y|\mu,\sigma)\) assuming i.i.d. Gaussian/Normal.
Assume initially that \(\mu\) is 0 and \(\sigma\) is 1.

Calculating the likelihood via Normal Dist.

Calculate the likelihood for 5 values of Z assuming normal distribution with mean 0 and sd 1.
Z=[3,1,7.2,-4,5], dnorm(Z,mean=0,sd=1)
The density are the y-axis values associated with the red dots corresponding to the Z values.
Likelihood is the multiplication of all the densities (heights).

Assuming Gaussian Density

Calculating the likelihood for the 5 observations

Comparing Likelihoods

Which of the two distributions supports data better?

Floating Point Precision

To formally compare model likelihoods we need to get the product of \(N\) density values. Since this product tends to approach 0 or infinity as the number of data points increase, we use logarithmic likelihood in our calculations.

# The largest exponent of 10 that can be evaluated at the time of writing on my system.  
 
10^308

[1] 1e+308

# Resulting Inf
10^309

[1] Inf

# Can you see the problem?
10^309==10^310

[1] TRUE

Calculating the likelihood in R

# Recall log(a^b)=b*log(a) so we can use the equality on both
#sets of numbers and compare whether they are equal.
309*log(10)==310*log(10)

[1] FALSE

#Similar issue would happen if we made the numbers smaller as they approach 0. 

# Values of the variable
Z=c(3,1,7.2,-4,5)
# Evaluating the density
dnorm(Z,mean=0,sd=1)

[1] 4.431848e-03 2.419707e-01 2.207990e-12 1.338302e-04 1.486720e-06

# Likelihood
prod(dnorm(Z,mean=0,sd=1))

[1] 4.711162e-25

Calculating the likelihood in R

There are cases where taking the exponential of the log sum of numbers might be necessary to get the original value.

# Log Likelihood recall log(a*b)=log(a)+log(b)
sum(dnorm(Z,mean=0,sd=1,log=TRUE))

[1] -56.01469

#and of course since a*b=exp(log(a)+log(b))
exp(sum(dnorm(Z,mean=0,sd=1,log=TRUE)))

[1] 4.711162e-25

However in most cases we just need to remember if an arbitray \(log(L2) > log(L1)\) then \(L2 > L1\).

Comparing log Likelihoods

to make decisions plot comparison of likelihoods is not ideal.
Let L1 be the log likelihood when mean is 0 and sd is 1.
Let L2 be the log likelihood when mean is mean of Z and sd is sd of Z.

L1=sum(dnorm(Z,mean=0,sd=1,log=TRUE))
L1

[1] -56.01469

L2=sum(dnorm(Z,mean=mean(Z),sd=sd(Z),log=TRUE))
L2

[1] -13.85757

L2>L1

[1] TRUE

Simulating a Linear Regression

Note that the mean of the predictions is the mean of the target variable.

# We will simulate independent variable X1
X1=rnorm(10000,mean=0,sd=1)
# Simulate dependent variable Y 
Y1=0.7*X1+rnorm(10000,0,1)
# Correlation between the two variables
cor(Y1,X1)

[1] 0.5699093

# Create a simple linear regression object
object=lm(Y1~X1)
# Mean of Y is
mean(Y1)

[1] -0.009112675

# Mean of the predicted values are
mean(predict(object))

[1] -0.009112675

Evaluating the Maximum Likelihood of the Model

# Predictions in the object
Y_pred=predict(object)
#Mean of the dependent variable
mean_Y1 = mean(Y1)
#Y1-Predict values can be replaced with residuals command
var_YPred=mean(((Y1-predict(object))^2))

#Finding the likelihood which is maximized Y1 and Y_pred are vectors!!!
sum(dnorm(Y1,mean=Y_pred,sd=sqrt(var_YPred),log=TRUE))

[1] -14207.43

We will now pick different values for \(\mu\) and \(\sigma\) and see different values for the likelihood.

Evaluating other Likelihoods for regression

#Other likelihood values
sum(dnorm(Y1,mean=Y_pred,sd=sqrt(var_YPred)-0.5,log=TRUE))

[1] -22222.06

#Other likelihood values
sum(dnorm(Y1,mean=Y_pred,sd=sqrt(var_YPred)+0.5,log=TRUE))

[1] -15480.96

#Other likelihood values
sum(dnorm(Y1,mean=Y_pred-1,sd=sqrt(var_YPred),log=TRUE))

[1] -19189.42

#Other likelihood values
sum(dnorm(Y1,mean=Y_pred+1,sd=sqrt(var_YPred),log=TRUE))

[1] -19189.42

Summary:

For probabilistic models likelihood is a measure of how well the data fits the model.
The larger the likelihood the better the fit.
Most measures of fit based on the likelihood has a negative multiplier.
Therefore for these measures we select the model that has minimum negative likelihood.