\usepackage{bm}

Predictive Analytics: Likelihood

Rasim Muzaffer Musal

Review: Likelihood

  • To understand the notation and information in the next set of slides we will go over some important terminology and notation.

  • Imagine observing values (Y) that can be designated as success/failure (loan default/not, defective/non defective etc…)

\[ \begin{align} Y={0,1,1,0,0,1} \end{align} \]

Generating Functions

  • If we want to predict or infer about Y, for we should have an idea about \(g(y)\), the generating function of Y.

  • Unless you do simulations, in most real-world problems you will not know \(g(y)\).

  • This will be an important consideration between choosing AIC vs BIC.

  • AIC is optimal for predicting new data and BIC is optimal for identifying \(g(y)\). We will discuss this if we have time.

Review: Bernoulli Likelihood

  • The likelihood is joint probability distribution of Y given parameter(s).

  • We estimate the parameter(s) of likelihood in order to understand what drives Y’s values.

  • Assumptions lie behind the choice of the likelihood.

  • The easiest assumption for \(Y={0,1,1,0,0,1}\) is i.i.d Bernoulli. \[ \begin{aligned} P(Y=0) \times P(Y=1) \times P(Y=1) \times & \\ P(Y=0) \times P(Y=0) \times P(Y=1) & \end{aligned} \]

Review: Bernoulli Likelihood

  • If you recall Bernoulli distribution had the form \[ \begin{aligned} P(Y=y)=p^{y} \times (1-p)^{1-y} \end{aligned} \]
  • Therefore with the i.i.d. assumption the above form reduces to

\[ \begin{aligned} (1-p) \times p \times p \times (1-p) \times (1-p) \times p \end{aligned} \]

  • Any time you enter a valid value for \(p\) you obtain a likelihood.

Review: Bernoulli Likelihood

  • The above form can be written as \[ \begin{aligned} (p)^{3} \times (1-p)^{3} \end{aligned} \]

  • So if \(p=0.1\) the above form is evaluated via \[0.1^{3} \times (0.9)^{3} = 0.000729\]

  • And if \(p=0.2\) the above form is evaluated via \[0.2^{3} \times (0.8)^{3} = 0.004096\]

Review: Bernoulli Likelihood

  • Since the above is calculated for discrete r.v. what we calculated is \(P(Y_{1},\ldots,Y_{6}|p)=L(Y_{1},\ldots,Y_{6};p)\)

  • If the data is continuous \(f(Y_{1},\ldots,Y_{6};p)=L(Y_{1},\ldots,Y_{6};p)\)

  • Recall that if the random variable is continuous the uncertainty function is not likelihood, it is the multiplication of the distribution heights.

Review: Bernoulli Likelihood

  • Since \(P(Y|p=0.2)>P(Y|p=0.1)\) we can claim that \(p=0.2\) is a parameter that fits the data better.

  • It should then be self evident that we would want to identify value of the \(p\) which maximizes likelihood.

  • Finding the parameter value(s) which gives the highest likelihood value for a given class of models (i.e.: Bernoulli i.i.d.) is referred to as finding the maximum likelihood value(s).

Review: Bernoulli Likelihood

  • In short:

  • The likelihood for the case of i.i.d. variables have been evaluated via \[ \begin{aligned} Likelihood = P(Y|p) = \prod_{i=1}^{i=n} P(Y=y_{i}|p) \end{aligned} \]

  • So we can create a vector of hypothetical \(p\) values and identify which value of \(p\) maximizes the likelihood.

Evaluating Bernoulli Likelihoods

library(ggplot2)
# p is going to be the vector of hypothetical values that go from 0 to 1 in increments of 0.01.
p=seq(from=0,to=1,by=0.01)
# vectorized operations note we did not have to declare 
# the likelihood vector. The arithmetic function will be evaluated for each element 

Likelihood=p^3*(1-p)^3
#Find the index likelihood is maximized.
which.max(Likelihood)
[1] 51
#The likelihood value itself
Likelihood[which.max(Likelihood)]
[1] 0.015625
#Find the p value which maximizes Likelihood
p[which.max(Likelihood)]
[1] 0.5

Plot of Likelihood and p

pandLikelihood=cbind(p,Likelihood)
ggplot(data=pandLikelihood)+geom_point(aes(y=Likelihood,x=p))

Likelihood of Normal Distribution

  • The workhorse distribution of the regression models is Normal/Gaussian.

  • The Normal distribution has the form for r.v. Z: \[ \begin{aligned} f(Z=z|\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}exp^{-\frac{(z-\mu)^{2}}{2\sigma^{2}}} \end{aligned} \]

  • The parameters of the distribution are \(\mu\) and \(\sigma\).

  • In standard regression we model \(\mu\) and estimate \(\sigma\) from data.

Likelihood of Linear Regression Models

  • Assume you observed the following set of values \[Z={3,1,7.2,-4,5}\].

  • \(f(Y|\mu,\sigma)\) assuming i.i.d. Gaussian/Normal.

  • Assume initially that \(\mu\) is 0 and \(\sigma\) is 1.

Calculating the likelihood via Normal Dist.

  • Calculate the likelihood for 5 values of Z assuming normal distribution with mean 0 and sd 1.
  • Z=[3,1,7.2,-4,5], dnorm(Z,mean=0,sd=1)
  • The density are the y-axis values associated with the red dots corresponding to the Z values.
  • Likelihood is the multiplication of all the densities (heights).

Assuming Gaussian Density

Calculating the likelihood for the 5 observations

Comparing Likelihoods

  • Which of the two distributions supports data better?

Floating Point Precision

  • To formally compare model likelihoods we need to get the product of \(N\) density values. Since this product tends to approach 0 or infinity as the number of data points increase, we use logarithmic likelihood in our calculations.
# The largest exponent of 10 that can be evaluated at the time of writing on my system.  
 
10^308
[1] 1e+308
# Resulting Inf
10^309
[1] Inf
# Can you see the problem?
10^309==10^310
[1] TRUE

Calculating the likelihood in R

# Recall log(a^b)=b*log(a) so we can use the equality on both
#sets of numbers and compare whether they are equal.
309*log(10)==310*log(10)
[1] FALSE
#Similar issue would happen if we made the numbers smaller as they approach 0. 

# Values of the variable
Z=c(3,1,7.2,-4,5)
# Evaluating the density
dnorm(Z,mean=0,sd=1)
[1] 4.431848e-03 2.419707e-01 2.207990e-12 1.338302e-04 1.486720e-06
# Likelihood
prod(dnorm(Z,mean=0,sd=1))
[1] 4.711162e-25

Calculating the likelihood in R

  • There are cases where taking the exponential of the log sum of numbers might be necessary to get the original value.
# Log Likelihood recall log(a*b)=log(a)+log(b)
sum(dnorm(Z,mean=0,sd=1,log=TRUE))
[1] -56.01469
#and of course since a*b=exp(log(a)+log(b))
exp(sum(dnorm(Z,mean=0,sd=1,log=TRUE)))
[1] 4.711162e-25
  • However in most cases we just need to remember if an arbitray \(log(L2) > log(L1)\) then \(L2 > L1\).

Comparing log Likelihoods

  • to make decisions plot comparison of likelihoods is not ideal.
  • Let L1 be the log likelihood when mean is 0 and sd is 1.
  • Let L2 be the log likelihood when mean is mean of Z and sd is sd of Z.
L1=sum(dnorm(Z,mean=0,sd=1,log=TRUE))
L1
[1] -56.01469
L2=sum(dnorm(Z,mean=mean(Z),sd=sd(Z),log=TRUE))
L2
[1] -13.85757
L2>L1
[1] TRUE

Simulating a Linear Regression

  • Note that the mean of the predictions is the mean of the target variable.
# We will simulate independent variable X1
X1=rnorm(10000,mean=0,sd=1)
# Simulate dependent variable Y 
Y1=0.7*X1+rnorm(10000,0,1)
# Correlation between the two variables
cor(Y1,X1)
[1] 0.5699093
# Create a simple linear regression object
object=lm(Y1~X1)
# Mean of Y is
mean(Y1)
[1] -0.009112675
# Mean of the predicted values are
mean(predict(object))
[1] -0.009112675

Evaluating the Maximum Likelihood of the Model

# Predictions in the object
Y_pred=predict(object)
#Mean of the dependent variable
mean_Y1 = mean(Y1)
#Y1-Predict values can be replaced with residuals command
var_YPred=mean(((Y1-predict(object))^2))

#Finding the likelihood which is maximized Y1 and Y_pred are vectors!!!
sum(dnorm(Y1,mean=Y_pred,sd=sqrt(var_YPred),log=TRUE))
[1] -14207.43
  • We will now pick different values for \(\mu\) and \(\sigma\) and see different values for the likelihood.

Evaluating other Likelihoods for regression

#Other likelihood values
sum(dnorm(Y1,mean=Y_pred,sd=sqrt(var_YPred)-0.5,log=TRUE))
[1] -22222.06
#Other likelihood values
sum(dnorm(Y1,mean=Y_pred,sd=sqrt(var_YPred)+0.5,log=TRUE))
[1] -15480.96
#Other likelihood values
sum(dnorm(Y1,mean=Y_pred-1,sd=sqrt(var_YPred),log=TRUE))
[1] -19189.42
#Other likelihood values
sum(dnorm(Y1,mean=Y_pred+1,sd=sqrt(var_YPred),log=TRUE))
[1] -19189.42

Summary:

  • For probabilistic models likelihood is a measure of how well the data fits the model.

  • The larger the likelihood the better the fit.

  • Most measures of fit based on the likelihood has a negative multiplier.

  • Therefore for these measures we select the model that has minimum negative likelihood.