Background

How do we infer sample data to population parameters? One way is to use maximum likelihood estimation method.
Let us consider a set of observations \(y_1, y_2, ..., y_n\), the likelihood function \(L\) is the join probability density of obtaining the data that are actually observed.
It is considered as a function of unknown parameters with the observed data held fixed.
The maximum likelihood estimators are those values of the parameters for which the data actually observed are most likely. In other words, the values that maximizes the likelihood function.

Background

We have i.i.d sample of data, assuming that the data is generated from some parametric density function or the probability mass function which is indexed by parameter \(\theta\).
Then we construct a likelihood function using the assumption that the data are iid which is simply the product of density or pmf evaluated at each sample data points \(y_i\).
We take the log of that function simply because it is computationally easier among other reasons.
We may have infinite number of possible parameter values, and we want to choose the one that best fits the data. In order to do this, we need to maximize the log likelihood function.

Generating data

library(dplyr)
library(plotly)
N = 1000
beta = 5
sigma_2 = 5
X = rnorm(N, 0, 5)
Z = rnorm(N, 0, sqrt(sigma_2))
Y = beta*X + Z

DT = data.table::data.table(X, Y, Z)
head(DT)

##             X          Y          Z
## 1: -5.1031125 -22.608491  2.9070710
## 2:  2.3169936  11.783020  0.1980519
## 3:  3.6623491  19.698005  1.3862593
## 4:  5.4599034  28.181253  0.8817356
## 5: -1.3758629  -8.999021 -2.1197066
## 6:  0.4845064   4.092999  1.6704669

Kernel density Estimation

a non-parametric method to estimate the probability density function of a random variable based on kernels as weights
provides a way to see the shape of data

Deriving the log likelihood function

log_like = function(theta, Y, X) {
  X = as.matrix(X)
  Y = as.matrix(Y)
  N = nrow(X)
  beta = theta[1]
  sigma_2 = theta[2]
  e = Y - beta*X
  log_lik = -0.5*N*log(2*pi) - 0.5*N*log(sigma_2) - ((t(e) %*% e)/(2*sigma_2))
  return(-log_lik)
}

Deriving log likelihood values

log_like_plot = function(beta, sigma_2) {
  X = as.matrix(DT$X)
  Y = as.matrix(DT$Y)
  N = nrow(X)
  e = Y - beta*X
  log_lik = -0.5*N*log(2*pi) - 0.5*N*log(sigma_2) - ((t(e) %*% e)/(2*sigma_2))
  return(log_lik)

}

Vectorizing log likelihood values

log_like_plot = Vectorize(log_like_plot)
beta_vals= seq(-10,10,by=1)
sigma_2_vals = seq(1,10,by=1)
log_vals = outer(beta_vals, sigma_2_vals,log_like_plot)

Plotting log likelihood function

The surface in the plot summarizes all the log likelihood values for different beta and sigma values and what we want to do is find the values of beta and sigma squared that hits the highest part of the surface.
In other words, we want to get different combination of beta and sigma squared values that maximizes the log likelihood function.

Finding MLE Estimates

library(knitr)

MLE_Estimates = optim(fn = log_like,
                      par = c(1,1),
                      lower = c(-Inf, -Inf),
                      upper = c(Inf, Inf),
                      hessian = TRUE,
                      method = "L-BFGS-B",
                      Y = DT$Y,
                      X = DT$X
                      )

Examining Estimates

MLE_par = MLE_Estimates$par
MLE_SE = sqrt(diag(solve(MLE_Estimates$hessian)))
MLE = data.table::data.table(param = c("beta", "sigma_2"),
                             estimates = MLE_par,
                             sd = MLE_SE)
kable(MLE)

param	estimates	sd
beta	4.993428	0.0146031
sigma_2	5.156092	0.2305873

We can see that MLE for beta is close to the true value of beta but sigma squared is quite far away and standard deviation is also quite large

Deriving likelihood values

log_like_plot_beta = function(beta) {
  sigma_2 = MLE_par[2]
  X = as.matrix(DT$X)
  Y = as.matrix(DT$Y)
  N = nrow(X)
  e = Y - beta*X
  log_lik = -0.5*N*log(2*pi) - 0.5*N*log(sigma_2) - ((t(e) %*% e)/(2*sigma_2))
  return(-log_lik)
}

Deriving likelihood values

log_like_plot_sigma2 = function(sigma_2) {
  beta = MLE_par[1]
  X = as.matrix(DT$X)
  Y = as.matrix(DT$Y)
  N = nrow(X)
  e = Y - beta*X
  log_lik = -0.5*N*log(2*pi) - 0.5*N*log(sigma_2) - ((t(e) %*% e)/(2*sigma_2))
  return(log_lik)
}

Vectorizing Estimates

log_like_plot_beta = Vectorize(log_like_plot_beta)
log_like_plot_sigma2 = Vectorize(log_like_plot_sigma2)

Plotting Sigma Squared

Binomial Model

Suppose we have a binomial distribution for N independent bernoulli trials (e.g., coin flips, success or failure etc). There is a probability function that links the number of successes and the probability associated with it.
The probability function gives the probability of data, given a specific parameter.
Let’s see the barplot for probability function where N = 25 trials and fixed parameter theta = 0.5. In each of these 25 trials, the probability of landing heads is 50%.

Plot Summary

The plot represents the probability of getting each possible number of heads in 25 outcomes.We can see that most of the occurences are between 12 and 15.
The most probable outcome is 12 and 13.
It tells the probability of each discrete outcome.

Binomial Model in action

There is a likelihood function that gives the likelihood of a range of parameters, given a specific data point.
Now, our input is parameters instead of data, this is called likelihood function.

Binomial Example

Suppose we observe ten successes in 25 trials. Here we are not asking about the probability of 10 successes but instead asking what is the parameter that would have given us 10 successes.
Let us see the plot explained above. Suppose that in the sequence of 25 coin flips, we observe 10 successes(heads).
What is the maximum likelihood estimate for theta? Assume binomial model.

Defining likelihood function

negative_log_likelihood_binom = function(data, par) {
  return(-log(dbinom(data, size = 25, prob = par)))
}

Finding parameters that maximizes the likelihood function

optim(par = 0.5, fn = negative_log_likelihood_binom, data = 10)

## $par
## [1] 0.4
## 
## $value
## [1] 1.82537
## 
## $counts
## function gradient 
##       28       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

theta = seq(from = 0, to = 1, by = 0.01)

Summary

The plot is telling us that the likelihood of observing 10 successes over the range of possible values of theta. We can see from the plot, as theta approaches 0.4, the likelihood goes up.
We can ask the question that what value of theta is most likely, given the data we observed?
We can see that our optimized parameter MLE = par = 0.4.

Maximum Likelihood Estimation

Background

Background

Generating data

Kernel density Estimation

Deriving the log likelihood function

Deriving log likelihood values

Vectorizing log likelihood values

Plotting log likelihood function

Plotting log likelihood function

Finding MLE Estimates

Examining Estimates

Deriving likelihood values

Deriving likelihood values

Vectorizing Estimates

Plotting beta

Plotting Sigma Squared

Binomial Model

Density Plot

Plot Summary

Binomial Model in action

Binomial Example

Defining likelihood function

Finding parameters that maximizes the likelihood function

Plotting likelihood

Summary