The purpose of this document is to demonstrate the steps in calculating Maximum Likelihood (ML) estimates of parameters of models given data. I’ll start by showing how to do this with the simple linear model. I’ll then show how to calculate ML parameter estimates for Gonzalez-Vallejo’s (2001) Proportional Difference (PD) model.

Three basic steps

There are three basics steps to obtaining ML estimates for model parameters

Get (or simulate) data

Pretty straighforward. If you don’t have data yet, you can simulate data and see how well you can recover model parameters.

Define the Loss Function

Defining the loss function is the most important part of the ML process. A loss function takes model parameters and data as input, and returns a quantitative value (a loss value) that indicates how consistent the data are with the model given specified parameter values. With ML estimation, we will define the loss function as the deviance, -2 times the sum of the log-likelihoods of the data. Formally:

\[dev = -2 \times \sum_{i=1}^{N} ln(l_{\vec{p}}(x_{i}))\]

Where \(l_{\bar{p}}(x_{i})\) is the likelihood of the ith data point given a vector of parameter values \(\vec{p}\), and N is the total number of data points. Likelihoods are defined by specific probability distributions, such as the Binomial for binary (0 or 1) data or Normal for continuous data. The specific model you are testing will specify the exact likelihood for any given data point.

Optimize: Find parameter values that minimize the loss function.

Once you have data and have defined the loss function, you can use optimization procedures to find the specific parameter values that minimize the loss function (and simultaneously maximize the log-likelihood of the data). There is no one optimization procedure to rule them all. No optimization procedure is perfect - that is, none can guarantee to find the best parameter values. However, for this tutorial we’ll use the default optimization procedures built in to the optim function in R.

Example 1: Linear regression with 1 predictor

To start, let’s calculate ML estimates for the simple linear model. I’ll generate random predictors x, then noisy responses y. We will assume the underlying model is y ~ ax + b + e. Our goal is to calculate ML estimates for a, b and the standard deviation of errors (e)

First, let’s create some data. We’ll have data.x be the independent variable, and data.y be the dependent variable.

I’ll set data.x to 100 random Normal samples with mean 10 and sd 2.

data.x <- rnorm(n = 100, mean = 10, sd = 2)

Next, we’ll create the true model values (true.y) by first specifying the true parameters of the model. Let’s set a (slope) to 3 and b (intercept) to 8.

a.true <- 3
b.true <- 8

true.y <- data.x * a.true + b.true

Now, we’ll create response data (data.y) by adding noise. We’ll first specify the true noise standard deviation (err.sd.true) to 1. We’ll then create the noise by generating random Normal samples with mean 0 and sd equal to err.sd.true. Finally, we’ll create the response data by adding the noise to the true model values:

err.sd.true <- 1  # Set noise sd
noise <- rnorm(n = 100, mean = 0, sd = 2)  # Generate noise
data.y <- true.y + noise  # Add noise to true (latent) responses

Let’s look at the data:

plot(data.x, data.y)

Step 1: Create a loss function to be minimized.

The loss function is the main function that specifies the model. It should take a single vector of parameter values as an input, calculate model fits to the response data using those parameter values, and return a loss value. For maximum-likelihood estimation, we’ll use deviance (-2 times sum of log likelihoods). By minimizing deviance, we will maximize likelihood.

For this example, we’ll assume a simple linear model \(N(\mu=a \times x + b, \sigma)\) with three parameters: a (the slope), b (the intercept), and \(\sigma\), the standard deviation of the errors.

lm.loss <- function(par) {
  
  a.par <- par[1]  # The current slope
  b.par <- par[2]  # The current intercept
  err.sigma <- par[3]  # The current error standard deviation
  
  # If the error standard deviation is invalid (i.e.; negative), then we need to return a very high deviance
# This will tell the optimization procedure to stay away from invalid (either statistically or psychologically)
#   parameter values.

if(err.sigma < 0) {deviance <- 10000000}

  # If the error standard deviation is valid (i.e.; > 0), then calculate the deviance...

  if(err.sigma > 0) {
  
  # Calculate the likelihood of each data point.
  # Here is where you specify the model and how you calculate likelihoods.
    
likelihoods <- dnorm(data.y, mean = data.x * a.par + b.par, sd = err.sigma)

# Now let's convert the vector of likelihoods to a summary deviance score (-2 times sub log-lik)

 # Calculate log-likelihoods of each data point
log.likelihoods <- log(likelihoods)
 
 # Calculate the deviance (-2 times sum of log-likelihoods)
deviance <- -2 * sum(log.likelihoods)

}

return(deviance)

}

Let’s try the function out. We’ll calculate the loss (deviance) of the data for a few parameter values. We’ll use a = 1, b = 5, err.sd = 20.

dev.temp <- lm.loss(c(1, 5, 20))
dev.temp # print value

## [1] 927.1694

This says that for parameter values of a = 1, b = 5 and error standard deviation = 20, the deviance is 927.17. Note: Even though I don’t specify the data when running the function, the function will look for the data in the current working directory. Since I specified these before, the function will find them.

Let’s try with some values that are closer to the true values (hopefully the deviance decreases). We’ll try setting a = 3.2, b = 7.5, and error standard deviation = 2

dev.temp <- lm.loss(c(3.2, 7.5, 2))
dev.temp

## [1] 485.3211

As you can see, our new deviance value is smaller that our previous one. But which parameter values result in the smallest deviance? In the next step, we’ll figure this out using an optimization function.

Step 2: Uze optimization functions to find parameter values that minimize the deviance function for the data

Now we’ll use R’s optimization functions to find the parameter values that minimize the loss function (and in return, maximize the likelihood). There are many different optimization functions in R. The simplest is optim (others include nlminb). To use optim, we set the par argument to be the starting parameter values, and the fn argument to the loss function we previously specified.

parameter.fits <- optim(par = c(0, 0, 10),
      fn = lm.loss, hessian = T
      )

parameter.fits

## $par
## [1] 3.032675 7.735734 2.052022
## 
## $value
## [1] 427.5332
## 
## $counts
## function gradient 
##      126       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##              [,1]        [,2]       [,3]
## [1,] 5187.5415282 486.8750697 -0.2324712
## [2,]  486.8750697  47.4969713 -0.0244107
## [3,]   -0.2324712  -0.0244107 94.9660933

Let’s look at the results. The best parameters are a = 3.03, b = 7.74, and err sd = 2.05. The fitted parameter values we got are pretty close to the true parameter values of a = 3, b = 8 and err sd = 1. The minimum deviance (using the previous parameter values) was 427.53. The convergence value of 0 means that the optimization function was able to converge on a single value (this is a good thing). If you get a value of 1 for convergence, it suggests that your final parameter estimates might not be reliable.

Getting confidence intervals for parameter values

After you run an optimization procedure, you can obtain confidence intervals for parameter values using the parameter Hessian matrix (for detailed information on this, see Lewandowsky and Farrell (2010)).

hessian <- parameter.fits$hessian
hessian.inv <- solve(hessian)
parameter.se <- sqrt(diag(hessian.inv))
parameter.se

## [1] 0.07129174 0.74505275 0.10261616

This tells us that the standard error of the slope (a) parameter is 0.07, the standard error for the intercept (b) parameter is 0.75, and the standard error for the error (\(\sigma\)) parameter is 0.1. We can now calculate 95% confidence intervals for each parameter by adding and subracting 1.96 times the standard error from the ML estimates:

CI.matrix <- as.data.frame(matrix(NA, nrow = 3, ncol = 3))

CI.matrix[1,] <- parameter.fits$par
CI.matrix[2,] <- parameter.fits$par - 1.96 * parameter.se
CI.matrix[3,] <- parameter.fits$par + 1.96 * parameter.se
names(CI.matrix) <- c("a", "b", "sigma")
rownames(CI.matrix) <- c("ML", "95% Lower bound", "95% Upper bound")

CI.matrix

##                        a        b    sigma
## ML              3.032675 7.735734 2.052022
## 95% Lower bound 2.892943 6.275431 1.850894
## 95% Upper bound 3.172407 9.196038 2.253150

To interpret the confidence intervals, we’d conclude that we are 95% confident that the population value of the slope a is in the interval [2.89, 3.17], the population value of the intercept b is in the interval [6.28, 9.2], and the population value of the error standard deviation \(\sigma\) is in the interval [1.85, 2.25].

Frequentist statistics Warning!!! You can not conclude from a frequentist confidence interval (like the ones we just calculated) that the probability that the true population value is between these intervals is 95%. This is a common mistake even made by experienced scientists (see comment by Herzog and Ostwald (2013)). To make this conclusion, you’d need to use Bayesian statistics. For a tutorial on calculating 95% highest density intervals, see Kruschke (2010) and Lee and Wagenmakers (2014).

We’re pretty much done, but before moving on to a psychological example, let’s compare the parameter values we just got using our custom optimization procedure with simple regression functions in R. Since the regression functions in R use normal equations that are are equivalent to maximizing log-likelihoods, we should be able to calculate the same values using simple regression functions in R. Let’s test this:

built.in <- lm(data.y ~ data.x)
built.in

## 
## Call:
## lm(formula = data.y ~ data.x)
## 
## Coefficients:
## (Intercept)       data.x  
##       7.734        3.033

Sure enough, R’s built in linear regression function calculates a slope (a) of 3.03 and an intercept (b) value of 7.73. These are identical to what we got before.

Example 2: The Proportional Difference (PD) model

Now we’ll move on to using ML to estimate parmaters of a psychological model of decision making. For this example, we’ll use the Proportional Difference (PD) model from Gonzalez-Vallejo (2002). The model describes how people make choices between multiple options, where each option is defined by a finite number of numerical attributes. For example, consider the decision between a smartphone A with 128GB of storage and a 10 hour battery life, and a smartphone B with 64GB of storange and a 15 hour battery life:

Attributes	Smartphone A	Smartphone B
Storage	128 GB	64 GB
Battery life	10 hours	15 hours

The PD model assumes that people make trade-offs between option attributes by calculating the proportional difference of one option’s attributes to the other. The proportional benefits of each attribute for each option are then summed to calculate an overall d score which indicates the overall value of an option relative to the other.

For example, given our previous example, we can calculate the d score as:

\[d = \frac{128-64}{128} + \frac{10-15}{15}=\frac{1}{8}\]

This means that smartphone A has a slight advantage over smartphone B.

The d score is a function of the two choice options. However, a person’s probability of choosing option A is a function of both the d value, a person’s decision threshold \(\delta\), and a noise parameter \(\sigma\). The larger a person’s \(\delta\) value is, the more evidence (i.e.; more extreme \(d\) value) they need to make a clear (i.e.; high probability) decision. The lower the person’s \(\delta\) value is, the less evidence (i.e.; less extreme \(d\) value) they need.

Formally, a person’s probability of selecting an option is defined by the cumulative normal distribution from \(-\infty\) to \(d - \delta\):

\[ p(choice = x) = \int_\infty^{d - \delta} N(\mu = 0, \sigma = \sigma)\]

We can visualize this probability here:

pd.lik <- function(x) {pnorm(x, mean = 0, sd = 1)}
plot(1, xlim = c(-2, 2), ylim = c(0, 1), 
     type = "n", xlab = expression(d - delta), ylab = "p(Choose x)")

rect(-100, -100, 100, 100, col = gray(.95, alpha = .8))
abline(h = seq(0, 1, .2), col = gray(1, alpha = .8), lwd = 3)
abline(h = seq(.1, 1.1, .2), col = gray(1, alpha = .8), lwd = 1)
abline(v = seq(-2, 2, .5), col = gray(1, alpha = .8), lwd = 3)
abline(v = seq(-2.25, 2.25, .5), col = gray(1, alpha = .8), lwd = 1)

curve(pd.lik, from = -2, to = 2, add = T, lwd = 2)

As you can see, the larger \(d\) is relative to \(\delta\) (that is the larger the difference \(d-\delta\)) the more likely a person is to pick the option favored by d. When \(d-\delta\) is exactly 0 (meaning that d is exactly at the person’s decision threshold \(d\)), the person is indifferent and is equally likely to choose both options.

Step 0: Create Stimuli

Before we can generate data, we’ll need to define some choice stimuli. We’ll create 20 stimuli, where each stimuli has defined two attributes a and b.

n.stim <- 200

option1 <- data.frame(a = sample(1:10, size = n.stim, replace = T),
                      p = sample(50:55, size = n.stim, replace = T))

option2 <- data.frame(b = sample(1:10, size = n.stim, replace = T),
                      q = sample(50:55, size = n.stim, replace = T))

Now we can calculate d values. Before we can calculate them, we’ll first need to calculate the denominator for the d calculations (max(|a|, |b|) and max(|p|, |q|))

library(matrixStats)

## matrixStats v0.10.3 (2014-10-01) successfully loaded. See ?matrixStats for help.

den.ab <- rowMaxs(cbind(option1$a, option2$b))
den.pq <- rowMaxs(cbind(option1$p, option2$q))

Now let’s calculate the d values for each stimuli pair:

d <- ((option1$a - option2$b) / den.ab - (option1$p - option2$q) / den.pq)

Here are the final results:

pd.stimuli <- cbind(option1, option2, d)
head(pd.stimuli)

##   a  p b  q           d
## 1 2 54 8 51 -0.80555556
## 2 9 55 4 50  0.46464646
## 3 8 53 8 55  0.03636364
## 4 8 51 9 51 -0.11111111
## 5 9 52 7 50  0.18376068
## 6 9 53 7 52  0.20335430

Step 1: Data

Now let’s create some participant data. We’ll set the true value of delta (decision threshold) to .5 and the random disturbance standard deviation to 1.

true.delta <- .2
err.sd <- 1

p.choose.x <- pnorm(pd.stimuli$d - true.delta, mean = 0, sd = 1)

Now we’ll create the actual choice data from the probabilities. The code looks a bit complicated, but it’s just a loop calculating a single for each stimuli based on the choice probability

choices <- sapply(1:length(p.choose.x), function(x) {
  
  prob <- p.choose.x[x]
  outcome <- sample(c("x", "y"), size = 1, replace = 1, prob = c(prob, 1 - prob))
  
})

Step 2: Define loss function

Let’s calculate the loss function. According to Gonzalez-Vallejo (2001), the probability of selecting option x should be proportional to the difference between the stimuli’s d value and the participant’s decision threshold delta (\(\delta\)). A difference of 0 indicates indifference (0.50 probability of choosing x), a large positive value of the difference indicates a high probability of choosing x, and an extreme negative value indicates a very low probability of choosing x. As specified by Gonzalez-Vallejo (2001), we’ll use the cumulative Normal distribution (pnorm function in R) to quantify the specific probability:

\[ p(choice = x) = \int_\infty^{d - \delta} N(\mu = 0, \sigma = \sigma)\]

where \(\sigma\) is the noise standard deviation.

Alltogether, we have two free parameters that we need to optimize: \(\delta\), the participant’s decision threshold, and \(\sigma\), the noise standard deviation.

pd.loss <- function(par) {
  
  delta <- par[1]
  err.sigma <- par[2]
  
  
  # If the error standard deviation is invalid (i.e.; negative), then we need to return a very high deviance
# This will tell the optimization procedure to stay away from invalid (either statistically or psychologically)
#   parameter values. I will restrict delta to be between -5 and + 5, and err.sigma to be greater than 0:

if(err.sigma < 0 | delta > 5 | delta < -5) {
  
  deviance <- 1000000
  
}

  
  # If the parameters are valid, then calculate the deviance...
  
  if(err.sigma > 0) {
  
# Now, calculate the likelihood of choosing option x for each stimuli
    
p.choose.x <- pnorm(pd.stimuli$d - delta, 
                       mean = 0, 
                       sd = err.sigma
                       )

# Likelihood of choosing y is just 1 - p(choose x)

p.choose.y <- 1 - p.choose.x

# Now calculate the likelihood of each observed choice

likelihoods <- rep(NA, length(choices))  # Set up the vector
likelihoods[choices == "x"] <- p.choose.x[choices == "x"]  # Likelihoods for x choices
likelihoods[choices == "y"] <- p.choose.y[choices == "y"]  # Likelihoods for y choices


# Because we don't like likelihoods of 0 (which should rarely happen), convert 0s to a very small number
likelihoods[likelihoods == 0] <- .0001

# Now let's convert the vector of likelihoods to a summary deviance score (-2 times sub log-lik)

 # Calculate log-likelihoods of each data point
log.likelihoods <- log(likelihoods)

 # Calculate the deviance (-2 times sum of log-likelihoods)
deviance <- -2 * sum(log.likelihoods)

}


return(deviance)

}

Step 3: Optimize

Now let’s fit the model to the data. Again, the true parameter values are delta = 0.2 and error sd = 1. Let’s see if we get ML values close to these:

parameter.fits <- optim(par = c(0, 1),
      fn = pd.loss, hessian = T
      )

parameter.fits

## $par
## [1] 0.1279175 1.2523457
## 
## $value
## [1] 256.7897
## 
## $counts
## function gradient 
##       43       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##            [,1]      [,2]
## [1,] 152.642956 -5.231999
## [2,]  -5.231999 24.802799

We got a ML delta value of 0.13, pretty close to the true value of 0.2. If you increase the number of stimuli, you should find that the ML value is closer to the true value.

Now, let’s calculate 95% confidence intervals for the parameters (again, for a more thorough explanation of this procedure, see Lewandowsky and Farrell (2010))

hessian <- parameter.fits$hessian
hessian.inv <- solve(hessian)
parameter.se <- sqrt(diag(hessian.inv))

CI.matrix <- as.data.frame(matrix(NA, nrow = 3, ncol = 2))

CI.matrix[1,] <- parameter.fits$par
CI.matrix[2,] <- parameter.fits$par - 1.96 * parameter.se
CI.matrix[3,] <- parameter.fits$par + 1.96 * parameter.se
names(CI.matrix) <- c("delta", "sigma")
rownames(CI.matrix) <- c("ML", "95% Lower bound", "95% Upper bound")

CI.matrix

##                       delta     sigma
## ML               0.12791749 1.2523457
## 95% Lower bound -0.03130098 0.8573599
## 95% Upper bound  0.28713596 1.6473314

It looks like our 95% confidence intervals for the delta and sigma parameters are [-0.03, 0.29] and [0.86, 1.65] respectively.

Simulation

Just to see how consistently ML can recover parameter values from the PD model, let’s conduct a simulation. I’ll calculate ML estimates from 200 simulated stimuli from participants with different delta values ranging from -1.0 to 1.0 in steps of 0.25. I’ll simulate each person’s responses 100 times.

n.stim.seq <- c(100)   # Number of stimuli for each participant
true.delta.seq <- seq(-1, 1, .25) # Participant delta
err.sd <- 1 # Error sd
n.sim <- 100 # Number of simulations

library(matrixStats)

result.df <- expand.grid(true.delta = true.delta.seq,
                         n.stim = n.stim.seq,
                         sim = 1:n.sim,
                         delta.fit = NA,
                         sd.fit = NA
                         )



for(i in 1:nrow(result.df)) {
  
  n.stim <- result.df$n.stim[i]
  true.delta <- result.df$true.delta[i]
  
option1 <- data.frame(a = sample(1:10, size = n.stim, replace = T),
                      p = sample(50:55, size = n.stim, replace = T))

option2 <- data.frame(b = sample(1:10, size = n.stim, replace = T),
                      q = sample(50:55, size = n.stim, replace = T))


den.ab <- rowMaxs(cbind(option1$a, option2$b))
den.pq <- rowMaxs(cbind(option1$p, option2$q))

d <- ((option1$a - option2$b) / den.ab - (option1$p - option2$q) / den.pq)

pd.stimuli <- cbind(option1, option2, d)


p.choose.x <- pnorm(pd.stimuli$d - true.delta, mean = 0, sd = 1)

choices <- sapply(1:length(p.choose.x), function(x) {
  
  prob <- p.choose.x[x]
  outcome <- sample(c("x", "y"), size = 1, replace = 1, prob = c(prob, 1 - prob))
  
})

parameter.fits <- optim(par = c(0, 1),
      fn = pd.loss
      )


result.df[i, 4:5] <- parameter.fits$par

}

Let’s check the results. I’ll graph the distribution of fitted delta values as a function of the true underlying delta values. The distributions are represented as beanplots with horizontal lines showing the mean of the distributions. I’ll put a dashed red line showing where the true delta value for each group of simulations is. If the ML procedure works, the horizontal mean line of the fitted delta distributions should be close to the red line.

library(beanplot)
library(RColorBrewer)

plot(1, xlim = c(0, length(true.delta.seq) + 1), 
     type = "n", ylim = c(-2, 2), xaxt = "n", 
     ylab = "Fitted Delta Value", xlab = "True Delta Value",
     main = "Simulated ML fits of proportional difference model"
     )

abline(h = seq(-2, 2, 1), col = gray(.8))
abline(h = seq(-2, 2, .25), col = gray(.8), lwd = .5)

abline(a = -1.25, b = .25, lty = 2, col = "red", lwd = 2)

with(result.df[result.df$n.stim == 100,], 
     beanplot(delta.fit ~ true.delta, col = gray(.8),
              ylim = c(-2, 2), add = T
              ))

Looks pretty good. The only consistent problems we see are with extreme delta values (-1 and + 1). Here, the average fitted value seems to be a bit more extreme than the true values. This is not terribly surprising: if a participant’s choice is perfectly consistent with d, then the fitting procedure will try to assign a very extreme \(\delta\) value to them.

References

Gonzalez-Vallejo, Claudia. 2002. “Making Trade-Offs: A Probabilistic and Context-Sensitive Model of Choice Behavior.” Psychological Review 109 (1). American Psychological Association: 137.

Herzog, Stefan, and Dirk Ostwald. 2013. “Experimental Biology: Sometimes Bayesian Statistics Are Better.” Nature 494 (7435). Nature Publishing Group: 35–35.

Kruschke, John. 2010. Doing Bayesian Data Analysis: A Tutorial Introduction with R. Academic Press.

Lee, Michael D, and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical Course. Cambridge University Press.

Lewandowsky, Stephan, and Simon Farrell. 2010. Computational Modeling in Cognition: Principles and Practice. Sage.

Maximum-Likelihood Fitting in R

Nathaniel Phillips

8 Sep 2015