The TMB library is a link that allows the high-level, statistically-oriented programming language R to interface with the computational speed of C++ for calculating likelihood. This example walks through how to generate simple parameter estimates for different distributions using R, then Template Model Builder.

Unlike many of the function calls in R and regression and modeling packages available on CRAN, using Template Model Builder requires you to literally write out the equations to calculate your likelihood in order to run a model. As such, this page goes into the weeds of how to calculate joint negative log likelihood in R first (familiar territory), then switches to TMB in hopes that it will make the C++ versions more intelligible.

My hope is that this document can serve the foundation of an understanding of to use TMB, by starting from the ground up.

Starting out in R

So, let’s start by getting a workspace set up: loading the required libraries, etc.

rm(list=ls()) # Removing everything from the R workspace
library(TMB) # Template model builder
library(data.table) #data manipulation package
library(ggplot2) # Plotting package
library(ggthemes) # makes plots more elegant

Simulating some data

First, let’s create a data set, where we know ahead of time what the distribution parameters are. This has the advantage of allowing the programmer to know ahead of time what the “right answer” for parameters should be, which will allow us to check our code for errors.

To keep things simple, we’ll start out with data sampled from a normal distribution.

simulated_data<-data.table(sim_norm=rnorm(10000,50,6)) 
    # Generating a column witihn the data set ("sim_norm") that has 10000 observations 
    # pulled from a normal distribution of mean 50, with a standard deviation of 6

Likelihood overview

Let’s say for the sake of example that you don’t already know the distribution parameters that were used to create these data sets, and you wanted to find out. To do this, we could roughly use the following process:

1. Decide what distribution is appropriate (for now, we will assume we are omnipotent and just “know”)
1. Come up with a starting guess for what the parameters are
1. Evaluate how likely it is that each point could have come from that distribution
1. Tweak the parameters, and see whether it becomes more likley that the data was generated from that distribution
1. Repeat steps 2) and 3) until you reach a point of diminishing returns.

Steps 2-4 are easy to complete with the help of computers and statistical packages through functions and optimization. Our intuition would tell us that data points closer to the mean of the distribution are probably more likely to have been generated by a process with those parameters than those far away from the mean. We also know that data points further from the mean are less surprising if the data is very noisy (has a high standard deviation).

Let’s use R’s “dnorm” function to evaluate how likely it is that an individual point came from a distribution, and visualize that data point as a red line being portrayed over the distribution:

# Visualizing the data point over the distribution 
ggplot(data=simulated_data)+
  geom_density(aes(sim_norm))+
  geom_vline(xintercept=48,colour='red')+
  labs(title="A Likely Data Point (red)")+
  theme_tufte()

# Discovering the likelihood of the number 48 being 
# generated from a normal distribution with  
# mean 50, SD 6
dnorm(48,50,6)

## [1] 0.0628972

Now, let’s evaluate the likelihood of a data point that seems less likely to have come from that distribution:

ggplot(data=simulated_data)+
  geom_density(aes(sim_norm))+
  geom_vline(xintercept=75,colour='blue')+
  labs(title="A Less Likely Data Point (blue)")+theme_tufte()

dnorm(75,50,6)

## [1] 1.129383e-05

As expected, the likelihood of the data point farther away from the mean is smaller. But that’s just one data point– what if you have a set of numbers that you think could have been generated from a given distribution?

Calculating the likelihood of multiple data points all coming from the same distribution

To evaluate the likelihood of the whole data set, we evaluate each data point’s probability of having been generated from that distribution, and then combine the probabilities. When evaluating the probability of mulitple events, you need to multiply the probabilities together to get the likelihood that all of the events will occur. However, multiplication is computationally expensive as the numbers get very large– so, instead of multiplying things together, we use the handy math trick of log-transforming the likelihood, which then allows the probabilities to be added together rather than multiplied.

This creates a “joint log likelihood”. This number would be positive– however, computers like to optimize to find minimums, so instead, we create a “joint negative log likelihood” (JNLL) for an entire data set by starting off with a JNLL of 0, and then subtract each data point’s likelihood from that number. The idea here is that an optimizer will be able to tweak the parameters of the distribution until it reaches the lowest JNLL that it can.

So, to calculate the JNLL of a data set given a set of parameters, you could do something like the following:

# JNLL function Version #3

# Defining a function to calculate the JNLL, containing input data, and parameters in a list:
jnll_norm<-function(data,parameters){
  
  mean_param<-param[1] # This function knows the first object in the list is going to be our parameter guess for the mean
  sd_param<-param[2] # and the second object will be our guess for the standard deviation
  
  jnll<-0 # Setting the Joint Negative Log likelihood to Zero -- throughout the loop,
          # this will get increasingly negative as each log-likleihood is subtracted from it
  for (each_datapoint in data_vector){ # Creating a loop through each data point
    #Find the likelihood of that point
    #having log=TRUE means that it will return the negative log likelihood 
      likelihood_of_single_point<-dnorm(each_datapoint,mean_param,sd_param,log=TRUE) 
    # Subtract that from the jnll
      jnll<-jnll-likelihood_of_single_point
    }
  return(jnll)
  }

A far more efficient way of calculating this would be to pass the vector of data values, rather than a single data point, to the dnorm function:

# JNLL function Version #2
jnll_norm<-function(data,parameters){
  
  mean_param<-parameters[1]
  sd_param<-parameters[2]
  
  vector_of_likelihoods<-dnorm(data,mean_param,sd_param,log=TRUE) #the dnorm() function can take a vector!
  jnll<- -1*sum(vector_of_likelihoods) # now, we just add together the negative log liklihoods produced by dnorm() above
  
  return(jnll)
}

Let’s test out this function on three different guesses, and visualize the guesses compared to our data that we have “observed” (in this case, simulated).

# Creating a data set based on the "real" (simulated) values, and the 
# creating data sets that would occur from distributions based on 2 parameter guesses
jnll_vis<-rbind(data.table(data=simulated_data$sim_norm, type="Simulated Data (mean=50,sd=6)"),
                data.table(data=rnorm(10000,40,10), type="Guess #1 (mean=40,sd=10)"),
                data.table(data=rnorm(10000,55,5), type="Guess #2 (mean=55,sd=5)"),
                data.table(data=rnorm(10000,50,6), type="Guess #3 (mean=50,sd=6)"))

ggplot(data=jnll_vis)+
  geom_density(aes(data,colour=type))+
  labs(title="Data vs. Parameter Guesses")

It looks like we would imagine that the third guess (which are the parameter values used to simulate the data in the first place), but let’s check which guess has the lowest joint negative log-likelihood:

# Guess #1
jnll_norm(data=simulated_data$sim_norm,parameters=c(40,10))

## [1] 38979.67

# Guess #2
jnll_norm(data=simulated_data$sim_norm,parameters=c(55,5))

## [1] 37534.1

# Guess #3
jnll_norm(data=simulated_data$sim_norm,parameters=c(50,6))

## [1] 32097.36

Looks like Guess # 3 (the true distribution parameters) wins out! But, that guess-and-check method is certainly far from ideal. Luckily, R has optimizers that can search through “parameter space” and find the combinations that achieve the lowest JNLL. So, let’s try using one of R’s optimizers to find out what distribution our simulated data came from.

How low can you go: Optimizing the JNLL

We’ll use optim(), a function in the {stats} library in R, which is automatically loaded, to do our guesswork for us (and more ingelligently).

When we’re guessing at a mean and standard deviation parameter as intelligent human scientists, we realize that there’s no such thing as a negative standard deviation. However, an optimizer just searching through paramater space isn’t that smart, and it could try giving our optimization function negative values for our standard deviation parameter, which breaks the dnorm() function an produces NaN:

dnorm(10,4,-1)

## Warning in dnorm(10, 4, -1): NaNs produced

## [1] NaN

So, we need to modify our function to calculate the JNLL to safeguard it from the fact that we want our optimizers to be searching through ALL of parameter space– negative to positive infinity. Lucky for us, we can use the log transform/exponentiation to make sure the parameter passed to dnorm is positive:

dnorm(10,4,exp(-1))

## [1] 1.873835e-58

Much better. Let’s put this idea into action by modifying our JNLL function:

# JNLL function Version #3
jnll_norm<-function(data,parameters){
  
  mean_param<-parameters[1]
  sd_param<-exp(parameters[2]) #**** CHANGE HERE****: the sd_param has been exponentiated
  
  vector_of_likelihoods<-dnorm(data,mean_param,sd_param,log=TRUE) 
  jnll<- -1*sum(vector_of_likelihoods)
  
  return(jnll)
}

Note that this means that when the optimizer returns what it thinks the best values are for our parameters, it will actually be returning the log of the standard deviation parameter.

With this in mind, it’s time to put one of R’s optimizers into action! We can pass our data, starting guesses, and function that we need to minimize to optim(). We save the output of this function into an object I’m calling Opt (for optimized).

# Now, we can use the otim() function in R to try to find better parameters to describe the data. 
Opt <- optim(par=c(0,0), # "par" needs to be a vector or list, with your starting guesses
                                            # ordered the same as how you define your parameters (in this case, 
                                            # mean first, standard deviation second).
             fn=jnll_norm, # "fn" is the funciton you are using to calculate JNLL
             data=simulated_data$sim_norm)

It’s worth noting that we include this “data=…” argument in our call to optim() because our JNLL function includes the name “data” for one of it’s function arguments. If for, example, your function had differently named arguments, your function and call to optim() might have looked like this:

# equally valid, yet ridiculous function and optimization call
  silly_names<-function(bumblebee_flight_paths,wild_guesses){
    center<-wild_guesses[1]
    noise<-exp(wild_guesses[2]) 
    return(-1*sum(dnorm(bumblebee_flight_paths,center,noise,log=TRUE) ))
  }

  Opt_silly <- optim(par=c(0,0), # The"par" argument needs to keep the same name, no matter how you've named
                                  # the parameter argument in your function. 
               fn=silly_names, # "fn" is stays the same as well
               bumblebee_flight_paths=simulated_data$sim_norm) # but the argument you use to enter your data changes

So, what’s in Opt?

names(Opt)

## [1] "par"         "value"       "counts"      "convergence" "message"

For now, let’s concentrate on two of the named objects within Opt– “par”, and “value”. “par” contains a vector of the parameter estimates that the optimizer thought minimized the JNLL, in the same order they were entered into the function. “value” contains the ending JNLL that the optimizer achieved.

Opt$value

## [1] 32097.21

This isn’t inherently meaningful by itself, but we can use this likelihood value to compare this distribution to other distributions or models.

Opt$par

## [1] 49.962140  1.791221

And there are our estimates for the mean and standard deviation! We know that this simulated data set was created using mean=50, and sd=6. So, why is our standard deviation parameter too small?

..Because it’s actually returning the log() of our standard deviation parameter! Let’s check out the log() of our optimizer’s best guess and see if matches up with what we expect (something around 6):

exp(Opt$par[2])

## [1] 5.996769

Much better. However, remembering which parameters need to be transformed can be tricky, which is one reason to used named lists in both your JNLL function and your list of parameters to make sure you understand your outputs.

Tip: Using Named Lists to Keep Track of your Parameters

Instead of simply passing your JNLL function a vector of starting guesses, you can used named lists in order to create some clarity and create a more readable output.

# JNLL function Version #4
jnll_norm<-function(data,parameters){
  
  mean_param<-parameters["mean"]
  sd_param<-exp(parameters["log_sd"])
  
  vector_of_likelihoods<-dnorm(data,mean_param,sd_param,log=TRUE) 
  jnll<- -1*sum(vector_of_likelihoods)
  return(jnll)
}

Opt <- optim(par=list("log_sd"=0,"mean"=0), # Now, this list needs to contain the named elements you specify in your 
                                            # function-- the order doesn't matter.
             fn=jnll_norm, # "fn" is the funciton you are using to calculate JNLL
             data=simulated_data$sim_norm)

Opt$par

##    log_sd      mean 
##  1.791221 49.962140

Organizing your results

Tip: Saving your parameters as a data.table

The following function will return your parameter values as a more convenient data format– a data.table, with another column for any parameter transformations that we need (specifically, if the variable that the optimizer spits out is actually the exponentiated version, we need to do a log-transform).

get_params<-function(optimized){
  # pulling the parameter estimates into a data table with columns for names and values
  params<-data.table(names=names(optimized$par),values=optimized$par)
  # adding the transformed versions into another column (if necessary) 
  params[grepl("log_",names)==T,trans:=exp(values)]
  return(params)
}

parameters<-get_params(Opt)
print(parameters)

##     names    values    trans
## 1: log_sd  1.791221 5.996769
## 2:   mean 49.962140       NA

Branching out to other distributions

For this example so far, we’ve only looked at the normal distribution. Here’s an example of the same essential process, but this time, with a beta distribution:

simulated_data[,sim_beta:=rbeta(10000,70,3)]
  # Generating a column ("sim_beta") that has 10000 observations 
  # pulled from a beta distribution with parameters a=70, b=3

jnll_beta<-function(data,parameters){
  # Now we have two parameters that determine the shape of the
  # beta disribtion, and both need to be positive, so both get exponentiated.
  
  shape1<-exp(parameters["log_shape1"])
  shape2<-exp(parameters["log_shape2"])


  vector_of_likelihoods<-dbeta(data,shape1,shape2,log=TRUE) 
  jnll<- -1*sum(vector_of_likelihoods)
  return(jnll)
}

Opt_beta <- optim(par=list("log_shape1"=0,"log_shape2"=0), 
             fn=jnll_beta,
             data=simulated_data$sim_beta)

params<-get_params(Opt_beta)
print(params)

##         names   values     trans
## 1: log_shape1 4.238086 69.275119
## 2: log_shape2 1.087793  2.967718

Shifting gears to TMB

What if your data set was really, really large, and you wanted to use the power of C++ to achieve a similar goal? TMB allows for this, and more.

TMB allows you to write your objective function (to calculate joint negative log likelihood) in C++ instead of R. You can use the same optimizers with the C++ code that you use with your R function, and although the C++ Code and the R code involve slightly different syntax, they are structured similarly.

Inputs to the C++ Function

The C++ code uses named inputs akin to the way we defined our parameter inputs in our final jnll function in R, where the function looked specifically for named parameters like “mean”. In addition to naming the parameters, we also name our data inputs– this is because in future work in TMB, you’ll probably have (at least) two kinds of data you want to keep straight– the column that represents your dependent variable, and the data table/matrix of values that correspond to your covariates. TMB’s fucntion “MakeADFun”. In general, a common convention is to use the suffix “i” if the data is 1-dimensional (just a column), and the suffix “ij” if the data is 2 dimensional (like a data set of covariates).

To prepare our data for use with TMB, we first need to create named objects with our “first guess” at their starting values, and a list that explicily names our data set.

Params = list("mean"=0, "log_sd"=0)
Data = list("y_i"=simulated_data$sim_norm)

Next, we need to write some C++ code that calculates the JNLL based on our distribution, parameters, and data.

Let’s take a look at the completed C++ file to discover the mean and standard deviation of this data set, and then break down each component.

#include <TMB.hpp>
template<class Type>
Type objective_function<Type>::operator() () 
{ 
  // Data
  DATA_VECTOR( y_i ); 

  // Parameters
  PARAMETER( mean );
  PARAMETER( log_sd );
  
  Type sd = exp(log_sd);
  Type jnll = 0;
  int n_data = y_i.size();

  // Probability of data conditional on parameter values
  for( int i=0; i<n_data; i++){ 
     jnll -= dnorm( y_i(i), mean, sd, true );
  }
  
  return jnll;
}

Not so bad, eh? One thing to note is that “\” denotes a comment in C++, just the way R uses “#”" for the same function. It’s helpful to know the following things when writing in C++ for TMB (these will make more sense as we break down what’s in the C++ file):

Each line that isn’t a comment, or a bracket (“{” or “}”) needs to end with a semicolon (“;”).
C++ needs to know what to expect. Unlike R, C++ needs to have its objects and data types/classes clearly defined. As such, you can’t just create a variable in your C++ code– you need to know ahead of time what kind of data it’s going to be. If it is going to remain static, and is an ingeger, you can define a number that can be used (static and unchanging) for the rest of the script. However, if something is going to change over the course of the function/C++ script, it is usually of the type “Type”.
Loops are super, super fast, so you probably don’t have to worry about creating a loop structure in your code.

Now that we’re acquainted with the basics, let’s take a deeper dive into each component of the above code. The first block of code loads in the TMB C++ library, and defines what follows as an “objective function” that R will be able to use. To use TMB, always include the following text above every C++ script:

#include <TMB.hpp> //This is the equivalent of loading a library
template<class Type>
Type objective_function<Type>::operator() () 
{ //this says that the objective fn is starting now

Next, we define our data inputs, and tell the C++ file what specific parameters to look for.

We are pointing the code to look for a data vector named “y_i”, as well as values for two parameters– one named “mean”, and one named “log-sd”.

 // Data 
  DATA_VECTOR( y_i ); //Note that this is the exact same name as what is in the R code version list.

  // Parameters
  PARAMETER( mean ); //tells C++ to look for named parameters "mean" and "log_sd"
  PARAMETER( log_sd );

This next section is where we calculate any downstream values/do any transformations that are required before the numbers passed from the optimizer (in this case, the values for mean and log_sd) are entered into the actual objective function. So, here, we need to make sure that we exponentiate the standard deviation parameter.

With “Type sd = exp(log_sd)”, we are creating an object of type “Type”, called “sd”, that is the exponentiated version of the log-sd parameter that gets passed through the optimizer.

We also define the variable jnll as the number 0 (data type “Type”), from which we will subtract the log-likelihood for each data point.

Next, we calculate the length of the vector y-i, which allows us to iterate through the right number of data points when calculating JNLL.

  Type sd = exp(log_sd);
  Type jnll = 0; 
  int n_data = y_i.size();

Now, it’s time to actually calculate the Negative Log-Likelihood for each individual data point.

This calculates (in a loop), the log-likelihood of each point, given the paramaters (mean and sd, in this case). Note that the iteration goes from 0 (C++ indexes from 0, while R indexes from 1), and as such, it must start at 0 and end at 1-n– this is accomplished by the equivalent of a “while” loop, where it keeps iterating from 0 to 1-n.

Although functions like dnorm() often have the same syntax as functions in R, this is not always the case- always check the TMB documentaion for these functions before finalizing your results.

  for( int i=0; i<n_data; i++){ // iterating 0-(1-n)
        jnll -= dnorm( y_i(i), mean, sd, true ); //calculate individual data point likelihood, subtract from jnll.
  } //closing the loop through each data point

Now, we just define what the function returns, and close the function.

  // Reporting
  return jnll;

} //closing the function

Compiling the C++ Code

To use your C++/TMB code, you first need to compile it. On Windows machines, it will create .dll and .o files– on Macs, it will create a .so file rather than a .dll, or “dynamically linked library”. A result of “0” means that the compilation worked, and you can check for the compiled code files in the same directory as the file path you provide for your C++ code.

compile( "normal_distribution.cpp" ) # compile() is a function within the TMB library

## Note: Using Makevars in C:/Users/stubbsrw/Documents/.R/Makevars

## [1] 0

Rather than simply compiling the code, I like to make sure I am using the most up-to-date version of the compiled code– so, each time I run the code, I look for existing compiled files, and erase them before trying the compile. Since the “.o” file extension is common to both Windows and Macs, I look for that file extension and then eliminate all related extensions. This block of code is an example of that strategy:

# Compile C++ TMB Code
  cpp_version<-"normal_distribution"
  if (file.exists(paste0(cpp_version,".o"))) file.remove(paste0(cpp_version,".so"), paste0(cpp_version,".o"),paste0(cpp_version,".dll"))

## Warning in file.remove(paste0(cpp_version, ".so"), paste0(cpp_version,
## ".o"), : cannot remove file 'normal_distribution.so', reason 'No such file
## or directory'

## Warning in file.remove(paste0(cpp_version, ".so"), paste0(cpp_version,
## ".o"), : cannot remove file 'normal_distribution.dll', reason 'Permission
## denied'

## [1] FALSE  TRUE FALSE

  compile(paste0(cpp_version,".cpp")) # Compile that file!

## Note: Using Makevars in C:/Users/stubbsrw/Documents/.R/Makevars

## [1] 0

  if (file.exists(paste0(cpp_version,".o")))   # Check to see if it went through:
    print("Yep, that seems to have worked!") else stop("Whoah, looks like your code didn't actually compile...")

## [1] "Yep, that seems to have worked!"

Now that you’ve compiled the code, it’s ready to use– however, it isn’t linked to R yet. To do this, we need to actually load the “dynamically loaded library” (.dll) into R so that it’s ready to use.

dyn.load(dynlib(cpp_version)) 
# I use "cpp_version" that I defined above to make sure I'm always 
# being consistent with what I'm compiling and loading.

Actually using the CPP file

Now, it’s time to take advantage of TMB as a wrapper for C++. With one simple call, MakeADFun(), TMB packages our data, parameters, and objetive function for optimization.

Obj <- MakeADFun(data=Data, parameters=Params, DLL=cpp_version)

Now, we can plug this object into R’s optimizers– note that the data component isn’t necessary– it’s been wrapped up in the Obj$fn already. There’s one more piece that we need to add this time– the function’s gradient, which the call MakeAdFun() creates for us.

From my understanding, the “gradient” is basically the shape and rapidity with which you begin to approach the global minimum (which can pass through local minimums as well) on your optimizer’s way to the set of parameters that will minimize your JNLL. This gradient has to do with the second derivative of the function– the rate of change in JNLL as the optimizer moves towards or away from the “optimal” parameters.

This gradient function (Gradient based methods take steps to minimize the function in the direction of the gradient– the direction of the steepest descent. Basically, the function looks for the fastest way to get where it wants to go (the minimum!) via the path of least resistance where each step will make the most gains.

This gradient is used by TMB to report standard errors for the parameters it estimates. The faster the optimizer moves away from the minimum when it tinkers with parameters, (the steeper the gradient is near the estimates it comes up with), the more certain the estimates are.

Opt<- optim(par=Obj$par, fn=Obj$fn, gradient=Obj$gr)

Let’s check out our parameter estimates:

params<-get_params(Opt)
print(params)

##     names    values    trans
## 1:   mean 49.962140       NA
## 2: log_sd  1.791221 5.996769

TMB also can report the standard errors of what it estimates using the SDreport() function. Let’s check it out:

sd<-sdreport( Obj ) # standard errors

## outer mgc:  8.767897 
## outer mgc:  8.770823 
## outer mgc:  8.764414 
## outer mgc:  28.73039 
## outer mgc:  11.23456

sd

## sdreport(.) result
##         Estimate  Std. Error
## mean   49.962140 0.059967749
## log_sd  1.791221 0.007074174
## Maximum gradient component: 8.767897

This is pretty cool.

Next up– moving beyond simple distribution parameters onto simple linear models.

Finding Distribution Parameters of Simple Data Sets in R and TMB

Rebecca Stubbs

Tuesday, December 06, 2016