Introduction

If you haven’t built many classification models, a great question that you might ask yourself is: “How do I determine how good my classifaction model is performing?”

To keep things simple, two primary choices that you might consider for evaluating the “goodness”" of your model are: 1) overall error rate or 2) cross-entropy. The first option is the most intuitive and is simple to compute. You just count the number of correct predictions and divide by the total number of predictions.

The cross-entropy (aka negative log likelihood) is a more robust measure of accuracy for many classification problems than an overall error rate and we’ll see why shortly. You can find plenty on this topic if you want to dig into the theory, but here my goal is to help you develop a good intuition for the concept which is easy to see with an example.

Let’s say that we are building a language model which is designed to predict the next word in a given phrase. We have a set of bigrams which can only be completed with one of the following 4 words: run, shout, play, eat. You have a prediction model which assigns a probability to each of these words and then outputs the word with the highest probability. For a problem structured like this, we can consider each word to be a class. Say you have a set of 10 trigrams you plan to use to test your models:

like to run
love to eat
when I play
makes me shout
he will run
she will play
they will shout
Tom will eat
want to shout
they might play

Now let’s say that you have two models MODEL 1 and MODEL 2 and you want to determine which one is better. The inputs to each models are the first two words (aka bigram prefix) of a trigram from the above list and the outputs are probabilities that the last (tail) word correctly completes the trigram. The model makes its prediction from the highest probability that it computed for the 4 possible words (classes). Let’s say you run both models with the bigram prefixes listed above and get the following results:

So which one is better? It’s hard to say if we look just at the absolute error rate, but if our models generate probabilities to make its predictions and those probabilities were those shown in the table below, which model would you say is better now?

You can download the Libreoffice calc version of this spreadsheet here or the Excel version of this spreadsheet here.

To make things easier to compare, all the colored cells are the probabilities each model assigned to the correct prediction. If the cell is green, this probability was the highest one computed and resulted in a correct prediction. If the cell is yellow, this probability was not the highest one computed and resulted in an incorrect prediction. If we look closely, we can see that MODEL1 tends to assign higher probabilities to green cells than MODEL2. We also see that MODEL1 assigns higher probabilities to the yellow cells than MODEL2. This implies that MODEL1 assigned higher probabilities to the correct predictions both when it was right (green cells) and when it was wrong (yellow cells). Knowing this, which model is doing a better job? Obviously MODEL1 because even though it’s overall error rate is the same as MODEL2, MODEL1 assigns higher probabilities to the correct class regardless of whether it made a correct prediction or not.

So now we have an idea of how one model might be better than another even if it has the same overall error rate, but how can I account for this mathematically? The average cross-entropy is what will do the trick nicely in this case. The cross-entropy (aka the negative log likelihood of the data) is a cost function that captures the idea we described earlier.

Recall what a cost function is. The larger the value of a cost function, the worse your model is performing. So our new cost function should increase when a model assigns lower probability to correct predictions and should decrease when it assigns higher probabilities to correct predictions. Let’s take a look at the formula for the average cross-entropy and see if this behaves the way we just described:

\[C = -\frac{1}{N}\sum_{n=1}^N \sum_{k=1}^K t_{n, k} \ln{y_{n, k}}\]

The term \(y_{n, k}\) is the output probability of class k for sample n. The term \(t_{n, k}\) is a binary-valued (can only be 0 or 1) indicator variable which is 1 when k is the correct class and 0 when k is any of the other incorrect classes. The cross-entropy and average cross-entropy are computed in the spreadsheets whose links are provided above.

If the model were perfect, we’d expect our model to assign a probability of 1 to every correct class. This would result in \(C = -(\frac{1}{N})N(\ln{1}) = 0\). Going in the other direction, the worse our model is, the higher the value we would expect for C. We can see this by imagining what happens when the model assigns low probabilities for correct classes. When our model computes low values for \(y_n\), the term \(\ln{y_n}\) gets bigger negatively because the lower the value for \(y_n\) gets (e.g. ln(0.1) = -2.3 and ln(0.01) = -4.6) the larger C gets.

If your model assigns some kind of scoring (e.g. such as with a Stupid Back-Off language model) instead of probabilities as is done for Bayesian classifiers (e.g. such as a Katz Back-Off language model), the same formula can still be applied if you normalize your scores so that they are like probabilities that fall between 0 and 1.

Hopefully this little tutorial provides a good intuition into the value of using cross-entropy to evaluate the accuracy of classification models. As mentioned earlier, the term cross-entropy is also referred to as the negative log likelihood of the data. The derivation of the formula above is rather straight-forward, but we’ll discuss this in more detail in Part II.