The Kaggle March Machine Learning Formula using the following scoring formula:

\[LogLoss = - \frac{1}{n} \sum\limits_{i=1}^n [y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]\]

where \(n\) is the number of games played, \(\hat{y_i}\) is the predicted probability of team1 beating team2, \(y_i\) is 1 if team1 wins and 0 if team2 wins.

Smaller LogLoss is better. Let’s write this function into R and see how it works for a few sets of predictions and results.

LogLoss <- function(pred, res){
  (-1/length(pred)) * sum (res * log(pred) + (1-res)*log(1-pred))
}

pred <- rep(0.5, 10)
res <- rep(c(0,1), 5)
pred; res
##  [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
##  [1] 0 1 0 1 0 1 0 1 0 1
LogLoss(pred, res)
## [1] 0.6931472
pred <- rep(c(.4,.6), 5)
res <- rep(c(0,1), 5)
pred; res
##  [1] 0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6 0.4 0.6
##  [1] 0 1 0 1 0 1 0 1 0 1
LogLoss(pred, res)
## [1] 0.5108256
pred <- rep(0.5, 5)
res <- c(1,1,1,1,0)
LogLoss(pred, res)
## [1] 0.6931472
pred <- rep(0.7, 5)
res <- c(1,1,1,1,0)
LogLoss(pred, res)
## [1] 0.5261345
pred <- rep(0.8, 5)
res <- c(1,1,1,1,0)
LogLoss(pred, res)
## [1] 0.5004024
pred <- rep(0.9, 5)
res <- c(1,1,1,1,0)
LogLoss(pred, res)
## [1] 0.5448054
pred <- rep(0.95, 5)
res <- c(1,1,1,1,0)
LogLoss(pred, res)
## [1] 0.6401811

What is the logic behind LogLoss?

The predictions for which

\[- \frac{1}{n} \sum\limits_{i=1}^n [y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]\]

is smallest, must have

\[ \sum\limits_{i=1}^n [y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]\]

that is largest and

\[ e^{\sum\limits_{i=1}^n [y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]}\]

that is largest. When can rearrange this to:

\[ \prod{e^{[y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]}}\]

and

\[ \prod{e^{y_i \cdot log_e(\hat{y_i})} \cdot e^{(1-y_i) \cdot log_e(1-\hat{y_i}) }}\]

\[ \prod{[ \hat{y_i}^{y_i} \cdot (1-\hat{y_i})^{1-y_i} ]}\]

For every game one of these two factors is just 1 (something raised to the \(0^{th}\) power). Ultimately, this just amounts to:

\[ \prod{p_i}\] where \(p_i\) is the predicted probability of the outcome.

In other words, we are maximizing the probability that the results would have occured if your predicted probabilities were correct. In other words, the preditions with the lowest LogLoss are the predictions for which \(P(results|predictions)\) is the highest.