Overview

You may recall from my first blog post [https://rpubs.com/mutuelinvestor/577619] that I’m pursuing my MS in data science to advance my horse picking abilities. To that end, logistic regression is likely to be an important tool in my arsenal. Here’s why.

Speed Figures provide valuable information to my horse picking process. Speed Figure are number or scores assigned to each race a horse has run in the past - a past performance history. Some equate this information to a company’s financial statements as they afford the horse player vital information to shape her investment decision.

Figures take into account many factors: how wide a horse was running (ground loss), the surface the race was run on (dirt, grass, mud, slop) as well as any trouble a horse may have encountered during the running of the race. The ability to accurately answer the question: Will horse A improve this race or not would give a handicapper an edge at the races. Therefore this question is a likely response variable to a logistic regression in my future.

Figure 1. Sample Speed Figures

_________

_________

You can see from the figure above that horses, like humans, are not machines. Sometime they run fast, sometimes they run not so fast and sometimes they run faster than ever. If horse player can use historical data to predict before the race is run if a horse will improve or not improve, she would have a big advantage. Before we take on this analysis, its important to have a firm grasp of logistic regression. This is the subject of my blog post. In this post we’ll (1) review the basics of logistic regression, (2) drill down on Log Odds, and (3) develop a better understanding of logistic regression coefficients.

The basics of logistic regression

The logistic regression boils down to the following equation:

log(p/(1-p)) = β0 + β1X+ εi.

Why Log Odds

To understand Log Odds, first we must understand odds and probabilities. Odds are the ratio of something happening to something not happening. Conversely probability is the ratio of something happening / to everything that could happen. Assume we have a horse that could win 4 our of 10 races. The odds, 4 to 6 (4 wins to 6 non wins), and probability 4/10 for our horse a depicted in Figure 2.

Figure 2. Odds vs Probability

_________

_________

Let’s focus for a moment on the odds ratio:

1. If the numerator were to increase (odds in favor of winning increase) toward infinity our odds would also approach infinity because infinity divided by 6 is almost infinity.

2. Conversely if the denominator of the odds ratio were to approach infinity (odds in favor of not winning increase) our odds would approach zero - because 4 divided by infinity is approximately zero.

Therefore we have a situation where the odds in favor of winning range between 4 and infinity while the odds in favor of not winning, losing, range between 4 and zero - The magnitude of odds in favor of winning (4 to infinity) is much wider than the magnitude of the odds in favor of not winning (4 to 0). Enter Log Odds.

Log Odds are nothing but the log of odds - log(odds). It’s a transform of odds that make the magnitude of odds symmetric.

Returning back to our original example:

* 4 to 6 = 0.6666 => log(0.6666) = -0.176

* 6 to 4 = 1.5000 => log(1.5000) = 0.176

By applying the log function we have created a symmetric range around an origin of zero - does that sound familiar.

In Figure 3 on the left you see odds for random numbers which sum to 100 (10 to 90 20 to 80, etc.). On the right side you see the log of those odds and the reason Log(odds) is important - it helps us get to a normal distribution. This will make Log(Odds) very useful finding probabilities in win / lose scenarios.

Figure 3. Odds vs Log(Odds)

_________

_________

Understand the logistic regression coefficients

The coefficients we get after using logistic regression tell us how much that particular variables contribute to the log odds. However, both the type of variable, categorical or continuous, and the amount change in the explanatory variable impact the coefficient’s interpretation. Let see an example.

Figure 4 is output from a sample logistic regression that answers the question will a horse in a mixed gender race win or lose.

Figure 4. Logistic Regression Results - Win or not win

Variable Coefficient p value
Gender 4.00 0.030
Speed Figure 2.00 0.001

We have one categorical variable, gender, and one continuous variable, Speed Figure. Let’s start with the categorical variable.

Categorical Coefficient - First we notice the p-value for Gender is less than 0.05. This tells us that it is significant, and we can proceed to our interpretation. To interpret this result, we have to know what a 0 (low) and a 1 (high) correspond to. In this case a 0 = female, and 1 = male. The Gender coefficient is positive so that means as gender “increases,” the odds of being winning also increases. Since 1 = Male, the interpretation of the Gender coefficient is that being male increase the log odds of winning by 4 times.

Continuous Coefficient - Our continuous coefficient is speed figure. Let assume the higher a speed figure the higher the speed of the horse. Again, let’s start our interpretation with the p value. The p value is less than 0.05 so the variable in significant and its safe to proceed.Because this Speed Figure is continuous, the interpretation of the coefficient (odds ratio) is a little different, but we can use the same logic. This odds ratio is interpreted in terms of each unit increase on the scale (i.e., going from 1 to 2, 2 to 3, 3 to 4, etc.). Thus, for each increase in the Speed Figure, the odds of winning increase by a factor of 2. This means that a horse with a figure of 3 is 2 times more likely to win than a horse with a figure of 2. Likewise, the odds of a horse with a score of 2 are inverted from there (1/2), or .5, to describe how much less likely they are to win than a horse with a figure of 3. All of these are in relation to a horse with an adjacent figure (i.e., 1 vs. 2, 2 vs. 3, and so on). To compare a horse with a figure of 2 to a horse with a figure of 5, things are a bit different.

At a Speed Figure of 2, the odds are 2 times more likely than 1; at 3, the odds are 4 times more likely than 1 (since they are 2 times more likely than a deliciousness of 2, which is 2 times more likely than a score of 1). Following this logic, skipping ahead more than one point at a time, you use the following equation:

(Odds Ratio^number of intervals difference) = difference in odds

So, for for a horse with a figure of 5 (4 intervals from a score of 1), the horses odds of winning are (2^4) 16 times greater than a horse with a score of 1.

Hopes this helps advance your understanding of Logistic Regression and Horse Racing. Thanks for reading.

References:

Odds and Probability - https://www.ctspedia.org/do/view/CTSpedia/OddsTerm

Simply Explained Logistic Regression with Example in R - https://towardsdatascience.com/simply-explained-logistic-regression-with-example-in-r-b919acb1d6b3

Interpreting Binary Logistic Regression

WHAT and WHY of Log Odds -https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704 Why use Odds Ratios in Logistic Regression - https://www.theanalysisfactor.com/why-use-odds-ratios/