Overview
You may recall from my first blog post [https://rpubs.com/mutuelinvestor/577619] that I’m pursuing my MS in data science to advance my horse picking abilities. To that end, logistic regression is likely to be an important tool in my arsenal. Here’s why.
Speed Figures provide valuable information to my horse picking process. Speed Figure are number or scores assigned to each race a horse has run in the past - a past performance history. Some equate this information to a company’s financial statements as they afford the horse player vital information to shape her investment decision.
Figures take into account many factors: how wide a horse was running (ground loss), the surface the race was run on (dirt, grass, mud, slop) as well as any trouble a horse may have encountered during the running of the race. The ability to accurately answer the question: Will horse A improve this race or not would give a handicapper an edge at the races. Therefore this question is a likely response variable to a logistic regression in my future.
You can see from the figure above that horses, like humans, are not machines. Sometime they run fast, sometimes they run not so fast and sometimes they run faster than ever. If horse player can use historical data to predict before the race is run if a horse will improve or not improve, she would have a big advantage. Before we take on this analysis, its important to have a firm grasp of logistic regression. This is the subject of my blog post. In this post we’ll (1) review the basics of logistic regression, (2) drill down on Log Odds, and (3) develop a better understanding of logistic regression coefficients.
The basics of logistic regression
The logistic regression boils down to the following equation:
log(p/(1-p)) = β0 + β1X+ εi.
The left-hand side, log(p/(1-p)), is called the logit odds, odds of probability or simply log odds. If you look closely it is the probability of desired outcome (p) being true divided by the probability of desired outcome not being true (1-p) - this is called logit function. Looking to the right-hand side, we see that our logit is linearly related to X. Recall that in linear regression β1 gives average increase in Y associated with a 1 unit increase in X. In logistic regression a 1 unit increase in X increase the logit by β1. If β1 is positive the change in the logit or log odds will be positive and vise versa.
Why Log Odds
To understand Log Odds, first we must understand odds and probabilities. Odds are the ratio of something happening to something not happening. Conversely probability is the ratio of something happening / to everything that could happen. Assume we have a horse that could win 4 our of 10 races. The odds, 4 to 6 (4 wins to 6 non wins), and probability 4/10 for our horse a depicted in Figure 2.
Let’s focus for a moment on the odds ratio:
1. If the numerator were to increase (odds in favor of winning increase) toward infinity our odds would also approach infinity because infinity divided by 6 is almost infinity.
2. Conversely if the denominator of the odds ratio were to approach infinity (odds in favor of not winning increase) our odds would approach zero - because 4 divided by infinity is approximately zero.
Therefore we have a situation where the odds in favor of winning range between 4 and infinity while the odds in favor of not winning, losing, range between 4 and zero - The magnitude of odds in favor of winning (4 to infinity) is much wider than the magnitude of the odds in favor of not winning (4 to 0). Enter Log Odds.
* 4 to 6 = 0.6666 => log(0.6666) = -0.176
* 6 to 4 = 1.5000 => log(1.5000) = 0.176
By applying the log function we have created a symmetric range around an origin of zero - does that sound familiar.
In Figure 3 on the left you see odds for random numbers which sum to 100 (10 to 90 20 to 80, etc.). On the right side you see the log of those odds and the reason Log(odds) is important - it helps us get to a normal distribution. This will make Log(Odds) very useful finding probabilities in win / lose scenarios.
Understand the logistic regression coefficients
The coefficients we get after using logistic regression tell us how much that particular variables contribute to the log odds. However, both the type of variable, categorical or continuous, and the amount change in the explanatory variable impact the coefficient’s interpretation. Let see an example.
Figure 4 is output from a sample logistic regression that answers the question will a horse in a mixed gender race win or lose.
We have one categorical variable, gender, and one continuous variable, Speed Figure. Let’s start with the categorical variable.
Categorical Coefficient - First we notice the p-value for Gender is less than 0.05. This tells us that it is significant, and we can proceed to our interpretation. To interpret this result, we have to know what a 0 (low) and a 1 (high) correspond to. In this case a 0 = female, and 1 = male. The Gender coefficient is positive so that means as gender “increases,” the odds of being winning also increases. Since 1 = Male, the interpretation of the Gender coefficient is that being male increase the log odds of winning by 4 times.
Continuous Coefficient - Our continuous coefficient is speed figure. Let assume the higher a speed figure the higher the speed of the horse. Again, let’s start our interpretation with the p value. The p value is less than 0.05 so the variable in significant and its safe to proceed.Because this Speed Figure is continuous, the interpretation of the coefficient (odds ratio) is a little different, but we can use the same logic. This odds ratio is interpreted in terms of each unit increase on the scale (i.e., going from 1 to 2, 2 to 3, 3 to 4, etc.). Thus, for each increase in the Speed Figure, the odds of winning increase by a factor of 2. This means that a horse with a figure of 3 is 2 times more likely to win than a horse with a figure of 2. Likewise, the odds of a horse with a score of 2 are inverted from there (1/2), or .5, to describe how much less likely they are to win than a horse with a figure of 3. All of these are in relation to a horse with an adjacent figure (i.e., 1 vs. 2, 2 vs. 3, and so on). To compare a horse with a figure of 2 to a horse with a figure of 5, things are a bit different.
At a Speed Figure of 2, the odds are 2 times more likely than 1; at 3, the odds are 4 times more likely than 1 (since they are 2 times more likely than a deliciousness of 2, which is 2 times more likely than a score of 1). Following this logic, skipping ahead more than one point at a time, you use the following equation:
(Odds Ratio^number of intervals difference) = difference in odds
So, for for a horse with a figure of 5 (4 intervals from a score of 1), the horses odds of winning are (2^4) 16 times greater than a horse with a score of 1.
Hopes this helps advance your understanding of Logistic Regression and Horse Racing. Thanks for reading.