DATA 606 Data Project

Part 1 - Introduction

Accurate speed figures are a necessity for any serious horse player. Ragozin speed figures have a reputation for being some of the best in the industry. That said, as the popularity of the Ragozin figures increases the marginal value of the figures decrease - more horse players using the same data to uncover winning wagers at the race track. Conventional speed figure wisdom says to look for horses who show a slight improvement over the best figure they have ever run (TOP). For example, if horse has a TOP (best figure ever run) of a 10 and then runs a 9.75, this would be an indication that the horse is in great condition and is about to run a big race and new TOP, say an 8 or 9 (lower is better). Ragozin speed figure players look for this and other patterns with the objective of finding horses that are about to run better than they have ever run before. These horses often represent lucrative betting interests because they are doing something they have never done before and that the public bettor may not have been expecting.

The aim of this study was to take a deep dive at analyzing figure patterns to determine if logit regression could be utilized to predict new TOPs. Specifically, I looked at 2- and 3-race speed figure patterns (along with other variables)to see if they can predict new TOPs. Therefore, my dependent variable (isTop) was coded as a 1 if the horse ran a new top and a Zero if he did note.

The 2-Race and/or 3-Race patterns were coded relative to a horses current TOP (best speed figure). There are four possible results of a race:

T - represents a new TOP (coded as a 1-Success)
P - means a horse paired or equaled her existing TOP. (ie. Horse has a top of 10 and runs another 10) (Coded as 0 Failure)
E - is an effort or a race from .25 to 5 points higher (slower) than the existing TOp (Coded as 0 Failure)
X - is an off (poor) race and is speed figure more than 5 points higher (slower) than the current top.

With that as background here are some sample patterns:

XXX - represents a horse that ran three off races prior to to the upcoming race
TXP - represents a horse that ran a TOP-T most recently, prior to that he ran an X and prior to that he Paired his TOP - P.

Through exploratory data analysis the research question for this analysis has been refined to:

RESEARCH QUESTION:

Are 2-race speed figure patterns and other categorical and numerical data (gender, age, racing surface, etc.) employed in a LOGIT model able to predict if a Horse will run a TOP in his next race.

Part 2 - Data

The data was self collected from Ragozin Speed Figures (http://thesheets.com) from the 2018 Breeders Cup races run at Churchill Downs on November 3, 2018. The data is comprised of the past performances of approximately 145 thoroughbred race horses.

# load data
setwd('\\Ragozin\\Files\\csv\\')
inputData <- read.csv(file='ProjectData.csv', stringsAsFactors = T)

The following head and str statements provide insight into the data set:

head(inputData)

##   RecordID Nbr       Hrse Gndr Age Top Outcome Outcome1       Figure
## 1     1350  15 ABELTASMAN    F   4 175       X      X13 15 vsfAWSA30
## 2     1351  14 ABELTASMAN    F   4 175       T       T2  2- wBAWSr25
## 3     1352  13 ABELTASMAN    F   4 375       P       P0   4- wAWBE 9
## 4     1353  12 ABELTASMAN    F   4 375       X      X10 13" vtAWCD 4
## 5     1354  11 ABELTASMAN    F   3 375       P       P0   4- VAWDM 3
## 6     1355  10 ABELTASMAN    F   3 375       E       E2 5" YstAWPX23
##   FgrVle Trnr TrkCde Srfce  Wthr Trble SmrtMny4 isTop TopType Move
## 1   1500  BBt     SA  Dirt Clear  <NA>     <NA>     0    1325 1325
## 2    175  BBt     Sr  Dirt Clear  <NA>     <NA>     1    -200 -200
## 3    375  BBt     BE  Dirt Clear  <NA>     <NA>     0       0    0
## 4   1350  BBt     CD  Dirt Clear   SML     <NA>     0     975  975
## 5    375  BBt     DM  Dirt Clear  <NA>     <NA>     0       0    0
## 6    550  BBt     PX  Dirt Clear   SML     <NA>     0     175  175
##   SoundnessFlag      RACE_DATE REST DaysSinceTop P12  P12D P123   P123D
## 1             1 9/30/2018 0:00   36           36  TP  T2P0  TPX T2P0X10
## 2             0 8/25/2018 0:00   77          441  PX P0X10  PXP P0X10P0
## 3             0  6/9/2018 0:00   36          364  XP X10P0  XPE X10P0E2
## 4             0  5/4/2018 0:00  182          328  PE  P0E2  PEX  P0E2X6
## 5             1 11/3/2017 0:00   41          146  EX  E2X6  EXT  E2X6T1
## 6             0 9/23/2017 0:00   62          105  XT  X6T1  XTE  X6T1E2
##   LEVEL WITHLASIX LEVLESSTOP        FVz
## 1   375         0        200  0.5938993
## 2   550         0        375 -1.3970850
## 3   550         0        175 -1.0965590
## 4   550         0        175  0.3685049
## 5   750         0        375 -1.0965590
## 6   750         0        375 -0.8335988

str(inputData)

## 'data.frame':    1921 obs. of  31 variables:
##  $ RecordID     : int  1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 ...
##  $ Nbr          : int  15 14 13 12 11 10 9 8 7 6 ...
##  $ Hrse         : Factor w/ 140 levels "ABELTASMAN","ALMANAAR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gndr         : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age          : int  4 4 4 4 3 3 3 3 3 3 ...
##  $ Top          : int  175 175 375 375 375 375 375 375 500 500 ...
##  $ Outcome      : Factor w/ 4 levels "E","P","T","X": 4 3 2 4 2 1 4 3 1 1 ...
##  $ Outcome1     : Factor w/ 47 levels "E0","E1","E2",..: 25 14 7 22 7 3 40 9 3 5 ...
##  $ Figure       : Factor w/ 1722 levels "- 1- VwAWBE 9",..: 859 946 1016 790 1013 1078 1340 1019 1712 1333 ...
##  $ FgrVle       : int  1500 175 375 1350 375 550 925 375 750 925 ...
##  $ Trnr         : Factor w/ 103 levels "AAn","ADr","AFe",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ TrkCde       : Factor w/ 57 levels "Ai","AP","AQ",..: 47 48 4 6 10 45 48 4 6 47 ...
##  $ Srfce        : Factor w/ 2 levels "Dirt","Turf": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Wthr         : Factor w/ 5 levels "Big Wind","Clear",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Trble        : Factor w/ 2 levels "BIG","SML": NA NA NA 2 NA 2 NA NA 2 NA ...
##  $ SmrtMny4     : Factor w/ 1 level "MONY": NA NA NA NA NA NA NA NA NA NA ...
##  $ isTop        : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ TopType      : int  1325 -200 0 975 0 175 550 -125 250 425 ...
##  $ Move         : int  1325 -200 0 975 0 175 550 -125 250 425 ...
##  $ SoundnessFlag: int  1 0 0 0 1 0 0 0 0 0 ...
##  $ RACE_DATE    : Factor w/ 601 levels "1/1/2018 0:00",..: 588 504 398 329 132 568 423 342 330 284 ...
##  $ REST         : int  36 77 36 182 41 62 43 36 27 35 ...
##  $ DaysSinceTop : int  36 441 364 328 146 105 43 98 62 35 ...
##  $ P12          : Factor w/ 16 levels "EE","EP","ET",..: 10 8 14 5 4 15 9 1 3 9 ...
##  $ P12D         : Factor w/ 449 levels "E0E2","E0E3",..: 196 141 278 127 56 374 175 37 91 192 ...
##  $ P123         : Factor w/ 63 levels "EEE","EEP","EET",..: 40 30 53 20 15 57 33 3 9 35 ...
##  $ P123D        : Factor w/ 1201 levels "E0E2E2","E0E2E5",..: 718 553 895 481 208 1053 622 124 371 691 ...
##  $ LEVEL        : int  375 550 550 550 750 750 750 750 925 950 ...
##  $ WITHLASIX    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LEVLESSTOP   : int  200 375 175 175 375 375 375 375 425 450 ...
##  $ FVz          : num  0.594 -1.397 -1.097 0.369 -1.097 ...

Please see the Appendix for a discription of key variables.

Part 3 - Exploratory data analysis

The original project proposal included exploratory data analysis for a variety of independent variables versus race outcomes (T, P, E, X). I have included some of these plots below for the more statistically significant variables and have added a information value analysis.

plot(inputData$Outcome ~ inputData$P12, ylab = "Race Outcome", xlab="2 Race Pattern", main="EDA Plot\n P12 vs Outcome")

This plot is encouraging, as it seems to demonstrate different patterns yield varying outcomes. For example, the XX pattern seems to result in significantly more X outcomes and far fewer T outcomes. Compare this to the EE pattern that seems to yield few Xs and many Ts.

cdplot(inputData$Outcome ~ inputData$DaysSinceTop, ylab = "Race Outcome", xlab="Days Since Top", main="EDA Plot\n Days Since Top vs Outcome")

The Days Since Top also offercs some potentially valuable insights and raises some interesting questions. For example why are there so few tops at 600 days vs 700 days.

cdplot(inputData$Outcome ~ inputData$LEVLESSTOP, ylab = "Race Outcome", xlab="Level Less Top", main="EDA Plot\n LEVLESSTOP vs Outcome")

Level less tops seeks to determine if on average a horse is running figures close to its TOP. The hypothesis being the closer to the top the better condition of the horse and thus a higher chance of a new TOP.

plot(inputData$Outcome ~ inputData$WITHLASIX, ylab = "Race Outcome", xlab="With Lasix", main="EDA Plot\n WITHLASIX vs Outcome")

Lasix is breathing medication that is know to cause horses to improve (run new tops).

Next, an Information Value analyis was conducted.

Compute Information Values To Assist With Model Contruction

library(smbinning)

Segregate Categorical and Continuous Variable and create information value data frame iv_df

factor_vars <- c("Gndr", "Trnr", "Srfce", "Wthr", "Trble", "P12", "P123")

continuous_vars <- c("Age", "Trnr", "FgrVle", "SoundnessFlag", "REST", "DaysSinceTop", "LEVEL", "LEVLESSTOP")

iv_df <- data.frame(VARS=c(factor_vars, continuous_vars), IV=numeric(15))

Compute IV for categorical variables

for(factor_var in factor_vars){
  smb <- smbinning.factor(inputData, y="isTop", x=factor_var)  # WOE table
  if(class(smb) != "character"){ # heck if some error occured
    iv_df[iv_df$VARS == factor_var, "IV"] <- smb$iv
  }
}

Compute IV for continuous variables

for(continuous_var in continuous_vars){
  smb <- smbinning(inputData, y="isTop", x=continuous_var)  # WOE table
  if(class(smb) != "character"){  # any error while calculating scores.
    iv_df[iv_df$VARS == continuous_var, "IV"] <- smb$iv
  }
}

Sorted Information Values

iv_df <- iv_df[order(-iv_df$IV), ]  # sort
iv_df

##             VARS     IV
## 8            Age 0.6611
## 14         LEVEL 0.6533
## 13  DaysSinceTop 0.3570
## 10        FgrVle 0.0728
## 4           Wthr 0.0088
## 1           Gndr 0.0044
## 5          Trble 0.0028
## 3          Srfce 0.0021
## 2           Trnr 0.0000
## 6            P12 0.0000
## 7           P123 0.0000
## 9           Trnr 0.0000
## 11 SoundnessFlag 0.0000
## 12          REST 0.0000
## 15    LEVLESSTOP 0.0000

Utilizing the EDA and Informaton Value analysis the following logit model was developed:

isTop ~ P12 + Age + LEVEL + REST + DaysSinceTop + LEVLESSTOP + WITHLASIX

**Prior to executing the model, training and test data sets were created. The data sets were structured to address a class bias (twice as may fails vs successes) identified in the data.

input_ones <- inputData[which(inputData$isTop == 1), ]  # all 1's
input_zeros <- inputData[which(inputData$isTop == 0), ]  # all 0'

set.seed(100)  # for repeatability of samples

input_ones_training_rows <- sample(1:nrow(input_ones), 0.80*nrow(input_ones))  # 1's for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.80*nrow(input_ones))  # 0's for training. Pick as many 0's as 1's

training_ones <- input_ones[input_ones_training_rows, ]  
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros)  # row bind the 1's and 0's

Create Test Data

test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros)  # row bind the 1's and 0's

LOGIT MODEL

logitMod <- glm(isTop ~  P12 + Age + LEVEL + REST + DaysSinceTop + LEVLESSTOP + WITHLASIX, data=trainingData, family=binomial(link="logit"))
predicted <- predict(logitMod, testData, type="response")

library(InformationValue)

Model Diagnostics

summary(logitMod)

## 
## Call:
## glm(formula = isTop ~ P12 + Age + LEVEL + REST + DaysSinceTop + 
##     LEVLESSTOP + WITHLASIX, family = binomial(link = "logit"), 
##     data = trainingData)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.50656  -0.86390   0.00222   0.83450   2.56681  
## 
## Coefficients:
##                Estimate Std. Error z value          Pr(>|z|)    
## (Intercept)  -0.6416033  0.6012761  -1.067          0.285941    
## P12EP         0.4527946  0.5315251   0.852          0.394282    
## P12ET        -1.5381124  0.2918994  -5.269 0.000000136927051 ***
## P12EX        -1.4418454  0.3850519  -3.745          0.000181 ***
## P12PE        -0.0281328  0.4333672  -0.065          0.948240    
## P12PP         1.0487447  0.6437564   1.629          0.103291    
## P12PT        -0.3193348  0.3955602  -0.807          0.419495    
## P12PX        -0.6538708  1.0559560  -0.619          0.535770    
## P12TE        -1.6164688  0.2846815  -5.678 0.000000013614724 ***
## P12TP        -1.3146419  0.4252256  -3.092          0.001991 ** 
## P12TT        -2.2479603  0.3116663  -7.213 0.000000000000548 ***
## P12TX        -2.6931280  0.5466808  -4.926 0.000000837900513 ***
## P12XE        -2.0435501  0.5183640  -3.942 0.000080701579383 ***
## P12XP        -0.0400964  0.7456788  -0.054          0.957117    
## P12XT        -3.4468574  0.5537403  -6.225 0.000000000482534 ***
## P12XX        -4.1326697  0.5981503  -6.909 0.000000000004878 ***
## Age          -0.1521894  0.1133295  -1.343          0.179307    
## LEVEL         0.0012973  0.0002490   5.211 0.000000188200759 ***
## REST          0.0085309  0.0017619   4.842 0.000001286031167 ***
## DaysSinceTop -0.0064995  0.0010496  -6.192 0.000000000592595 ***
## LEVLESSTOP    0.0029239  0.0004292   6.812 0.000000000009627 ***
## WITHLASIX     1.5200866  0.3064316   4.961 0.000000702734394 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1364.11  on 983  degrees of freedom
## Residual deviance:  995.43  on 962  degrees of freedom
## AIC: 1039.4
## 
## Number of Fisher Scoring iterations: 5

Concordance(testData$isTop, predicted)

## $Concordance
## [1] 0.847905
## 
## $Discordance
## [1] 0.152095
## 
## $Tied
## [1] -0.00000000000000002775558
## 
## $Pairs
## [1] 100812

plotROC(testData$isTop, predicted)

confusionMatrix(testData$isTop, predicted, threshold = .55)

##     0  1
## 0 643 31
## 1 170 93

sensitivity(testData$isTop, predicted, threshold = .55)

## [1] 0.75

specificity(testData$isTop, predicted, threshold = .55)

## [1] 0.7908979

Logit Model Observations:

- The model produced symetric residuals 
- The variables LEVEL, REST, DaysSinceTop, LEVLESSTOP and WITHLASIX all had signifance scores of 0.
- 10 of the 16 P12 2-Race pattern factors has significand socres of 0.001 or less
- The model has Concordance of 0.847 and AUROC = 0.8476
- The Confusion Matix indicates a sensitivity score of 75% at the .55 threshold.  
- The Confusion Matix indicates a specificity score of 79% at the .55 threshold.

The model has produced numerous statistics and indicators that a strong goodness of fit exists.

Part 4 - Inference

We establish the following null hypothesis to determine if the LogitModel is statistically significant.

Ho -the deviance of the Null deviance model = the deviance if LogitModel with P12 and other independent variables.

HA - the deviance of the two models are not equal.

We test this be computing the difference in the chi-square statistic for the deviance and degrees of freedom:

1-pchisq(1364.11-995.43,983-962) = 0

The p-value of 0 means we would reject the null hypothesis. This is provides strong evidence that the model is statistically significant.

Part 5 - Conclusion

The LOGIT model developed in this analysis is a promising first step toward my goal of applying data science to horse racing. The model develop yielded numerous statistics and indicators that support a strong goodness of fit exists. From a practical standpoint, a Sensitivity score in the 70s would provide any horse player a significant leg up if the model stood up to real-world application in racetrack betting markets. Before taking that step, however, I believe additional analysis is warranted in the following areas:

The data set was comprised of 2018 Breeders Cup horses. A truly biased sample when you consider these are the very best horses in the world. A broad sample set may prove useful to improving this analysis.
The 2-Race and 3-Race patterns are factors with many levels. I believe, but have yet to find supporting documentation, that variables of this type require a significantly larger data set such that all possible patterns are captured with a statistically appropriate number of samples for each pattern. This may explain why some P12 patterns were not significant whiles others were.
The current study does not incorporate intra-race competition between horses. Developing variables that capture this dynamic may improve the model’s response.
I believe my Information Value analysis may suffer from an error. As it currently stands, it is not producing any values for the categorical variables. Despite the variable having IV.

References

Logistic Regression: https://r-statistics.co/Logistic-Regression-With-R.html
Understanding the Summary Output for a Logistic Regression in R:https://www.youtube.com/watch?v=xl5dZo_BSJk
Statistics with R: Logistic Regression, Lesson 19 by Courtney Brown: https://www.youtube.com/watch?v=EocjYP5h0cE
https://www.wired.com/2002/03/betting/

Appendix (optional)

Description of Key Variables

Level - Median speed figure value
P12 - Two Race Pattern
P12D - Two Race Pattern with more detail, therefore more sparce.
Rest - Time since last Race
Age - General 2 thru 7
Gender - Male or Female
Racing Surface - Dirt or Turf
Soundness Flag - Variable that indicates if a horse may have soundness issues (slow start, boring in or out, etc.)
WithLASIX - a value that indicates if the Horse received Lasix for the first time within the last four races. (Lasix helps horse breathe better.)
Time weighted move (moves the are the relative performance to the current top)
FVz - The z-cord of the horse past performance figure value
Trainer - The initials of the horses Trainer
Outcome - Result of race in categorical pattern.
Outcome - More detailed version of Outcome
Move - the integer outcome of the race.
LEVLESSTOP - A numeric value that is derived by subtracting the current TOP from the current Level. Remove this section if you don’t have an appendix