Accurate speed figures are a necessity for any serious horse player. Ragozin speed figures have a reputation for being some of the best in the industry. That said, as the popularity of the Ragozin figures increases the marginal value of the figures decrease - more horse players using the same data to uncover winning wagers at the race track. Conventional speed figure wisdom says to look for horses who show a slight improvement over the best figure they have ever run (TOP). For example, if horse has a TOP (best figure ever run) of a 10 and then runs a 9.75, this would be an indication that the horse is in great condition and is about to run a big race and new TOP, say an 8 or 9 (lower is better). Ragozin speed figure players look for this and other patterns with the objective of finding horses that are about to run better than they have ever run before. These horses often represent lucrative betting interests because they are doing something they have never done before and that the public bettor may not have been expecting.
The aim of this study was to take a deep dive at analyzing figure patterns to determine if logit regression could be utilized to predict new TOPs. Specifically, I looked at 2- and 3-race speed figure patterns (along with other variables)to see if they can predict new TOPs. Therefore, my dependent variable (isTop) was coded as a 1 if the horse ran a new top and a Zero if he did note.
The 2-Race and/or 3-Race patterns were coded relative to a horses current TOP (best speed figure). There are four possible results of a race:
With that as background here are some sample patterns:
Through exploratory data analysis the research question for this analysis has been refined to:
RESEARCH QUESTION:
Are 2-race speed figure patterns and other categorical and numerical data (gender, age, racing surface, etc.) employed in a LOGIT model able to predict if a Horse will run a TOP in his next race.
The data was self collected from Ragozin Speed Figures (http://thesheets.com) from the 2018 Breeders Cup races run at Churchill Downs on November 3, 2018. The data is comprised of the past performances of approximately 145 thoroughbred race horses.
# load data
setwd('\\Ragozin\\Files\\csv\\')
inputData <- read.csv(file='ProjectData.csv', stringsAsFactors = T)
The following head and str statements provide insight into the data set:
head(inputData)
## RecordID Nbr Hrse Gndr Age Top Outcome Outcome1 Figure
## 1 1350 15 ABELTASMAN F 4 175 X X13 15 vsfAWSA30
## 2 1351 14 ABELTASMAN F 4 175 T T2 2- wBAWSr25
## 3 1352 13 ABELTASMAN F 4 375 P P0 4- wAWBE 9
## 4 1353 12 ABELTASMAN F 4 375 X X10 13" vtAWCD 4
## 5 1354 11 ABELTASMAN F 3 375 P P0 4- VAWDM 3
## 6 1355 10 ABELTASMAN F 3 375 E E2 5" YstAWPX23
## FgrVle Trnr TrkCde Srfce Wthr Trble SmrtMny4 isTop TopType Move
## 1 1500 BBt SA Dirt Clear <NA> <NA> 0 1325 1325
## 2 175 BBt Sr Dirt Clear <NA> <NA> 1 -200 -200
## 3 375 BBt BE Dirt Clear <NA> <NA> 0 0 0
## 4 1350 BBt CD Dirt Clear SML <NA> 0 975 975
## 5 375 BBt DM Dirt Clear <NA> <NA> 0 0 0
## 6 550 BBt PX Dirt Clear SML <NA> 0 175 175
## SoundnessFlag RACE_DATE REST DaysSinceTop P12 P12D P123 P123D
## 1 1 9/30/2018 0:00 36 36 TP T2P0 TPX T2P0X10
## 2 0 8/25/2018 0:00 77 441 PX P0X10 PXP P0X10P0
## 3 0 6/9/2018 0:00 36 364 XP X10P0 XPE X10P0E2
## 4 0 5/4/2018 0:00 182 328 PE P0E2 PEX P0E2X6
## 5 1 11/3/2017 0:00 41 146 EX E2X6 EXT E2X6T1
## 6 0 9/23/2017 0:00 62 105 XT X6T1 XTE X6T1E2
## LEVEL WITHLASIX LEVLESSTOP FVz
## 1 375 0 200 0.5938993
## 2 550 0 375 -1.3970850
## 3 550 0 175 -1.0965590
## 4 550 0 175 0.3685049
## 5 750 0 375 -1.0965590
## 6 750 0 375 -0.8335988
str(inputData)
## 'data.frame': 1921 obs. of 31 variables:
## $ RecordID : int 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 ...
## $ Nbr : int 15 14 13 12 11 10 9 8 7 6 ...
## $ Hrse : Factor w/ 140 levels "ABELTASMAN","ALMANAAR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Gndr : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 4 4 4 4 3 3 3 3 3 3 ...
## $ Top : int 175 175 375 375 375 375 375 375 500 500 ...
## $ Outcome : Factor w/ 4 levels "E","P","T","X": 4 3 2 4 2 1 4 3 1 1 ...
## $ Outcome1 : Factor w/ 47 levels "E0","E1","E2",..: 25 14 7 22 7 3 40 9 3 5 ...
## $ Figure : Factor w/ 1722 levels "- 1- VwAWBE 9",..: 859 946 1016 790 1013 1078 1340 1019 1712 1333 ...
## $ FgrVle : int 1500 175 375 1350 375 550 925 375 750 925 ...
## $ Trnr : Factor w/ 103 levels "AAn","ADr","AFe",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ TrkCde : Factor w/ 57 levels "Ai","AP","AQ",..: 47 48 4 6 10 45 48 4 6 47 ...
## $ Srfce : Factor w/ 2 levels "Dirt","Turf": 1 1 1 1 1 1 1 1 1 1 ...
## $ Wthr : Factor w/ 5 levels "Big Wind","Clear",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Trble : Factor w/ 2 levels "BIG","SML": NA NA NA 2 NA 2 NA NA 2 NA ...
## $ SmrtMny4 : Factor w/ 1 level "MONY": NA NA NA NA NA NA NA NA NA NA ...
## $ isTop : int 0 1 0 0 0 0 0 1 0 0 ...
## $ TopType : int 1325 -200 0 975 0 175 550 -125 250 425 ...
## $ Move : int 1325 -200 0 975 0 175 550 -125 250 425 ...
## $ SoundnessFlag: int 1 0 0 0 1 0 0 0 0 0 ...
## $ RACE_DATE : Factor w/ 601 levels "1/1/2018 0:00",..: 588 504 398 329 132 568 423 342 330 284 ...
## $ REST : int 36 77 36 182 41 62 43 36 27 35 ...
## $ DaysSinceTop : int 36 441 364 328 146 105 43 98 62 35 ...
## $ P12 : Factor w/ 16 levels "EE","EP","ET",..: 10 8 14 5 4 15 9 1 3 9 ...
## $ P12D : Factor w/ 449 levels "E0E2","E0E3",..: 196 141 278 127 56 374 175 37 91 192 ...
## $ P123 : Factor w/ 63 levels "EEE","EEP","EET",..: 40 30 53 20 15 57 33 3 9 35 ...
## $ P123D : Factor w/ 1201 levels "E0E2E2","E0E2E5",..: 718 553 895 481 208 1053 622 124 371 691 ...
## $ LEVEL : int 375 550 550 550 750 750 750 750 925 950 ...
## $ WITHLASIX : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LEVLESSTOP : int 200 375 175 175 375 375 375 375 425 450 ...
## $ FVz : num 0.594 -1.397 -1.097 0.369 -1.097 ...
Please see the Appendix for a discription of key variables.
The original project proposal included exploratory data analysis for a variety of independent variables versus race outcomes (T, P, E, X). I have included some of these plots below for the more statistically significant variables and have added a information value analysis.
plot(inputData$Outcome ~ inputData$P12, ylab = "Race Outcome", xlab="2 Race Pattern", main="EDA Plot\n P12 vs Outcome")
This plot is encouraging, as it seems to demonstrate different patterns yield varying outcomes. For example, the XX pattern seems to result in significantly more X outcomes and far fewer T outcomes. Compare this to the EE pattern that seems to yield few Xs and many Ts.
cdplot(inputData$Outcome ~ inputData$DaysSinceTop, ylab = "Race Outcome", xlab="Days Since Top", main="EDA Plot\n Days Since Top vs Outcome")
The Days Since Top also offercs some potentially valuable insights and raises some interesting questions. For example why are there so few tops at 600 days vs 700 days.
cdplot(inputData$Outcome ~ inputData$LEVLESSTOP, ylab = "Race Outcome", xlab="Level Less Top", main="EDA Plot\n LEVLESSTOP vs Outcome")
Level less tops seeks to determine if on average a horse is running figures close to its TOP. The hypothesis being the closer to the top the better condition of the horse and thus a higher chance of a new TOP.
plot(inputData$Outcome ~ inputData$WITHLASIX, ylab = "Race Outcome", xlab="With Lasix", main="EDA Plot\n WITHLASIX vs Outcome")
Lasix is breathing medication that is know to cause horses to improve (run new tops).
Compute Information Values To Assist With Model Contruction
library(smbinning)
Segregate Categorical and Continuous Variable and create information value data frame iv_df
factor_vars <- c("Gndr", "Trnr", "Srfce", "Wthr", "Trble", "P12", "P123")
continuous_vars <- c("Age", "Trnr", "FgrVle", "SoundnessFlag", "REST", "DaysSinceTop", "LEVEL", "LEVLESSTOP")
iv_df <- data.frame(VARS=c(factor_vars, continuous_vars), IV=numeric(15))
Compute IV for categorical variables
for(factor_var in factor_vars){
smb <- smbinning.factor(inputData, y="isTop", x=factor_var) # WOE table
if(class(smb) != "character"){ # heck if some error occured
iv_df[iv_df$VARS == factor_var, "IV"] <- smb$iv
}
}
Compute IV for continuous variables
for(continuous_var in continuous_vars){
smb <- smbinning(inputData, y="isTop", x=continuous_var) # WOE table
if(class(smb) != "character"){ # any error while calculating scores.
iv_df[iv_df$VARS == continuous_var, "IV"] <- smb$iv
}
}
Sorted Information Values
iv_df <- iv_df[order(-iv_df$IV), ] # sort
iv_df
## VARS IV
## 8 Age 0.6611
## 14 LEVEL 0.6533
## 13 DaysSinceTop 0.3570
## 10 FgrVle 0.0728
## 4 Wthr 0.0088
## 1 Gndr 0.0044
## 5 Trble 0.0028
## 3 Srfce 0.0021
## 2 Trnr 0.0000
## 6 P12 0.0000
## 7 P123 0.0000
## 9 Trnr 0.0000
## 11 SoundnessFlag 0.0000
## 12 REST 0.0000
## 15 LEVLESSTOP 0.0000
isTop ~ P12 + Age + LEVEL + REST + DaysSinceTop + LEVLESSTOP + WITHLASIX
**Prior to executing the model, training and test data sets were created. The data sets were structured to address a class bias (twice as may fails vs successes) identified in the data.
input_ones <- inputData[which(inputData$isTop == 1), ] # all 1's
input_zeros <- inputData[which(inputData$isTop == 0), ] # all 0'
set.seed(100) # for repeatability of samples
input_ones_training_rows <- sample(1:nrow(input_ones), 0.80*nrow(input_ones)) # 1's for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.80*nrow(input_ones)) # 0's for training. Pick as many 0's as 1's
training_ones <- input_ones[input_ones_training_rows, ]
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros) # row bind the 1's and 0's
Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros) # row bind the 1's and 0's
logitMod <- glm(isTop ~ P12 + Age + LEVEL + REST + DaysSinceTop + LEVLESSTOP + WITHLASIX, data=trainingData, family=binomial(link="logit"))
predicted <- predict(logitMod, testData, type="response")
library(InformationValue)
Model Diagnostics
summary(logitMod)
##
## Call:
## glm(formula = isTop ~ P12 + Age + LEVEL + REST + DaysSinceTop +
## LEVLESSTOP + WITHLASIX, family = binomial(link = "logit"),
## data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.50656 -0.86390 0.00222 0.83450 2.56681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6416033 0.6012761 -1.067 0.285941
## P12EP 0.4527946 0.5315251 0.852 0.394282
## P12ET -1.5381124 0.2918994 -5.269 0.000000136927051 ***
## P12EX -1.4418454 0.3850519 -3.745 0.000181 ***
## P12PE -0.0281328 0.4333672 -0.065 0.948240
## P12PP 1.0487447 0.6437564 1.629 0.103291
## P12PT -0.3193348 0.3955602 -0.807 0.419495
## P12PX -0.6538708 1.0559560 -0.619 0.535770
## P12TE -1.6164688 0.2846815 -5.678 0.000000013614724 ***
## P12TP -1.3146419 0.4252256 -3.092 0.001991 **
## P12TT -2.2479603 0.3116663 -7.213 0.000000000000548 ***
## P12TX -2.6931280 0.5466808 -4.926 0.000000837900513 ***
## P12XE -2.0435501 0.5183640 -3.942 0.000080701579383 ***
## P12XP -0.0400964 0.7456788 -0.054 0.957117
## P12XT -3.4468574 0.5537403 -6.225 0.000000000482534 ***
## P12XX -4.1326697 0.5981503 -6.909 0.000000000004878 ***
## Age -0.1521894 0.1133295 -1.343 0.179307
## LEVEL 0.0012973 0.0002490 5.211 0.000000188200759 ***
## REST 0.0085309 0.0017619 4.842 0.000001286031167 ***
## DaysSinceTop -0.0064995 0.0010496 -6.192 0.000000000592595 ***
## LEVLESSTOP 0.0029239 0.0004292 6.812 0.000000000009627 ***
## WITHLASIX 1.5200866 0.3064316 4.961 0.000000702734394 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1364.11 on 983 degrees of freedom
## Residual deviance: 995.43 on 962 degrees of freedom
## AIC: 1039.4
##
## Number of Fisher Scoring iterations: 5
Concordance(testData$isTop, predicted)
## $Concordance
## [1] 0.847905
##
## $Discordance
## [1] 0.152095
##
## $Tied
## [1] -0.00000000000000002775558
##
## $Pairs
## [1] 100812
plotROC(testData$isTop, predicted)
confusionMatrix(testData$isTop, predicted, threshold = .55)
## 0 1
## 0 643 31
## 1 170 93
sensitivity(testData$isTop, predicted, threshold = .55)
## [1] 0.75
specificity(testData$isTop, predicted, threshold = .55)
## [1] 0.7908979
Logit Model Observations:
- The model produced symetric residuals
- The variables LEVEL, REST, DaysSinceTop, LEVLESSTOP and WITHLASIX all had signifance scores of 0.
- 10 of the 16 P12 2-Race pattern factors has significand socres of 0.001 or less
- The model has Concordance of 0.847 and AUROC = 0.8476
- The Confusion Matix indicates a sensitivity score of 75% at the .55 threshold.
- The Confusion Matix indicates a specificity score of 79% at the .55 threshold.
The model has produced numerous statistics and indicators that a strong goodness of fit exists.
We establish the following null hypothesis to determine if the LogitModel is statistically significant.
Ho -the deviance of the Null deviance model = the deviance if LogitModel with P12 and other independent variables.
HA - the deviance of the two models are not equal.
We test this be computing the difference in the chi-square statistic for the deviance and degrees of freedom:
1-pchisq(1364.11-995.43,983-962) = 0
The p-value of 0 means we would reject the null hypothesis. This is provides strong evidence that the model is statistically significant.
The LOGIT model developed in this analysis is a promising first step toward my goal of applying data science to horse racing. The model develop yielded numerous statistics and indicators that support a strong goodness of fit exists. From a practical standpoint, a Sensitivity score in the 70s would provide any horse player a significant leg up if the model stood up to real-world application in racetrack betting markets. Before taking that step, however, I believe additional analysis is warranted in the following areas:
The data set was comprised of 2018 Breeders Cup horses. A truly biased sample when you consider these are the very best horses in the world. A broad sample set may prove useful to improving this analysis.
The 2-Race and 3-Race patterns are factors with many levels. I believe, but have yet to find supporting documentation, that variables of this type require a significantly larger data set such that all possible patterns are captured with a statistically appropriate number of samples for each pattern. This may explain why some P12 patterns were not significant whiles others were.
The current study does not incorporate intra-race competition between horses. Developing variables that capture this dynamic may improve the model’s response.
I believe my Information Value analysis may suffer from an error. As it currently stands, it is not producing any values for the categorical variables. Despite the variable having IV.
Description of Key Variables