Introduction

Ever since I was a teenager, I’ve always enjoyed gambling, and horseracing in particular. I grew up in Saratoga Springs, home to one of, if not the, most prestigious and renowned race tracks in the country. The excitement is palpable at the track, with tourists, hardened gamblers, and fans of the sport all cheering on their selections as they come down the stretch. I am a fan of the sport in two capacities. Firstly, I’m a fan as anyone else is a fan of baseball or football. The great horses of my era were never great investments, but they have memorable wins that elicit nostalgia and excitement. Secondly, I am a fan as a gambler. Horseracing and Poker are the two main games that are mathematically possible to win at. They represent the dream that if your skill eclipses the house edge, then you’re not simply throwing your money away, but are potentially making money. Of course, it’s no small feat to beat the house edge in either game, but for the past few years, it’s exactly what I’ve been trying to do.

My model started as a very naive attmept to beat the 2013 Kentucky Derby. I input some data from the previous two Kentucky Derbies, by hand, and performed a linear regression on finish position. The horse I predicted to win, Palace Malice, came in 12th place. Interestingly enough, he lead for half the race beofre tiring, and went on to be a very successful older horse (The Kentucky Derby is a race for three year olds, the equivalent of maybe an olympic-aged athlete for humans. An ‘older horse’ is a horse that is four or older, and is considered fully adult). Despite my incompetence, I found the excercise to be enjoyable, and I thought that I was surely not the first pserson to attempt this task. After some searching, I came across an article that seemed to be exactly what I was looking for. The author created a model for racing that identified value, and used statistics that I could understand. I still have this paper three-hole-punched, in a binder, with the sections separated by tabs. I even wrote a table of contents for it. During 2013, this was the most formative influence on my model.

The first breakthrough I had was that I didn’t have to input data by hand, I could just buy it. So with Dayes’ thesis as my guide, I purchased some data about horses that ran at Aqueduct, a track in New York City that runs winter meets. My plan was to get my model ready and play the Aquedct meet for the 2013-2014 season. I ended up with a linear regression model in Excel that had an R^2 of about .07, and took an hour every day to deploy on fresh data. Despite being very clunky and subject to accidental deletion and overwriting, I actually did ok during the Aqueduct meet. I lost money of course, but my ROI was about $.94. The house cut in horseracing varies between 15% and 20%, so I was quite pleased with myself, and I knew that with some adjustments, I would have something valuable. Since then, I’ve continued to develop my model, as well as my process, and it’s become something worthy of documentation

Preparing the Data

Past performances

This data is available before each race and include information like how fast each horse ran in the past, how much money they’ve won, and other related data. These files are identical to what I would get before going to the track on a certain day. Therefore, this is where all of the predictive features will come from. The data comes in four file types: .DRF are one record per race, and contain data about each race such as distance, surface, and purse; .DR2 are one record per horse, and contain summary information about each horse such as post position, jockey, and trainer; .DR3 are one record per previous start for each horse, and contain information about each previous start for a horse such as finish position and speed; .DR4 are one record per horse, and contain proprietary statistics from BRIS, most important of which is their power number, which is an overall estimate of each horse’s strength.

I have the data archived in thier .DR* file formats, but I have additional data stored in .csv format from earlier iterations of the model that could not handle the native file extension. For the .csv files, the first character will be a 1, 2, 3, or 4, indicating what type of file it is. I will use regular expressions and the plyr package to import these files and build my dataset.

library(plyr)
library(dplyr)
library(stringr)
library(caret)

#First the .DR* files
setwd("C:/Users/Coby/Documents/DailyFiles/Aggregation of Files/DR_type")

#Race data from .DRF
temp = list.files(pattern = ".DRF$")
myfiles = lapply(temp, read.csv, header = FALSE)
RaceData1 = ldply(myfiles, data.frame)
RaceData1 = RaceData1[,c(1,2,3,4,5,7, 23)]
names(RaceData1) = c("Track", "RDate", "RNum", "Dist","Surface","RType", "PostTime")

#Horse data from .DR2
temp = list.files(pattern = ".DR2$")
myfiles = lapply(temp, read.csv, header = FALSE)
myfiles2 = lapply(temp, read.csv, header = FALSE, quote = "")
HorseData1 = ldply(myfiles, data.frame)
program = ldply(myfiles2, data.frame)
HorseData1 = HorseData1[,c(1,2,3,4,19,21,24,25,41,43,64,65,69,71,75,189,201,202)]
names(HorseData1) = c("Track", "RDate", "RNum", "PP", "TrainRecordCurYear", "JockRecordCurYear", "HName", "YOB", "Medication", 
                     "EquipChange", "Year", "StartsYear", "EarnYear", "StartsPrev", "EarnPrev", "BRISRunstyle", "ProgramNum", "MLOdds")
x = str_replace(program[,201], "[:punct:]", "")
HorseData1$ProgramNum = str_replace(x, "[:punct:]", "")

#Past performance data from .DR3
temp = list.files(pattern = ".DR3$")
myfiles = lapply(temp, read.csv, header = FALSE)
PastPerf1 = ldply(myfiles, data.frame)
PastPerf1 = PastPerf1[,c(1,2,3,4,5,10,11,14,35,41,53,56,57,61,64,84)]
names(PastPerf1) = c("Track", "RDate", "RNum", "PP", "PrevRDate", "PrevCondition", "PrevDist","NumHor", "Purse", "FinishPos", 
                     "BLMargin", "BRIS2fPace", "BRIS4fPace","BRISLatePace", "BRISSpeed", "ClaimedFlag")

#BRIS files from .DR4
temp = list.files(pattern = ".DR4$")
myfiles = lapply(temp, read.csv, header = FALSE)
BRISStats1 = ldply(myfiles, data.frame)
BRISStats1 = BRISStats1[,c(1,2,3,4,5,10)]
names(BRISStats1) = c("Track", "RDate", "RNum", "PP", "PowerRating", "BestSpeedLife")


#Next, the .csv files
setwd("C:/Users/Coby/Documents/DailyFiles/Aggregation of Files/1_4_csv")

#Race data from .csv
temp = list.files(pattern = "^1")
myfiles = lapply(temp, read.csv, header = FALSE)
RaceData2 = ldply(myfiles, data.frame)
RaceData2 = RaceData2[,c(1,2,3,4,5,7, 23)]
names(RaceData2) = c("Track", "RDate", "RNum", "Dist","Surface","RType", "PostTime")

#Horse Data from .csv
temp = list.files(pattern = "^2")
myfiles = lapply(temp, read.csv, header = FALSE)
myfiles2 = lapply(temp, read.csv, header = FALSE, quote = "")
HorseData2 = ldply(myfiles, data.frame)
program = ldply(myfiles2, data.frame)
HorseData2 = HorseData2[,c(1,2,3,4,19,21,24,25,41,43,64,65,69,71,75,189,201,202)]
names(HorseData2) = c("Track", "RDate", "RNum", "PP", "TrainRecordCurYear", "JockRecordCurYear", "HName", "YOB", "Medication", 
                     "EquipChange", "Year", "StartsYear", "EarnYear", "StartsPrev", "EarnPrev", "BRISRunstyle", "ProgramNum", "MLOdds")
x = str_replace(program[,201], "[:punct:]", "")
HorseData2$ProgramNum = str_replace(x, "[:punct:]", "")

#Past performance data from .csv
temp = list.files(pattern = "^3")
myfiles = lapply(temp, read.csv, header = FALSE)
PastPerf2 = ldply(myfiles, data.frame)
PastPerf2 = PastPerf2[,c(1,2,3,4,5,10,11,14,35,41,53,56,57,61,64,84)]
names(PastPerf2) = c("Track", "RDate", "RNum", "PP", "PrevRDate", "PrevCondition", "PrevDist","NumHor", "Purse", "FinishPos", 
                     "BLMargin", "BRIS2fPace", "BRIS4fPace","BRISLatePace", "BRISSpeed", "ClaimedFlag")

#BRIS files from .csv
temp = list.files(pattern = "^4")
myfiles = lapply(temp, read.csv, header = FALSE)
BRISStats2 = ldply(myfiles, data.frame)
BRISStats2 = BRISStats2[,c(1,2,3,4,5,10)]
names(BRISStats2) = c("Track", "RDate", "RNum", "PP", "PowerRating", "BestSpeedLife")


#Putting them together

#Unique to ensure no duplicates

RaceData = unique(rbind(RaceData1, RaceData2))
HorseData = unique(rbind(HorseData1, HorseData2))
PastPerf = unique(rbind(PastPerf1, PastPerf2))
BRISStats = unique(rbind(BRISStats1, BRISStats2))

#Clean up our workspace

rm(RaceData1, RaceData2, HorseData1, HorseData2, PastPerf1, PastPerf2, BRISStats1, BRISStats2, temp, myfiles, myfiles2, program, x)

ls()

Results

This is data that is available after each race. Information about finish order is contained in these files, as well as the final odds and lineup of horses for each race. In the data available pre-race, I can see which horses are signed-up to run and an estimated odds line called the ‘morning line odds’. When the race goes off, some horses that were scheduled to run may end up scratching. Horses may scratch for a variety of reasons including health issues, inclement weather, or simply nerves, and this is up to the trainer or owner’s discretion.

Like the pre race data, I have different file formats for results. One type is .RES which is one file with all of the data I use. The other is a series of files with more data about other aspects of the race. This data is in .1 and .2 files. For my purposes today, both results files have the same information, which is final odds, finish position, and beaten lengths. I will perform a very similar routine on these files as I did for the pre-race data.

#First the .RES data
setwd("C:/Users/Coby/Documents/DailyFiles/Aggregation of Files/RES_type")
temp = list.files(pattern = ".RES$")
myfiles = lapply(temp, read.csv, header = FALSE)
Results = ldply(myfiles, data.frame)
Results = Results[,c(1,2,3,4,5,11,12,14,16,20,21,22)]
names(Results) = c('Track', 'RDate', 'RNum', 'Dist', 'Surface', 'FTime', 
                   'TrackCondition', 'PP', 'HName', 'FinishPos', 'finishBL', 'FOdds')

#Then the .1 and .2
setwd("C:/Users/Coby/Documents/DailyFiles/Aggregation of Files/1_6_res")
temp = list.files(pattern = "\\.1$")
myfiles = lapply(temp, read.csv, header = FALSE)
resRace = ldply(myfiles, data.frame)
resRace = resRace[,c(1,2,3,5,9,44,38)]
names(resRace) = c('Track', 'RDate', 'RNum', 'Dist', 'Surface', 'FTime', 'TrackCondition')

temp = list.files(pattern = "\\.2$")
myfiles = lapply(temp, read.csv, header = FALSE)
resHorse = ldply(myfiles, data.frame)
resHorse = resHorse[,c(1,2,3,5,8,31,60,73)]
names(resHorse) = c('Track', 'RDate', 'RNum', 'HName' ,'PP','FOdds', 'FinishPos', 'finishBL')


#Putting it together
res = merge(resRace, resHorse)
Results = unique(rbind(Results, res[,c('Track', 'RDate', 'RNum', 'Dist', 'Surface', 'FTime', 
                                   'TrackCondition', 'PP', 'HName', 'FinishPos', 'finishBL', 'FOdds')]))

rm(res, resHorse, resRace, myfiles, temp)

ls()

Building Features

Now that my raw data is imported, I can begin cleaning it, calculating new feilds, and preparing it for regression.

Processing each data set

I will begin by calculating some fields on my data sets to help quantify some of the raw data.

# Race data
RaceData$PostTime = gsub("[\\(, \\), \\/]", "", substr(RaceData$PostTime, 1, 7))

#Horse data

##Trainer win percentage
HorseData$TrainStartCurYear = as.numeric(substr(HorseData$TrainRecordCurYear, 1, 4))
HorseData$TrainWinCurYear = as.numeric(substr(HorseData$TrainRecordCurYear, 5, 8))
HorseData$TrainWinPct = HorseData$TrainWinCurYear/(HorseData$TrainStartCurYear + 1) # +1 eliminates undefined

##Jockey win percentage
HorseData$JocStartCurYear = as.numeric(substr(HorseData$JockRecordCurYear, 1, 4))
HorseData$JocWinCurYear = as.numeric(substr(HorseData$JockRecordCurYear, 5, 8))
HorseData$JockWinPct = HorseData$JocWinCurYear/(HorseData$JocStartCurYear + 1)

##Horse earnings
HorseData$earnPer = (HorseData$EarnYear + HorseData$EarnPrev)/(HorseData$StartsYear+HorseData$StartsPrev)
HorseData$logEarn = log(HorseData$earnPer + 1)

##Equipment changes
HorseData$addBlinks = ifelse(HorseData$EquipChange == 1, 1, 0)
HorseData$remBlinks = ifelse(HorseData$EquipChange == 2, 1, 0)

HorseData = select(HorseData, Track, RDate, RNum, PP, HName, BRISRunstyle, TrainWinPct, JockWinPct, logEarn, addBlinks, remBlinks, MLOdds,
                   ProgramNum)

#Past performance data

##Find days since previous race
PastPerf$RDateD = as.Date(as.character(PastPerf$RDate), "%Y%m%d")
PastPerf$PrevDateD = as.Date(as.character(PastPerf$PrevRDate), "%Y%m%d")
PastPerf$DaysOld = PastPerf$RDateD - PastPerf$PrevDateD

##Finish position is messy, including records for horses that didn't finish.  
##We will say in these cases that they simply finished last.
##If a horse's finish position is not a number, we can assume that they didnt finish.
##I am not interested in the reason for not finishing.
PastPerf$FinishPos = ifelse(is.na(as.numeric(as.character(PastPerf$FinishPos))), 
                            PastPerf$NumHor, 
                            as.numeric(as.character(PastPerf$FinishPos)))
PastPerf$FinishPos = ifelse(PastPerf$FinishPos %in% c(0,89,92,99), PastPerf$NumHor, PastPerf$FinishPos)

##The field BLMargin is either beaten lengths or margin of victory, depending on if the horse won or not. 
PastPerf$BL = as.numeric(ifelse(PastPerf$FinishPos == 1, 0, PastPerf$BLMargin))
PastPerf$Margin = as.numeric(ifelse(PastPerf$FinishPos == 1, PastPerf$BLMargin, 0))

##First interesting variable.  I rate past races based on their finish percentile, purse, and how much they lost/won by.
##Further work could be done to optimize this variable.  I did not consider the weights and factors very much when creating this variable.
PastPerf$raceRating = sqrt((1 - PastPerf$FinishPos/PastPerf$NumHor)*(log(PastPerf$Purse)*(log(1+PastPerf$Margin)+log(1+PastPerf$BL))))

#In order to use the past performance data, I must condense the features into 1-per-horse variables.  I choose to summarise different variables #in different time periods.  I chose 120 days to be 'recent', and to use that time period for form information.
#I chose 365 days to be a good barometer of the horse's ability, whether or not it is in good form.
#I chose 730 days to be the furthest back it is worth going for a horse, and use this time frame for class data.

##subset just races in the past 120 days for recency
PastPerf[is.na(PastPerf)] = 0
pastPerf120 = filter(PastPerf, DaysOld <= 120)
pastPerf120 = group_by(pastPerf120, Track, RDate, RNum, PP)
pastPerf120 = summarise(pastPerf120, countRaces120 = length(Track), avgBL = mean(BL), avgRaceRating = mean(raceRating))

##Subset just races in past 365 days
pastPerf365 = filter(PastPerf, DaysOld <= 365, !is.na(BRISSpeed))
pastPerf365 = group_by(pastPerf365, Track, RDate, RNum, PP)
pastPerf365 = summarise(pastPerf365, avgSpeed = mean(BRISSpeed), avg2FSpeed = mean(BRIS2fPace), 
                        avg4FSpeed = mean(BRIS4fPace), avgLateSpeed = mean(BRISLatePace),
                        turfRating = (1-(length(PrevRDate) - sum(PrevCondition %in% c('FM', 'YL')))/length(PrevRDate))*
                          sum(PrevCondition %in% c('FM', 'YL') & FinishPos <=3)/sum(PrevCondition %in% c('FM', 'YL')),
                        avgRaceRating365 = mean(raceRating))

##subset just races in past 730 days
pastPerf730 = filter(PastPerf, DaysOld <= 730)
pastPerf730 = group_by(pastPerf730, Track, RDate, RNum, PP)
pastPerf730 = summarise(pastPerf730, logPurse = log(mean(Purse)), highestLevel = log(max(ifelse(FinishPos == 1, Purse, 0))+1))


##Combine all the past Perf data
PastPerf = left_join(pastPerf730, pastPerf365)
PastPerf = left_join(PastPerf, pastPerf120)
PastPerf[is.na(PastPerf)] = 0
PastPerf$countsq = PastPerf$countRaces120*PastPerf$countRaces120

Combining predictors

Now I will combine the predictor data into one big data set.

# Putting it together
h = inner_join(HorseData, BRISStats)
h = inner_join(h, RaceData)
h = left_join(h, PastPerf)
h = unique(h)
h[is.na(h)] = 0
h$Track = str_trim(h$Track)

Calculating additional fields and accounting for in-race competition

A main issue with predicting the outcome of horseraces is dealing with the idea that each entry is competing against only other horses in the same race. This issue shows up when you consider the difference between a low class race with a purse of $10,000, and a high class race with a purse of $500,000. The winner of the low class race might have run many times before with speed figures in the 70’s. Perhaps they earn about $1,000 per start. The winner of the high class race, however, is more likely to be running speed figures in the 90’s and be earning at least $25,000 per start. When I train a model, all it will see is 70,$1,000, winner; 90, $25,000, winner. It’s necessary to prepare the data so that the model can evaluate the relative strength of each horse. To fix this problem, I simplty create fields that measure each horse’s ability, in a number of stats, to that of their race mates.

i = group_by(h, Track, RDate, RNum)
j = summarise(i, meanEarn = mean(logEarn), meanPurse = mean(logPurse), 
              meanSpeed = mean(avgSpeed), meanES = mean(avg2FSpeed), mean4f = mean(avg4FSpeed),  meanLS = mean(avgLateSpeed), 
              numHor = length(PP), meanPower = mean(PowerRating), meanTurfRat = mean(turfRating), maxLevel = max(highestLevel), 
              meanRaceRat365 = mean(avgRaceRating365), meanRaceRat = mean(avgRaceRating), meanBL = mean(avgBL))
h = inner_join(i, j)

h$compTurfRat = ifelse(h$meanTurfRat == 0, 0, h$avgTurfRating - h$meanTurfRat)
h$compRaceRat = ifelse(h$meanRaceRat == 0, 0, h$avgRaceRating - h$meanRaceRat)
h$compRaceRat365 = ifelse(h$meanRaceRat365 == 0, 0, h$avgRaceRating365 - h$meanRaceRat365)
h$compLevel = ifelse(h$maxLevel == 0, 0, h$highestLevel - h$maxLevel)
h$compBL = ifelse(h$meanBL == 0, 0, h$avgBL - h$meanBL)
h$compEarn = ifelse(h$logEarn == 0, 0, h$logEarn - h$meanEarn)
h$compPurse = ifelse(h$logPurse == 0, 0, h$logPurse- h$meanPurse)
h$compSpeed = ifelse(h$avgSpeed == 0, 0, h$avgSpeed - h$meanSpeed)
h$compES = ifelse(h$avg2FSpeed == 0, 0, h$avg2FSpeed - h$meanES)
h$comp4f = ifelse(h$avg4FSpeed == 0, 0, h$avg4FSpeed - h$mean4f)
h$compLS = ifelse(h$avgLateSpeed == 0, 0, h$avgLateSpeed - h$meanLS)
h$compPower = ifelse(h$PowerRating == 0, 0, h$PowerRating - h$meanPower)
h$raceID = factor(paste(h$Track, h$RDate, h$RNum, sep = ""))



rm(BRISStats, PastPerf, HorseData, RaceData, i, j, pastPerf730, pastPerf365, pastPerf120, speedPC, classPC)

ls()

Incorporating results

Now that I have a pretty good set of predictors, I will bring in the results data to get ready to model. Unfortunately, there is not a good id field to match pre-race and results data. The best I can do is a combination of track, date, race number, and horse name. The would-be unique id is track, date, race number, and horse number. The issue is that the results file does not have the ‘program number’, which wouldn’t change over time, but the ‘post position’ which will change if horses scratch. If a horse is the #5, 5 refers to its program number, i.e. the number you bet on. If the #4 horse doesn’t run though, the #5 will shift in and run from post position 4.

h$Track = str_trim(as.character(h$Track))
Results$Track = str_trim(as.character(Results$Track))
h$RDate = as.character(h$RDate)
Results$RDate = as.character(Results$RDate)
h$HName = toupper(h$HName)
Results$HName = toupper(Results$HName)
Results$HName = str_trim(gsub("\\(.*$", "", Results$HName ))
h = inner_join(h,Results, by = c("Track", "RNum", "RDate", "HName"))

rm(Results)

ls()

Final preparations

The last thing I want to do before training my model is filter out maiden races, bad data, and break the data out into training and test sets for validation. It’s important that I separate training and test by race, not at the horse level. This will come in to play later, in a simulation. For the time being, I will also only train on dirt sprints i.e. races under a mile, run on a dirt track. This subset of races has proved to be easier to model than other types, and is the only segment which I have acheived profitability on. My hope is that with more data, I will be able to successfully model different types of races.

h = filter(h, FinishPos != 92)
h = filter(h, finishBL < 40)
h = filter(h, FOdds > 0)
h = filter(h, !(RType %in% c('S', 'M', 'MO')))
h[is.na(h)] = 0
h$Surface.x = ifelse(as.character(h$Surface.x) == 'TRUE', 'T', as.character(h$Surface.x))
h$Dist.x = abs(h$Dist.x)
h=filter(h, Dist.x >= 1100)

ds = filter(h, Surface.x %in% c('d', 'D') & Dist.x < 1760)

# ##Perform principle component analysis on speed and class variables
# speedPC = prcomp(select(h, compSpeed, compES, comp4f, compLS)[4:7])
# h$speedPC1 = speedPC$x[,1]
# h$speedPC2 = speedPC$x[,2]
# h$speedPC3 = speedPC$x[,3]
# 
# classPC = prcomp(select(h, compEarn, compPurse, compLevel, compRaceRat365)[4:7])
# h$classPC = classPC$x[,1]


##########
trainRaces = sample(unique(ds$raceID), size = floor(length(unique(ds$raceID))*.7))

dsTrain = filter(ds, raceID %in% trainRaces)
dsTest = filter(ds, !(raceID %in% trainRaces))

Training the Model

Now that the hard work is taken care of, I can finally get to the prediction. My system works in two steps. First I train a model to predict the beaten lengths of each horse. The lower the number, the better the horse. I predict this on a log scale, because the difference between losing by one length and two lengths is much larger than the difference between 11 lengths and 12. The second step is to simulate the race. I use a monte carlo technique to simuilate the normal variations in performance between my model prediction and what might happen over thousands of races. For each simulated race, I record the ‘winner’, and tally the count of winners for each horse. My predicted odds line is based on the percentage of the simulated races that each horse wins. Without further ado, I’ll train the beaten lengths model.

# hBLModel = train(log(finishBL+1) ~ compSpeed + compES + comp4f + compLS + BestSpeedLife + compEarn + compPurse + compLevel + 
#                     compRaceRat + TrainWinPct + JockWinPct + numHor + compPower + BRISRunstyle + countRaces120 + countsq, 
#                     method = 'lm', data = hTrain)

dsBLModel = train(log(finishBL+1) ~ compSpeed + compES + comp4f + compLS + BestSpeedLife + compEarn + compPurse + compLevel + 
                    compRaceRat + TrainWinPct + JockWinPct + numHor + compPower + BRISRunstyle + countRaces120 + countsq, 
                    method = 'gbm', data = dsTrain,verbose = F)

dsBLModel

0.2088379, 0.8713505

Evaluating the Model

We can see that the R2 is about 20%, which isn’t bad considering how much variance there is in a horse race. My objective is to be making money though. So the question is how to test my model for profitability. My answer is to simulate the races based on my predicted beaten lengths, and count how many simulated races each horse wins. Once I have an idea of each horse’s probability of winning, I can test a betting strategy out to test for profitability. I will simulate the training set with a variety of standard deviations. The stantard deviation is about .871, so that’s a good starting point. But in the past, I’ve always found that using a higher variance in my simulations is best.

dsTrain$preBL = predict(dsBLModel, dsTrain)

#functions
monteCarlo = function(val, StD, n) {
  result = val + rnorm(n, mean = val, sd = StD)
  return(result)
}

simDayMin = function(x, StD, n) {
  results = vector("numeric", 0)
  for (i in 1:length(unique(x$raceID))) {
    xrace = subset(x, x$raceID == as.character(unique(x$raceID))[i])
    sim = vector()
    for (j in 1:length(xrace$PP)) {
      sim = cbind(sim, monteCarlo(as.numeric(xrace[j,3]), StD, n))
    }
    winners = as.vector(1:length(xrace$raceID))
    for (k in 1:n) {
      winners = c(winners, which.min(sim[k,]))
    }
    results = c(results, as.vector(table(winners)))
  }
  return(results)
}


#Sim results with different variances
dsTrain = dsTrain[order(dsTrain$raceID, dsTrain$PP.y),]
simTrain = dsTrain[,c("raceID", "PP.y", "preBL")]
dsTrain$wins871 = simDayMin(simTrain[,c(1,2,3)], .871, 20000) -1
dsTrain$wins721 = simDayMin(simTrain[,c(1,2,3)], .721, 20000) -1
dsTrain$wins1021 = simDayMin(simTrain[,c(1,2,3)], 1.021, 20000) -1
dsTrain$wins1171 = simDayMin(simTrain[,c(1,2,3)], 1.171, 20000) -1
dsTrain$wins1321 = simDayMin(simTrain[,c(1,2,3)], 1.321, 20000) -1
dsTrain$wins1471 = simDayMin(simTrain[,c(1,2,3)], 1.471, 20000) -1
dsTrain$wins1621 = simDayMin(simTrain[,c(1,2,3)], 1.621, 20000) -1
dsTrain$wins1771 = simDayMin(simTrain[,c(1,2,3)], 1.771, 20000) -1


#group horses by approximate percentage of winning
#I use 15 groups with about equal numbers of horses
library(Hmisc)
dsTrain$group721 = cut2(dsTrain$wins721, g = 15)
dsTrain$group871 = cut2(dsTrain$wins871, g = 15)
dsTrain$group1021 = cut2(dsTrain$wins1021, g = 15)
dsTrain$group1171 = cut2(dsTrain$wins1171, g = 15)
dsTrain$group1321 = cut2(dsTrain$wins1321, g = 15)
dsTrain$group1471 = cut2(dsTrain$wins1471, g = 15)
dsTrain$group1621 = cut2(dsTrain$wins1621, g = 15)
dsTrain$group1771 = cut2(dsTrain$wins1771, g = 15)

Visualizing the best simulation

Now that I have predicted win data, actual win data, and I’ve clustered my data, I will plot predicted vs. actual for each simulation. The best choice will have a nice linear relationship with the equation y = x i.e. if my predicted win percentage is 15%, the actual win percentage will be 15%. Let’s see how the plots look.

x = group_by(dsTrain, group721)
x = summarise(x, trueWinPct = sum(FinishPos == 1)/length(raceID), predWinPct = mean(wins721)/20000)

qplot(x$predWinPct, x$trueWinPct) + geom_smooth(method = 'lm')

x = group_by(dsTrain, group871)
x = summarise(x, trueWinPct = sum(FinishPos == 1)/length(raceID), predWinPct = mean(wins871)/20000)

qplot(x$predWinPct, x$trueWinPct) + geom_smooth(method = 'lm')

x = group_by(dsTrain, group1021)
x = summarise(x, trueWinPct = sum(FinishPos == 1)/length(raceID), predWinPct = mean(wins1021)/20000)

qplot(x$predWinPct, x$trueWinPct) + geom_smooth(method = 'lm')

x = group_by(dsTrain, group1171)
x = summarise(x, trueWinPct = sum(FinishPos == 1)/length(raceID), predWinPct = mean(wins1171)/20000)

qplot(x$predWinPct, x$trueWinPct) + geom_smooth(method = 'lm')

x = group_by(dsTrain, group1321)
x = summarise(x, trueWinPct = sum(FinishPos == 1)/length(raceID), predWinPct = mean(wins1321)/20000)

qplot(x$predWinPct, x$trueWinPct) + geom_smooth(method = 'lm')

x = group_by(dsTrain, group1471)
x = summarise(x, trueWinPct = sum(FinishPos == 1)/length(raceID), predWinPct = mean(wins1471)/20000)

qplot(x$predWinPct, x$trueWinPct) + geom_smooth(method = 'lm')

x = group_by(dsTrain, group1621)
x = summarise(x, trueWinPct = sum(FinishPos == 1)/length(raceID), predWinPct = mean(wins1621)/20000)

qplot(x$predWinPct, x$trueWinPct) + geom_smooth(method = 'lm')

x = group_by(dsTrain, group1771)
x = summarise(x, trueWinPct = sum(FinishPos == 1)/length(raceID), predWinPct = mean(wins1771)/20000)

qplot(x$predWinPct, x$trueWinPct) + geom_smooth(method = 'lm')

The simulation with variance 1.321 looks the best. The best fit line has exactly the right equation and the points seem to have a nice linear relationship. I’ll use that variance for testing profitability.

Profitability

Now I will make a betting strategy and test it in the training data and the test data. I will have to simulate the test data using the 1.321 variance simulation that I applied to the training set. My betting strategy will be based on the Kelly Criterion in which the optimal bet size is proportional to how valuable the bet is. For example a horse that I think should be 2-1, but the betting public gives him 10-1, then that would be an extremely valuable opportunity and I would bet more than if the same hore had odds of 3-1.

dsTest = dsTest[order(dsTest$raceID, dsTest$PP.y),]
simTest = dsTest[,c("raceID", "PP.y", "preBL")]
dsTest$wins1321 = simDayMin(simTest[,c(1,2,3)], 1.321, 20000) -1

dsTrain$fairOdds = 20000/dsTrain$wins1321 - 1
dsTest$fairOdds = 20000/dsTest$wins1321 - 1
dsTrain$betSize = ((dsTrain$wins1321/20000)*(dsTrain$FOdds +1)-1)/dsTrain$FOdds
dsTest$betSize = ((dsTest$wins1321/20000)*(dsTest$FOdds +1)-1)/dsTest$FOdds


#Baseline bet: If the track odds are higher than my fair odds, bet, otherwise, don't bet
dsTrain$bet1 = ifelse(dsTrain$fairOdds <= dsTrain$FOdds, ifelse(dsTrain$FinishPos == 1,dsTrain$FOdds*dsTrain$betSize,-dsTrain$betSize), 0)

#If my fair odds are less than 1.5, then I feel really good about the horse and I will bet it again if the track odds undervalue it.
dsTrain$bet2 = ifelse(dsTrain$fairOdds <= 1.5 & dsTrain$fairOdds <= dsTrain$FOdds,
                      ifelse(dsTrain$FinishPos == 1,dsTrain$FOdds*dsTrain$betSize,-dsTrain$betSize), 0)
# 
# #If the track undervalues a horse by 50% or more I will bet it again as long as the fair odds are under 15 
# #(variance gets bigger with bigger odds)
# dsTrain$bet3 = ifelse(dsTrain$fairOdds <= 15 & dsTrain$fairOdds*1.5 <= dsTrain$FOdds,
#                       ifelse(dsTrain$FinishPos == 1,dsTrain$FOdds*dsTrain$betSize,-dsTrain$betSize), 0)
# 
# #If a horse with fair odds under 10 is undervalued by 100% I will bet it one more time
# dsTrain$bet4 = ifelse(dsTrain$fairOdds <= 10 & dsTrain$fairOdds*2 <= dsTrain$FOdds,
#                       ifelse(dsTrain$FinishPos == 1,dsTrain$FOdds*dsTrain$betSize,-dsTrain$betSize), 0)

#Same thing for test set
dsTest$bet1 = ifelse(dsTest$fairOdds <= dsTest$FOdds, ifelse(dsTest$FinishPos == 1,dsTest$FOdds*dsTest$betSize,-dsTest$betSize), 0)

dsTest$bet2 = ifelse(dsTest$fairOdds <= 1.5 & dsTest$fairOdds <= dsTest$FOdds,
                      ifelse(dsTest$FinishPos == 1,dsTest$FOdds*dsTest$betSize,-dsTest$betSize), 0)
# 
# dsTest$bet3 = ifelse(dsTest$fairOdds <= 15 & dsTest$fairOdds*1.5 <= dsTest$FOdds,
#                       ifelse(dsTest$FinishPos == 1,dsTest$FOdds*dsTest$betSize,-dsTest$betSize), 0)
# 
# dsTest$bet4 = ifelse(dsTest$fairOdds <= 10 & dsTest$fairOdds*2 <= dsTest$FOdds,
#                       ifelse(dsTest$FinishPos == 1,dsTest$FOdds*dsTest$betSize,-dsTest$betSize), 0)

dsTrain$payout = dsTrain$bet1+dsTrain$bet2#+dsTrain$bet3+dsTrain$bet4
dsTest$payout = dsTest$bet1+dsTest$bet2#+dsTest$bet3+dsTest$bet4


roiTrain = 1 + sum(dsTrain$payout)/sum(abs(dsTrain$betSize))

roiTest = 1 + sum(dsTest$payout)/sum(abs(dsTest$betSize))


roiTrain
## [1] 1.011316
roiTest
## [1] 1.000198

Deployment of the model

I would love to squeak out a few more percentage points in ROI, but my model is definitely doing something right. I am essentially breaking even, against a track cut of ~20%. After taking into account bonuses that various betting sites offer, I am definitely on the right side of things. Of course, I don’t actually bet algorithmically with my breakeven model. I use it as a guide when I go to the track, but I make all kinds of bets - not just to win. To deploy the model into a ‘deliverable’, I retrain it on all of my data, including the test set. Then I create a table output with the relevant information which looks like this:

PostTime Track RNum RType Distance Surface RDate
1:35 GP 3 AO 6 D 20160320
Name Number FairOdds MLOdds PowerRating Run_Stlye Early_Speed Late_Speed
DEFER HEAVEN 3 2.21 1.4 140.20 E 94 84
BROTHERSOFTHETIME 4 2.48 3.0 131.18 P 78 100
PUNTROOSKIE 6 5.18 4.0 132.02 E/P 90 89
TOASTING MASTER 2 6.38 6.0 132.78 E 90 80
SINGANOTHERSONG 5 17.93 12.0 126.48 E 93 80
HEAR THAT TUNE 1 18.83 6.0 128.52 E 90 84
PostTime Track RNum RType Distance Surface RDate
1:50 AQU 2 R 6 d 20160320
Name Number FairOdds MLOdds PowerRating Run_Stlye Early_Speed Late_Speed
CORAMOSS 2 1.02 0.8 124.82 S 90 69
ICE COUTURE 6 4.24 4.0 117.59 P 76 83
QUICK HIT FEVER 4 6.76 12.0 116.95 E/P 89 79
PARADISE PEAK 3 8.50 10.0 115.74 S 88 69
LITTLE BEAR CAT 5 15.14 8.0 115.74 E/P 86 61
ISI ANTONIA 1 51.14 6.0 117.80 E 37 30
PostTime Track RNum RType Distance Surface RDate
2:30 OP 1 A 6 D 20160320
Name Number FairOdds MLOdds PowerRating Run_Stlye Early_Speed Late_Speed
CHANTMEUPBABY 5 3.71 4.0 119.13 P 87 79
MEANBONE 8 4.84 4.0 117.97 S 86 79
MAIZE ROAD 7 5.38 3.5 116.20 S 83 85
SANDHILL SAMMY 11 6.94 10.0 118.89 E/P 88 70
GUSKA MON SHOES 3 8.81 10.0 116.29 P 82 81
SUSPECT A STORM 4 10.50 12.0 119.26 E/P 90 74
OUR DOC TANNER 1 11.11 5.0 118.63 E/P 88 75
ARROGANT 12 30.55 15.0 112.41 E/P 83 74
THE ARKANSAN 10 59.61 20.0 111.85 E/P 81 67
TIPSY SUSPECT 2 69.52 20.0 109.95 E 87 63

Conclusion

This will always be a work in progress for me. A fun hobby to help me learn data science skills. Thank you for reading my write up. If you have any comments or questions, feel free to contact me at coby5252@gmail.com.