Final Project


Understanding Ground Loss in Horse Racing

An Analysis of Trakus Race Distance Data With R.


Author: James Mundy

Date: December 11, 2019



Abstract


In horse racing, the winner is not always the fastest horse. The distance run by each horse is also a key determinant. Historically, horseplayers have reflected ground loss in their pre-race analysis by making a somewhat generic adjustment to each horse that was likely to race wide.


This analysis seeks to develop a better understanding of ground loss and to answer the following questions:

  • Does field size (the number of horses in a race) play a role in ground loss?
  • Does running style have a positive or negative impact on ground loss?
  • Could a ground loss model be employed to gain a competitive advantage?






















Data

Getting The DATA


The data phase of the project included screen scraping data, writing scraped data to csv files, importing a folder of csv files into a data frame and reviewing the imported data for accuracy and completeness. Specifics steps complete can be viewed below and are listed here:

  • Wrote script to screen scrape data
  • Loaded csv files from folder into tibble with readr:read_csv and purrr::map
  • Previewed data using following functions: skimr::skim, dplyr::glimpse, visdat::viz_data

As you will see below, overall the data was in good shape and is ready for the data wrangling phase of our workflow.


Screen Sraped Data


The screen scraping script logs into the trakus website and scrapes the race data specified in the script. The image below provides and example of the data that is scraped. For this project, I scraped all the races from Acqueduct race track for the month of December 2019. This process provided more than 24,000 obseratons, before filtering.

Screen Sraper Script

library(rjson)
library(httr)
library(XML)
library(stringr)
library(lubridate)
library(tidyverse)

# This script scrape data from Trakus T-Charts

#Login
login <- function(){
  r <- POST("http://tnetwork.trakus.com/tnet/Login.aspx", body = list(TEXT_Username="JamesMundy",TEXT_Password="jamba1ayA",BUTTON_Submit="Log+In"))
  a<-cookies(r)
  ASP.NET_SessionID <-a[7][[1]][1]
  userCredentials <-a[7][[1]][2]
  ASPAUTH <-a[7][[1]][3]
}

#Convert Fraction to doubles
mixedToFloat <- function(x){
  is.integer  <- grepl("^\\d+$", x)
  is.fraction <- grepl("^\\d+\\/\\d+$", x)
  is.mixed    <- grepl("^\\d+ \\d+\\/\\d+$", x)
  #stopifnot(all(is.integer | is.fraction | is.mixed))
  
  numbers <- strsplit(x, "[ /]")
  
  ifelse(is.integer,  as.numeric(sapply(numbers, `[`, 1)),
         ifelse(is.fraction, as.numeric(sapply(numbers, `[`, 1)) /
                  as.numeric(sapply(numbers, `[`, 2)),
                as.numeric(sapply(numbers, `[`, 1)) +
                  as.numeric(sapply(numbers, `[`, 2)) /
                  as.numeric(sapply(numbers, `[`, 3))))
}

getLengths <- function(x){
  case_when(
    x == "Neck" ~ 0.25,
    x == "Head" ~ 0.1,
    x == "Nose" ~ 0.05,
    nchar(x) > 0 ~ mixedToFloat(x),
    TRUE ~ 0
  )
  
}



#Match Abbreviation with VenueID
start <- function(data,race_date)
{
  trk <- str_to_lower(data)
  d <- mdy(race_date)
  filedate <- str_replace_all(toString(d),"-","")
  filename <- str_c(data,filedate,".csv", sep = '')
  
  
  setwd(file.path("C:","Users", "mutue", "OneDrive", "Documents", "TimeForm", "Data", "TCharts"))
  l <- list(CD = 18,BEL = 23,DM = 10,GP = 20,TAM = 21,AQU = 24,SA = 4,SAR = 25,KEE = 7)
  cat(l[[data]],race_date)
  login()
  race_program(l[[data]],race_date,filename)
}

#Go to specific date and track
race_program <- function(V_ID,race_date,filename){
  
  tchartConn <<-file(filename,"w")
  
  url <-paste(c('http://tnetwork.trakus.com/tnet/t_RaceDay.aspx?VenueID=',toString(V_ID),'&Type=TBRED&Date=',toString(race_date)),collapse='')
  r <- GET(url)
  temp<-content(r, "text")
  doc <- htmlParse(temp)
  #Grab all event id
  temp_event_id_list<-xpathSApply(doc,"//a[contains(@href,'t_Recap.aspx?EventID=')]/@href")
  temp_event_id_list<-substr(temp_event_id_list,22,40)
  event_id_list <- list()
  counter<-1
  #Scrape through each event
  for( i in temp_event_id_list)
  {
    event_id_list[counter]<-i
    counter<-counter+1
    race_detail(i, tchartConn)
  }
  close(tchartConn, type ='w')
  
}

#Scrape Race in every range
race_detail <- function(event_id, f){
  
  url <- paste(c('http://tnetwork.trakus.com/tnet/t_Recap.aspx?EventID=',toString(event_id),'&PostSelect=0'),collapse='')
  r <- GET(url)
  temp<-content(r, "text")
  doc <- htmlParse(temp)
  race_info<-xpathSApply(doc,"//*[@class=\"recapTextHeader\"]/td/div")[[1]]
  race_info<-toString(xmlValue(race_info))
  #race_info<-paste(c(toString(xmlValue(race_info[1]$text)),toString(xmlValue(race_info[3]$i)),toString(xmlValue(race_info[5]$text)),toString(xmlValue(race_info[7]$text)),toString(xmlValue(race_info[9]$text))),collapse='|')
  race_info <- str_replace_all(race_info,"(<).*?(>)","") 
  race_info <- str_replace_all(race_info, "(\\s{3,200})"," ")

  
  track_name <- str_extract(race_info,'.+?(?=\\sRace)')
  race_number <- str_extract(race_info,'(Race\\s\\d+)')
  purse <- str_replace(str_extract(race_info, "(\\$\\d+)"),"(\\$)","")
  race_date <- str_extract(race_info, "([A-Z][a-z]+\\s\\d+\\,\\s\\d{4})")
  race_date <- mdy(race_date)
  race_date <- toString(race_date)
  distance <- str_extract(race_info,'(\\d\\sFurlongs|\\d\\.\\d Furlongs|\\d\\s+\\d\\/\\d{1,2}\\sMiles)')
  surface <- str_extract(race_info,'(Dirt|Turf|Sythetic|Outer Turf|Inner Turf)')
  
  final_select<-length(xpathSApply(doc,"//*[@id=\"PostSelect\"]/option"))
  print(final_select)
  #Loop through each segment
  for(range_select in 2:final_select)
  {
    url <- paste(c('http://tnetwork.trakus.com/tnet/t_Recap.aspx?EventID=',toString(event_id),'&PostSelect=',toString(range_select)),collapse='')
    print(url)
    r <- GET(url)
    temp<-content(r, "text")
    doc <- htmlParse(temp)
    
    horse_num  <- x <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[2]/img",xmlAttrs)
    horse_num <- horse_num[3,]
    horse_num <- str_sub(horse_num,-6,-4)
    horse_num <- str_extract(horse_num,'[0-9ABC]{1,2}')
    
    
    
    horse_name <- x <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[3]",xmlValue)
    horse_time <- x <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[4]",xmlValue)
    horse_dist <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[5]",xmlValue)
    horse_rail <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[6]",xmlValue)
    horse_avg <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[7]",xmlValue)
    horse_ahead <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[8]",xmlValue)
    horse_ahead <- str_replace(horse_ahead,'(\\dP\\d:)',"")
    horse_ahead <- str_replace(horse_ahead,'(\\()',"")
    horse_ahead <- str_replace(horse_ahead,'(\\))',"")
    horse_ahead <- getLengths(horse_ahead)
    ifelse(horse_ahead == "","0",horse_ahead)
    
    
    
    horse_cu_time <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[11]",xmlValue)
    horse_cu_peak <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[12]",xmlValue)
    horse_cu_dist <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[13]",xmlValue)
    horse_cu_delta <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[14]",xmlValue)
    horse_cu_delta <- str_replace_all(horse_cu_delta,'---','0')
    horse_cu_avg <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[15]",xmlValue)
    
    for(count in 1:length(horse_name))
    {
      
      
      cat(track_name, race_number, purse, race_date, distance, surface, count, range_select, horse_num[count], horse_name[count], horse_time[count],horse_dist[count],horse_rail[count],horse_avg[count],horse_ahead[count],horse_cu_time[count],horse_cu_peak[count],horse_cu_dist[count],horse_cu_delta[count],horse_cu_avg[count],'\n')
      writeLines(paste(c(track_name, race_number, purse, race_date, distance, surface, count, range_select, horse_num[count], horse_name[count], horse_time[count],horse_dist[count],horse_rail[count],horse_avg[count],horse_ahead[count],horse_cu_time[count],horse_cu_peak[count],horse_cu_dist[count],horse_cu_delta[count],horse_cu_avg[count]),collapse = ','), f)
  
    }
  }
}


start("AQU","12/8/2019")



Skim Function


The skim function is an alternative to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable.

Data summary
Name tchart
Number of rows 24184
Number of columns 20
_______________________
Column type frequency:
character 7
Date 1
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
trk 0 1 8 8 0 1 0
rnbr 0 1 6 7 0 10 0
rdst 0 1 10 12 0 8 0
srf 0 1 4 10 0 3 0
pst 0 1 1 2 0 17 0
hn 0 1 3 23 0 1402 0
ct 0 1 4 7 0 7654 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
rdt 0 1 2019-11-01 2019-12-08 2019-11-16 24

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rprs 0 1 69020.14 48369.81 28000.00 41000.00 62000.00 70000.00 400000.00 ▇▁▁▁▁
pos 0 1 4.94 2.85 1.00 3.00 5.00 7.00 14.00 ▇▇▃▂▁
sxt 0 1 8.92 4.55 2.00 5.00 9.00 12.00 25.00 ▇▇▅▂▁
st 0 1 6.17 0.47 3.49 5.85 6.14 6.43 12.28 ▁▇▁▁▁
sdst 0 1 334.24 3.73 323.00 332.00 333.00 336.00 419.00 ▇▁▁▁▁
rf 0 1 11.78 8.28 1.60 5.50 10.30 16.40 107.80 ▇▁▁▁▁
samph 0 1 37.11 2.63 18.60 35.50 37.20 38.90 64.90 ▁▇▇▁▁
sla 0 1 1.20 2.15 0.00 0.25 0.75 1.50 66.00 ▇▁▁▁▁
cpk 0 1 39.93 2.26 23.70 39.00 40.30 41.40 64.90 ▁▇▇▁▁
cdst 0 1 2644.66 1522.65 323.00 1329.00 2659.00 3689.00 8099.00 ▇▇▅▂▁
cdlt 0 1 2.92 14.11 -113.00 -1.00 0.00 8.00 117.00 ▁▁▇▁▁
camph 0 1 37.39 1.61 23.70 36.50 37.50 38.50 42.60 ▁▁▁▇▂



Glimpse Function


The glimpse function displays a vertical preview of the dataset. It allows one to easily preview data type and sample data.

## Observations: 24,184
## Variables: 20
## $ trk   <chr> "Aqueduct", "Aqueduct", "Aqueduct", "Aqueduct", "Aqueduc...
## $ rnbr  <chr> "Race 1", "Race 1", "Race 1", "Race 1", "Race 1", "Race ...
## $ rprs  <dbl> 28000, 28000, 28000, 28000, 28000, 28000, 28000, 28000, ...
## $ rdt   <date> 2019-11-01, 2019-11-01, 2019-11-01, 2019-11-01, 2019-11...
## $ rdst  <chr> "6.5 Furlongs", "6.5 Furlongs", "6.5 Furlongs", "6.5 Fur...
## $ srf   <chr> "Dirt", "Dirt", "Dirt", "Dirt", "Dirt", "Dirt", "Dirt", ...
## $ pos   <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5,...
## $ sxt   <dbl> 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,...
## $ pst   <chr> "7", "3", "4", "2", "5", "6", "1", "7", "3", "4", "6", "...
## $ hn    <chr> "Ma Meatloaf", "Shanghai Bonnie", "Eight Oaks", "Smokin ...
## $ st    <dbl> 6.49, 6.51, 6.56, 6.80, 6.86, 6.89, 6.97, 5.45, 5.45, 5....
## $ sdst  <dbl> 332, 332, 331, 331, 332, 332, 333, 332, 333, 333, 332, 3...
## $ rf    <dbl> 28.1, 16.7, 19.9, 10.4, 30.4, 35.2, 6.3, 19.8, 12.8, 13....
## $ samph <dbl> 34.9, 34.7, 34.4, 33.2, 33.0, 32.9, 32.5, 41.5, 41.7, 40...
## $ sla   <dbl> 0.25, 0.25, 1.75, 0.25, 0.25, 0.50, 0.00, 0.25, 1.00, 2....
## $ ct    <chr> "6.49", "6.51", "6.56", "6.80", "6.86", "6.89", "6.97", ...
## $ cpk   <dbl> 34.9, 34.7, 34.4, 33.2, 33.0, 32.9, 32.5, 41.5, 41.7, 40...
## $ cdst  <dbl> 332, 332, 331, 331, 332, 332, 333, 664, 665, 664, 664, 6...
## $ cdlt  <dbl> 0, 0, -1, -1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2, 2, 1, ...
## $ camph <dbl> 34.9, 34.7, 34.4, 33.2, 33.0, 32.9, 32.5, 37.9, 37.9, 37...



Vis_dat Function


The Vis_dat function is a great way to visualize the data type and missing data within a data frame.


Back to top

Wrangling

Wrangling The Data


In the data wrangling portion of my workflow I utiilize several tidyverse packages and functions to change data types, rename variables, sort, filter, select and create variables. New variables added include:

  • ground loss - the distance greater than the stated race distance run by a horse in feet
  • feet per second - measure of velocity of the horse in a segment (sfps) or cummulative (cpfs)
  • running style - captures if a horse runs on the lead, close to the lead, mid pack or from the back
  • field - number of horses in the race

Factors were created using the Forcats package for the following variables:

  • Surface (srf) - 3 Levels
  • Running Style (rs) - 4 Levels
  • Race Distance String (rdst_str) - 8 Levels

Finally, the skim and viz_dat function were utilized a second time to review the data and the modified tibble rendered to a data table for additonal review.


Wrangling Code


Click the Code button to see data wrangling code and comments.

tchart <- tchart %>% 
  #Convert variable to double
  mutate(rnbr = as.double(str_extract(rnbr, "(\\d+)"))) %>% 
  mutate(rprs = as.double(str_extract(rprs, "(\\d+)"))) %>%
  mutate(rf = as.double(rf)) %>%
  #Adjusting sxt variable to start from 1 instead of 2 - represents each 1/16th of the race.
  mutate(sxt = sxt-1) %>% 
  #Creating Ground Loss, Feet Per Second
  mutate(grdlss = (cdst- (sxt * 330))) %>% 
  mutate(sfps = sdst / st) %>%
  mutate(sla = if_else(is.na(sla),0,sla)) %>% 
  #Converting cummulative time to seconds
  mutate(ct = if_else(nchar(ct)>6,getSecs(ct),as.numeric(ct))) %>% 
  #Calculate cum feet per second
  mutate(cfps = as.numeric(cdst) / as.numeric(ct)) %>%
  group_by(trk, rdt, rnbr, sxt) %>% 
  mutate(field = max(pos)) %>%
  mutate(bf = cumsum(sla) - sla) %>% 
  mutate(bf =  bf * 9) %>% 
  mutate(rsv = bf/max(bf)) %>% 
  ungroup() %>% 
  #Start of Running Style creation
  group_by(trk, rdt, rnbr) %>% 
  mutate(rsv = if_else(max(sxt)<16 & sxt==4, rsv,NA_real_)) %>%
  mutate(rsv = if_else(max(sxt)>=16 & sxt==8, bf/max(bf),rsv)) %>%
  mutate(rsbf = if_else(max(sxt)<16 & sxt==4, bf,NA_real_)) %>%
  mutate(rsbf = if_else(max(sxt)>=16 & sxt==8, bf,rsbf)) %>%
  ungroup() %>% 
  group_by(trk, rdt, rnbr, sxt) %>%
  mutate(rail_rank = min_rank(rf)) %>% 
  mutate(has_entry = if_else(str_detect(pst,"([A-Z])"),1L,NA_integer_)) %>% 
  mutate(rail_rank = if_else(sxt == 1, rail_rank, NA_integer_)) %>% 
  arrange(desc(has_entry)) %>%
  fill(has_entry) %>% 
  arrange(sxt) %>% 
  ungroup() %>% 
  #Calculate Running Style For Horses
  group_by(trk, rdt, rnbr, hn) %>%
  fill(rail_rank) %>% 
  arrange(trk, rdt, rnbr, hn, rsv) %>% 
  fill(rsv) %>% 
  fill(rsbf) %>% 
  #Completing Running Style Calculation
  mutate(rs = case_when(
         rsbf < 5.00 ~ "Ldr",
         rsv <= 0.333 ~ "EP",
         rsv <=0.66 ~ "MP",
         rsv <=1 ~ "Clsr",
         TRUE ~ "Z")) %>% 
  ungroup() %>% 
  arrange(trk, rdt, rnbr,sxt,bf) %>% 
  group_by(trk, rdt, rnbr) %>% 
  filter(sxt==max(sxt)) %>%
  #Create Race distance as string
  rename(rdst_str = rdst) %>% 
  #Create race distance as numeric
  mutate(rdst = sxt * 330) %>% 
  mutate(rtyp = if_else(max(sxt)<16,"S","R")) %>% 
  
  #Adjusting Post Position When there is an Entry 1A
  mutate(pp = if_else(has_entry==1,rail_rank, as.integer(pst))) %>% 
  mutate(pp = if_else(is.na(pp),as.integer(str_trim(pst)),pp)) %>%
  ungroup() %>%  
  
  # Create Factors
  mutate(srf = factor(srf,levels=surface_levels)) %>% 
  mutate(rs = factor(rs, levels=rs_levels)) %>% 
  mutate(rdst_str = factor(rdst_str, levels = distance_levels)) %>%
  mutate(ppfld = if_else(pp <= field,log10(pp*field),0)) %>% 
  mutate(fs = if_else(field <=5, "s",if_else(field>9,"l","m"))) %>% 
  
  #Exclude turf races because of relatively small sample this time of year
  filter(srf == "Dirt") %>%  
  #Exclude race greater than 1 mile.  Not many races greater than a mile were run in November 2019
  filter(rdst <= 5280) %>% 
  filter(pp > 0) %>%
  select(trk, rdt, rnbr, rdst_str, rdst, srf, field, pp, hn, pos, rs, ppfld, grdlss, cdst, fs)



Skim Function


Review data again after data wrangling has been completed.

Data summary
Name tchart
Number of rows 887
Number of columns 15
_______________________
Column type frequency:
character 3
Date 1
factor 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
trk 0 1 8 8 0 1 0
hn 0 1 4 22 0 749 0
fs 0 1 1 1 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
rdt 0 1 2019-11-01 2019-12-08 2019-11-17 24

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
rdst_str 0 1 FALSE 4 6 F: 369, 8 F: 260, 6.5: 167, 7 F: 91
srf 0 1 FALSE 1 Dir: 887, Tur: 0, Out: 0
rs 0 1 FALSE 4 EP: 344, Ldr: 197, MP: 188, Cls: 158

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rnbr 0 1 4.93 2.82 1 2.00 5.00 7.00 10.00 ▇▇▆▆▅
rdst 0 1 4476.76 556.09 3960 3960.00 4290.00 5280.00 5280.00 ▇▃▂▁▆
field 0 1 8.03 2.07 4 6.00 8.00 9.00 14.00 ▆▇▆▃▁
pp 0 1 5.21 3.09 1 3.00 5.00 7.00 16.00 ▇▆▃▁▁
pos 0 1 4.52 2.59 1 2.00 4.00 6.00 14.00 ▇▇▃▂▁
ppfld 0 1 1.31 0.57 0 1.08 1.45 1.69 2.29 ▂▂▃▇▂
grdlss 0 1 62.26 15.25 35 51.00 61.00 72.00 172.00 ▇▆▁▁▁
cdst 0 1 4539.02 559.40 3995 4021.00 4349.00 5332.00 5383.00 ▇▃▂▁▆



Vis_dat Function


Viz_dat again confirms there is no missing data.




Explore

Exploring The Data


We will begin to answer questions about distribution, correlation. The answers to these questions have implication on our modeling phase and should begin to answer the questions on the impact of field size and running style on ground loss.

Ground Loss By Distance and Running Style


We will start the explore phase with density plots of response variable - Ground Loss. Does this variable have a normal distribution?




Ground Loss By Distance and Post Position


Field size is define here as small (five or less), medium (6 -9) or large (10+).




Ground Loss By Distance and Field Size ((l)arge, (m)edium, (s)mall)


Field size is define here as small (five or less), medium (6 -9) or large (10+).




Correlation With Ground Loss (log10) - Post Position by Distance and Running Style


Our Post Position by distance and running style makes a strong argument for running style making a difference in ground loss.




Correlation With Ground Loss (log10) - Field Size by Distance and Running Style


We will begin to answer questions about distribution, correlation. The answers to these questions will determine our ability to successfully model our response variable - ground loss. We will start this excercise by creating a data frames for our numeric and character variable




Back to top

Model

Building Models


The modeling strategy was a many models approach. Specifically, nesting was utilized to create specific models at the intersection of running style and race distance. Traditional Linear (lm) and linear with spline (lm w/ ns()) were employed. The broom package is used to display modeling resuls:


Model 1


Model 1 is nested (a model for each nested group) linear model thas has been applied to the Running stlye and Race Distance Group. For example for each combination of running style and race distance, a model is calculated. The model is defined as:

  • lm(log(grdlss) ~ pp +log(field))

Model 1 Results

Model 2


Similar Model 1, Model 2 is nest spline model thas has also been applied to the Running stlye and Race Distance Group. The model is defined as:

  • lm(log(grdlss) ~ ns(pp, df=5))

Model 2 Results

Back to top

Communicate

The Findings


What are the answers to the questions posed at the beginning of this analysis.


Questions and Answers

Back to top