DATA 607 - Final Project

Final Project

Understanding Ground Loss in Horse Racing

An Analysis of Trakus Race Distance Data With R.

Author: James Mundy

Date: December 11, 2019

Abstract

In horse racing, the winner is not always the fastest horse. The distance run by each horse is also a key determinant. Historically, horseplayers have reflected ground loss in their pre-race analysis by making a somewhat generic adjustment to each horse that was likely to race wide.

This analysis seeks to develop a better understanding of ground loss and to answer the following questions:

Does field size (the number of horses in a race) play a role in ground loss?
Does running style have a positive or negative impact on ground loss?
Could a ground loss model be employed to gain a competitive advantage?

Data

Getting The DATA

The data phase of the project included screen scraping data, writing scraped data to csv files, importing a folder of csv files into a data frame and reviewing the imported data for accuracy and completeness. Specifics steps complete can be viewed below and are listed here:

Wrote script to screen scrape data
Loaded csv files from folder into tibble with readr:read_csv and purrr::map
Previewed data using following functions: skimr::skim, dplyr::glimpse, visdat::viz_data

As you will see below, overall the data was in good shape and is ready for the data wrangling phase of our workflow.

Screen Sraped Data

The screen scraping script logs into the trakus website and scrapes the race data specified in the script. The image below provides and example of the data that is scraped. For this project, I scraped all the races from Acqueduct race track for the month of December 2019. This process provided more than 24,000 obseratons, before filtering.

Screen Sraper Script

library(rjson)
library(httr)
library(XML)
library(stringr)
library(lubridate)
library(tidyverse)

# This script scrape data from Trakus T-Charts

#Login
login <- function(){
  r <- POST("http://tnetwork.trakus.com/tnet/Login.aspx", body = list(TEXT_Username="JamesMundy",TEXT_Password="jamba1ayA",BUTTON_Submit="Log+In"))
  a<-cookies(r)
  ASP.NET_SessionID <-a[7][[1]][1]
  userCredentials <-a[7][[1]][2]
  ASPAUTH <-a[7][[1]][3]
}

#Convert Fraction to doubles
mixedToFloat <- function(x){
  is.integer  <- grepl("^\\d+$", x)
  is.fraction <- grepl("^\\d+\\/\\d+$", x)
  is.mixed    <- grepl("^\\d+ \\d+\\/\\d+$", x)
  #stopifnot(all(is.integer | is.fraction | is.mixed))
  
  numbers <- strsplit(x, "[ /]")
  
  ifelse(is.integer,  as.numeric(sapply(numbers, `[`, 1)),
         ifelse(is.fraction, as.numeric(sapply(numbers, `[`, 1)) /
                  as.numeric(sapply(numbers, `[`, 2)),
                as.numeric(sapply(numbers, `[`, 1)) +
                  as.numeric(sapply(numbers, `[`, 2)) /
                  as.numeric(sapply(numbers, `[`, 3))))
}

getLengths <- function(x){
  case_when(
    x == "Neck" ~ 0.25,
    x == "Head" ~ 0.1,
    x == "Nose" ~ 0.05,
    nchar(x) > 0 ~ mixedToFloat(x),
    TRUE ~ 0
  )
  
}



#Match Abbreviation with VenueID
start <- function(data,race_date)
{
  trk <- str_to_lower(data)
  d <- mdy(race_date)
  filedate <- str_replace_all(toString(d),"-","")
  filename <- str_c(data,filedate,".csv", sep = '')
  
  
  setwd(file.path("C:","Users", "mutue", "OneDrive", "Documents", "TimeForm", "Data", "TCharts"))
  l <- list(CD = 18,BEL = 23,DM = 10,GP = 20,TAM = 21,AQU = 24,SA = 4,SAR = 25,KEE = 7)
  cat(l[[data]],race_date)
  login()
  race_program(l[[data]],race_date,filename)
}

#Go to specific date and track
race_program <- function(V_ID,race_date,filename){
  
  tchartConn <<-file(filename,"w")
  
  url <-paste(c('http://tnetwork.trakus.com/tnet/t_RaceDay.aspx?VenueID=',toString(V_ID),'&Type=TBRED&Date=',toString(race_date)),collapse='')
  r <- GET(url)
  temp<-content(r, "text")
  doc <- htmlParse(temp)
  #Grab all event id
  temp_event_id_list<-xpathSApply(doc,"//a[contains(@href,'t_Recap.aspx?EventID=')]/@href")
  temp_event_id_list<-substr(temp_event_id_list,22,40)
  event_id_list <- list()
  counter<-1
  #Scrape through each event
  for( i in temp_event_id_list)
  {
    event_id_list[counter]<-i
    counter<-counter+1
    race_detail(i, tchartConn)
  }
  close(tchartConn, type ='w')
  
}

#Scrape Race in every range
race_detail <- function(event_id, f){
  
  url <- paste(c('http://tnetwork.trakus.com/tnet/t_Recap.aspx?EventID=',toString(event_id),'&PostSelect=0'),collapse='')
  r <- GET(url)
  temp<-content(r, "text")
  doc <- htmlParse(temp)
  race_info<-xpathSApply(doc,"//*[@class=\"recapTextHeader\"]/td/div")[[1]]
  race_info<-toString(xmlValue(race_info))
  #race_info<-paste(c(toString(xmlValue(race_info[1]$text)),toString(xmlValue(race_info[3]$i)),toString(xmlValue(race_info[5]$text)),toString(xmlValue(race_info[7]$text)),toString(xmlValue(race_info[9]$text))),collapse='|')
  race_info <- str_replace_all(race_info,"(<).*?(>)","") 
  race_info <- str_replace_all(race_info, "(\\s{3,200})"," ")

  
  track_name <- str_extract(race_info,'.+?(?=\\sRace)')
  race_number <- str_extract(race_info,'(Race\\s\\d+)')
  purse <- str_replace(str_extract(race_info, "(\\$\\d+)"),"(\\$)","")
  race_date <- str_extract(race_info, "([A-Z][a-z]+\\s\\d+\\,\\s\\d{4})")
  race_date <- mdy(race_date)
  race_date <- toString(race_date)
  distance <- str_extract(race_info,'(\\d\\sFurlongs|\\d\\.\\d Furlongs|\\d\\s+\\d\\/\\d{1,2}\\sMiles)')
  surface <- str_extract(race_info,'(Dirt|Turf|Sythetic|Outer Turf|Inner Turf)')
  
  final_select<-length(xpathSApply(doc,"//*[@id=\"PostSelect\"]/option"))
  print(final_select)
  #Loop through each segment
  for(range_select in 2:final_select)
  {
    url <- paste(c('http://tnetwork.trakus.com/tnet/t_Recap.aspx?EventID=',toString(event_id),'&PostSelect=',toString(range_select)),collapse='')
    print(url)
    r <- GET(url)
    temp<-content(r, "text")
    doc <- htmlParse(temp)
    
    horse_num  <- x <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[2]/img",xmlAttrs)
    horse_num <- horse_num[3,]
    horse_num <- str_sub(horse_num,-6,-4)
    horse_num <- str_extract(horse_num,'[0-9ABC]{1,2}')
    
    
    
    horse_name <- x <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[3]",xmlValue)
    horse_time <- x <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[4]",xmlValue)
    horse_dist <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[5]",xmlValue)
    horse_rail <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[6]",xmlValue)
    horse_avg <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[7]",xmlValue)
    horse_ahead <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[8]",xmlValue)
    horse_ahead <- str_replace(horse_ahead,'(\\dP\\d:)',"")
    horse_ahead <- str_replace(horse_ahead,'(\\()',"")
    horse_ahead <- str_replace(horse_ahead,'(\\))',"")
    horse_ahead <- getLengths(horse_ahead)
    ifelse(horse_ahead == "","0",horse_ahead)
    
    
    
    horse_cu_time <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[11]",xmlValue)
    horse_cu_peak <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[12]",xmlValue)
    horse_cu_dist <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[13]",xmlValue)
    horse_cu_delta <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[14]",xmlValue)
    horse_cu_delta <- str_replace_all(horse_cu_delta,'---','0')
    horse_cu_avg <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[15]",xmlValue)
    
    for(count in 1:length(horse_name))
    {
      
      
      cat(track_name, race_number, purse, race_date, distance, surface, count, range_select, horse_num[count], horse_name[count], horse_time[count],horse_dist[count],horse_rail[count],horse_avg[count],horse_ahead[count],horse_cu_time[count],horse_cu_peak[count],horse_cu_dist[count],horse_cu_delta[count],horse_cu_avg[count],'\n')
      writeLines(paste(c(track_name, race_number, purse, race_date, distance, surface, count, range_select, horse_num[count], horse_name[count], horse_time[count],horse_dist[count],horse_rail[count],horse_avg[count],horse_ahead[count],horse_cu_time[count],horse_cu_peak[count],horse_cu_dist[count],horse_cu_delta[count],horse_cu_avg[count]),collapse = ','), f)
  
    }
  }
}


start("AQU","12/8/2019")

Load Data

Map and read_csv functions are used in conjunction to load all the csv files in the TChars folders. Files are then reduced with rbind into a single data frame - tcharts.

files <- dir("C:/Users/mutue/OneDrive/Documents/TimeForm/Data/TCharts/", full.names = T) 

tchart <- files %>%
  map(read_csv, col_names=c("trk",
                            "rnbr","rprs","rdt","rdst","srf","pos","sxt","pst","hn","st","sdst","rf","samph","sla","ct","cpk","cdst","cdlt","camph")) %>%     # read in all the files individually, using
                       # the function read_csv() from the readr package
  reduce(rbind)        # reduce with rbind into one dataframe

Skim Function

The skim function is an alternative to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable.

skim(tchart)

Data summary
Name	tchart
Number of rows	24184
Number of columns	20
_______________________
Column type frequency:
character	7
Date	1
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
trk	1	8	8	1
rnbr	1	6	7	10
rdst	1	10	12	8
srf	1	4	10	3
pst	1	1	2	17
hn	1	3	23	1402
ct	1	4	7	7654

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
rdt	0	1	2019-11-01	2019-12-08	2019-11-16	24

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rprs	1	69020.14	48369.81	28000.00	41000.00	62000.00	70000.00	400000.00	▇▁▁▁▁
pos	1	4.94	2.85	1.00	3.00	5.00	7.00	14.00	▇▇▃▂▁
sxt	1	8.92	4.55	2.00	5.00	9.00	12.00	25.00	▇▇▅▂▁
st	1	6.17	0.47	3.49	5.85	6.14	6.43	12.28	▁▇▁▁▁
sdst	1	334.24	3.73	323.00	332.00	333.00	336.00	419.00	▇▁▁▁▁
rf	1	11.78	8.28	1.60	5.50	10.30	16.40	107.80	▇▁▁▁▁
samph	1	37.11	2.63	18.60	35.50	37.20	38.90	64.90	▁▇▇▁▁
sla	1	1.20	2.15	0.00	0.25	0.75	1.50	66.00	▇▁▁▁▁
cpk	1	39.93	2.26	23.70	39.00	40.30	41.40	64.90	▁▇▇▁▁
cdst	1	2644.66	1522.65	323.00	1329.00	2659.00	3689.00	8099.00	▇▇▅▂▁
cdlt	1	2.92	14.11	-113.00	-1.00	0.00	8.00	117.00	▁▁▇▁▁
camph	1	37.39	1.61	23.70	36.50	37.50	38.50	42.60	▁▁▁▇▂

Glimpse Function

The glimpse function displays a vertical preview of the dataset. It allows one to easily preview data type and sample data.

glimpse(tchart)

## Observations: 24,184
## Variables: 20
## $ trk   <chr> "Aqueduct", "Aqueduct", "Aqueduct", "Aqueduct", "Aqueduc...
## $ rnbr  <chr> "Race 1", "Race 1", "Race 1", "Race 1", "Race 1", "Race ...
## $ rprs  <dbl> 28000, 28000, 28000, 28000, 28000, 28000, 28000, 28000, ...
## $ rdt   <date> 2019-11-01, 2019-11-01, 2019-11-01, 2019-11-01, 2019-11...
## $ rdst  <chr> "6.5 Furlongs", "6.5 Furlongs", "6.5 Furlongs", "6.5 Fur...
## $ srf   <chr> "Dirt", "Dirt", "Dirt", "Dirt", "Dirt", "Dirt", "Dirt", ...
## $ pos   <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5,...
## $ sxt   <dbl> 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,...
## $ pst   <chr> "7", "3", "4", "2", "5", "6", "1", "7", "3", "4", "6", "...
## $ hn    <chr> "Ma Meatloaf", "Shanghai Bonnie", "Eight Oaks", "Smokin ...
## $ st    <dbl> 6.49, 6.51, 6.56, 6.80, 6.86, 6.89, 6.97, 5.45, 5.45, 5....
## $ sdst  <dbl> 332, 332, 331, 331, 332, 332, 333, 332, 333, 333, 332, 3...
## $ rf    <dbl> 28.1, 16.7, 19.9, 10.4, 30.4, 35.2, 6.3, 19.8, 12.8, 13....
## $ samph <dbl> 34.9, 34.7, 34.4, 33.2, 33.0, 32.9, 32.5, 41.5, 41.7, 40...
## $ sla   <dbl> 0.25, 0.25, 1.75, 0.25, 0.25, 0.50, 0.00, 0.25, 1.00, 2....
## $ ct    <chr> "6.49", "6.51", "6.56", "6.80", "6.86", "6.89", "6.97", ...
## $ cpk   <dbl> 34.9, 34.7, 34.4, 33.2, 33.0, 32.9, 32.5, 41.5, 41.7, 40...
## $ cdst  <dbl> 332, 332, 331, 331, 332, 332, 333, 664, 665, 664, 664, 6...
## $ cdlt  <dbl> 0, 0, -1, -1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2, 2, 1, ...
## $ camph <dbl> 34.9, 34.7, 34.4, 33.2, 33.0, 32.9, 32.5, 37.9, 37.9, 37...

Vis_dat Function

The Vis_dat function is a great way to visualize the data type and missing data within a data frame.

vis_dat(tchart)

Wrangling

Wrangling The Data

In the data wrangling portion of my workflow I utiilize several tidyverse packages and functions to change data types, rename variables, sort, filter, select and create variables. New variables added include:

ground loss - the distance greater than the stated race distance run by a horse in feet
feet per second - measure of velocity of the horse in a segment (sfps) or cummulative (cpfs)
running style - captures if a horse runs on the lead, close to the lead, mid pack or from the back
field - number of horses in the race

Factors were created using the Forcats package for the following variables:

Surface (srf) - 3 Levels
Running Style (rs) - 4 Levels
Race Distance String (rdst_str) - 8 Levels

Finally, the skim and viz_dat function were utilized a second time to review the data and the modified tibble rendered to a data table for additonal review.

Wrangling Code

Click the Code button to see data wrangling code and comments.

tchart <- tchart %>% 
  #Convert variable to double
  mutate(rnbr = as.double(str_extract(rnbr, "(\\d+)"))) %>% 
  mutate(rprs = as.double(str_extract(rprs, "(\\d+)"))) %>%
  mutate(rf = as.double(rf)) %>%
  #Adjusting sxt variable to start from 1 instead of 2 - represents each 1/16th of the race.
  mutate(sxt = sxt-1) %>% 
  #Creating Ground Loss, Feet Per Second
  mutate(grdlss = (cdst- (sxt * 330))) %>% 
  mutate(sfps = sdst / st) %>%
  mutate(sla = if_else(is.na(sla),0,sla)) %>% 
  #Converting cummulative time to seconds
  mutate(ct = if_else(nchar(ct)>6,getSecs(ct),as.numeric(ct))) %>% 
  #Calculate cum feet per second
  mutate(cfps = as.numeric(cdst) / as.numeric(ct)) %>%
  group_by(trk, rdt, rnbr, sxt) %>% 
  mutate(field = max(pos)) %>%
  mutate(bf = cumsum(sla) - sla) %>% 
  mutate(bf =  bf * 9) %>% 
  mutate(rsv = bf/max(bf)) %>% 
  ungroup() %>% 
  #Start of Running Style creation
  group_by(trk, rdt, rnbr) %>% 
  mutate(rsv = if_else(max(sxt)<16 & sxt==4, rsv,NA_real_)) %>%
  mutate(rsv = if_else(max(sxt)>=16 & sxt==8, bf/max(bf),rsv)) %>%
  mutate(rsbf = if_else(max(sxt)<16 & sxt==4, bf,NA_real_)) %>%
  mutate(rsbf = if_else(max(sxt)>=16 & sxt==8, bf,rsbf)) %>%
  ungroup() %>% 
  group_by(trk, rdt, rnbr, sxt) %>%
  mutate(rail_rank = min_rank(rf)) %>% 
  mutate(has_entry = if_else(str_detect(pst,"([A-Z])"),1L,NA_integer_)) %>% 
  mutate(rail_rank = if_else(sxt == 1, rail_rank, NA_integer_)) %>% 
  arrange(desc(has_entry)) %>%
  fill(has_entry) %>% 
  arrange(sxt) %>% 
  ungroup() %>% 
  #Calculate Running Style For Horses
  group_by(trk, rdt, rnbr, hn) %>%
  fill(rail_rank) %>% 
  arrange(trk, rdt, rnbr, hn, rsv) %>% 
  fill(rsv) %>% 
  fill(rsbf) %>% 
  #Completing Running Style Calculation
  mutate(rs = case_when(
         rsbf < 5.00 ~ "Ldr",
         rsv <= 0.333 ~ "EP",
         rsv <=0.66 ~ "MP",
         rsv <=1 ~ "Clsr",
         TRUE ~ "Z")) %>% 
  ungroup() %>% 
  arrange(trk, rdt, rnbr,sxt,bf) %>% 
  group_by(trk, rdt, rnbr) %>% 
  filter(sxt==max(sxt)) %>%
  #Create Race distance as string
  rename(rdst_str = rdst) %>% 
  #Create race distance as numeric
  mutate(rdst = sxt * 330) %>% 
  mutate(rtyp = if_else(max(sxt)<16,"S","R")) %>% 
  
  #Adjusting Post Position When there is an Entry 1A
  mutate(pp = if_else(has_entry==1,rail_rank, as.integer(pst))) %>% 
  mutate(pp = if_else(is.na(pp),as.integer(str_trim(pst)),pp)) %>%
  ungroup() %>%  
  
  # Create Factors
  mutate(srf = factor(srf,levels=surface_levels)) %>% 
  mutate(rs = factor(rs, levels=rs_levels)) %>% 
  mutate(rdst_str = factor(rdst_str, levels = distance_levels)) %>%
  mutate(ppfld = if_else(pp <= field,log10(pp*field),0)) %>% 
  mutate(fs = if_else(field <=5, "s",if_else(field>9,"l","m"))) %>% 
  
  #Exclude turf races because of relatively small sample this time of year
  filter(srf == "Dirt") %>%  
  #Exclude race greater than 1 mile.  Not many races greater than a mile were run in November 2019
  filter(rdst <= 5280) %>% 
  filter(pp > 0) %>%
  select(trk, rdt, rnbr, rdst_str, rdst, srf, field, pp, hn, pos, rs, ppfld, grdlss, cdst, fs)

Skim Function

Review data again after data wrangling has been completed.

skim(tchart)

Data summary
Name	tchart
Number of rows	887
Number of columns	15
_______________________
Column type frequency:
character	3
Date	1
factor	3
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
trk	1	8	8	1
hn	1	4	22	749
fs	1	1	1	3

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
rdt	0	1	2019-11-01	2019-12-08	2019-11-17	24

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
rdst_str	1	FALSE	4	6 F: 369, 8 F: 260, 6.5: 167, 7 F: 91
srf	1	FALSE	1	Dir: 887, Tur: 0, Out: 0
rs	1	FALSE	4	EP: 344, Ldr: 197, MP: 188, Cls: 158

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rnbr	1	4.93	2.82	1	2.00	5.00	7.00	10.00	▇▇▆▆▅
rdst	1	4476.76	556.09	3960	3960.00	4290.00	5280.00	5280.00	▇▃▂▁▆
field	1	8.03	2.07	4	6.00	8.00	9.00	14.00	▆▇▆▃▁
pp	1	5.21	3.09	1	3.00	5.00	7.00	16.00	▇▆▃▁▁
pos	1	4.52	2.59	1	2.00	4.00	6.00	14.00	▇▇▃▂▁
ppfld	1	1.31	0.57	0	1.08	1.45	1.69	2.29	▂▂▃▇▂
grdlss	1	62.26	15.25	35	51.00	61.00	72.00	172.00	▇▆▁▁▁
cdst	1	4539.02	559.40	3995	4021.00	4349.00	5332.00	5383.00	▇▃▂▁▆

Vis_dat Function

Viz_dat again confirms there is no missing data.

vis_dat(tchart)

Outliers and Small N

Box plots identify outliers so they can be removed from the data set. We are now ready to advance to the Explore phase of the workflow.

p <- ggplot(tchart, aes(x=rdst_str, y=grdlss, fill=rdst_str)) +
    geom_boxplot(alpha=0.7) +
    stat_summary(fun.y=mean, geom="point", shape=20, size=5, color="red", fill="red") +
    theme(legend.position="none") +
    ylim(0, 150) +
    theme_fivethirtyeight()
  
  
 p + theme(legend.position = "none")

 tchart <- tchart %>% 
  mutate(outlier = if_else((rdst_str=="6 Furlongs" | rdst_str == "6.5 Furlongs") & grdlss > 100, 1,0)) %>% 
  filter(outlier == 0)
 

p <- ggplot(tchart, aes(x=rdst_str, y=grdlss, fill=rdst_str)) +
    geom_boxplot(alpha=0.7) +
    stat_summary(fun.y=mean, geom="point", shape=20, size=5, color="red", fill="red") +
    theme(legend.position="none") +
    ylim(0, 150) +
    theme_fivethirtyeight()
  
  
 p + theme(legend.position = "none")

Explore

Exploring The Data

We will begin to answer questions about distribution, correlation. The answers to these questions have implication on our modeling phase and should begin to answer the questions on the impact of field size and running style on ground loss.

Ground Loss By Distance and Running Style

We will start the explore phase with density plots of response variable - Ground Loss. Does this variable have a normal distribution?

  ggplot(tchart, aes(grdlss))+
  scale_x_log10() +
  geom_density(fill="blue", color="blue") +
  facet_grid(rs ~ rdst_str, scale = "free") +
  labs(x=element_blank(), y= element_blank()) +
  theme_fivethirtyeight()

Ground Loss By Distance and Post Position

Field size is define here as small (five or less), medium (6 -9) or large (10+).

  ggplot(tchart, aes(grdlss))+
  scale_x_log10() +
  geom_density(fill="blue", color="blue") +
  facet_grid(pp ~ rdst_str, scale = "free") +
  labs(x=element_blank(), y= element_blank()) +
  theme_fivethirtyeight()

Ground Loss By Distance and Field Size ((l)arge, (m)edium, (s)mall)

Field size is define here as small (five or less), medium (6 -9) or large (10+).

  ggplot(tchart, aes(grdlss))+
  scale_x_log10() +
  geom_density(fill="blue", color="blue") +
  facet_grid(fs ~ rdst_str, scale = "free") +
  labs(x=element_blank(), y= element_blank()) +
  theme_fivethirtyeight()

Correlation With Ground Loss (log10) - Post Position by Distance and Running Style

Our Post Position by distance and running style makes a strong argument for running style making a difference in ground loss.

  p <- ggplot(data = tchart, mapping = aes(x = pp, y = grdlss)) +
    geom_point(alpha = 0.2) +
    scale_y_log10() +
    scale_x_log10() +
    geom_smooth() +
    facet_grid(rs ~ rdst_str) +
    theme_fivethirtyeight()
 p

Correlation With Ground Loss (log10) - Field Size by Distance and Running Style

We will begin to answer questions about distribution, correlation. The answers to these questions will determine our ability to successfully model our response variable - ground loss. We will start this excercise by creating a data frames for our numeric and character variable

  p <- ggplot(data = tchart, mapping = aes(x = field, y = grdlss)) +
    geom_point(alpha = 0.2) +
    scale_y_log10() +
    scale_x_log10() +
    geom_smooth() +
    facet_grid(rs ~ rdst_str) +
    theme_fivethirtyeight()
 p

Model

Building Models

The modeling strategy was a many models approach. Specifically, nesting was utilized to create specific models at the intersection of running style and race distance. Traditional Linear (lm) and linear with spline (lm w/ ns()) were employed. The broom package is used to display modeling resuls:

Model 1

Model 1 is nested (a model for each nested group) linear model thas has been applied to the Running stlye and Race Distance Group. For example for each combination of running style and race distance, a model is calculated. The model is defined as:

lm(log(grdlss) ~ pp +log(field))

t1 <- tchart %>% 
  nest(-rs, -rdst_str) %>% 
  mutate(mod1 = map(data, ~ lm(log(grdlss) ~ pp +log(field) , data = .)),
         results = map(mod1, glance)) %>% 
  unnest(results) %>% 
  select(rdst_str, rs, r.squared, adj.r.squared, statistic, p.value)

Model 1 Results

datatable(t1)

ggplot(data = t1, aes(x = factor(rs), y = r.squared)) +
    geom_bar(stat = "identity", aes(fill =p.value)) +
    facet_grid(~rdst_str) +
    labs(x = "Running Style", y = expression(R^{2}))

Model 2

Similar Model 1, Model 2 is nest spline model thas has also been applied to the Running stlye and Race Distance Group. The model is defined as:

lm(log(grdlss) ~ ns(pp, df=5))

t2 <- tchart %>% 
  nest(-rs, -rdst_str) %>% 
  mutate(mod4 = map(data, ~ lm(log(grdlss) ~ ns(pp, df = 3), data = .)),
         results = map(mod4, glance)) %>% 
  unnest(results) %>% 
  select(rdst_str, rs, r.squared, adj.r.squared, statistic, p.value)

Model 2 Results

datatable(t2)

ggplot(data = t2, aes(x = factor(rs), y = r.squared)) +
    geom_bar(stat = "identity", aes(fill =p.value)) +
    facet_grid(~rdst_str) +
    labs(x = "Running Style", y = expression(R^{2}))

Communicate

The Findings

What are the answers to the questions posed at the beginning of this analysis.

Questions and Answers

Does field size (the number of horses in a race) play a role in ground loss?

Yes, I believe the box plots below support the premise that field size does implact ground loss. Additionally, it would appear that dispersion of ground loss is also positively correlated with field size.

p <- ggplot(tchart, aes(rs, grdlss)) +
     geom_boxplot() +
     facet_grid(rdst_str~ fs) +
    theme_fivethirtyeight()

p

Does running style have a positive or negative impact on ground loss?

It appears that running style can have both a postive and/or negative impact on ground loss. This is evident in the visualizations below, which may also reflect jockey tactics.

p <- ggplot(data = tchart, mapping = aes(x = field, y = grdlss)) +
    geom_point(alpha = 0.2) +
    scale_y_log10() +
    scale_x_log10() +
    geom_smooth() +
    facet_grid(rs ~ rdst_str) +
    theme_fivethirtyeight()
 p

Could a ground loss model be employed to gain a competitive advantage?

Yes, I belive a ground loss model could provide a competative advantage. Unfortunately, the models in this analysis are not that model. I do believe, however, that multiple models is the correct approach and that with more data and the correct model construction a robust value-added model is possible.