An Analysis of Trakus Race Distance Data With R.
Author: James Mundy
Date: December 11, 2019
In horse racing, the winner is not always the fastest horse. The distance run by each horse is also a key determinant. Historically, horseplayers have reflected ground loss in their pre-race analysis by making a somewhat generic adjustment to each horse that was likely to race wide.
This analysis seeks to develop a better understanding of ground loss and to answer the following questions:
The data phase of the project included screen scraping data, writing scraped data to csv files, importing a folder of csv files into a data frame and reviewing the imported data for accuracy and completeness. Specifics steps complete can be viewed below and are listed here:
As you will see below, overall the data was in good shape and is ready for the data wrangling phase of our workflow.
The screen scraping script logs into the trakus website and scrapes the race data specified in the script. The image below provides and example of the data that is scraped. For this project, I scraped all the races from Acqueduct race track for the month of December 2019. This process provided more than 24,000 obseratons, before filtering.
library(rjson)
library(httr)
library(XML)
library(stringr)
library(lubridate)
library(tidyverse)
# This script scrape data from Trakus T-Charts
#Login
login <- function(){
r <- POST("http://tnetwork.trakus.com/tnet/Login.aspx", body = list(TEXT_Username="JamesMundy",TEXT_Password="jamba1ayA",BUTTON_Submit="Log+In"))
a<-cookies(r)
ASP.NET_SessionID <-a[7][[1]][1]
userCredentials <-a[7][[1]][2]
ASPAUTH <-a[7][[1]][3]
}
#Convert Fraction to doubles
mixedToFloat <- function(x){
is.integer <- grepl("^\\d+$", x)
is.fraction <- grepl("^\\d+\\/\\d+$", x)
is.mixed <- grepl("^\\d+ \\d+\\/\\d+$", x)
#stopifnot(all(is.integer | is.fraction | is.mixed))
numbers <- strsplit(x, "[ /]")
ifelse(is.integer, as.numeric(sapply(numbers, `[`, 1)),
ifelse(is.fraction, as.numeric(sapply(numbers, `[`, 1)) /
as.numeric(sapply(numbers, `[`, 2)),
as.numeric(sapply(numbers, `[`, 1)) +
as.numeric(sapply(numbers, `[`, 2)) /
as.numeric(sapply(numbers, `[`, 3))))
}
getLengths <- function(x){
case_when(
x == "Neck" ~ 0.25,
x == "Head" ~ 0.1,
x == "Nose" ~ 0.05,
nchar(x) > 0 ~ mixedToFloat(x),
TRUE ~ 0
)
}
#Match Abbreviation with VenueID
start <- function(data,race_date)
{
trk <- str_to_lower(data)
d <- mdy(race_date)
filedate <- str_replace_all(toString(d),"-","")
filename <- str_c(data,filedate,".csv", sep = '')
setwd(file.path("C:","Users", "mutue", "OneDrive", "Documents", "TimeForm", "Data", "TCharts"))
l <- list(CD = 18,BEL = 23,DM = 10,GP = 20,TAM = 21,AQU = 24,SA = 4,SAR = 25,KEE = 7)
cat(l[[data]],race_date)
login()
race_program(l[[data]],race_date,filename)
}
#Go to specific date and track
race_program <- function(V_ID,race_date,filename){
tchartConn <<-file(filename,"w")
url <-paste(c('http://tnetwork.trakus.com/tnet/t_RaceDay.aspx?VenueID=',toString(V_ID),'&Type=TBRED&Date=',toString(race_date)),collapse='')
r <- GET(url)
temp<-content(r, "text")
doc <- htmlParse(temp)
#Grab all event id
temp_event_id_list<-xpathSApply(doc,"//a[contains(@href,'t_Recap.aspx?EventID=')]/@href")
temp_event_id_list<-substr(temp_event_id_list,22,40)
event_id_list <- list()
counter<-1
#Scrape through each event
for( i in temp_event_id_list)
{
event_id_list[counter]<-i
counter<-counter+1
race_detail(i, tchartConn)
}
close(tchartConn, type ='w')
}
#Scrape Race in every range
race_detail <- function(event_id, f){
url <- paste(c('http://tnetwork.trakus.com/tnet/t_Recap.aspx?EventID=',toString(event_id),'&PostSelect=0'),collapse='')
r <- GET(url)
temp<-content(r, "text")
doc <- htmlParse(temp)
race_info<-xpathSApply(doc,"//*[@class=\"recapTextHeader\"]/td/div")[[1]]
race_info<-toString(xmlValue(race_info))
#race_info<-paste(c(toString(xmlValue(race_info[1]$text)),toString(xmlValue(race_info[3]$i)),toString(xmlValue(race_info[5]$text)),toString(xmlValue(race_info[7]$text)),toString(xmlValue(race_info[9]$text))),collapse='|')
race_info <- str_replace_all(race_info,"(<).*?(>)","")
race_info <- str_replace_all(race_info, "(\\s{3,200})"," ")
track_name <- str_extract(race_info,'.+?(?=\\sRace)')
race_number <- str_extract(race_info,'(Race\\s\\d+)')
purse <- str_replace(str_extract(race_info, "(\\$\\d+)"),"(\\$)","")
race_date <- str_extract(race_info, "([A-Z][a-z]+\\s\\d+\\,\\s\\d{4})")
race_date <- mdy(race_date)
race_date <- toString(race_date)
distance <- str_extract(race_info,'(\\d\\sFurlongs|\\d\\.\\d Furlongs|\\d\\s+\\d\\/\\d{1,2}\\sMiles)')
surface <- str_extract(race_info,'(Dirt|Turf|Sythetic|Outer Turf|Inner Turf)')
final_select<-length(xpathSApply(doc,"//*[@id=\"PostSelect\"]/option"))
print(final_select)
#Loop through each segment
for(range_select in 2:final_select)
{
url <- paste(c('http://tnetwork.trakus.com/tnet/t_Recap.aspx?EventID=',toString(event_id),'&PostSelect=',toString(range_select)),collapse='')
print(url)
r <- GET(url)
temp<-content(r, "text")
doc <- htmlParse(temp)
horse_num <- x <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[2]/img",xmlAttrs)
horse_num <- horse_num[3,]
horse_num <- str_sub(horse_num,-6,-4)
horse_num <- str_extract(horse_num,'[0-9ABC]{1,2}')
horse_name <- x <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[3]",xmlValue)
horse_time <- x <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[4]",xmlValue)
horse_dist <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[5]",xmlValue)
horse_rail <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[6]",xmlValue)
horse_avg <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[7]",xmlValue)
horse_ahead <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[8]",xmlValue)
horse_ahead <- str_replace(horse_ahead,'(\\dP\\d:)',"")
horse_ahead <- str_replace(horse_ahead,'(\\()',"")
horse_ahead <- str_replace(horse_ahead,'(\\))',"")
horse_ahead <- getLengths(horse_ahead)
ifelse(horse_ahead == "","0",horse_ahead)
horse_cu_time <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[11]",xmlValue)
horse_cu_peak <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[12]",xmlValue)
horse_cu_dist <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[13]",xmlValue)
horse_cu_delta <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[14]",xmlValue)
horse_cu_delta <- str_replace_all(horse_cu_delta,'---','0')
horse_cu_avg <-xpathSApply(doc,"//tr[@bgcolor=\"white\"]/td[15]",xmlValue)
for(count in 1:length(horse_name))
{
cat(track_name, race_number, purse, race_date, distance, surface, count, range_select, horse_num[count], horse_name[count], horse_time[count],horse_dist[count],horse_rail[count],horse_avg[count],horse_ahead[count],horse_cu_time[count],horse_cu_peak[count],horse_cu_dist[count],horse_cu_delta[count],horse_cu_avg[count],'\n')
writeLines(paste(c(track_name, race_number, purse, race_date, distance, surface, count, range_select, horse_num[count], horse_name[count], horse_time[count],horse_dist[count],horse_rail[count],horse_avg[count],horse_ahead[count],horse_cu_time[count],horse_cu_peak[count],horse_cu_dist[count],horse_cu_delta[count],horse_cu_avg[count]),collapse = ','), f)
}
}
}
start("AQU","12/8/2019")Map and read_csv functions are used in conjunction to load all the csv files in the TChars folders. Files are then reduced with rbind into a single data frame - tcharts.
files <- dir("C:/Users/mutue/OneDrive/Documents/TimeForm/Data/TCharts/", full.names = T)
tchart <- files %>%
map(read_csv, col_names=c("trk",
"rnbr","rprs","rdt","rdst","srf","pos","sxt","pst","hn","st","sdst","rf","samph","sla","ct","cpk","cdst","cdlt","camph")) %>% # read in all the files individually, using
# the function read_csv() from the readr package
reduce(rbind) # reduce with rbind into one dataframeThe skim function is an alternative to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable.
| Name | tchart |
| Number of rows | 24184 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| Date | 1 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| trk | 0 | 1 | 8 | 8 | 0 | 1 | 0 |
| rnbr | 0 | 1 | 6 | 7 | 0 | 10 | 0 |
| rdst | 0 | 1 | 10 | 12 | 0 | 8 | 0 |
| srf | 0 | 1 | 4 | 10 | 0 | 3 | 0 |
| pst | 0 | 1 | 1 | 2 | 0 | 17 | 0 |
| hn | 0 | 1 | 3 | 23 | 0 | 1402 | 0 |
| ct | 0 | 1 | 4 | 7 | 0 | 7654 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| rdt | 0 | 1 | 2019-11-01 | 2019-12-08 | 2019-11-16 | 24 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rprs | 0 | 1 | 69020.14 | 48369.81 | 28000.00 | 41000.00 | 62000.00 | 70000.00 | 400000.00 | ▇▁▁▁▁ |
| pos | 0 | 1 | 4.94 | 2.85 | 1.00 | 3.00 | 5.00 | 7.00 | 14.00 | ▇▇▃▂▁ |
| sxt | 0 | 1 | 8.92 | 4.55 | 2.00 | 5.00 | 9.00 | 12.00 | 25.00 | ▇▇▅▂▁ |
| st | 0 | 1 | 6.17 | 0.47 | 3.49 | 5.85 | 6.14 | 6.43 | 12.28 | ▁▇▁▁▁ |
| sdst | 0 | 1 | 334.24 | 3.73 | 323.00 | 332.00 | 333.00 | 336.00 | 419.00 | ▇▁▁▁▁ |
| rf | 0 | 1 | 11.78 | 8.28 | 1.60 | 5.50 | 10.30 | 16.40 | 107.80 | ▇▁▁▁▁ |
| samph | 0 | 1 | 37.11 | 2.63 | 18.60 | 35.50 | 37.20 | 38.90 | 64.90 | ▁▇▇▁▁ |
| sla | 0 | 1 | 1.20 | 2.15 | 0.00 | 0.25 | 0.75 | 1.50 | 66.00 | ▇▁▁▁▁ |
| cpk | 0 | 1 | 39.93 | 2.26 | 23.70 | 39.00 | 40.30 | 41.40 | 64.90 | ▁▇▇▁▁ |
| cdst | 0 | 1 | 2644.66 | 1522.65 | 323.00 | 1329.00 | 2659.00 | 3689.00 | 8099.00 | ▇▇▅▂▁ |
| cdlt | 0 | 1 | 2.92 | 14.11 | -113.00 | -1.00 | 0.00 | 8.00 | 117.00 | ▁▁▇▁▁ |
| camph | 0 | 1 | 37.39 | 1.61 | 23.70 | 36.50 | 37.50 | 38.50 | 42.60 | ▁▁▁▇▂ |
The glimpse function displays a vertical preview of the dataset. It allows one to easily preview data type and sample data.
## Observations: 24,184
## Variables: 20
## $ trk <chr> "Aqueduct", "Aqueduct", "Aqueduct", "Aqueduct", "Aqueduc...
## $ rnbr <chr> "Race 1", "Race 1", "Race 1", "Race 1", "Race 1", "Race ...
## $ rprs <dbl> 28000, 28000, 28000, 28000, 28000, 28000, 28000, 28000, ...
## $ rdt <date> 2019-11-01, 2019-11-01, 2019-11-01, 2019-11-01, 2019-11...
## $ rdst <chr> "6.5 Furlongs", "6.5 Furlongs", "6.5 Furlongs", "6.5 Fur...
## $ srf <chr> "Dirt", "Dirt", "Dirt", "Dirt", "Dirt", "Dirt", "Dirt", ...
## $ pos <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5,...
## $ sxt <dbl> 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,...
## $ pst <chr> "7", "3", "4", "2", "5", "6", "1", "7", "3", "4", "6", "...
## $ hn <chr> "Ma Meatloaf", "Shanghai Bonnie", "Eight Oaks", "Smokin ...
## $ st <dbl> 6.49, 6.51, 6.56, 6.80, 6.86, 6.89, 6.97, 5.45, 5.45, 5....
## $ sdst <dbl> 332, 332, 331, 331, 332, 332, 333, 332, 333, 333, 332, 3...
## $ rf <dbl> 28.1, 16.7, 19.9, 10.4, 30.4, 35.2, 6.3, 19.8, 12.8, 13....
## $ samph <dbl> 34.9, 34.7, 34.4, 33.2, 33.0, 32.9, 32.5, 41.5, 41.7, 40...
## $ sla <dbl> 0.25, 0.25, 1.75, 0.25, 0.25, 0.50, 0.00, 0.25, 1.00, 2....
## $ ct <chr> "6.49", "6.51", "6.56", "6.80", "6.86", "6.89", "6.97", ...
## $ cpk <dbl> 34.9, 34.7, 34.4, 33.2, 33.0, 32.9, 32.5, 41.5, 41.7, 40...
## $ cdst <dbl> 332, 332, 331, 331, 332, 332, 333, 664, 665, 664, 664, 6...
## $ cdlt <dbl> 0, 0, -1, -1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2, 2, 1, ...
## $ camph <dbl> 34.9, 34.7, 34.4, 33.2, 33.0, 32.9, 32.5, 37.9, 37.9, 37...
The Vis_dat function is a great way to visualize the data type and missing data within a data frame.
In the data wrangling portion of my workflow I utiilize several tidyverse packages and functions to change data types, rename variables, sort, filter, select and create variables. New variables added include:
Factors were created using the Forcats package for the following variables:
Finally, the skim and viz_dat function were utilized a second time to review the data and the modified tibble rendered to a data table for additonal review.
Click the Code button to see data wrangling code and comments.
tchart <- tchart %>%
#Convert variable to double
mutate(rnbr = as.double(str_extract(rnbr, "(\\d+)"))) %>%
mutate(rprs = as.double(str_extract(rprs, "(\\d+)"))) %>%
mutate(rf = as.double(rf)) %>%
#Adjusting sxt variable to start from 1 instead of 2 - represents each 1/16th of the race.
mutate(sxt = sxt-1) %>%
#Creating Ground Loss, Feet Per Second
mutate(grdlss = (cdst- (sxt * 330))) %>%
mutate(sfps = sdst / st) %>%
mutate(sla = if_else(is.na(sla),0,sla)) %>%
#Converting cummulative time to seconds
mutate(ct = if_else(nchar(ct)>6,getSecs(ct),as.numeric(ct))) %>%
#Calculate cum feet per second
mutate(cfps = as.numeric(cdst) / as.numeric(ct)) %>%
group_by(trk, rdt, rnbr, sxt) %>%
mutate(field = max(pos)) %>%
mutate(bf = cumsum(sla) - sla) %>%
mutate(bf = bf * 9) %>%
mutate(rsv = bf/max(bf)) %>%
ungroup() %>%
#Start of Running Style creation
group_by(trk, rdt, rnbr) %>%
mutate(rsv = if_else(max(sxt)<16 & sxt==4, rsv,NA_real_)) %>%
mutate(rsv = if_else(max(sxt)>=16 & sxt==8, bf/max(bf),rsv)) %>%
mutate(rsbf = if_else(max(sxt)<16 & sxt==4, bf,NA_real_)) %>%
mutate(rsbf = if_else(max(sxt)>=16 & sxt==8, bf,rsbf)) %>%
ungroup() %>%
group_by(trk, rdt, rnbr, sxt) %>%
mutate(rail_rank = min_rank(rf)) %>%
mutate(has_entry = if_else(str_detect(pst,"([A-Z])"),1L,NA_integer_)) %>%
mutate(rail_rank = if_else(sxt == 1, rail_rank, NA_integer_)) %>%
arrange(desc(has_entry)) %>%
fill(has_entry) %>%
arrange(sxt) %>%
ungroup() %>%
#Calculate Running Style For Horses
group_by(trk, rdt, rnbr, hn) %>%
fill(rail_rank) %>%
arrange(trk, rdt, rnbr, hn, rsv) %>%
fill(rsv) %>%
fill(rsbf) %>%
#Completing Running Style Calculation
mutate(rs = case_when(
rsbf < 5.00 ~ "Ldr",
rsv <= 0.333 ~ "EP",
rsv <=0.66 ~ "MP",
rsv <=1 ~ "Clsr",
TRUE ~ "Z")) %>%
ungroup() %>%
arrange(trk, rdt, rnbr,sxt,bf) %>%
group_by(trk, rdt, rnbr) %>%
filter(sxt==max(sxt)) %>%
#Create Race distance as string
rename(rdst_str = rdst) %>%
#Create race distance as numeric
mutate(rdst = sxt * 330) %>%
mutate(rtyp = if_else(max(sxt)<16,"S","R")) %>%
#Adjusting Post Position When there is an Entry 1A
mutate(pp = if_else(has_entry==1,rail_rank, as.integer(pst))) %>%
mutate(pp = if_else(is.na(pp),as.integer(str_trim(pst)),pp)) %>%
ungroup() %>%
# Create Factors
mutate(srf = factor(srf,levels=surface_levels)) %>%
mutate(rs = factor(rs, levels=rs_levels)) %>%
mutate(rdst_str = factor(rdst_str, levels = distance_levels)) %>%
mutate(ppfld = if_else(pp <= field,log10(pp*field),0)) %>%
mutate(fs = if_else(field <=5, "s",if_else(field>9,"l","m"))) %>%
#Exclude turf races because of relatively small sample this time of year
filter(srf == "Dirt") %>%
#Exclude race greater than 1 mile. Not many races greater than a mile were run in November 2019
filter(rdst <= 5280) %>%
filter(pp > 0) %>%
select(trk, rdt, rnbr, rdst_str, rdst, srf, field, pp, hn, pos, rs, ppfld, grdlss, cdst, fs)Review data again after data wrangling has been completed.
| Name | tchart |
| Number of rows | 887 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| Date | 1 |
| factor | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| trk | 0 | 1 | 8 | 8 | 0 | 1 | 0 |
| hn | 0 | 1 | 4 | 22 | 0 | 749 | 0 |
| fs | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| rdt | 0 | 1 | 2019-11-01 | 2019-12-08 | 2019-11-17 | 24 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| rdst_str | 0 | 1 | FALSE | 4 | 6 F: 369, 8 F: 260, 6.5: 167, 7 F: 91 |
| srf | 0 | 1 | FALSE | 1 | Dir: 887, Tur: 0, Out: 0 |
| rs | 0 | 1 | FALSE | 4 | EP: 344, Ldr: 197, MP: 188, Cls: 158 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rnbr | 0 | 1 | 4.93 | 2.82 | 1 | 2.00 | 5.00 | 7.00 | 10.00 | ▇▇▆▆▅ |
| rdst | 0 | 1 | 4476.76 | 556.09 | 3960 | 3960.00 | 4290.00 | 5280.00 | 5280.00 | ▇▃▂▁▆ |
| field | 0 | 1 | 8.03 | 2.07 | 4 | 6.00 | 8.00 | 9.00 | 14.00 | ▆▇▆▃▁ |
| pp | 0 | 1 | 5.21 | 3.09 | 1 | 3.00 | 5.00 | 7.00 | 16.00 | ▇▆▃▁▁ |
| pos | 0 | 1 | 4.52 | 2.59 | 1 | 2.00 | 4.00 | 6.00 | 14.00 | ▇▇▃▂▁ |
| ppfld | 0 | 1 | 1.31 | 0.57 | 0 | 1.08 | 1.45 | 1.69 | 2.29 | ▂▂▃▇▂ |
| grdlss | 0 | 1 | 62.26 | 15.25 | 35 | 51.00 | 61.00 | 72.00 | 172.00 | ▇▆▁▁▁ |
| cdst | 0 | 1 | 4539.02 | 559.40 | 3995 | 4021.00 | 4349.00 | 5332.00 | 5383.00 | ▇▃▂▁▆ |
Box plots identify outliers so they can be removed from the data set. We are now ready to advance to the Explore phase of the workflow.
p <- ggplot(tchart, aes(x=rdst_str, y=grdlss, fill=rdst_str)) +
geom_boxplot(alpha=0.7) +
stat_summary(fun.y=mean, geom="point", shape=20, size=5, color="red", fill="red") +
theme(legend.position="none") +
ylim(0, 150) +
theme_fivethirtyeight()
p + theme(legend.position = "none") tchart <- tchart %>%
mutate(outlier = if_else((rdst_str=="6 Furlongs" | rdst_str == "6.5 Furlongs") & grdlss > 100, 1,0)) %>%
filter(outlier == 0)
p <- ggplot(tchart, aes(x=rdst_str, y=grdlss, fill=rdst_str)) +
geom_boxplot(alpha=0.7) +
stat_summary(fun.y=mean, geom="point", shape=20, size=5, color="red", fill="red") +
theme(legend.position="none") +
ylim(0, 150) +
theme_fivethirtyeight()
p + theme(legend.position = "none") We will begin to answer questions about distribution, correlation. The answers to these questions have implication on our modeling phase and should begin to answer the questions on the impact of field size and running style on ground loss.
We will start the explore phase with density plots of response variable - Ground Loss. Does this variable have a normal distribution?
ggplot(tchart, aes(grdlss))+
scale_x_log10() +
geom_density(fill="blue", color="blue") +
facet_grid(rs ~ rdst_str, scale = "free") +
labs(x=element_blank(), y= element_blank()) +
theme_fivethirtyeight()Field size is define here as small (five or less), medium (6 -9) or large (10+).
ggplot(tchart, aes(grdlss))+
scale_x_log10() +
geom_density(fill="blue", color="blue") +
facet_grid(pp ~ rdst_str, scale = "free") +
labs(x=element_blank(), y= element_blank()) +
theme_fivethirtyeight()Field size is define here as small (five or less), medium (6 -9) or large (10+).
ggplot(tchart, aes(grdlss))+
scale_x_log10() +
geom_density(fill="blue", color="blue") +
facet_grid(fs ~ rdst_str, scale = "free") +
labs(x=element_blank(), y= element_blank()) +
theme_fivethirtyeight()Our Post Position by distance and running style makes a strong argument for running style making a difference in ground loss.
p <- ggplot(data = tchart, mapping = aes(x = pp, y = grdlss)) +
geom_point(alpha = 0.2) +
scale_y_log10() +
scale_x_log10() +
geom_smooth() +
facet_grid(rs ~ rdst_str) +
theme_fivethirtyeight()
pWe will begin to answer questions about distribution, correlation. The answers to these questions will determine our ability to successfully model our response variable - ground loss. We will start this excercise by creating a data frames for our numeric and character variable
p <- ggplot(data = tchart, mapping = aes(x = field, y = grdlss)) +
geom_point(alpha = 0.2) +
scale_y_log10() +
scale_x_log10() +
geom_smooth() +
facet_grid(rs ~ rdst_str) +
theme_fivethirtyeight()
pThe modeling strategy was a many models approach. Specifically, nesting was utilized to create specific models at the intersection of running style and race distance. Traditional Linear (lm) and linear with spline (lm w/ ns()) were employed. The broom package is used to display modeling resuls:
Model 1 is nested (a model for each nested group) linear model thas has been applied to the Running stlye and Race Distance Group. For example for each combination of running style and race distance, a model is calculated. The model is defined as:
ggplot(data = t1, aes(x = factor(rs), y = r.squared)) +
geom_bar(stat = "identity", aes(fill =p.value)) +
facet_grid(~rdst_str) +
labs(x = "Running Style", y = expression(R^{2}))Similar Model 1, Model 2 is nest spline model thas has also been applied to the Running stlye and Race Distance Group. The model is defined as:
What are the answers to the questions posed at the beginning of this analysis.
Yes, I believe the box plots below support the premise that field size does implact ground loss. Additionally, it would appear that dispersion of ground loss is also positively correlated with field size.
p <- ggplot(tchart, aes(rs, grdlss)) +
geom_boxplot() +
facet_grid(rdst_str~ fs) +
theme_fivethirtyeight()
pIt appears that running style can have both a postive and/or negative impact on ground loss. This is evident in the visualizations below, which may also reflect jockey tactics.
p <- ggplot(data = tchart, mapping = aes(x = field, y = grdlss)) +
geom_point(alpha = 0.2) +
scale_y_log10() +
scale_x_log10() +
geom_smooth() +
facet_grid(rs ~ rdst_str) +
theme_fivethirtyeight()
pYes, I belive a ground loss model could provide a competative advantage. Unfortunately, the models in this analysis are not that model. I do believe, however, that multiple models is the correct approach and that with more data and the correct model construction a robust value-added model is possible.