The Data

The data is horse racing past performance data. This is a challenging data set:

  • Data is stacked_ -The data set includes several types of data and is stacked within the file. For example, there are records for Starters, Past Performances, Race, Exoctic Wagers, Information, and Workouts. Each record type is designated by a record(S, P, I, E, W). The data is also relatively wide with more than 40 unnamed columns.
  • Missing Data - Some horses in field are international so there is incomplete data for these horse
  • Data Type Conversiion Required - Initially all of the data is character, therefore some data will need to be converted to other types (data, integer, etc.)
  • New Variables required - additional feilds will be required so mutate will be employed to create these fields
  • One to Many Data - There is a natural one to many relationship to this data so I will attempt to create a list column to reflect this

My Game Plan

My game plan for this data set follows:

  • Import Data - Use readr to import the data
  • Get To A Working Data Set - Remove record type and combine record type to get to a good starting point
  • __Create some helper functions to be used in the Wrangling / Transform step
  • Wrangling / Transform Data Create headers, variables, select and convert data
  • Remove Missing Remove the international horses with imcomplete data
  • Create List Column There is a natural 1 to Many (One horse has many past performances) to this data I will try to capture this in a tidy way
  • Analyze I will leverage the list column to calculate the desired metrics to answer the questions.

My Questions

1. What horses have the top 10 most wins?

2. What horses have the top 10 win percentage?

3. What horses have the top 10 in the money percentage?

Transform Data

The steps to transform the dataset are set forth below:

  1. Import Data

Utilize readr and and its col_names and col_types parameter to import the file. Column C1 captures the stacked nature of the file. The letter designation indicates the type of record (S,R, I, W, etc.)

## # A tibble: 4,871 x 46
##    c1    c2     c3    c4    c5    c6    c7    c8    c9    c10   c11   c12  
##    <chr> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 H     Sarat~ SAR   2019~ 10    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  2 R     Maiden 2019~ 1     Dirt  1 1/~ $90,~ (UP ~ 7     <NA>  <NA>  <NA> 
##  3 E     SAR    2019~ 1     "\"E~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  4 E     SAR    2019~ 1     Trif~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  5 E     SAR    2019~ 1     Supe~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  6 E     SAR    2019~ 1     Pick~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  7 E     SAR    2019~ 1     "Dou~ 1     2019~ SAR   Stre~ Stre~ "\"N~ Geld~
##  8 I     2019   Vet ~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  9 P     WO     Stre~ J Ca~ Jul ~ 95    9     Maid~ $65K  1 1/~ Synth Fast~
## 10 P     WO     Stre~ J Ca~ Jun ~ 94    10    Maid~ $95K  1 1/~ Turf  Firm~
## # ... with 4,861 more rows, and 34 more variables: c13 <chr>, c14 <chr>,
## #   c15 <chr>, c16 <chr>, c17 <chr>, c18 <chr>, c19 <chr>, c20 <chr>,
## #   c21 <chr>, c22 <chr>, c23 <chr>, c24 <chr>, c25 <chr>, c26 <chr>,
## #   c27 <chr>, c28 <chr>, c29 <chr>, c30 <chr>, c31 <chr>, c32 <chr>,
## #   c33 <chr>, c34 <chr>, c35 <chr>, c36 <chr>, c37 <chr>, c38 <chr>,
## #   c39 <chr>, c40 <chr>, c41 <chr>, c42 <chr>, c43 <chr>, c44 <chr>,
## #   c45 <chr>, c46 <chr>
  1. Get To A Good Starting Point

Now that the data is imported, I will get to a good starting point by filtering to the records that I will utilize (P and I), ensure a Horse name exists on each row and adding a record_id to preserve the order of the file and provide a good point of reference.

record_id horse
1 Street Talkin Guy
2 Street Talkin Guy
3 Street Talkin Guy
4 Street Talkin Guy
5 One Eyed Jack
6 Four Ten
  1. Create Helper Functions

Here I create two helper functions that will use in conjunction with mutate to create two new variables past_performance_code and furlongs (integer), Past performance code is a code that will uniquely identify a past performance line and furlongs is the distance of the race in furlongs ( a furlong is 1/8 of mile.)

get_pp_code <- function(dt_str, track, race, finish) {
  d <- str_trunc(dt_str,2, side = "left", ellipsis = "")
  m <- str_sub(dt_str,5,2)
  m <- month(ymd(dt_str),label=TRUE, abbr=TRUE)
  yr <- str_trunc(dt_str,4, side = "right", ellipsis = "")
  yr <- str_trunc(yr,2, side = "left", ellipsis = "")
  r <- str_pad(race, 2, "left", "0")
  p <- str_pad(finish,2, "left", "0")
  s <- "_"
  x <- str_c(d,m,yr,s,r,track,p, sep="")
}


get_furlongs <- function(d) {
  case_when(
    str_detect(d,'3 Furlongs') ~ 300,
    str_detect(d,'3 1/2 Furlongs') ~ 350, 
    str_detect(d,'3 3/4 Furlongs') ~ 375,
    str_detect(d,'4 Furlongs') ~ 400,
    str_detect(d,'4 1/4 Furlongs') ~ 425,
    str_detect(d,'4 1/2 Furlongs') ~ 450,
    str_detect(d,'5 Furlongs') ~ 500,
    str_detect(d,'About 5 Furlongs') ~ 500,
    str_detect(d,'5 1/4 Furlongs') ~ 525,
    str_detect(d,'5 Furlongs 80 Yards') ~ 536,
    str_detect(d,'5 1/2 Furlongs') ~ 550,
    str_detect(d,'About 5 1/2 Furlongs') ~ 550,
    str_detect(d,'6 Furlongs') ~ 600,
    str_detect(d,'About 6 Furlongs') ~ 600,
    str_detect(d,'6 1/2 Furlongs') ~ 650,
    str_detect(d,'About 6 1/2 Furlongs') ~ 650,
    str_detect(d,'7 Furlongs') ~ 700,
    str_detect(d,'About 7 Furlongs') ~ 700,
    str_detect(d,'7 1/2 Furlongs') ~ 750,
    str_detect(d,'1 Mile') ~ 800,
    str_detect(d,'About 1 Mile') ~ 800,
    str_detect(d,'1 Mile 40 Yards') ~ 818,
    str_detect(d,'About 1 Mile 40 Yards') ~ 818,
    str_detect(d,'1 Mile 70 Yards') ~ 832,
    str_detect(d,'About 1 Mile 70 Yards') ~ 832,
    str_detect(d,'1 1/16 Miles') ~ 850,
    str_detect(d,'About 1 1/16 Miles') ~ 850,
    str_detect(d,'1 1/8 Miles') ~ 900,
    str_detect(d,'1 3/16 Miles') ~ 950,
    str_detect(d,'1 1/4 Miles') ~ 1000,
    str_detect(d,'1 5/16 Miles') ~ 1040,
    str_detect(d,'1 5/16 Miles') ~ 1050,
    str_detect(d,'1 3/8 Miles') ~ 1100,
    str_detect(d,'1 7/16 Miles') ~ 1150,
    str_detect(d,'1 1/2 Miles') ~ 1200,
    str_detect(d,'1 9/16 Miles') ~ 1250,
    str_detect(d,'1 5/8 Miles') ~ 1300,
    str_detect(d,'1 11/16 Miles') ~ 1350,
    str_detect(d,'1 3/4 Miles') ~ 1400,
    str_detect(d,'1 13/16 Miles') ~ 1450,
    str_detect(d,'1 7/8 Miles') ~ 1500,
    str_detect(d,'1 15/16 Miles') ~ 1550,
    str_detect(d,'2 Miles') ~ 1600,
    TRUE ~ 0
  )
}
  1. Data Wrangling / Transform

In this step I will continue to build out the tidy data set by renaming columns (trk_cde, jockey, off_odds, dist_str), creating new variables with mutate (distance, pp_code) and converting variable types (post, finish and field to integer)

record_id horse pp_cde race_cls jockey off_odds distance surface surface_cond post field finish
1 Street Talkin Guy NA NA NA NA 0 NA NA NA NA NA
2 Street Talkin Guy 07Jul19_09WO02 Maiden Special Weight J Campbell 6-1 850 Synth Fast 7 1 8 2
3 Street Talkin Guy 08Jun19_10WO10 Maiden Special Weight J Campbell 8-1 850 Turf Firm 9 13 13 10
4 Street Talkin Guy 19May19_01WO03 Maiden Special Weight J Campbell 14-1 650 Synth Fast 8 1 7 3
5 One Eyed Jack 29Jun19_06BEL03 Maiden Special Weight J Ortiz 3-1 600 Dirt Fast 7 2 6 3
6 Four Ten 17Jul19_07SAR03 Maiden Special Weight J Alvarado 3-1 700 Dirt Sloppy 6 1 5 3
  1. Remove Row With Missing Data

A subset of the data includes international horses. These horses have incomplete data sets and will be removed from the data.

horse post field finish
Street Talkin Guy 1 8 2
Street Talkin Guy 13 13 10
Street Talkin Guy 1 7 3
One Eyed Jack 2 6 3
Four Ten 1 5 3
Four Ten 6 8 8
  1. Create A List Column

Now I should have a tibble with a data list column of past performances for each horse and summary stats of the past performances for each horse

## # A tibble: 77 x 9
##    horse      data      starts   win place  show avg_finish win_pct itm_pct
##    <chr>      <list>     <int> <dbl> <dbl> <dbl>      <dbl>   <dbl>   <dbl>
##  1 Noble Tho~ <tibble ~     24     8    14    17       2.83  0.333    0.708
##  2 First App~ <tibble ~     26     6    12    14       3.62  0.231    0.538
##  3 Irish Val~ <tibble ~     23     6    11    16       2.78  0.261    0.696
##  4 Mo Maveri~ <tibble ~     17     5     9    11       3     0.294    0.647
##  5 Dontblame~ <tibble ~     11     4     6     9       2.82  0.364    0.818
##  6 Super Dude <tibble ~     11     4     7     9       2.36  0.364    0.818
##  7 Dover Cli~ <tibble ~     18     4    10    10       3.44  0.222    0.556
##  8 Wantagh Q~ <tibble ~     47     4     9    15       5.15  0.0851   0.319
##  9 Cross Bor~ <tibble ~     16     4     7     8       3.62  0.25     0.5  
## 10 Free Kitty <tibble ~     22     4    12    17       3.14  0.182    0.773
## # ... with 67 more rows

Answers To Questions

1. What horses have the top 10 most wins?

horse starts win place show win_pct itm_pct avg_finish
Noble Thought 24 8 14 17 0.3333333 0.7083333 2.833333
First Appeal 26 6 12 14 0.2307692 0.5384615 3.615385
Irish Valor 23 6 11 16 0.2608696 0.6956522 2.782609
Mo Maverick 17 5 9 11 0.2941176 0.6470588 3.000000
Dontblamerocket 11 4 6 9 0.3636364 0.8181818 2.818182
Super Dude 11 4 7 9 0.3636364 0.8181818 2.363636
Dover Cliffs 18 4 10 10 0.2222222 0.5555556 3.444444
Wantagh Queen 47 4 9 15 0.0851064 0.3191489 5.148936
Cross Border 16 4 7 8 0.2500000 0.5000000 3.625000
Free Kitty 22 4 12 17 0.1818182 0.7727273 3.136364

2. What horses have the top win percentage?

horse starts win place show win_pct itm_pct avg_finish
Sister Peacock 7 4 6 7 0.5714286 1.0000000 1.571429
O’Keeffe 7 4 6 6 0.5714286 0.8571429 2.000000
Big Bennys Tribute 4 2 3 3 0.5000000 0.7500000 3.000000
South of France 7 3 5 6 0.4285714 0.8571429 2.428571
Missmizz 7 3 4 5 0.4285714 0.7142857 2.857143
Karama 5 2 4 4 0.4000000 0.8000000 2.600000
Skamania 8 3 4 4 0.3750000 0.5000000 3.000000
Dontblamerocket 11 4 6 9 0.3636364 0.8181818 2.818182
Super Dude 11 4 7 9 0.3636364 0.8181818 2.363636
Noble Thought 24 8 14 17 0.3333333 0.7083333 2.833333

3. What horses have the top in the money percentage?

horse starts win place show win_pct itm_pct avg_finish
Sister Peacock 7 4 6 7 0.5714286 1.0000000 1.571429
Graded On a Curve 3 1 2 3 0.3333333 1.0000000 2.000000
Chelsea Cloisters 6 1 4 6 0.1666667 1.0000000 2.166667
One Eyed Jack 1 0 0 1 0.0000000 1.0000000 3.000000
Majority Rules 2 0 2 2 0.0000000 1.0000000 2.000000
Morning Gold 1 0 1 1 0.0000000 1.0000000 2.000000
O’Keeffe 7 4 6 6 0.5714286 0.8571429 2.000000
South of France 7 3 5 6 0.4285714 0.8571429 2.428571
Abyssinian 6 2 4 5 0.3333333 0.8333333 2.333333
Dontblamerocket 11 4 6 9 0.3636364 0.8181818 2.818182