Data Preparation

# load data
setwd('\\Ragozin\\Files\\csv\\')
dat <- read.csv(file='ProjectData.csv', stringsAsFactors = T)
str(dat)
## 'data.frame':    1931 obs. of  31 variables:
##  $ RecordID     : int  1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 ...
##  $ Nbr          : int  15 14 13 12 11 10 9 8 7 6 ...
##  $ Hrse         : Factor w/ 140 levels "ABELTASMAN","ALMANAAR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gndr         : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age          : int  4 4 4 4 3 3 3 3 3 3 ...
##  $ Top          : int  175 175 375 375 375 375 375 375 500 500 ...
##  $ Outcome      : Factor w/ 4 levels "E","P","T","X": 4 3 2 4 2 1 4 3 1 1 ...
##  $ Outcome1     : Factor w/ 47 levels "E0","E1","E2",..: 25 14 7 22 7 3 40 9 3 5 ...
##  $ Figure       : Factor w/ 1732 levels "- 1- VwAWBE 9",..: 859 946 1025 790 1022 1088 1350 1028 1722 1343 ...
##  $ FgrVle       : int  1500 175 375 1350 375 550 925 375 750 925 ...
##  $ Trnr         : Factor w/ 103 levels "AAn","ADr","AFe",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ TrkCde       : Factor w/ 57 levels "","Ai","AP","AQ",..: 47 48 5 7 11 45 48 5 7 47 ...
##  $ Srfce        : Factor w/ 2 levels "Dirt","Turf": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Wthr         : Factor w/ 6 levels "","Big Wind",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Trble        : Factor w/ 2 levels "BIG","SML": NA NA NA 2 NA 2 NA NA 2 NA ...
##  $ SmrtMny4     : Factor w/ 1 level "MONY": NA NA NA NA NA NA NA NA NA NA ...
##  $ isTop        : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ TopType      : int  1325 -200 0 975 0 175 550 -125 250 425 ...
##  $ Move         : int  1325 -200 0 975 0 175 550 -125 250 425 ...
##  $ SoundnessFlag: int  1 0 0 0 1 0 0 0 0 0 ...
##  $ RACE_DATE    : Factor w/ 608 levels "2015-01-08 00:00:00",..: 598 577 528 506 425 401 355 329 309 293 ...
##  $ REST         : int  36 77 36 182 41 62 43 36 27 35 ...
##  $ DaysSinceTop : int  36 441 364 328 146 105 43 98 62 35 ...
##  $ P12          : Factor w/ 16 levels "EE","EP","ET",..: 10 8 14 5 4 15 9 1 3 9 ...
##  $ P12D         : Factor w/ 449 levels "E0E2","E0E3",..: 196 141 278 127 56 374 175 37 91 192 ...
##  $ P123         : Factor w/ 63 levels "EEE","EEP","EET",..: 40 30 53 20 15 57 33 3 9 35 ...
##  $ P123D        : Factor w/ 1201 levels "E0E2E2","E0E2E5",..: 718 553 895 481 208 1053 622 124 371 691 ...
##  $ LEVEL        : int  375 550 550 550 750 750 750 750 925 950 ...
##  $ WITHLASIX    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LEVLESSTOP   : int  200 375 175 175 375 375 375 375 425 450 ...
##  $ FVz          : num  0.594 -1.397 -1.097 0.369 -1.097 ...

Research question

Are 3-race past performance patterns, rest, and certain categorical data (gender, age, racing surface, etc.) predictive of a horse’s next out performance in thoroughbred horse racing.

Cases

Each case will include the three past performance result/patterns (speed figure) for a thoroghbred horse, amount of rest between races, as well as categorical data including gender, age and race surface. The ultimate set of variables will be determined after some preliminary exploratory analysis. I estimate appproximately 2000 total cases in the data set.

Data collection

The data was self collected from Ragozin Speed Figures (http://thesheets.com) from the 2018 Breeders Cup races run at Churchill Downs on November 3, 2018. The data is comprised of the past performances of 148 thoroughbred race horses

Type of study

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Dependent Variable

The dependent variable will be binomial and will represent if a horse ran a new top in the past performance race. A Top is the fastest (lowest) speed figure a horse has earned in his career. The binomial nature of my Dependant variable lends itself to a logit regression model

Independent Variables

The independent variable is isTop. The varible takes on two values. 1 = A new Top and 0 = no new Top.

Per the str statement above the data set includes 30 dependent variables. A summary information for some of the more pertinent variables follows:

  1. Level - Mead speed figure value
  2. P123 - Three Race Pattern
  3. P1233D - Three Race Pattern with more detail, therfore more sparce.
  4. Rest - Time since last Race
  5. Age - Generall 2 thru 7
  6. Gender - Male or Female
  7. Racing Surface - Dirt or Turf
  8. Soundness Flag - Variable that indicates if a horse may have soundness issues (slow start, boring in or out, etc.)
  9. WithLASIX - a value that indicates if the Horse received Lasix for the first time within the last four races. (Lasix helps horse breathe better.)
  10. Time weighted move (moves the are the relative performance to the current top)
  11. FVz - The z-cord of the horse past perfermance figure value
  12. Trainer - The initials of the horses Trainer
  13. Outcome - Result of race in categorical pattern.
  14. Outcome - More detailed version of Outcome
  15. Move - the interger outcome of the race.
  16. Additional variable are available for use

Relevant summary statistics

The following section provides summary statistics and charts of my dataset. In additon to the summary function, I’ve utilized various charts to plot outcomes against independent variables. The outcomes include: T- Top, E- Effort, P - Paired and X - a race 5 point away from top. This information will utilized to shape my logit regression model.

summary(dat)
##     RecordID           Nbr                    Hrse      Gndr    
##  Min.   :   1.0   Min.   : 1.000   DISCREETLOVER:  44   F: 735  
##  1st Qu.: 496.5   1st Qu.: 4.000   RICHARDSBOY  :  32   M:1196  
##  Median : 983.0   Median : 8.000   BUCCHERO     :  28           
##  Mean   : 993.1   Mean   : 9.126   WARRIORSCLUB :  28           
##  3rd Qu.:1490.5   3rd Qu.:13.000   CHANTELINE   :  26           
##  Max.   :2022.0   Max.   :44.000   HUNT         :  26           
##                                    (Other)      :1747           
##       Age             Top         Outcome    Outcome1          Figure    
##  Min.   :2.000   Min.   :-100.0   E:738   E2     :287   #NAME?    :  95  
##  1st Qu.:3.000   1st Qu.: 500.0   P:185   T0     :199   ~= 12 FR  :   6  
##  Median :3.000   Median : 750.0   T:625   P0     :185   ~^= 11 IRE:   5  
##  Mean   :3.548   Mean   : 808.1   X:383   E4     :151   ~= 10 GB  :   5  
##  3rd Qu.:4.000   3rd Qu.:1075.0           T2     :147   ~= 6 GB   :   5  
##  Max.   :7.000   Max.   :2875.0           E3     :129   ~= 7 FR   :   5  
##                                           (Other):833   (Other)   :1810  
##      FgrVle          Trnr          TrkCde     Srfce             Wthr     
##  Min.   :-100   PMr    :  88   SA     :247   Dirt:1116            :   3  
##  1st Qu.: 750   CCB    :  80   BE     :202   Turf: 815   Big Wind :  88  
##  Median :1000   BBt    :  64   CD     :178               Clear    :1756  
##  Mean   :1105   DOl    :  62   Sr     :170               Huge Wind:  34  
##  3rd Qu.:1325   SMA    :  60   DM     :126               rain     :  48  
##  Max.   :9900   (Other):1344   KE     :110               snow     :   2  
##                 NA's   : 233   (Other):898                               
##   Trble      SmrtMny4        isTop           TopType       
##  BIG :  41   MONY:  22   Min.   :0.0000   Min.   :-1525.0  
##  SML : 208   NA's:1909   1st Qu.:0.0000   1st Qu.:  -25.0  
##  NA's:1682               Median :0.0000   Median :  150.0  
##                          Mean   :0.3237   Mean   :  234.7  
##                          3rd Qu.:1.0000   3rd Qu.:  425.0  
##                          Max.   :1.0000   Max.   : 9450.0  
##                                                            
##       Move         SoundnessFlag                   RACE_DATE   
##  Min.   :-1525.0   Min.   :0.00000   2018-08-25 00:00:00:  31  
##  1st Qu.:  -25.0   1st Qu.:0.00000   2018-05-05 00:00:00:  29  
##  Median :  150.0   Median :0.00000   2017-10-07 00:00:00:  24  
##  Mean   :  234.7   Mean   :0.05334   2017-11-04 00:00:00:  24  
##  3rd Qu.:  425.0   3rd Qu.:0.00000   2018-10-06 00:00:00:  23  
##  Max.   : 9450.0   Max.   :1.00000   2018-06-09 00:00:00:  22  
##                                      (Other)            :1778  
##       REST         DaysSinceTop        P12           P12D     
##  Min.   :  6.00   Min.   :  0.0   EE     :429   E2T0   : 163  
##  1st Qu.: 28.00   1st Qu.: 35.0   TE     :285   T0E2   : 148  
##  Median : 34.00   Median : 90.0   ET     :216   E2E2   :  59  
##  Mean   : 50.35   Mean   :150.5   TT     :213   T2T0   :  26  
##  3rd Qu.: 50.00   3rd Qu.:233.0   XX     :159   E2T2   :  24  
##  Max.   :542.00   Max.   :832.0   EX     :113   P0E2   :  24  
##  NA's   :1                        (Other):516   (Other):1487  
##       P123          P123D          LEVEL        WITHLASIX     
##  EEE    : 276   E2E2T0 : 144   Min.   :  75   Min.   :0.0000  
##  TEE    : 194   T0E2E2 : 140   1st Qu.: 875   1st Qu.:0.0000  
##  TTE    : 105   T2T0E  :  20   Median :1125   Median :0.0000  
##  ETE    : 101   E2E2E2 :  17   Mean   :1160   Mean   :0.1455  
##  XXX    : 100   E2T0E  :  16   3rd Qu.:1400   3rd Qu.:0.0000  
##  EET    :  87   P0T0E  :  11   Max.   :2875   Max.   :1.0000  
##  (Other):1068   (Other):1583                                  
##    LEVLESSTOP        FVz           
##  Min.   :   0   Min.   :-1.810308  
##  1st Qu.: 150   1st Qu.:-0.533073  
##  Median : 313   Median :-0.157416  
##  Mean   : 352   Mean   : 0.000007  
##  3rd Qu.: 500   3rd Qu.: 0.330939  
##  Max.   :1525   Max.   :13.215990  
## 

Outcomes by Age

plot(dat$Outcome ~ dat$Age)

Outcomes by Rest

#Create some bins to plot rest
REST_bins <- dat$REST
dat$REST_bins <- cut(REST_bins, 5, labels = c("B1", "B2", "B3", "B4", "B5"))
#dat$REST_bins <- cut(REST_bins, 10)
plot(dat$Outcome ~ dat$REST_bin, na.rm=TRUE)
## Warning in rect(xleft, ybottom, xright, ytop, col = col, ...): "na.rm" is
## not a graphical parameter

Outcomes by Gender

plot(dat$Outcome ~ dat$Gndr)

Outcomes by Pattern -P123

plot(dat$Outcome ~ dat$P123)

Outcomes by Soundness

plot(dat$Outcome ~ dat$SoundnessFlag)

Outcomes by With Lasix

plot(dat$Outcome ~ dat$WITHLASIX)

Outcomes by With Surface

plot(dat$Outcome ~ dat$Srfce)

Outcomes by With Days Since Top

cdplot(dat$Outcome ~ dat$DaysSinceTop)

Outcomes by With Level (Median Value over 5 races) Less Top

cdplot(dat$Outcome ~ dat$LEVLESSTOP)