# load data
setwd('\\Ragozin\\Files\\csv\\')
dat <- read.csv(file='ProjectData.csv', stringsAsFactors = T)
str(dat)
## 'data.frame': 1931 obs. of 31 variables:
## $ RecordID : int 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 ...
## $ Nbr : int 15 14 13 12 11 10 9 8 7 6 ...
## $ Hrse : Factor w/ 140 levels "ABELTASMAN","ALMANAAR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Gndr : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 4 4 4 4 3 3 3 3 3 3 ...
## $ Top : int 175 175 375 375 375 375 375 375 500 500 ...
## $ Outcome : Factor w/ 4 levels "E","P","T","X": 4 3 2 4 2 1 4 3 1 1 ...
## $ Outcome1 : Factor w/ 47 levels "E0","E1","E2",..: 25 14 7 22 7 3 40 9 3 5 ...
## $ Figure : Factor w/ 1732 levels "- 1- VwAWBE 9",..: 859 946 1025 790 1022 1088 1350 1028 1722 1343 ...
## $ FgrVle : int 1500 175 375 1350 375 550 925 375 750 925 ...
## $ Trnr : Factor w/ 103 levels "AAn","ADr","AFe",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ TrkCde : Factor w/ 57 levels "","Ai","AP","AQ",..: 47 48 5 7 11 45 48 5 7 47 ...
## $ Srfce : Factor w/ 2 levels "Dirt","Turf": 1 1 1 1 1 1 1 1 1 1 ...
## $ Wthr : Factor w/ 6 levels "","Big Wind",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Trble : Factor w/ 2 levels "BIG","SML": NA NA NA 2 NA 2 NA NA 2 NA ...
## $ SmrtMny4 : Factor w/ 1 level "MONY": NA NA NA NA NA NA NA NA NA NA ...
## $ isTop : int 0 1 0 0 0 0 0 1 0 0 ...
## $ TopType : int 1325 -200 0 975 0 175 550 -125 250 425 ...
## $ Move : int 1325 -200 0 975 0 175 550 -125 250 425 ...
## $ SoundnessFlag: int 1 0 0 0 1 0 0 0 0 0 ...
## $ RACE_DATE : Factor w/ 608 levels "2015-01-08 00:00:00",..: 598 577 528 506 425 401 355 329 309 293 ...
## $ REST : int 36 77 36 182 41 62 43 36 27 35 ...
## $ DaysSinceTop : int 36 441 364 328 146 105 43 98 62 35 ...
## $ P12 : Factor w/ 16 levels "EE","EP","ET",..: 10 8 14 5 4 15 9 1 3 9 ...
## $ P12D : Factor w/ 449 levels "E0E2","E0E3",..: 196 141 278 127 56 374 175 37 91 192 ...
## $ P123 : Factor w/ 63 levels "EEE","EEP","EET",..: 40 30 53 20 15 57 33 3 9 35 ...
## $ P123D : Factor w/ 1201 levels "E0E2E2","E0E2E5",..: 718 553 895 481 208 1053 622 124 371 691 ...
## $ LEVEL : int 375 550 550 550 750 750 750 750 925 950 ...
## $ WITHLASIX : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LEVLESSTOP : int 200 375 175 175 375 375 375 375 425 450 ...
## $ FVz : num 0.594 -1.397 -1.097 0.369 -1.097 ...
Are 3-race past performance patterns, rest, and certain categorical data (gender, age, racing surface, etc.) predictive of a horse’s next out performance in thoroughbred horse racing.
Each case will include the three past performance result/patterns (speed figure) for a thoroghbred horse, amount of rest between races, as well as categorical data including gender, age and race surface. The ultimate set of variables will be determined after some preliminary exploratory analysis. I estimate appproximately 2000 total cases in the data set.
The data was self collected from Ragozin Speed Figures (http://thesheets.com) from the 2018 Breeders Cup races run at Churchill Downs on November 3, 2018. The data is comprised of the past performances of 148 thoroughbred race horses
This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
The dependent variable will be binomial and will represent if a horse ran a new top in the past performance race. A Top is the fastest (lowest) speed figure a horse has earned in his career. The binomial nature of my Dependant variable lends itself to a logit regression model
The independent variable is isTop. The varible takes on two values. 1 = A new Top and 0 = no new Top.
Per the str statement above the data set includes 30 dependent variables. A summary information for some of the more pertinent variables follows:
The following section provides summary statistics and charts of my dataset. In additon to the summary function, I’ve utilized various charts to plot outcomes against independent variables. The outcomes include: T- Top, E- Effort, P - Paired and X - a race 5 point away from top. This information will utilized to shape my logit regression model.
summary(dat)
## RecordID Nbr Hrse Gndr
## Min. : 1.0 Min. : 1.000 DISCREETLOVER: 44 F: 735
## 1st Qu.: 496.5 1st Qu.: 4.000 RICHARDSBOY : 32 M:1196
## Median : 983.0 Median : 8.000 BUCCHERO : 28
## Mean : 993.1 Mean : 9.126 WARRIORSCLUB : 28
## 3rd Qu.:1490.5 3rd Qu.:13.000 CHANTELINE : 26
## Max. :2022.0 Max. :44.000 HUNT : 26
## (Other) :1747
## Age Top Outcome Outcome1 Figure
## Min. :2.000 Min. :-100.0 E:738 E2 :287 #NAME? : 95
## 1st Qu.:3.000 1st Qu.: 500.0 P:185 T0 :199 ~= 12 FR : 6
## Median :3.000 Median : 750.0 T:625 P0 :185 ~^= 11 IRE: 5
## Mean :3.548 Mean : 808.1 X:383 E4 :151 ~= 10 GB : 5
## 3rd Qu.:4.000 3rd Qu.:1075.0 T2 :147 ~= 6 GB : 5
## Max. :7.000 Max. :2875.0 E3 :129 ~= 7 FR : 5
## (Other):833 (Other) :1810
## FgrVle Trnr TrkCde Srfce Wthr
## Min. :-100 PMr : 88 SA :247 Dirt:1116 : 3
## 1st Qu.: 750 CCB : 80 BE :202 Turf: 815 Big Wind : 88
## Median :1000 BBt : 64 CD :178 Clear :1756
## Mean :1105 DOl : 62 Sr :170 Huge Wind: 34
## 3rd Qu.:1325 SMA : 60 DM :126 rain : 48
## Max. :9900 (Other):1344 KE :110 snow : 2
## NA's : 233 (Other):898
## Trble SmrtMny4 isTop TopType
## BIG : 41 MONY: 22 Min. :0.0000 Min. :-1525.0
## SML : 208 NA's:1909 1st Qu.:0.0000 1st Qu.: -25.0
## NA's:1682 Median :0.0000 Median : 150.0
## Mean :0.3237 Mean : 234.7
## 3rd Qu.:1.0000 3rd Qu.: 425.0
## Max. :1.0000 Max. : 9450.0
##
## Move SoundnessFlag RACE_DATE
## Min. :-1525.0 Min. :0.00000 2018-08-25 00:00:00: 31
## 1st Qu.: -25.0 1st Qu.:0.00000 2018-05-05 00:00:00: 29
## Median : 150.0 Median :0.00000 2017-10-07 00:00:00: 24
## Mean : 234.7 Mean :0.05334 2017-11-04 00:00:00: 24
## 3rd Qu.: 425.0 3rd Qu.:0.00000 2018-10-06 00:00:00: 23
## Max. : 9450.0 Max. :1.00000 2018-06-09 00:00:00: 22
## (Other) :1778
## REST DaysSinceTop P12 P12D
## Min. : 6.00 Min. : 0.0 EE :429 E2T0 : 163
## 1st Qu.: 28.00 1st Qu.: 35.0 TE :285 T0E2 : 148
## Median : 34.00 Median : 90.0 ET :216 E2E2 : 59
## Mean : 50.35 Mean :150.5 TT :213 T2T0 : 26
## 3rd Qu.: 50.00 3rd Qu.:233.0 XX :159 E2T2 : 24
## Max. :542.00 Max. :832.0 EX :113 P0E2 : 24
## NA's :1 (Other):516 (Other):1487
## P123 P123D LEVEL WITHLASIX
## EEE : 276 E2E2T0 : 144 Min. : 75 Min. :0.0000
## TEE : 194 T0E2E2 : 140 1st Qu.: 875 1st Qu.:0.0000
## TTE : 105 T2T0E : 20 Median :1125 Median :0.0000
## ETE : 101 E2E2E2 : 17 Mean :1160 Mean :0.1455
## XXX : 100 E2T0E : 16 3rd Qu.:1400 3rd Qu.:0.0000
## EET : 87 P0T0E : 11 Max. :2875 Max. :1.0000
## (Other):1068 (Other):1583
## LEVLESSTOP FVz
## Min. : 0 Min. :-1.810308
## 1st Qu.: 150 1st Qu.:-0.533073
## Median : 313 Median :-0.157416
## Mean : 352 Mean : 0.000007
## 3rd Qu.: 500 3rd Qu.: 0.330939
## Max. :1525 Max. :13.215990
##
Outcomes by Age
plot(dat$Outcome ~ dat$Age)
Outcomes by Rest
#Create some bins to plot rest
REST_bins <- dat$REST
dat$REST_bins <- cut(REST_bins, 5, labels = c("B1", "B2", "B3", "B4", "B5"))
#dat$REST_bins <- cut(REST_bins, 10)
plot(dat$Outcome ~ dat$REST_bin, na.rm=TRUE)
## Warning in rect(xleft, ybottom, xright, ytop, col = col, ...): "na.rm" is
## not a graphical parameter
Outcomes by Gender
plot(dat$Outcome ~ dat$Gndr)
Outcomes by Pattern -P123
plot(dat$Outcome ~ dat$P123)
Outcomes by Soundness
plot(dat$Outcome ~ dat$SoundnessFlag)
Outcomes by With Lasix
plot(dat$Outcome ~ dat$WITHLASIX)
Outcomes by With Surface
plot(dat$Outcome ~ dat$Srfce)
Outcomes by With Days Since Top
cdplot(dat$Outcome ~ dat$DaysSinceTop)
Outcomes by With Level (Median Value over 5 races) Less Top
cdplot(dat$Outcome ~ dat$LEVLESSTOP)