データ元

http://dokoka.org/wiki.cgi?page=R%A4%C8%B6%A5%C7%CF%A5%C7%A1%BC%A5%BF%A4%C7%B3%D8%A4%D6%C5%FD%B7%D7%B3%D8%A1%A1%C2%E81%B2%F3%A1%A1%A5%C7%A1%BC%A5%BF%A5%BD%A1%BC%A5%B9%A4%C8%A4%B7%A4%C6%A4%CE%B6%A5%C7%CF%A4%CE%CC%A5%CE%CF

考え方

http://introndatalab.com/blog/7

データクレンジング

> jra <- read.csv("jra_race_result.csv")
> str(jra)
'data.frame':   198401 obs. of  30 variables:
 $ 開催日      : Factor w/ 426 levels "2013-01-05","2013-01-06",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ 競馬場      : Factor w/ 10 levels "京都","阪神",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ レース番号  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ レース名    : Factor w/ 1130 levels "1932-1950sダービーメモリーズトキノミノルカップ",..: 152 152 152 152 152 152 152 152 152 152 ...
 $ コース      : Factor w/ 3 levels "ダート","芝",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ 周回        : Factor w/ 8 levels "右","右2周","左",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ 距離        : int  1200 1200 1200 1200 1200 1200 1200 1200 1200 1200 ...
 $ 馬場状態    : Factor w/ 4 levels "重","不良","良",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ 賞金        : num  500 200 125 75 50 0 0 0 0 0 ...
 $ 頭数        : int  16 16 16 16 16 16 16 16 16 16 ...
 $ 着順        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ 枠番        : int  6 2 7 1 5 5 3 2 4 3 ...
 $ 馬番        : int  12 4 13 1 10 9 6 3 7 5 ...
 $ 馬名        : Factor w/ 24373 levels "アーカイブ","アークアーセナル",..: 22934 350 2327 14213 5678 1554 18783 14065 6986 9814 ...
 $ 性別        : Factor w/ 3 levels "セ","牡","牝": 2 3 3 2 2 2 2 2 3 3 ...
 $ 年齢        : int  3 3 3 3 3 3 3 3 3 3 ...
 $ 騎手        : Factor w/ 257 levels "A.アッゼニ","A.シュタルケ",..: 75 168 74 61 140 135 240 88 136 90 ...
 $ タイム      : num  73.6 73.6 73.7 74 74.3 74.3 74.3 74.4 74.6 74.8 ...
 $ 着差        : Factor w/ 22 levels "","1 1/2馬身",..: 1 20 20 4 4 20 19 11 3 3 ...
 $ 通過順      : Factor w/ 8698 levels "","01-01","01-01-01",..: 2 233 2279 1201 6226 4370 233 7299 4182 6226 ...
 $ 上り3F      : num  39.4 39.3 39.1 39.5 38.9 39.4 40 38.8 39.7 39.4 ...
 $ 斤量        : num  56 54 54 56 56 56 53 56 54 54 ...
 $ 馬体重      : int  484 454 396 484 462 474 506 448 448 476 ...
 $ 増減        : int  2 -2 0 6 2 8 -4 8 -4 -2 ...
 $ 人気        : int  1 7 5 2 3 6 10 14 11 4 ...
 $ オッズ      : num  1.9 44.1 11.8 5.3 6.1 ...
 $ ブリンカー  : Factor w/ 2 levels "","B": 1 1 1 1 1 1 1 1 1 1 ...
 $ 調教師      : Factor w/ 321 levels "C.チャン","C.ファウンズ",..: 120 204 57 27 133 64 256 77 221 282 ...
 $ 調教コメント: Factor w/ 181 levels "nan","いま一息",..: 46 102 46 74 41 9 57 180 95 13 ...
 $ 調教評価    : Factor w/ 5 levels "A","B","C","D",..: 2 3 2 2 2 3 3 3 3 3 ...
> # install.packages("lubridate")
> library(lubridate)
> jra$date <-  ymd(jra$開催日)
> # install.packages("BH")
> # install.packages("dplyr")
> library(dplyr)
> n.t.2500 <- filter(jra,競馬場=="中山") %>% filter(コース=="芝") %>% filter(距離=="2500")
> 
> n.t.2500$isIn2nd <- 0
> n.t.2500$isIn2nd[which(n.t.2500$着順 %in% 1:2)] <- 1
> 
> n <- nrow(n.t.2500)
> 
> UmaNyusyoRatio <- c()
> KisyuNyusyoRatio <- c()
> for(i in 1:n){
+   uma <- n.t.2500$馬名[i]
+   kishu <- n.t.2500$騎手[i]
+   time <- n.t.2500$date[i]
+   d1 <- filter(jra,馬名==uma)
+   p <- nrow(d1)
+   if(length(which((time - d1$date)>0))>0){
+     d2 <- d1[which((time - d1$date)>0),]
+     UmaNyusyoRatio[i] <- sum(d2$着順 %in% 1:2)/p
+   }
+   d3 <- filter(jra,騎手==kishu)
+   p <- nrow(d3)
+   if(length(which((time - d3$date)>0))>0){
+     d4 <- d3[which((time - d3$date)>0),]
+     KisyuNyusyoRatio[i] <- sum(d4$着順 %in% 1:2)/p
+   }
+ }

ロジスティック回帰

> dat <- data.frame(isIn2nd=n.t.2500$isIn2nd, UmaNyusyoRatio,KisyuNyusyoRatio)
> glm(isIn2nd ~ UmaNyusyoRatio + KisyuNyusyoRatio, family = binomial,data=dat)

Call:  glm(formula = isIn2nd ~ UmaNyusyoRatio + KisyuNyusyoRatio, family = binomial, 
    data = dat)

Coefficients:
     (Intercept)    UmaNyusyoRatio  KisyuNyusyoRatio  
          -2.421             2.054             3.999  

Degrees of Freedom: 408 Total (i.e. Null);  406 Residual
  (20 observations deleted due to missingness)
Null Deviance:      344.6 
Residual Deviance: 334.5    AIC: 340.5

有馬記念で