ランナー状況とホームランの関係が気になりました.
調べてみました.
コードはgithubにあげてあります.
https://github.com/gghatano/analyze_mlbdata_with_R
ランナー状況ごとに, ホームラン率を計算してみます. まずはデータの読み込み. 適当な所に, retrosheetからダウンロードしたデータを置いて読み込みます.
library(dplyr)
library(data.table)
library(magrittr)
library(xtable)
dat = fread("../../../data/all2013.csv")
Read 57.6% of 190907 rows Read 190907 rows and 97 (of 97) columns from 0.076 GB file in 00:00:03
name = fread("../../batting_data/names.csv", header = FALSE) %>% unlist
dat %>% setnames(name)
ランナーの人数ごとに, ホームランの本数を集計してみます.
dat_hr_runner =
dat %>%
dplyr::filter(AB_FL == "T") %>%
mutate(runners = (BASE1_RUN_ID != "") + (BASE2_RUN_ID != "") + (BASE3_RUN_ID != "")) %>%
mutate(HR_FL = (EVENT_CD == 23)) %>%
group_by(runners) %>%
dplyr::summarise(atbats = n(), hrs = sum(HR_FL)) %>%
mutate(runners = as.integer(runners))
dat_hr_runner %>%
xtable(digits = 4) %>% print("html")
| runners | atbats | hrs | |
|---|---|---|---|
| 1 | 0 | 96284 | 2811 |
| 2 | 1 | 47666 | 1295 |
| 3 | 2 | 18516 | 459 |
| 4 | 3 | 3604 | 96 |
ホームラン率を見ます.
dat_hr_runner %>%
dplyr::mutate(hr_rate = hrs / atbats) %>%
xtable(digits = 4) %>% print("html")
| runners | atbats | hrs | hr_rate | |
|---|---|---|---|---|
| 1 | 0 | 96284 | 2811 | 0.0292 |
| 2 | 1 | 47666 | 1295 | 0.0272 |
| 3 | 2 | 18516 | 459 | 0.0248 |
| 4 | 3 | 3604 | 96 | 0.0266 |
ランナー0人のホームラン率が, ちょっと高いですね. なんででしょうか.
もう少し細かく見てみます. ランナー状況, つまりどの塁が埋まっているか…という状況ごとに, ホームランの本数を見てみます.
dat_runner_123 =
dat %>%
dplyr::filter(AB_FL == "T") %>%
mutate(runners = (BASE1_RUN_ID != "")*1 + (BASE2_RUN_ID != "") * 10 + (BASE3_RUN_ID != "")*100) %>%
mutate(runners = as.integer(runners)) %>%
mutate(HR_FL = (EVENT_CD == 23)) %>%
group_by(runners) %>%
dplyr::summarise(atbat = n(), hrs = sum(HR_FL))
dat_runner_123 %>%
xtable(digits = 4) %>% print("html")
| runners | atbat | hrs | |
|---|---|---|---|
| 1 | 0 | 96284 | 2811 |
| 2 | 1 | 30299 | 879 |
| 3 | 10 | 13239 | 310 |
| 4 | 11 | 11070 | 285 |
| 5 | 100 | 4128 | 106 |
| 6 | 101 | 4529 | 117 |
| 7 | 110 | 2917 | 57 |
| 8 | 111 | 3604 | 96 |
できてますね. ホームラン率を計算してみます.
dat_runner_123 %>%
mutate(hr_rate = hrs / atbat) %>%
xtable(digits = 4) %>% print("html")
| runners | atbat | hrs | hr_rate | |
|---|---|---|---|---|
| 1 | 0 | 96284 | 2811 | 0.0292 |
| 2 | 1 | 30299 | 879 | 0.0290 |
| 3 | 10 | 13239 | 310 | 0.0234 |
| 4 | 11 | 11070 | 285 | 0.0257 |
| 5 | 100 | 4128 | 106 | 0.0257 |
| 6 | 101 | 4529 | 117 | 0.0258 |
| 7 | 110 | 2917 | 57 | 0.0195 |
| 8 | 111 | 3604 | 96 | 0.0266 |
ほほう. ランナー2,3塁打とホームランが出にくい.
95%信頼区間も出しますか.
dat_hr_123_conf =
dat_runner_123 %>%
group_by(runners) %>%
summarise(hr_rate = hrs / atbat,
(binom.test(hrs, atbat))$conf.int[1],
(binom.test(hrs, atbat))$conf.int[2]) %>%
setnames(c("runners", "hr_rate", "hr_rate_low", "hr_rate_high"))
dat_hr_123_conf %>%
xtable(digits = 4) %>% print("html")
| runners | hr_rate | hr_rate_low | hr_rate_high | |
|---|---|---|---|---|
| 1 | 0 | 0.0292 | 0.0281 | 0.0303 |
| 2 | 1 | 0.0290 | 0.0271 | 0.0310 |
| 3 | 10 | 0.0234 | 0.0209 | 0.0261 |
| 4 | 11 | 0.0257 | 0.0229 | 0.0289 |
| 5 | 100 | 0.0257 | 0.0211 | 0.0310 |
| 6 | 101 | 0.0258 | 0.0214 | 0.0309 |
| 7 | 110 | 0.0195 | 0.0148 | 0.0252 |
| 8 | 111 | 0.0266 | 0.0216 | 0.0324 |
折角なので, 可視化しましょう.
dat_hr_123_conf %>%
mutate(runners = as.factor(runners)) %>%
ggplot(aes(x = runners)) +
geom_point(aes(y = hr_rate), size = 5) +
geom_errorbar(aes(ymin = hr_rate_low, ymax = hr_rate_high)) +
ggtitle("HR-rate in each runner-situation (with Confidence Interval)")
以上です.