2013年4月のMLB全打席結果データを利用して, dplyr::summarise_eachの練習をします.
データとコードは, https://github.com/gghatano/analyze_mlbdata_with_R/tree/master/batting_data/game_analysis/summarise にあります.
まずはデータの読み込み. 2013年4月の打席結果データ.csvをfread.
library(data.table)
library(dplyr)
library(xtable)
dat = fread("../dat2013_04.csv")
4月の打率, 長打率, 投球数を集計します.
ヒット数,塁打数, 投球数の合計を, 打数で割り算すればいいのですが, 全て同じ操作なので, 新機能であるsummarise_eachを使って簡単に処理できるはず.
dat_april =
dat %>%
select(BAT_ID, AB_FL, H_FL, PITCH_SEQ_TX) %>% ## 打席フラグ, ヒットフラグ, 投球結果のデータを利用.
mutate(pitches = nchar(PITCH_SEQ_TX), HIT_FL = ifelse(H_FL > 0, 1, 0)) %>% ## 球数と, HITorNOTのデータを作成
select(-PITCH_SEQ_TX) %>% ## もういらない
group_by(BAT_ID) %>% ## 打者ごとに
summarise_each(funs(sum)) %>% ## 塁打数, ヒット数, 投球数を打数で割り算する.
dplyr::filter(AB_FL > 50) %>%
mutate_each(funs(./AB_FL), vars = H_FL:HIT_FL) %>%
select(BAT_ID, vars1:vars3) %>%
setnames(c("retroID", "SLG", "pitches", "average"))
dat_april %>% head
## retroID SLG pitches average
## 1 ackld001 0.2857 4.824 0.2527
## 2 alony001 0.4479 4.896 0.2917
## 3 altuj001 0.4352 4.102 0.3241
## 4 alvap001 0.3146 5.315 0.1798
## 5 amara001 0.3585 4.491 0.2453
## 6 andre001 0.3300 4.910 0.2600
あとは, retroIDとフルネームを対応させて表示するだけ.
fullname= fread("../../fullname.csv")
dat_april %>%
inner_join(fullname, by = "retroID") %>%
select(name, average) %>%
arrange(desc(average))%>%
head(10) %>%
xtable %>% print(type="html")
name | average | |
---|---|---|
1 | Carlos Santana | 0.39 |
2 | James Loney | 0.37 |
3 | Torii Hunter | 0.37 |
4 | Chris Johnson | 0.37 |
5 | Jean Segura | 0.37 |
6 | Miguel Cabrera | 0.36 |
7 | Carlos Gomez | 0.36 |
8 | Wilin Rosario | 0.35 |
9 | Chris Davis | 0.35 |
10 | Nate McLouth | 0.35 |
fullname= fread("../../fullname.csv")
dat_april %>%
inner_join(fullname, by = "retroID") %>%
select(name, SLG) %>%
arrange(desc(SLG))%>%
head(10) %>%
xtable %>% print(type="html")
name | SLG | |
---|---|---|
1 | Justin Upton | 0.73 |
2 | Chris Davis | 0.73 |
3 | Carlos Santana | 0.72 |
4 | Bryce Harper | 0.72 |
5 | Travis Hafner | 0.67 |
6 | Mark Reynolds | 0.65 |
7 | Wilin Rosario | 0.65 |
8 | Dexter Fowler | 0.62 |
9 | Carlos Gomez | 0.62 |
10 | Troy Tulowitzki | 0.60 |
fullname= fread("../../fullname.csv")
dat_april %>%
inner_join(fullname, by = "retroID") %>%
select(name, pitches) %>%
arrange(desc(pitches))%>%
head(10) %>%
xtable %>% print(type="html")
name | pitches | |
---|---|---|
1 | Billy Butler | 6.53 |
2 | Lucas Duda | 6.43 |
3 | A.J. Ellis | 6.28 |
4 | Will Venable | 6.22 |
5 | Lance Berkman | 6.12 |
6 | David Wright | 6.01 |
7 | Mark Ellis | 5.96 |
8 | Rickie Weeks | 5.96 |
9 | Joey Votto | 5.93 |
10 | Nick Swisher | 5.93 |
投げさせた球数と打率の関係.
dat_april %>%
ggplot() +
geom_point(aes(x=average, y = pitches)) +
ggtitle("average vs pitches")
関係なさそう.
球数と長打率は?
dat_april %>%
ggplot() +
geom_point(aes(x=SLG, y = pitches)) +
ggtitle("SLG vs pitches")
dat_april %>%
ggplot() +
geom_point(aes(x=SLG, y = average)) +
ggtitle("average vs SLG")