dplyrのsummarise_eachを使う

データ読み込み

2013年4月のMLB全打席結果データを利用して, dplyr::summarise_eachの練習をします.

データとコードは, https://github.com/gghatano/analyze_mlbdata_with_R/tree/master/batting_data/game_analysis/summarise にあります.

まずはデータの読み込み. 2013年4月の打席結果データ.csvをfread.

library(data.table)
library(dplyr)
library(xtable)
dat = fread("../dat2013_04.csv")

集計

4月の打率, 長打率, 投球数を集計します.

ヒット数,塁打数, 投球数の合計を, 打数で割り算すればいいのですが, 全て同じ操作なので, 新機能であるsummarise_eachを使って簡単に処理できるはず.

dat_april = 
  dat %>% 
  select(BAT_ID, AB_FL, H_FL, PITCH_SEQ_TX) %>% ## 打席フラグ, ヒットフラグ, 投球結果のデータを利用. 
  mutate(pitches = nchar(PITCH_SEQ_TX), HIT_FL = ifelse(H_FL > 0, 1, 0)) %>% ## 球数と, HITorNOTのデータを作成
  select(-PITCH_SEQ_TX) %>% ## もういらない
  group_by(BAT_ID) %>% ## 打者ごとに
  summarise_each(funs(sum)) %>% ## 塁打数, ヒット数, 投球数を打数で割り算する.
  dplyr::filter(AB_FL > 50) %>%
  mutate_each(funs(./AB_FL), vars = H_FL:HIT_FL) %>% 
  select(BAT_ID, vars1:vars3) %>% 
  setnames(c("retroID", "SLG", "pitches", "average"))

dat_april %>% head
##    retroID    SLG pitches average
## 1 ackld001 0.2857   4.824  0.2527
## 2 alony001 0.4479   4.896  0.2917
## 3 altuj001 0.4352   4.102  0.3241
## 4 alvap001 0.3146   5.315  0.1798
## 5 amara001 0.3585   4.491  0.2453
## 6 andre001 0.3300   4.910  0.2600

あとは, retroIDとフルネームを対応させて表示するだけ.

4月の打率ランキング.

fullname= fread("../../fullname.csv")
dat_april %>% 
  inner_join(fullname, by = "retroID") %>%
  select(name, average) %>% 
  arrange(desc(average))%>% 
  head(10) %>%
  xtable %>% print(type="html")
name average
1 Carlos Santana 0.39
2 James Loney 0.37
3 Torii Hunter 0.37
4 Chris Johnson 0.37
5 Jean Segura 0.37
6 Miguel Cabrera 0.36
7 Carlos Gomez 0.36
8 Wilin Rosario 0.35
9 Chris Davis 0.35
10 Nate McLouth 0.35

4月の長打率ランキング.

fullname= fread("../../fullname.csv")
dat_april %>% 
  inner_join(fullname, by = "retroID") %>%
  select(name, SLG) %>% 
  arrange(desc(SLG))%>% 
  head(10) %>%
  xtable %>% print(type="html")
name SLG
1 Justin Upton 0.73
2 Chris Davis 0.73
3 Carlos Santana 0.72
4 Bryce Harper 0.72
5 Travis Hafner 0.67
6 Mark Reynolds 0.65
7 Wilin Rosario 0.65
8 Dexter Fowler 0.62
9 Carlos Gomez 0.62
10 Troy Tulowitzki 0.60

4月の平均投球数ランキング.

fullname= fread("../../fullname.csv")
dat_april %>% 
  inner_join(fullname, by = "retroID") %>%
  select(name, pitches) %>% 
  arrange(desc(pitches))%>% 
  head(10) %>%
  xtable %>% print(type="html")
name pitches
1 Billy Butler 6.53
2 Lucas Duda 6.43
3 A.J. Ellis 6.28
4 Will Venable 6.22
5 Lance Berkman 6.12
6 David Wright 6.01
7 Mark Ellis 5.96
8 Rickie Weeks 5.96
9 Joey Votto 5.93
10 Nick Swisher 5.93

グラフ

投げさせた球数と打率の関係.

dat_april %>% 
  ggplot() + 
  geom_point(aes(x=average, y = pitches)) + 
  ggtitle("average vs pitches")

plot of chunk unnamed-chunk-6

関係なさそう.

球数と長打率は?

dat_april %>% 
  ggplot() + 
  geom_point(aes(x=SLG, y = pitches)) + 
  ggtitle("SLG vs pitches")

plot of chunk unnamed-chunk-7

dat_april %>% 
  ggplot() + 
  geom_point(aes(x=SLG, y = average)) + 
  ggtitle("average vs SLG")

plot of chunk unnamed-chunk-8