Administrative

Please indicate

  • Roughly how much time you spent on this HW so far: hour and a half
  • The URL of the RPubs published URL here.
  • What gave you the most trouble: getting mean slugging percentage based on year and leage in problem 2
  • Any comments you have: n/a
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(Lahman)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
library(ggplot2)
data("Teams")
head(Teams)
##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
## 6   1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 NA  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 NA  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18 NA  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16 NA  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 NA  NA NA 313 121 3.72 32   1  0
## 6  <NA> 376 1281 410  66  27  9 46 23 56 NA  NA NA 266 137 4.95 27   0  0
##   IPouts  HA HRA BBA SOA   E DP   FP                    name
## 1    828 367   2  42  23 225 NA 0.83    Boston Red Stockings
## 2    753 308   6  28  22 218 NA 0.82 Chicago White Stockings
## 3    762 346  13  53  34 223 NA 0.81  Cleveland Forest Citys
## 4    507 261   5  21  17 163 NA 0.80    Fort Wayne Kekiongas
## 5    879 373   7  42  22 227 NA 0.83        New York Mutuals
## 6    747 329   3  53  16 194 NA 0.84  Philadelphia Athletics
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6     Jefferson Street Grounds         NA 102  98      ATH            PH1
##   teamIDretro
## 1         BS1
## 2         CH1
## 3         CL1
## 4         FW1
## 5         NY2
## 6         PH1

Problem 1.

Define two new variables in the Teams data frame: batting average (BA) and slugging percentage (SLG). Batting average is the ratio of hits (H) to at-bats (AB), and slugging percentage is the total bases divided by at-bats. To compute the total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

slg_pct <-
Teams %>%
  mutate(BA = H/AB, SLG = (X2B*2 + X3B*3 + HR*4 + (H-X2B-X3B-HR)) / AB)
  
head(slg_pct)
##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
## 6   1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 NA  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 NA  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18 NA  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16 NA  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 NA  NA NA 313 121 3.72 32   1  0
## 6  <NA> 376 1281 410  66  27  9 46 23 56 NA  NA NA 266 137 4.95 27   0  0
##   IPouts  HA HRA BBA SOA   E DP   FP                    name
## 1    828 367   2  42  23 225 NA 0.83    Boston Red Stockings
## 2    753 308   6  28  22 218 NA 0.82 Chicago White Stockings
## 3    762 346  13  53  34 223 NA 0.81  Cleveland Forest Citys
## 4    507 261   5  21  17 163 NA 0.80    Fort Wayne Kekiongas
## 5    879 373   7  42  22 227 NA 0.83        New York Mutuals
## 6    747 329   3  53  16 194 NA 0.84  Philadelphia Athletics
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6     Jefferson Street Grounds         NA 102  98      ATH            PH1
##   teamIDretro        BA       SLG
## 1         BS1 0.3104956 0.4220117
## 2         CH1 0.2700669 0.3737458
## 3         CL1 0.2765599 0.3912310
## 4         FW1 0.2386059 0.2935657
## 5         NY2 0.2870370 0.3497151
## 6         PH1 0.3200625 0.4348165

Problem 2.

Plot a time series of SLG since 1954 by league (lgID). Is slugging percentage typically higher in the American League (AL) or the National League?

year1954 <-
  filter(slg_pct, yearID >= 1954) 
  

avg_slg <- ddply(year1954, .(yearID, lgID), summarize, AvgSlg = mean(SLG))
head(avg_slg)
##   yearID lgID    AvgSlg
## 1   1954   AL 0.3732352
## 2   1954   NL 0.4067245
## 3   1955   AL 0.3810866
## 4   1955   NL 0.4068172
## 5   1956   AL 0.3935263
## 6   1956   NL 0.4008650
ggplot(avg_slg, mapping = aes(x = yearID, y = AvgSlg, color = lgID)) + geom_point() + labs(x = "Year", y = "Average Slugging Percantage", title = "Average Slugging Percantage vs. Year")

##tried using dplyr functions but kept getting "error in order(yearID): object 'yearID' not found.  Through google found out how to use function above and used that instead.  
##slg_pct %>%
  ##group_by(lgID, yearID) %>%
  ##filter(yearID >= 1954) %>%
  ##select(yearID, SLG) %>%
  ##summarize(AvgSlg = mean(SLG, na.rm = TRUE)) %>%
  ##arrange(yearID)

It looks like up until right around after 1970 the National League has a higher slugging percantage, after that point the AL seems to have higher. The DH was introduced in 1973, which is probably why the AL has a higher SLG pct after around that year.

Problem 3.

Display the top 15 teams ranked in terms of slugging percentage in MLB history. Repeat this using teams since 1969.

##all time 
slg_pct %>%
  group_by(teamIDBR, yearID) %>%
  select(SLG) %>%
  arrange(desc(SLG)) %>%
  head(n=15)
## Adding missing grouping variables: `teamIDBR`, `yearID`
## Source: local data frame [15 x 3]
## Groups: teamIDBR, yearID [15]
## 
##    teamIDBR yearID       SLG
##       <chr>  <int>     <dbl>
## 1       BOS   2003 0.4908996
## 2       NYY   1927 0.4890593
## 3       NYY   1930 0.4877019
## 4       SEA   1997 0.4845030
## 5       BSN   1894 0.4843345
## 6       CLE   1994 0.4838389
## 7       SEA   1996 0.4835921
## 8       NYY   1936 0.4834556
## 9       COL   2001 0.4829525
## 10      BLN   1894 0.4828089
## 11      CHC   1930 0.4809174
## 12      CLE   1995 0.4787192
## 13      TEX   1999 0.4786763
## 14      COL   1997 0.4777798
## 15      NYY   2009 0.4775618
##since 1969
slg_pct %>% 
  group_by(teamIDBR, yearID) %>%
  filter(yearID >= 1969) %>%
  select(SLG) %>%
  arrange(desc(SLG)) %>%
  head(n=15)
## Adding missing grouping variables: `teamIDBR`, `yearID`
## Source: local data frame [15 x 3]
## Groups: teamIDBR, yearID [15]
## 
##    teamIDBR yearID       SLG
##       <chr>  <int>     <dbl>
## 1       BOS   2003 0.4908996
## 2       SEA   1997 0.4845030
## 3       CLE   1994 0.4838389
## 4       SEA   1996 0.4835921
## 5       COL   2001 0.4829525
## 6       CLE   1995 0.4787192
## 7       TEX   1999 0.4786763
## 8       COL   1997 0.4777798
## 9       NYY   2009 0.4775618
## 10      HOU   2000 0.4766607
## 11      ATL   2003 0.4754850
## 12      CLE   1996 0.4752684
## 13      ANA   2000 0.4724591
## 14      COL   1996 0.4724508
## 15      BOS   2004 0.4723776

Problem 4.

The Angles have at times been called the California Angles (CAL), the Anaheim Angels (ANA), and the Los Angeles Angels (LAA). Find the 10 most successful seasons in Angels history. Have they ever won the world series?

##10 most successful seasons
slg_pct %>%
  group_by(teamIDBR) %>%
  filter(teamIDBR == "CAL" | teamIDBR == "ANA" | teamIDBR == "LAA") %>%
  mutate(WinPct = W / (W+L)) %>%
  select(W, L, WinPct, WSWin) %>%
  arrange(desc(WinPct)) %>%
  head(n=10)
## Adding missing grouping variables: `teamIDBR`
## Source: local data frame [10 x 5]
## Groups: teamIDBR [3]
## 
##    teamIDBR     W     L    WinPct WSWin
##       <chr> <int> <int>     <dbl> <chr>
## 1       LAA   100    62 0.6172840     N
## 2       ANA    99    63 0.6111111     Y
## 3       LAA    98    64 0.6049383     N
## 4       LAA    97    65 0.5987654     N
## 5       LAA    95    67 0.5864198     N
## 6       LAA    94    68 0.5802469     N
## 7       CAL    93    69 0.5740741     N
## 8       CAL    92    70 0.5679012     N
## 9       ANA    92    70 0.5679012     N
## 10      CAL    91    71 0.5617284     N
##see if the have any other WS wins
slg_pct %>%
  group_by(teamIDBR) %>%
  filter(teamIDBR == "CAL" | teamIDBR == "ANA" | teamIDBR == "LAA") %>%
  mutate(WinPct = W / (W+L)) %>%
  select(yearID, W, L, WinPct, WSWin) %>%
  arrange(desc(WinPct)) %>%
  filter(WSWin == "Y")
## Adding missing grouping variables: `teamIDBR`
## Source: local data frame [1 x 6]
## Groups: teamIDBR [1]
## 
##   teamIDBR yearID     W     L    WinPct WSWin
##      <chr>  <int> <int> <int>     <dbl> <chr>
## 1      ANA   2002    99    63 0.6111111     Y

They have only won world series once, as the Anaheim Angels, in 2002.