Administrative

Please indicate

  • Roughly how much time you spent on this HW so far: approx. 4 hours
  • The URL of the RPubs published URL here.
  • What gave you the most trouble: For Q3, what the question is asking was a bit ambiguous and I included both answers. I was trying to figure out a way to exclude the teams that existed before 1969 which was a bit troublesome until I figured it out.
  • Any comments you have:
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Lahman)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Problem 1.

Define two new variables in the Teams data frame: batting average (BA) and slugging percentage (SLG). Batting average is the ratio of hits (H) to at-bats (AB), and slugging percentage is the total bases divided by at-bats. To compute the total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

Teams1 <- Teams %>%
  mutate(BA = H / AB) %>%
  mutate(SLG = ((H - X2B - X3B - HR) + 2 * X2B + 3 * X3B + 4 * HR) / AB)

Problem 2.

Plot a time series of SLG since 1954 by league (lgID). Is slugging percentage typically higher in the American League (AL) or the National League?

Teams2 <- Teams1 %>%
  select(SLG, yearID, lgID, teamID) %>%
  filter(!is.na(yearID) & yearID > 1954) %>%
  filter(lgID == "NL" | lgID == "AL") %>%
  group_by(lgID, yearID) %>%
  summarize(avg_SLG = mean(SLG))
head(Teams2)
## Source: local data frame [6 x 3]
## Groups: lgID [1]
## 
##     lgID yearID   avg_SLG
##   <fctr>  <int>     <dbl>
## 1     AL   1955 0.3810866
## 2     AL   1956 0.3935263
## 3     AL   1957 0.3823205
## 4     AL   1958 0.3829396
## 5     AL   1959 0.3839940
## 6     AL   1960 0.3874953
ggplot(Teams2, aes(x = yearID, y = avg_SLG, color = lgID)) +
  geom_point() + geom_line()

It appears that the National League was taking lead in slugging percentage around 1960s, and after a period of struggle around 1970s, the American League is constantly ahead of the National League, in slugging percentage.


Problem 3.

Display the top 15 teams ranked in terms of slugging percentage in MLB history. Repeat this using teams since 1969.

For this problem, I interpreted the question in two different ways.

Here’s the first one

for(i in 1969:1970){
  print(
    Teams1 %>%
    filter(!is.na(yearID) & yearID == i) %>%
    arrange(-SLG) %>%
    select(yearID, lgID, name, SLG) %>%
  head(15))}
##    yearID lgID                  name       SLG
## 1    1969   NL       Cincinnati Reds 0.4222577
## 2    1969   AL        Boston Red Sox 0.4149982
## 3    1969   AL     Baltimore Orioles 0.4135556
## 4    1969   AL       Minnesota Twins 0.4084904
## 5    1969   NL    Pittsburgh Pirates 0.3977959
## 6    1969   AL        Detroit Tigers 0.3874288
## 7    1969   NL          Chicago Cubs 0.3835443
## 8    1969   NL        Atlanta Braves 0.3796703
## 9    1969   AL   Washington Senators 0.3781898
## 10   1969   AL     Oakland Athletics 0.3758461
## 11   1969   NL Philadelphia Phillies 0.3720414
## 12   1969   NL  San Francisco Giants 0.3609792
## 13   1969   NL   St. Louis Cardinals 0.3592847
## 14   1969   NL   Los Angeles Dodgers 0.3588214
## 15   1969   NL        Montreal Expos 0.3585532
##    yearID lgID                 name       SLG
## 1    1970   NL      Cincinnati Reds 0.4357401
## 2    1970   AL       Boston Red Sox 0.4276423
## 3    1970   NL         Chicago Cubs 0.4146786
## 4    1970   NL San Francisco Giants 0.4091072
## 5    1970   NL   Pittsburgh Pirates 0.4057123
## 6    1970   NL       Atlanta Braves 0.4035341
## 7    1970   AL      Minnesota Twins 0.4028816
## 8    1970   AL    Baltimore Orioles 0.4010821
## 9    1970   AL    Cleveland Indians 0.3935567
## 10   1970   AL    Oakland Athletics 0.3919271
## 11   1970   NL     San Diego Padres 0.3911540
## 12   1970   NL       Houston Astros 0.3905633
## 13   1970   NL  Los Angeles Dodgers 0.3822690
## 14   1970   NL  St. Louis Cardinals 0.3789770
## 15   1970   AL       Detroit Tigers 0.3736284

** I only looped the year upto 1970 since getting data from 1969 to 2015 gives me a “LONG” list of data.

Taking average of teams

Teams1 %>%
  filter(!is.na(yearID)) %>%
  group_by(name) %>%
  summarize(mean_SLG = mean(SLG)) %>%
  arrange(-mean_SLG) %>%
  head(15)
## # A tibble: 15 × 2
##                             name  mean_SLG
##                            <chr>     <dbl>
## 1               Colorado Rockies 0.4425149
## 2                 Anaheim Angels 0.4223234
## 3             Cincinnati Redlegs 0.4199253
## 4           Boston Red Stockings 0.4165024
## 5              Toronto Blue Jays 0.4164657
## 6           Arizona Diamondbacks 0.4152269
## 7               Milwaukee Braves 0.4135638
## 8  Los Angeles Angels of Anaheim 0.4133163
## 9                  Texas Rangers 0.4132939
## 10              New York Yankees 0.4104905
## 11          Tampa Bay Devil Rays 0.4059601
## 12                Tampa Bay Rays 0.4052192
## 13               Florida Marlins 0.4051400
## 14              Seattle Mariners 0.4037377
## 15               Minnesota Twins 0.3993335

My second interpretation of this question is filtering the rows and getting rid of teams that were already there before 1969.

Teams1 %>%
  filter(!is.na(yearID)) %>%
  filter(yearID > 1969) %>%
  group_by(yearID) %>%
  arrange(-SLG) %>%
  select(yearID, lgID, name, SLG) %>%
  head(15)
## Source: local data frame [15 x 4]
## Groups: yearID [10]
## 
##    yearID   lgID              name       SLG
##     <int> <fctr>             <chr>     <dbl>
## 1    2003     AL    Boston Red Sox 0.4908996
## 2    1997     AL  Seattle Mariners 0.4845030
## 3    1994     AL Cleveland Indians 0.4838389
## 4    1996     AL  Seattle Mariners 0.4835921
## 5    2001     NL  Colorado Rockies 0.4829525
## 6    1995     AL Cleveland Indians 0.4787192
## 7    1999     AL     Texas Rangers 0.4786763
## 8    1997     NL  Colorado Rockies 0.4777798
## 9    2009     AL  New York Yankees 0.4775618
## 10   2000     NL    Houston Astros 0.4766607
## 11   2003     NL    Atlanta Braves 0.4754850
## 12   1996     AL Cleveland Indians 0.4752684
## 13   2000     AL    Anaheim Angels 0.4724591
## 14   1996     NL  Colorado Rockies 0.4724508
## 15   2004     AL    Boston Red Sox 0.4723776

Then filtering the teams that existed before 1969,

Teams_before_1969 <-
  Teams1 %>%
  select(yearID, teamID) %>%
  filter(yearID < 1969)

Teams1 %>%
  anti_join(Teams_before_1969, by = "teamID") %>%
  group_by(yearID) %>%
  arrange(-SLG) %>%
  select(yearID, lgID, teamID, name, SLG) %>%
  head(15)
## Source: local data frame [15 x 5]
## Groups: yearID [9]
## 
##    yearID   lgID teamID              name       SLG
##     <int> <fctr> <fctr>             <chr>     <dbl>
## 1    1997     AL    SEA  Seattle Mariners 0.4845030
## 2    1996     AL    SEA  Seattle Mariners 0.4835921
## 3    2001     NL    COL  Colorado Rockies 0.4829525
## 4    1999     AL    TEX     Texas Rangers 0.4786763
## 5    1997     NL    COL  Colorado Rockies 0.4777798
## 6    2000     AL    ANA    Anaheim Angels 0.4724591
## 7    1996     NL    COL  Colorado Rockies 0.4724508
## 8    1999     NL    COL  Colorado Rockies 0.4716585
## 9    1995     NL    COL  Colorado Rockies 0.4707649
## 10   2001     AL    TEX     Texas Rangers 0.4707124
## 11   2000     AL    TOR Toronto Blue Jays 0.4692619
## 12   1996     AL    TEX     Texas Rangers 0.4686075
## 13   2005     AL    TEX     Texas Rangers 0.4683345
## 14   1998     AL    SEA  Seattle Mariners 0.4676617
## 15   2006     AL    TOR Toronto Blue Jays 0.4628306

For the second method, we see younger teams, such as Seattle Mariners (data available from 1977), Texas Rangers (from 1972), Colorado Rockies (from 1993), and Toronto Blue Jays (from 1977)


Problem 4.

The Angles have at times been called the California Angles (CAL), the Anaheim Angels (ANA), and the Los Angeles Angels (LAA). Find the 10 most successful seasons in Angels history. Have they ever won the world series?

Teams %>%
  filter(teamID == "CAL" | teamID == "ANA" | teamID == "LAA") %>%
  mutate(win_rate = W / L) %>%
  select(yearID, teamID, Rank, W, L, win_rate, WSWin) %>%
  arrange(Rank, -win_rate) %>%
  head(10)
##    yearID teamID Rank   W  L win_rate WSWin
## 1    2008    LAA    1 100 62 1.612903     N
## 2    2014    LAA    1  98 64 1.531250     N
## 3    2009    LAA    1  97 65 1.492308     N
## 4    2005    LAA    1  95 67 1.417910     N
## 5    2007    LAA    1  94 68 1.382353     N
## 6    1982    CAL    1  93 69 1.347826     N
## 7    1986    CAL    1  92 70 1.314286     N
## 8    2004    ANA    1  92 70 1.314286     N
## 9    1979    CAL    1  88 74 1.189189     N
## 10   2002    ANA    2  99 63 1.571429     Y
Teams %>%
  filter(teamID == "CAL" | teamID == "ANA" | teamID == "LAA") %>%
  select(yearID, teamID, Rank, W, L, DivWin, WSWin) %>%
  arrange(desc(WSWin)) %>%
  head(10)
##    yearID teamID Rank  W  L DivWin WSWin
## 1    2002    ANA    2 99 63      N     Y
## 2    1961    LAA    8 70 91   <NA>     N
## 3    1962    LAA    3 86 76   <NA>     N
## 4    1963    LAA    9 70 91   <NA>     N
## 5    1964    LAA    5 82 80   <NA>     N
## 6    1965    CAL    7 75 87   <NA>     N
## 7    1966    CAL    6 80 82   <NA>     N
## 8    1967    CAL    5 84 77   <NA>     N
## 9    1968    CAL    8 67 95   <NA>     N
## 10   1969    CAL    3 71 91      N     N

The 10 most successful seasons for the Angels are shown in the table above. The win_rate is calculated by dividing total wins W over total loses L in that seaseon. All the rows are arranged first by Rank, than by the win_rate to see which season was more successful, for every year of same Rank.

The year 2008 was the most successful, with Rank 1 and win rate of 1.6129032, followed by 2014, 2009, and so on as shown on the table. In total, they have won the division 9 times.

The second table shows that the Angels won the World Series only once in 2002.