Popularity of first names for MLB teams

On his podcast, the poscast, Joe Posnanski and guest Michael Schur have been talking about the players on the Yankees that sound like made-up people, like Chasen Shreve. This project tries to quantify that idea my looking at how common, or uncommon, the names on a given team are.

I downloaded the yearly list of baby names from the US social security administration which is available from https://www.ssa.gov/oact/babynames/limits.html. This data set includes a count of first names given to US babies going back to 1880, as long as at least 5 babies were given a particular name. I used SQL to join this data set with batting, pitching, and birth year data from the Lahman database to generate plate-appearance weighted lists of name popularity. That data set is available as a gist at https://gist.github.com/bdilday/6a398341cbc4a72bf405e0f39b680dc5

Using a database of US names means that the names of some players born in a different country might end up being counted as uncommon, when that’s not really correct or in the spirit of what I’m trying to measure. Since the Lahman database includes country of birth, I can also look at just US born players.

Here I read in the data from the gist,

library(readr)
library(dplyr)
library(ggplot2)
df <- read_csv('https://gist.githubusercontent.com/bdilday/6a398341cbc4a72bf405e0f39b680dc5/raw/cf7fa1babbea21c1bb09f8866fd1c20475f84a10/mlb_popular_names.csv')

The namefrac column is the fraction of babies with the given name in the year of the players birth. So for example, Chasen has the value 2.8e-5 for both 2014 and 2015,

 df %>% filter(namefirst=='Chasen') %>% select(teamid, yearid, namefirst, namelast, namefrac, nameweight)

## # A tibble: 4 x 6
##   teamid yearid namefirst namelast namefrac nameweight
##    <chr>  <int>     <chr>    <chr>    <dbl>      <dbl>
## 1    ATL   2014    Chasen   Shreve  2.8e-05   0.000000
## 2    ATL   2014    Chasen   Shreve  2.8e-05   0.001400
## 3    NYA   2015    Chasen   Shreve  2.8e-05   0.000000
## 4    NYA   2015    Chasen   Shreve  2.8e-05   0.007028

The nameweight column is namefrac x the number of plate appearances. Plate appearances include PA as a batter and batters faced as a pitcher. Those are kept separate which is why there are two entries for each team-year combination shown above. I left it this way since they;ll get combined anyway when I take the weighted average for a team.

The highest values of namefrac are “John” in the early 20th century, which have value of around 8.5%.

There’s a bunch of players with names that don’t make the list from the SSA and so have zero weight. The most plate appearances with zero weight names are,

df %>% arrange(namefrac, desc(pa)) %>% select(teamid, yearid, namefirst, namelast, pa)

## # A tibble: 129,749 x 5
##    teamid yearid namefirst  namelast    pa
##     <chr>  <int>     <chr>     <chr> <int>
## 1     NY1   1903   Christy Mathewson  1521
## 2     NY1   1908   Christy Mathewson  1499
## 3     SLA   1938      Bobo    Newsom  1475
## 4     NY1   1904   Christy Mathewson  1456
## 5     DET   1944     Dizzy     Trout  1421
## 6     BRO   1923  Burleigh    Grimes  1418
## 7     BOS   1935       Wes   Ferrell  1391
## 8     NY1   1901   Christy Mathewson  1383
## 9     PIT   1928  Burleigh    Grimes  1377
## 10    SLA   1933      Bump    Hadley  1365
## # ... with 129,739 more rows

The highest plate-appearance weighted values are,

df %>% arrange(desc(nameweight)) %>% select(teamid, yearid, namefirst, namelast, pa)

## # A tibble: 129,749 x 5
##    teamid yearid namefirst  namelast    pa
##     <chr>  <int>     <chr>     <chr> <int>
## 1     DET   1904    George    Mullin  1597
## 2     DET   1907    George    Mullin  1493
## 3     DET   1905    George    Mullin  1473
## 4     DET   1906    George    Mullin  1408
## 5     DET   1903    George    Mullin  1345
## 6     CLE   1923    George      Uhle  1548
## 7     PHI   1908    George McQuillan  1409
## 8     CHA   1936      John Whitehead  1035
## 9     BLF   1914    George     Suggs  1278
## 10    CIN   1912    George     Suggs  1256
## # ... with 129,739 more rows

Now I group by team and season to look at the most and least common teams. I generate one data frame with all countries and one with players born in the USA only.

team.year <- df %>% group_by(teamid, yearid) %>% mutate(spa=sum(pa), sw=sum(nameweight), x=sw/spa)

team.year.usa <- df %>% filter(birthcountry=='USA') %>% group_by(teamid, yearid) %>% mutate(spa=sum(pa), sw=sum(nameweight), x=sw/spa)

The most extreme values for both common and uncommonly named teams are early 20th century teams,

team.year %>% arrange(desc(x)) %>% select(teamid:pa, x)

## Source: local data frame [129,749 x 5]
## Groups: teamid, yearid [2,411]
## 
##    teamid yearid  playerid    pa          x
##     <chr>  <int>     <chr> <int>      <dbl>
## 1     CHA   1901 skopejo01    31 0.08738300
## 2     CHA   1901 skopejo01     0 0.08738300
## 3     CHA   1902 durhajo01    17 0.05222803
## 4     CHA   1902 durhajo01    94 0.05222803
## 5     CHA   1902 hugheed01     4 0.05222803
## 6     DET   1902  eganwi01     8 0.04509903
## 7     DET   1902  eganwi01     0 0.04509903
## 8     DET   1902 mccarar01    29 0.04509903
## 9     DET   1902 mccarar01     0 0.04509903
## 10    DET   1902 mullige01   128 0.04509903
## # ... with 129,739 more rows

team.year %>% arrange(x) %>% select(teamid:pa, x)

## Source: local data frame [129,749 x 5]
## Groups: teamid, yearid [2,411]
## 
##    teamid yearid  playerid    pa     x
##     <chr>  <int>     <chr> <int> <dbl>
## 1     BRO   1902 winhala01     2     0
## 2     BRO   1902 winhala01     0     0
## 3     BSN   1902  haleda01    17     0
## 4     BSN   1902  haleda01     0     0
## 5     PHI   1903 rudoldu01     1     0
## 6     WS1   1903 robinra01   406     0
## 7     NY1   1904  amesre01    43     0
## 8     NY1   1904  amesre01   464     0
## 9     NY1   1904 mathech01   142     0
## 10    NY1   1904 mathech01  1456     0
## # ... with 129,739 more rows

This is because they have players born prior to 1880 that aren’t included in the data. If I plot the total PAs, and the PA weighted commonality of names against year, I get,

g <- team.year %>% ggplot(aes(x=yearid, y=x)) 
print(g + geom_point())

g <- team.year %>% ggplot(aes(x=spa, y=x)) 
print(g + geom_point())

From here on I’m going to somewhat arbitrarily focus on 1961 and later.

Here are the most commonly named teams, 1961 or later,

team.year %>% filter(yearid>=1961) %>% summarise(x=mean(x)) %>% arrange(desc(x))

## Source: local data frame [1,446 x 3]
## Groups: teamid [39]
## 
##    teamid yearid           x
##     <chr>  <int>       <dbl>
## 1     SFN   1994 0.009791392
## 2     KCA   1993 0.009779969
## 3     SFN   1976 0.009673519
## 4     SFN   1978 0.008682502
## 5     KCA   1994 0.008601047
## 6     PHI   1985 0.008424861
## 7     PIT   2001 0.008414243
## 8     MIN   1990 0.008382733
## 9     SEA   2000 0.008245771
## 10    MIN   1992 0.008206225
## # ... with 1,436 more rows

Most commonly named teams, US players only,

team.year.usa %>% filter(yearid>=1961) %>% summarise(x=mean(x)) %>% arrange(desc(x))

## Source: local data frame [1,446 x 3]
## Groups: teamid [39]
## 
##    teamid yearid           x
##     <chr>  <int>       <dbl>
## 1     KCA   1993 0.011419529
## 2     NYA   1998 0.010578570
## 3     KC1   1964 0.010468000
## 4     ATL   2001 0.010457720
## 5     PIT   2001 0.010380819
## 6     SFN   1994 0.010325332
## 7     SLN   2005 0.010264645
## 8     SFN   1976 0.009778363
## 9     SEA   2000 0.009762225
## 10    SEA   2002 0.009712772
## # ... with 1,436 more rows

Least commonly named teams,

team.year %>% filter(yearid>=1961) %>% summarise(x=mean(x)) %>% arrange(x)

## Source: local data frame [1,446 x 3]
## Groups: teamid [39]
## 
##    teamid yearid            x
##     <chr>  <int>        <dbl>
## 1     NYA   1976 0.0007468444
## 2     ATL   1967 0.0007639050
## 3     MIL   2005 0.0009436827
## 4     CLE   1983 0.0010348522
## 5     TOR   1983 0.0010467887
## 6     CHA   1972 0.0010566808
## 7     BOS   1976 0.0010800753
## 8     ATL   1968 0.0011027922
## 9     OAK   1973 0.0011100424
## 10    NYA   1973 0.0011518698
## # ... with 1,436 more rows

And finally, the least commonly named teams, US players only

team.year.usa %>% filter(yearid>=1961) %>% summarise(x=mean(x)) %>% arrange(x)

## Source: local data frame [1,446 x 3]
## Groups: teamid [39]
## 
##    teamid yearid            x
##     <chr>  <int>        <dbl>
## 1     MIL   2005 0.0008289797
## 2     NYA   1976 0.0008333503
## 3     ATL   1967 0.0008569369
## 4     CHA   1972 0.0011256615
## 5     TOR   1983 0.0011458445
## 6     CLE   1983 0.0011833945
## 7     CHA   1975 0.0012113704
## 8     CHA   1974 0.0012283803
## 9     OAK   1973 0.0012469752
## 10    NYA   1977 0.0012501884
## # ... with 1,436 more rows

To take a peak at those first few teams and see which players got a lot of playing time while having uncommon names,

df %>% filter(teamid=='MIL', yearid==2005, birthcountry=='USA', pa>=450, namefrac<=0.01) %>% select(teamid, yearid, namefirst, namelast, pa, namefrac) %>%  arrange(desc(pa))

## # A tibble: 7 x 6
##   teamid yearid namefirst namelast    pa namefrac
##    <chr>  <int>     <chr>    <chr> <int>    <dbl>
## 1    MIL   2005     Chris  Capuano   949 0.000814
## 2    MIL   2005      Doug    Davis   946 0.000080
## 3    MIL   2005     Brady    Clark   646 0.000300
## 4    MIL   2005       Ben   Sheets   633 0.000348
## 5    MIL   2005      Lyle  Overbay   615 0.000170
## 6    MIL   2005     Geoff  Jenkins   594 0.000024
## 7    MIL   2005      Bill     Hall   540 0.000159

df %>% filter(teamid=='NYA', yearid==1976, birthcountry=='USA', pa>=450, namefrac<=0.01) %>% select(teamid, yearid, namefirst, namelast, pa, namefrac) %>%  arrange(desc(pa))

## # A tibble: 10 x 6
##    teamid yearid namefirst  namelast    pa namefrac
##     <chr>  <int>     <chr>     <chr> <int>    <dbl>
## 1     NYA   1976   Catfish    Hunter  1210 0.000000
## 2     NYA   1976      Dock     Ellis   886 0.000027
## 3     NYA   1976       Roy     White   709 0.004857
## 4     NYA   1976     Chris Chambliss   668 0.000734
## 5     NYA   1976   Thurman    Munson   645 0.000137
## 6     NYA   1976     Graig   Nettles   645 0.000004
## 7     NYA   1976       Ken  Holtzman   627 0.000357
## 8     NYA   1976    Mickey    Rivers   603 0.000315
## 9     NYA   1976     Doyle Alexander   547 0.000292
## 10    NYA   1976    Willie  Randolph   488 0.003358

df %>% filter(teamid=='TOR', yearid==1983, birthcountry=='USA', pa>=450, namefrac<=0.01) %>% select(teamid, yearid, namefirst, namelast, pa, namefrac) %>%  arrange(desc(pa))

## # A tibble: 7 x 6
##   teamid yearid namefirst  namelast    pa namefrac
##    <chr>  <int>     <chr>     <chr> <int>    <dbl>
## 1    TOR   1983      Dave     Stieb  1141 0.001013
## 2    TOR   1983       Jim    Clancy   955 0.000827
## 3    TOR   1983       Jim      Gott   776 0.002664
## 4    TOR   1983    Willie    Upshaw   640 0.002882
## 5    TOR   1983     Lloyd    Moseby   590 0.000877
## 6    TOR   1983     Doyle Alexander   482 0.000292
## 7    TOR   1983     Cliff   Johnson   474 0.000077