On his podcast, the poscast, Joe Posnanski and guest Michael Schur have been talking about the players on the Yankees that sound like made-up people, like Chasen Shreve. This project tries to quantify that idea my looking at how common, or uncommon, the names on a given team are.
I downloaded the yearly list of baby names from the US social security administration which is available from https://www.ssa.gov/oact/babynames/limits.html. This data set includes a count of first names given to US babies going back to 1880, as long as at least 5 babies were given a particular name. I used SQL to join this data set with batting, pitching, and birth year data from the Lahman database to generate plate-appearance weighted lists of name popularity. That data set is available as a gist at https://gist.github.com/bdilday/6a398341cbc4a72bf405e0f39b680dc5
Using a database of US names means that the names of some players born in a different country might end up being counted as uncommon, when that’s not really correct or in the spirit of what I’m trying to measure. Since the Lahman database includes country of birth, I can also look at just US born players.
Here I read in the data from the gist,
library(readr)
library(dplyr)
library(ggplot2)
df <- read_csv('https://gist.githubusercontent.com/bdilday/6a398341cbc4a72bf405e0f39b680dc5/raw/cf7fa1babbea21c1bb09f8866fd1c20475f84a10/mlb_popular_names.csv')
The namefrac column is the fraction of babies with the given name in the year of the players birth. So for example, Chasen has the value 2.8e-5 for both 2014 and 2015,
df %>% filter(namefirst=='Chasen') %>% select(teamid, yearid, namefirst, namelast, namefrac, nameweight)
## # A tibble: 4 x 6
## teamid yearid namefirst namelast namefrac nameweight
## <chr> <int> <chr> <chr> <dbl> <dbl>
## 1 ATL 2014 Chasen Shreve 2.8e-05 0.000000
## 2 ATL 2014 Chasen Shreve 2.8e-05 0.001400
## 3 NYA 2015 Chasen Shreve 2.8e-05 0.000000
## 4 NYA 2015 Chasen Shreve 2.8e-05 0.007028
The nameweight column is namefrac x the number of plate appearances. Plate appearances include PA as a batter and batters faced as a pitcher. Those are kept separate which is why there are two entries for each team-year combination shown above. I left it this way since they;ll get combined anyway when I take the weighted average for a team.
The highest values of namefrac are “John” in the early 20th century, which have value of around 8.5%.
There’s a bunch of players with names that don’t make the list from the SSA and so have zero weight. The most plate appearances with zero weight names are,
df %>% arrange(namefrac, desc(pa)) %>% select(teamid, yearid, namefirst, namelast, pa)
## # A tibble: 129,749 x 5
## teamid yearid namefirst namelast pa
## <chr> <int> <chr> <chr> <int>
## 1 NY1 1903 Christy Mathewson 1521
## 2 NY1 1908 Christy Mathewson 1499
## 3 SLA 1938 Bobo Newsom 1475
## 4 NY1 1904 Christy Mathewson 1456
## 5 DET 1944 Dizzy Trout 1421
## 6 BRO 1923 Burleigh Grimes 1418
## 7 BOS 1935 Wes Ferrell 1391
## 8 NY1 1901 Christy Mathewson 1383
## 9 PIT 1928 Burleigh Grimes 1377
## 10 SLA 1933 Bump Hadley 1365
## # ... with 129,739 more rows
The highest plate-appearance weighted values are,
df %>% arrange(desc(nameweight)) %>% select(teamid, yearid, namefirst, namelast, pa)
## # A tibble: 129,749 x 5
## teamid yearid namefirst namelast pa
## <chr> <int> <chr> <chr> <int>
## 1 DET 1904 George Mullin 1597
## 2 DET 1907 George Mullin 1493
## 3 DET 1905 George Mullin 1473
## 4 DET 1906 George Mullin 1408
## 5 DET 1903 George Mullin 1345
## 6 CLE 1923 George Uhle 1548
## 7 PHI 1908 George McQuillan 1409
## 8 CHA 1936 John Whitehead 1035
## 9 BLF 1914 George Suggs 1278
## 10 CIN 1912 George Suggs 1256
## # ... with 129,739 more rows
Now I group by team and season to look at the most and least common teams. I generate one data frame with all countries and one with players born in the USA only.
team.year <- df %>% group_by(teamid, yearid) %>% mutate(spa=sum(pa), sw=sum(nameweight), x=sw/spa)
team.year.usa <- df %>% filter(birthcountry=='USA') %>% group_by(teamid, yearid) %>% mutate(spa=sum(pa), sw=sum(nameweight), x=sw/spa)
The most extreme values for both common and uncommonly named teams are early 20th century teams,
team.year %>% arrange(desc(x)) %>% select(teamid:pa, x)
## Source: local data frame [129,749 x 5]
## Groups: teamid, yearid [2,411]
##
## teamid yearid playerid pa x
## <chr> <int> <chr> <int> <dbl>
## 1 CHA 1901 skopejo01 31 0.08738300
## 2 CHA 1901 skopejo01 0 0.08738300
## 3 CHA 1902 durhajo01 17 0.05222803
## 4 CHA 1902 durhajo01 94 0.05222803
## 5 CHA 1902 hugheed01 4 0.05222803
## 6 DET 1902 eganwi01 8 0.04509903
## 7 DET 1902 eganwi01 0 0.04509903
## 8 DET 1902 mccarar01 29 0.04509903
## 9 DET 1902 mccarar01 0 0.04509903
## 10 DET 1902 mullige01 128 0.04509903
## # ... with 129,739 more rows
team.year %>% arrange(x) %>% select(teamid:pa, x)
## Source: local data frame [129,749 x 5]
## Groups: teamid, yearid [2,411]
##
## teamid yearid playerid pa x
## <chr> <int> <chr> <int> <dbl>
## 1 BRO 1902 winhala01 2 0
## 2 BRO 1902 winhala01 0 0
## 3 BSN 1902 haleda01 17 0
## 4 BSN 1902 haleda01 0 0
## 5 PHI 1903 rudoldu01 1 0
## 6 WS1 1903 robinra01 406 0
## 7 NY1 1904 amesre01 43 0
## 8 NY1 1904 amesre01 464 0
## 9 NY1 1904 mathech01 142 0
## 10 NY1 1904 mathech01 1456 0
## # ... with 129,739 more rows
This is because they have players born prior to 1880 that aren’t included in the data. If I plot the total PAs, and the PA weighted commonality of names against year, I get,
g <- team.year %>% ggplot(aes(x=yearid, y=x))
print(g + geom_point())
g <- team.year %>% ggplot(aes(x=spa, y=x))
print(g + geom_point())
From here on I’m going to somewhat arbitrarily focus on 1961 and later.
Here are the most commonly named teams, 1961 or later,
team.year %>% filter(yearid>=1961) %>% summarise(x=mean(x)) %>% arrange(desc(x))
## Source: local data frame [1,446 x 3]
## Groups: teamid [39]
##
## teamid yearid x
## <chr> <int> <dbl>
## 1 SFN 1994 0.009791392
## 2 KCA 1993 0.009779969
## 3 SFN 1976 0.009673519
## 4 SFN 1978 0.008682502
## 5 KCA 1994 0.008601047
## 6 PHI 1985 0.008424861
## 7 PIT 2001 0.008414243
## 8 MIN 1990 0.008382733
## 9 SEA 2000 0.008245771
## 10 MIN 1992 0.008206225
## # ... with 1,436 more rows
Most commonly named teams, US players only,
team.year.usa %>% filter(yearid>=1961) %>% summarise(x=mean(x)) %>% arrange(desc(x))
## Source: local data frame [1,446 x 3]
## Groups: teamid [39]
##
## teamid yearid x
## <chr> <int> <dbl>
## 1 KCA 1993 0.011419529
## 2 NYA 1998 0.010578570
## 3 KC1 1964 0.010468000
## 4 ATL 2001 0.010457720
## 5 PIT 2001 0.010380819
## 6 SFN 1994 0.010325332
## 7 SLN 2005 0.010264645
## 8 SFN 1976 0.009778363
## 9 SEA 2000 0.009762225
## 10 SEA 2002 0.009712772
## # ... with 1,436 more rows
Least commonly named teams,
team.year %>% filter(yearid>=1961) %>% summarise(x=mean(x)) %>% arrange(x)
## Source: local data frame [1,446 x 3]
## Groups: teamid [39]
##
## teamid yearid x
## <chr> <int> <dbl>
## 1 NYA 1976 0.0007468444
## 2 ATL 1967 0.0007639050
## 3 MIL 2005 0.0009436827
## 4 CLE 1983 0.0010348522
## 5 TOR 1983 0.0010467887
## 6 CHA 1972 0.0010566808
## 7 BOS 1976 0.0010800753
## 8 ATL 1968 0.0011027922
## 9 OAK 1973 0.0011100424
## 10 NYA 1973 0.0011518698
## # ... with 1,436 more rows
And finally, the least commonly named teams, US players only
team.year.usa %>% filter(yearid>=1961) %>% summarise(x=mean(x)) %>% arrange(x)
## Source: local data frame [1,446 x 3]
## Groups: teamid [39]
##
## teamid yearid x
## <chr> <int> <dbl>
## 1 MIL 2005 0.0008289797
## 2 NYA 1976 0.0008333503
## 3 ATL 1967 0.0008569369
## 4 CHA 1972 0.0011256615
## 5 TOR 1983 0.0011458445
## 6 CLE 1983 0.0011833945
## 7 CHA 1975 0.0012113704
## 8 CHA 1974 0.0012283803
## 9 OAK 1973 0.0012469752
## 10 NYA 1977 0.0012501884
## # ... with 1,436 more rows
To take a peak at those first few teams and see which players got a lot of playing time while having uncommon names,
df %>% filter(teamid=='MIL', yearid==2005, birthcountry=='USA', pa>=450, namefrac<=0.01) %>% select(teamid, yearid, namefirst, namelast, pa, namefrac) %>% arrange(desc(pa))
## # A tibble: 7 x 6
## teamid yearid namefirst namelast pa namefrac
## <chr> <int> <chr> <chr> <int> <dbl>
## 1 MIL 2005 Chris Capuano 949 0.000814
## 2 MIL 2005 Doug Davis 946 0.000080
## 3 MIL 2005 Brady Clark 646 0.000300
## 4 MIL 2005 Ben Sheets 633 0.000348
## 5 MIL 2005 Lyle Overbay 615 0.000170
## 6 MIL 2005 Geoff Jenkins 594 0.000024
## 7 MIL 2005 Bill Hall 540 0.000159
df %>% filter(teamid=='NYA', yearid==1976, birthcountry=='USA', pa>=450, namefrac<=0.01) %>% select(teamid, yearid, namefirst, namelast, pa, namefrac) %>% arrange(desc(pa))
## # A tibble: 10 x 6
## teamid yearid namefirst namelast pa namefrac
## <chr> <int> <chr> <chr> <int> <dbl>
## 1 NYA 1976 Catfish Hunter 1210 0.000000
## 2 NYA 1976 Dock Ellis 886 0.000027
## 3 NYA 1976 Roy White 709 0.004857
## 4 NYA 1976 Chris Chambliss 668 0.000734
## 5 NYA 1976 Thurman Munson 645 0.000137
## 6 NYA 1976 Graig Nettles 645 0.000004
## 7 NYA 1976 Ken Holtzman 627 0.000357
## 8 NYA 1976 Mickey Rivers 603 0.000315
## 9 NYA 1976 Doyle Alexander 547 0.000292
## 10 NYA 1976 Willie Randolph 488 0.003358
df %>% filter(teamid=='TOR', yearid==1983, birthcountry=='USA', pa>=450, namefrac<=0.01) %>% select(teamid, yearid, namefirst, namelast, pa, namefrac) %>% arrange(desc(pa))
## # A tibble: 7 x 6
## teamid yearid namefirst namelast pa namefrac
## <chr> <int> <chr> <chr> <int> <dbl>
## 1 TOR 1983 Dave Stieb 1141 0.001013
## 2 TOR 1983 Jim Clancy 955 0.000827
## 3 TOR 1983 Jim Gott 776 0.002664
## 4 TOR 1983 Willie Upshaw 640 0.002882
## 5 TOR 1983 Lloyd Moseby 590 0.000877
## 6 TOR 1983 Doyle Alexander 482 0.000292
## 7 TOR 1983 Cliff Johnson 474 0.000077