Length of Baby Names

Introduction

This project looks at the length of names over time. The hypothes is that as time goes on, names are decreasing in size. This is specifically looking at males names as it seems they are generally shorter than womens (around 5 characters).The data is from each decade from the 1900’s to the 2000’s to get a grasp of the most recent data and see if names have increased over the past 100 years.

The Process

To begin, take the babynames database and filter it to just be one decade.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(babynames)
babynames %>% filter (year %in% 1900:1909) -> babynames_00s

Then filter out just male names

boy_names <- filter(babynames_00s, sex == "M")$name

Then get the length of each name

str_length(boy_names) -> boy00

Then generate the average length for all male names in the 1900’s

mean(boy00) ->avg00

When all put together, the average came to 5.68

babynames_00s <- filter(babynames, year == 1900,1901,1902,1903,1904,1905,1906,1907,1908,1909)
boy_names <- filter(babynames_00s, sex == "M")$name
str_length(boy_names) -> boy00
mean(boy00) ->avg00

avg00

## [1] 5.679947

After finding the average, calculate the actual names that were popular during this decade. This can be done by using the previously created dataset for just one decade and filtering out womens names

babynames_00s %>% 
  filter(sex=="M")

## # A tibble: 1,506 x 5
##     year sex   name        n   prop
##    <dbl> <chr> <chr>   <int>  <dbl>
##  1  1900 M     John     9829 0.0606
##  2  1900 M     William  8579 0.0529
##  3  1900 M     James    7245 0.0447
##  4  1900 M     George   5403 0.0333
##  5  1900 M     Charles  4099 0.0253
##  6  1900 M     Robert   3821 0.0236
##  7  1900 M     Joseph   3714 0.0229
##  8  1900 M     Frank    3477 0.0214
##  9  1900 M     Edward   2720 0.0168
## 10  1900 M     Henry    2606 0.0161
## # … with 1,496 more rows

Then it is grouped by the name and the sex, each instance of every name is counted and arranged in desceding order and the list is cut off at 10 resulting in a list of the top 10 male names in the 1990’s

babynames_00s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name    sex   total
##    <chr>   <chr> <int>
##  1 John    M      9829
##  2 William M      8579
##  3 James   M      7245
##  4 George  M      5403
##  5 Charles M      4099
##  6 Robert  M      3821
##  7 Joseph  M      3714
##  8 Frank   M      3477
##  9 Edward  M      2720
## 10 Henry   M      2606

Each Decade

This process was repeated for every decade

The 1910’s

babynames_10s <- filter(babynames, year == 1910,1911,1912,1913,1914,1915,1916,1917,1918,1919)
boy_names10 <- filter(babynames_10s, sex == "M")$name
str_length(boy_names10) -> boy10
mean(boy10) ->avg10
avg10

## [1] 5.789016

babynames_10s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name    sex   total
##    <chr>   <chr> <int>
##  1 John    M     11450
##  2 James   M      9195
##  3 William M      8844
##  4 Robert  M      5609
##  5 George  M      5441
##  6 Joseph  M      5228
##  7 Charles M      4785
##  8 Frank   M      3768
##  9 Edward  M      3408
## 10 Henry   M      2899

The 1920’s

babynames_20s <- filter(babynames, year == 1920,1921,1922,1923,1924,1925,1926,1927,1928,1929)
boy_names20 <- filter(babynames_20s, sex == "M")$name
str_length(boy_names20) -> boy20
mean(boy20) ->avg20
avg20

## [1] 5.890982

babynames_20s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name    sex   total
##    <chr>   <chr> <int>
##  1 John    M     56913
##  2 William M     50147
##  3 Robert  M     48678
##  4 James   M     47909
##  5 Charles M     28308
##  6 George  M     26893
##  7 Joseph  M     25591
##  8 Edward  M     20095
##  9 Frank   M     16432
## 10 Richard M     15010

The 1930’s

babynames_30s <- filter(babynames, year == 1930,1931,1932,1933,1934,1935,1936,1937,1938,1939)
boy_names30 <- filter(babynames_30s, sex == "M")$name
str_length(boy_names30) -> boy30
mean(boy30) ->avg30
avg30

## [1] 5.935036

babynames_30s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name    sex   total
##    <chr>   <chr> <int>
##  1 Robert  M     62147
##  2 James   M     53944
##  3 John    M     52432
##  4 William M     47259
##  5 Richard M     32178
##  6 Charles M     31863
##  7 Donald  M     29046
##  8 George  M     22779
##  9 Joseph  M     20981
## 10 Edward  M     17347

The 1940’s

babynames_40s <- filter(babynames, year == 1940,1941,1942,1943,1944,1945,1946,1947,1948,1949)
boy_names40 <- filter(babynames_40s, sex == "M")$name
str_length(boy_names40) -> boy40
mean(boy40) -> avg40
avg40

## [1] 5.922764

babynames_40s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name    sex   total
##    <chr>   <chr> <int>
##  1 James   M     62471
##  2 Robert  M     61195
##  3 John    M     54777
##  4 William M     44770
##  5 Richard M     37416
##  6 Charles M     31685
##  7 David   M     27684
##  8 Thomas  M     23987
##  9 Donald  M     23104
## 10 Ronald  M     20729

The 1950’s

babynames_50s <- filter(babynames, year == 1950,1951,1952,1953,1954,1955,1956,1957,1958,1959)
boy_names50 <- filter(babynames_50s, sex == "M")$name
str_length(boy_names50) -> boy50
mean(boy50) ->avg50
avg50

## [1] 5.942734

babynames_50s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name    sex   total
##    <chr>   <chr> <int>
##  1 James   M     86239
##  2 Robert  M     83565
##  3 John    M     79420
##  4 Michael M     65151
##  5 David   M     60730
##  6 William M     60690
##  7 Richard M     51001
##  8 Thomas  M     45603
##  9 Charles M     39099
## 10 Gary    M     33753

The 1960’s

babynames_60s <- filter(babynames, year == 1960,1961,1962,1963,1964,1965,1966,1967,1968,1969)
boy_names60 <- filter(babynames_60s, sex == "M")$name
str_length(boy_names60) -> boy60
mean(boy60)->avg60
avg60

## [1] 5.903704

babynames_60s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name    sex   total
##    <chr>   <chr> <int>
##  1 David   M     85928
##  2 Michael M     84183
##  3 James   M     76842
##  4 John    M     76096
##  5 Robert  M     72369
##  6 Mark    M     58731
##  7 William M     49354
##  8 Richard M     43561
##  9 Thomas  M     39279
## 10 Steven  M     33895

The 1970’s

babynames_70s <- filter(babynames, year == 1970,1971,1972,1973,1974,1975,1976,1977,1978,1979)
boy_names70 <- filter(babynames_70s, sex == "M")$name
str_length(boy_names70) -> boy70
mean(boy70) -> avg70
avg70

## [1] 5.955801

babynames_70s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name        sex   total
##    <chr>       <chr> <int>
##  1 Michael     M     85303
##  2 James       M     61766
##  3 David       M     61753
##  4 John        M     58520
##  5 Robert      M     57201
##  6 Christopher M     41775
##  7 William     M     38910
##  8 Brian       M     31932
##  9 Mark        M     31494
## 10 Richard     M     30452

The 1980’s

babynames_80s <- filter(babynames, year == 1980,1981,1982,1983,1984,1985,1986,1987,1988,1989)
boy_names80 <- filter(babynames_80s, sex == "M")$name
str_length(boy_names80) -> boy80
mean(boy80) -> avg80
avg80

## [1] 6.006037

babynames_80s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name        sex   total
##    <chr>       <chr> <int>
##  1 Michael     M     68693
##  2 Christopher M     49092
##  3 Jason       M     48173
##  4 David       M     41923
##  5 James       M     39325
##  6 Matthew     M     37860
##  7 Joshua      M     36060
##  8 John        M     35279
##  9 Robert      M     34281
## 10 Joseph      M     30202

The 1990’s

babynames_90s <- filter(babynames, year == 1990,1991,1992,1993,1994,1995,1996,1997,1998,1999)
boy_names90 <- filter(babynames_90s, sex == "M")$name
str_length(boy_names90) -> boy90
mean(boy90) ->avg90
avg90

## [1] 6.128321

babynames_90s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name        sex   total
##    <chr>       <chr> <int>
##  1 Michael     M     65282
##  2 Christopher M     52332
##  3 Matthew     M     44800
##  4 Joshua      M     43216
##  5 Daniel      M     33815
##  6 David       M     33742
##  7 Andrew      M     33657
##  8 James       M     32347
##  9 Justin      M     30638
## 10 Joseph      M     30127

The 2000’s

babynames_200s <- filter(babynames, year == 2000,2001,2001,2003,2004,200,2006,2007,2008,2009)
boy_names200 <- filter(babynames_200s, sex == "M")$name
str_length(boy_names200) -> boy200
mean(boy200) -> avg200
avg200

## [1] 6.123225

babynames_200s %>% 
  filter(sex=="M") %>% 
  group_by(name,sex) %>% 
  summarise(total = sum(n)) %>% 
  arrange(desc(total)) %>% 
  head(10)

## # A tibble: 10 x 3
## # Groups:   name [10]
##    name        sex   total
##    <chr>       <chr> <int>
##  1 Jacob       M     34471
##  2 Michael     M     32035
##  3 Matthew     M     28572
##  4 Joshua      M     27538
##  5 Christopher M     24931
##  6 Nicholas    M     24652
##  7 Andrew      M     23639
##  8 Joseph      M     22825
##  9 Daniel      M     22312
## 10 Tyler       M     21503

Visualization

After geting all of this information, it is then graphed by making a custom dataset. Starting with creating a decades variable

decades <-c("1900-1909", "1910-1919", "1920-1929", "1930-1939", "1940-1949", "1950-1959", "1960-1969", "1970-1979", "1980-1989", "1990-1999", "2000-2009")

Then the aveage name length that correspondes with each decade

mean <- c(5.68,5.79,5.89,5.94,5.92,5.94,5.90,5.95,6.00,6.13,6.12)

Thus creating this datframe

DecadeData <- data.frame(decades,mean)
DecadeData

##      decades mean
## 1  1900-1909 5.68
## 2  1910-1919 5.79
## 3  1920-1929 5.89
## 4  1930-1939 5.94
## 5  1940-1949 5.92
## 6  1950-1959 5.94
## 7  1960-1969 5.90
## 8  1970-1979 5.95
## 9  1980-1989 6.00
## 10 1990-1999 6.13
## 11 2000-2009 6.12

Then the dataframe is plotted creating a visual of the average lengths by the decade

ggplot(DecadeData, aes(decades,mean)) +geom_col() + coord_cartesian(ylim= c(4.5,6.15)) + coord_flip() + ggtitle("The Length of Male Names Over Time")

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Conclusion

In conlusion te hypothesis is incorrect. This can be because of a couple things. The first is that the hypothesis was made out of personal opinion. Many people go by nicknames in todays date. Therefore people with names like Jacob and Michael (the two most popular names in the 200’s) may use common nicknames like Jake and Mike. I can also be because there are more people now than there were in the past, meaning there are generally more names in more recent decades (as shown below) making the average higher.

babynames %>%
    count(decade = 10 * (year %/% 10))

## # A tibble: 14 x 2
##    decade      n
##     <dbl>  <int>
##  1   1880  22743
##  2   1890  29522
##  3   1900  36675
##  4   1910  80515
##  5   1920 105360
##  6   1930  91487
##  7   1940  95640
##  8   1950 110520
##  9   1960 124173
## 10   1970 167186
## 11   1980 205784
## 12   1990 263135
## 13   2000 325180
## 14   2010 266745

Another aspect that came to light in this project was that the more popular names in later decades were generally longer. In the early 90’s the more popular names were John and James which are shorter than the more modern popular names like Christopher and Matthew. Since more people were named longer names more recently, this will increase the average because there are more instances of them.

A final observation is that the way names are spelled have changed throughout the years. For instance, the name James can also be spelled Jaymes which adds another character. Before the 1950’s, “Jaymes” was not a known spelling (according to this database). As seen below, the instance of the name Jaymes has become more prominent in recent years. This means that even though the person may have the common name of “James”, it is spelled with more characters which will increase the general average of the decade. Many names have alternate spellings which will inherently increase the average.

James <- babynames %>% 
  filter(name %in% c("Jaymes") & sex=="M") %>% 
  count(decade = 10 * (year %/% 10)) %>% 
  group_by(decade)
James

## # A tibble: 7 x 2
## # Groups:   decade [7]
##   decade     n
##    <dbl> <int>
## 1   1950     2
## 2   1960     7
## 3   1970    10
## 4   1980    10
## 5   1990    10
## 6   2000    10
## 7   2010     8