This project looks at the length of names over time. The hypothes is that as time goes on, names are decreasing in size. This is specifically looking at males names as it seems they are generally shorter than womens (around 5 characters).The data is from each decade from the 1900’s to the 2000’s to get a grasp of the most recent data and see if names have increased over the past 100 years.
To begin, take the babynames database and filter it to just be one decade.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(babynames)
babynames %>% filter (year %in% 1900:1909) -> babynames_00s
Then filter out just male names
boy_names <- filter(babynames_00s, sex == "M")$name
Then get the length of each name
str_length(boy_names) -> boy00
Then generate the average length for all male names in the 1900’s
mean(boy00) ->avg00
When all put together, the average came to 5.68
babynames_00s <- filter(babynames, year == 1900,1901,1902,1903,1904,1905,1906,1907,1908,1909)
boy_names <- filter(babynames_00s, sex == "M")$name
str_length(boy_names) -> boy00
mean(boy00) ->avg00
avg00
## [1] 5.679947
After finding the average, calculate the actual names that were popular during this decade. This can be done by using the previously created dataset for just one decade and filtering out womens names
babynames_00s %>%
filter(sex=="M")
## # A tibble: 1,506 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1900 M John 9829 0.0606
## 2 1900 M William 8579 0.0529
## 3 1900 M James 7245 0.0447
## 4 1900 M George 5403 0.0333
## 5 1900 M Charles 4099 0.0253
## 6 1900 M Robert 3821 0.0236
## 7 1900 M Joseph 3714 0.0229
## 8 1900 M Frank 3477 0.0214
## 9 1900 M Edward 2720 0.0168
## 10 1900 M Henry 2606 0.0161
## # … with 1,496 more rows
Then it is grouped by the name and the sex, each instance of every name is counted and arranged in desceding order and the list is cut off at 10 resulting in a list of the top 10 male names in the 1990’s
babynames_00s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 John M 9829
## 2 William M 8579
## 3 James M 7245
## 4 George M 5403
## 5 Charles M 4099
## 6 Robert M 3821
## 7 Joseph M 3714
## 8 Frank M 3477
## 9 Edward M 2720
## 10 Henry M 2606
This process was repeated for every decade
The 1910’s
babynames_10s <- filter(babynames, year == 1910,1911,1912,1913,1914,1915,1916,1917,1918,1919)
boy_names10 <- filter(babynames_10s, sex == "M")$name
str_length(boy_names10) -> boy10
mean(boy10) ->avg10
avg10
## [1] 5.789016
babynames_10s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 John M 11450
## 2 James M 9195
## 3 William M 8844
## 4 Robert M 5609
## 5 George M 5441
## 6 Joseph M 5228
## 7 Charles M 4785
## 8 Frank M 3768
## 9 Edward M 3408
## 10 Henry M 2899
The 1920’s
babynames_20s <- filter(babynames, year == 1920,1921,1922,1923,1924,1925,1926,1927,1928,1929)
boy_names20 <- filter(babynames_20s, sex == "M")$name
str_length(boy_names20) -> boy20
mean(boy20) ->avg20
avg20
## [1] 5.890982
babynames_20s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 John M 56913
## 2 William M 50147
## 3 Robert M 48678
## 4 James M 47909
## 5 Charles M 28308
## 6 George M 26893
## 7 Joseph M 25591
## 8 Edward M 20095
## 9 Frank M 16432
## 10 Richard M 15010
The 1930’s
babynames_30s <- filter(babynames, year == 1930,1931,1932,1933,1934,1935,1936,1937,1938,1939)
boy_names30 <- filter(babynames_30s, sex == "M")$name
str_length(boy_names30) -> boy30
mean(boy30) ->avg30
avg30
## [1] 5.935036
babynames_30s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 Robert M 62147
## 2 James M 53944
## 3 John M 52432
## 4 William M 47259
## 5 Richard M 32178
## 6 Charles M 31863
## 7 Donald M 29046
## 8 George M 22779
## 9 Joseph M 20981
## 10 Edward M 17347
The 1940’s
babynames_40s <- filter(babynames, year == 1940,1941,1942,1943,1944,1945,1946,1947,1948,1949)
boy_names40 <- filter(babynames_40s, sex == "M")$name
str_length(boy_names40) -> boy40
mean(boy40) -> avg40
avg40
## [1] 5.922764
babynames_40s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 James M 62471
## 2 Robert M 61195
## 3 John M 54777
## 4 William M 44770
## 5 Richard M 37416
## 6 Charles M 31685
## 7 David M 27684
## 8 Thomas M 23987
## 9 Donald M 23104
## 10 Ronald M 20729
The 1950’s
babynames_50s <- filter(babynames, year == 1950,1951,1952,1953,1954,1955,1956,1957,1958,1959)
boy_names50 <- filter(babynames_50s, sex == "M")$name
str_length(boy_names50) -> boy50
mean(boy50) ->avg50
avg50
## [1] 5.942734
babynames_50s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 James M 86239
## 2 Robert M 83565
## 3 John M 79420
## 4 Michael M 65151
## 5 David M 60730
## 6 William M 60690
## 7 Richard M 51001
## 8 Thomas M 45603
## 9 Charles M 39099
## 10 Gary M 33753
The 1960’s
babynames_60s <- filter(babynames, year == 1960,1961,1962,1963,1964,1965,1966,1967,1968,1969)
boy_names60 <- filter(babynames_60s, sex == "M")$name
str_length(boy_names60) -> boy60
mean(boy60)->avg60
avg60
## [1] 5.903704
babynames_60s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 David M 85928
## 2 Michael M 84183
## 3 James M 76842
## 4 John M 76096
## 5 Robert M 72369
## 6 Mark M 58731
## 7 William M 49354
## 8 Richard M 43561
## 9 Thomas M 39279
## 10 Steven M 33895
The 1970’s
babynames_70s <- filter(babynames, year == 1970,1971,1972,1973,1974,1975,1976,1977,1978,1979)
boy_names70 <- filter(babynames_70s, sex == "M")$name
str_length(boy_names70) -> boy70
mean(boy70) -> avg70
avg70
## [1] 5.955801
babynames_70s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 Michael M 85303
## 2 James M 61766
## 3 David M 61753
## 4 John M 58520
## 5 Robert M 57201
## 6 Christopher M 41775
## 7 William M 38910
## 8 Brian M 31932
## 9 Mark M 31494
## 10 Richard M 30452
The 1980’s
babynames_80s <- filter(babynames, year == 1980,1981,1982,1983,1984,1985,1986,1987,1988,1989)
boy_names80 <- filter(babynames_80s, sex == "M")$name
str_length(boy_names80) -> boy80
mean(boy80) -> avg80
avg80
## [1] 6.006037
babynames_80s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 Michael M 68693
## 2 Christopher M 49092
## 3 Jason M 48173
## 4 David M 41923
## 5 James M 39325
## 6 Matthew M 37860
## 7 Joshua M 36060
## 8 John M 35279
## 9 Robert M 34281
## 10 Joseph M 30202
The 1990’s
babynames_90s <- filter(babynames, year == 1990,1991,1992,1993,1994,1995,1996,1997,1998,1999)
boy_names90 <- filter(babynames_90s, sex == "M")$name
str_length(boy_names90) -> boy90
mean(boy90) ->avg90
avg90
## [1] 6.128321
babynames_90s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 Michael M 65282
## 2 Christopher M 52332
## 3 Matthew M 44800
## 4 Joshua M 43216
## 5 Daniel M 33815
## 6 David M 33742
## 7 Andrew M 33657
## 8 James M 32347
## 9 Justin M 30638
## 10 Joseph M 30127
The 2000’s
babynames_200s <- filter(babynames, year == 2000,2001,2001,2003,2004,200,2006,2007,2008,2009)
boy_names200 <- filter(babynames_200s, sex == "M")$name
str_length(boy_names200) -> boy200
mean(boy200) -> avg200
avg200
## [1] 6.123225
babynames_200s %>%
filter(sex=="M") %>%
group_by(name,sex) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: name [10]
## name sex total
## <chr> <chr> <int>
## 1 Jacob M 34471
## 2 Michael M 32035
## 3 Matthew M 28572
## 4 Joshua M 27538
## 5 Christopher M 24931
## 6 Nicholas M 24652
## 7 Andrew M 23639
## 8 Joseph M 22825
## 9 Daniel M 22312
## 10 Tyler M 21503
After geting all of this information, it is then graphed by making a custom dataset. Starting with creating a decades variable
decades <-c("1900-1909", "1910-1919", "1920-1929", "1930-1939", "1940-1949", "1950-1959", "1960-1969", "1970-1979", "1980-1989", "1990-1999", "2000-2009")
Then the aveage name length that correspondes with each decade
mean <- c(5.68,5.79,5.89,5.94,5.92,5.94,5.90,5.95,6.00,6.13,6.12)
Thus creating this datframe
DecadeData <- data.frame(decades,mean)
DecadeData
## decades mean
## 1 1900-1909 5.68
## 2 1910-1919 5.79
## 3 1920-1929 5.89
## 4 1930-1939 5.94
## 5 1940-1949 5.92
## 6 1950-1959 5.94
## 7 1960-1969 5.90
## 8 1970-1979 5.95
## 9 1980-1989 6.00
## 10 1990-1999 6.13
## 11 2000-2009 6.12
Then the dataframe is plotted creating a visual of the average lengths by the decade
ggplot(DecadeData, aes(decades,mean)) +geom_col() + coord_cartesian(ylim= c(4.5,6.15)) + coord_flip() + ggtitle("The Length of Male Names Over Time")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
In conlusion te hypothesis is incorrect. This can be because of a couple things. The first is that the hypothesis was made out of personal opinion. Many people go by nicknames in todays date. Therefore people with names like Jacob and Michael (the two most popular names in the 200’s) may use common nicknames like Jake and Mike. I can also be because there are more people now than there were in the past, meaning there are generally more names in more recent decades (as shown below) making the average higher.
babynames %>%
count(decade = 10 * (year %/% 10))
## # A tibble: 14 x 2
## decade n
## <dbl> <int>
## 1 1880 22743
## 2 1890 29522
## 3 1900 36675
## 4 1910 80515
## 5 1920 105360
## 6 1930 91487
## 7 1940 95640
## 8 1950 110520
## 9 1960 124173
## 10 1970 167186
## 11 1980 205784
## 12 1990 263135
## 13 2000 325180
## 14 2010 266745
Another aspect that came to light in this project was that the more popular names in later decades were generally longer. In the early 90’s the more popular names were John and James which are shorter than the more modern popular names like Christopher and Matthew. Since more people were named longer names more recently, this will increase the average because there are more instances of them.
A final observation is that the way names are spelled have changed throughout the years. For instance, the name James can also be spelled Jaymes which adds another character. Before the 1950’s, “Jaymes” was not a known spelling (according to this database). As seen below, the instance of the name Jaymes has become more prominent in recent years. This means that even though the person may have the common name of “James”, it is spelled with more characters which will increase the general average of the decade. Many names have alternate spellings which will inherently increase the average.
James <- babynames %>%
filter(name %in% c("Jaymes") & sex=="M") %>%
count(decade = 10 * (year %/% 10)) %>%
group_by(decade)
James
## # A tibble: 7 x 2
## # Groups: decade [7]
## decade n
## <dbl> <int>
## 1 1950 2
## 2 1960 7
## 3 1970 10
## 4 1980 10
## 5 1990 10
## 6 2000 10
## 7 2010 8