Babynames: U.S. City Names

Author

Lena Gunn

Hypothesis:

My baby names project will assess the popularity of U.S. City names as baby names overtime as well as how many times they were used total by using the count method. I will take a deeper look into the top 10 names overall, top 10 names female and top 10 names male. My hypothesis is that I think the names Charlotte, Austin, Elizabeth and Allan will be 4 of the most popular names.

The external data set I found on Wikipedia.

First I’ll install the necessary packages.

library(readxl)
city_names <- read_excel("city_names.xlsx",
                         col_types = c("numeric", "text", "text",
                                       "numeric", "numeric", "text", "text",
                                       "text", "text", "text", "text"))

Warning: Expecting numeric in A2 / R2C1: got 'rank'

Warning: Expecting numeric in D2 / R2C4: got 'estimate'

Warning: Expecting numeric in E2 / R2C5: got 'census'

New names:
• `2022` -> `2022...1`
• `2022` -> `2022...4`
• `` -> `...10`
• `` -> `...11`

library(babynames)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

colnames(city_names)[2] <- "name"

babynames |>
  inner_join(city_names, relationship = "many-to-many") -> merged

Joining with `by = join_by(name)`

merged |> 
  arrange(desc(prop))

# A tibble: 10,246 × 15
    year sex   name      n   prop `2022...1` `State[a]` `2022...4` `2020` Change
   <dbl> <chr> <chr> <int>  <dbl>      <dbl> <chr>           <dbl>  <dbl> <chr> 
 1  1880 F     Eliz…  1939 0.0199        211 New Jersey     134283 137298 −2.20%
 2  1882 F     Eliz…  2186 0.0189        211 New Jersey     134283 137298 −2.20%
 3  1883 F     Eliz…  2255 0.0188        211 New Jersey     134283 137298 −2.20%
 4  1881 F     Eliz…  1852 0.0187        211 New Jersey     134283 137298 −2.20%
 5  1884 F     Eliz…  2549 0.0185        211 New Jersey     134283 137298 −2.20%
 6  1885 F     Eliz…  2582 0.0182        211 New Jersey     134283 137298 −2.20%
 7  1886 F     Eliz…  2680 0.0174        211 New Jersey     134283 137298 −2.20%
 8  1887 F     Eliz…  2681 0.0172        211 New Jersey     134283 137298 −2.20%
 9  1888 F     Eliz…  3224 0.0170        211 New Jersey     134283 137298 −2.20%
10  1889 F     Eliz…  3058 0.0162        211 New Jersey     134283 137298 −2.20%
# ℹ 10,236 more rows
# ℹ 5 more variables: `2020 land area` <chr>, `2020 population density` <chr>,
#   Location <chr>, ...10 <chr>, ...11 <chr>

Then I arranged and filtered the top 10 U.S. City names male and female.

merged |> 
  count(name) |> 
  arrange(desc(n)) |> 
  head(10) -> top

Observation #1

I then created a bar graph to see how many times total the name was used over all the years and split it by male and female. I decided to split it between male and female so then we could break it down further to see how many times the city name was used as a babyname.

merged |> 
  filter(name %in% c('Aurora', 'Allen', 'Elizabeth', 'Garland',
                     'Quincy', 'Carmel', 'Lynn','Dallas','Eugene','Cary')) |> 
  ggplot(aes( name, color = n, fill=sex)) + geom_bar() + coord_flip() +  facet_wrap(~sex)

Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Findings

This graph is interesting because I was not expecting to see names such as Garland and Carmel on the list. Another interesting take away from this graph is I was not expecting that many males to be named Lynn or Elizabeth.

Observation #2

I then created a bar graph to see the exact number of times the top 10 male names were used.

merged |> 
  filter(name %in% c('Allen', 'Austin', 'Cary', 'Cleveland', 'Dallas', 
                     'Eugene', 'Everett', 'Garland', 'Henderson', 'Houston')) |> 
  filter(sex=='M') |> 
  ggplot(aes( name, color = n)) + geom_bar() + coord_flip()

Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Observation #3

I then created a line graph to see the frequency of how the name was used over the years.

merged |> 
  filter(name %in% c('Allen', 'Austin', 'Cary', 'Cleveland', 'Dallas', 
                     'Eugene', 'Everett', 'Garland', 'Henderson', 'Houston')) |> 
  filter(sex=='M') |> 
  ggplot(aes(year, n, color=name))+geom_line()

Findings

In the graph you can see a couple of different peaks. One peak that I thought was interesting was in the 1930s Eugene peaked then has declined since then. It would be interesting to see if there was something significant that happened in Eugene, Oregon in that time frame. The peak that was the most significant was in the late 1990s with Austin. The name Austin shot up fast and then rapidly declined. It would also be intersting to see if there was something significant that happened in Austin, Texas during this time frame or if there was a public figure that was very popular named Austin.

Observation #4

I then created a bar graph to see the exact number of times the top 10 female names were used.

merged |> 
  filter(name %in% c('Aurora', 'Charlotte', 'Elizabeth', 'Odessa', 'Savannah', 
                     'Carmel', 'Lynn', 'Dallas', 'Eugene', 'Cary')) |> 
  filter(sex=='F') |> 
  ggplot(aes( name, color = n)) + geom_bar() + coord_flip()

Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

I then created a line graph to see the frequency of how the name was used over the years.

Observation #5

merged |> 
  filter(name %in% c('Aurora', 'Charlotte', 'Elizabeth', 'Odessa', 'Savannah', 
                     'Carmel', 'Lynn', 'Dallas', 'Eugene', 'Cary')) |> 
  filter(sex=='F') |> 
  ggplot(aes(year, n, color=name)) + geom_line()

Findings

This graph had many different peaks. The most significant being the name Elizabeth. The name fluctuated throughout the years. Elizabeth hit it’s greatest peak in the late 90s and has been on decline since then. The name Charlotte peaked in the 1940s and declined until the 2000s where it rapidly increased. It would be interesting to see if how the city of Charlotte is rapidly growing has a correlation with the name rapidly growing as well.

Conclusion

After reviewing the visualizations I was very surprised by the results. I would say there is some validity to my hypothesis. My hypothesis was that the names Elizabeth, Charlotte, Austin and Allen would appear in the top 10 names. From my results I found that Elizabeth and Allen were in the top 10 names overall for male and female. For the top 10 Male names I found that Allen and Austin were both listed. And for the top 10 female names I found that the names Elizabeth and Charlotte were both listed. It was interesting to see when each of the names spiked in popularity. It would be interesting to see in future research if anything significant happened in the city the year the name spiked. Another interesting future research would be to see how many babies were named in that particular city or state with a name related.