Babynames: Vowels Over Time

Author

Brian Walsh

Hypothesis

My hypothesis is that the proportion of baby names starting with vowels has been increasing over the years.

First, I’ll install the necessary packages.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(babynames)
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

The external dataset was loaded here:

canada_babynames <- read_csv("StateNames.csv")
Rows: 5647426 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Name, Gender, State
dbl (3): Id, Year, Count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I found the state names data on Kaggle.

Observation 1:

Hypothesis: The proportion of baby names starting with ‘A’ has increased over time.# Filter names starting with ‘A’

a_names <- babynames %>%
  filter(substr(name, 1, 1) == "A")
a_name_proportions <- a_names %>%
  group_by(year) %>%
  summarize(proportion = sum(n))
ggplot(a_name_proportions, aes(x = year, y = proportion)) +
  geom_line() +
  labs(title = "Total Number of Baby Names Starting with 'A' Over Time",
       x = "Year",
       y = "Number of Names") +
  scale_y_continuous(labels = comma) + # Format y-axis labels with commas
  theme_minimal()