The Most Popular first letter ‘babyname’ in NYC
In this project I will determine what is the most popular first letter babynames from each year in New York from 2011-2014 Second, I will find out what the most popular first letter and sex of babynames from each year in NYC from 2011-2014
I will use the NYC babynames data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
Popular_Baby_Names <- read_csv("Popular_Baby_Names.csv")
## Rows: 47545 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Gender, Ethnicity, Child's First Name
## dbl (3): Year of Birth, Count, Rank
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First, I mutated the data to find the first letter of the name
Popular_Baby_Names %>%
mutate(first_letter = substr(`Child's First Name`,0,1)) %>%
arrange(desc(first_letter)) -> Popular_Baby_Names
#knitr::kable(Popular_Baby_Names)
Let’s load the ‘NYC babynames’ files from NYC Open Data
Second, I sorted the most popular first letter babyname in a data frame
Popular_Baby_Names %>%
group_by(first_letter) %>%
summarize(letter_count = sum(Count)) %>%
arrange(desc(letter_count)) -> Popular_First_Letter
Third, I made a data frame of just the letter in the most popular year
Popular_Baby_Names %>%
group_by(first_letter, `Year of Birth`) %>%
summarize(letter_count = sum(Count)) %>%
arrange(desc(letter_count)) -> Popular_First_Letter_Year
## `summarise()` has grouped output by 'first_letter'. You can override using the
## `.groups` argument.
Most_Popular_First_Letter_Year <- Popular_First_Letter_Year %>%
group_by(`Year of Birth`) %>%
top_n(1, letter_count) %>%
filter(`Year of Birth` <2015)
The most popular first letter name overall is “A”.
Most_Popular_First_Letter_Year
## # A tibble: 4 × 3
## # Groups: Year of Birth [4]
## first_letter `Year of Birth` letter_count
## <chr> <dbl> <dbl>
## 1 A 2014 52285
## 2 A 2012 51934
## 3 A 2011 51655
## 4 A 2013 50786
Then I made a line graph to show the data from 2011-2014
Popular_First_Letter_Year %>%
filter(`Year of Birth` <2015) %>%
ggplot(aes(`Year of Birth`,letter_count, color = first_letter)) + geom_line()
I will now find out the totals per year and take the previous data and
look at gender specifically Females
Popular_Baby_Names_Females <- Popular_Baby_Names %>%
filter(Gender %in% "FEMALE")
Popular_Baby_Names_Females %>%
group_by(first_letter) %>%
summarize(letter_count = sum(Count)) %>%
arrange(desc(letter_count)) -> Popular_First_Letter_Females
Popular_Baby_Names_Females %>%
group_by(first_letter, `Year of Birth`) %>%
summarize(letter_count = sum(Count)) %>%
arrange(desc(letter_count)) -> Popular_First_Letter_Females
## `summarise()` has grouped output by 'first_letter'. You can override using the
## `.groups` argument.
Most_Popular_First_Letter_Year_Females <- Popular_First_Letter_Females %>%
group_by(`Year of Birth`) %>%
top_n(1, letter_count) %>%
filter(`Year of Birth` <2015)
When plotting the most popular letter from 2011-2014 for females it shows that “A” is most common. When plotting by gender specifically female you can see that “A” is still the most
Most_Popular_First_Letter_Year_Females
## # A tibble: 4 × 3
## # Groups: Year of Birth [4]
## first_letter `Year of Birth` letter_count
## <chr> <dbl> <dbl>
## 1 A 2014 25977
## 2 A 2012 24487
## 3 A 2013 24270
## 4 A 2011 23130
Popular_First_Letter_Females %>%
filter(`Year of Birth` < 2015) %>%
ggplot(aes(`Year of Birth`,letter_count, color = first_letter)) + geom_line()
I will now take the previous data and look at males
Popular_Baby_Names_Males <- Popular_Baby_Names %>%
filter(Gender %in% "MALE")
Popular_Baby_Names_Males %>%
group_by(first_letter) %>%
summarize(letter_count = sum(Count)) %>%
arrange(desc(letter_count)) -> Popular_First_Letter_Males
Popular_Baby_Names_Males %>%
filter(`Year of Birth`<2015) %>%
group_by(first_letter, `Year of Birth`) %>%
summarize(letter_count = sum(Count)) %>%
arrange(desc(letter_count)) -> Popular_First_Letter_Males
## `summarise()` has grouped output by 'first_letter'. You can override using the
## `.groups` argument.
Most_Popular_First_Letter_Year_Males <- Popular_First_Letter_Males %>%
group_by(`Year of Birth`) %>%
top_n(1, letter_count) %>%
filter(`Year of Birth` <2015) %>%
head(12)
The most popular letter per year was “J” and was consistent.
Most_Popular_First_Letter_Year_Males
## # A tibble: 4 × 3
## # Groups: Year of Birth [4]
## first_letter `Year of Birth` letter_count
## <chr> <dbl> <dbl>
## 1 J 2011 36853
## 2 J 2012 34918
## 3 J 2013 32274
## 4 J 2014 29975
Popular_First_Letter_Males %>%
ggplot(aes(`Year of Birth`,letter_count, color = first_letter)) + geom_line()
In conclusion when trying to figure out the most popular first letter babyname by year in New York City. The most popular first letter babyname was “A”. When analyzing by gender specifically female “A” was also the most common. The most popular first letter babyname for males was “J”. If I were to expand on this in the future I would like to compare the New York City babynames to the State of New York to see the difference in the most popular first letter babynames.