The Popular Baby Names dataset from NYC Open Data provides information on the most common first names given to babies born in New York City each year, starting from 2011. It includes variables such as the baby’s gender, the mother’s reported ethnicity, the baby’s first name, the number of babies given that name in a given year, and the name’s rank in popularity within that demographic group. The data is based on official birth certificates and allows users to explore trends in baby naming patterns across different ethnicities, genders, and years within NYC.I will be using this dataset to compare top names by count using a bar chart.
Load in the library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 77287 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Gender, Ethnicity, Child's First Name
dbl (3): Year of Birth, Count, Rank
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Clean and Summarize the data
names(babynames) <-tolower(names(babynames))names(babynames) <-gsub(" ","_",names(babynames))names(babynames) <-gsub("'","",names(babynames))#gsub will remove spaces in between words in the headers and replace3 them with underscorehead(babynames)
year_of_birth gender ethnicity childs_first_name
Min. :2011 Length:77287 Length:77287 Length:77287
1st Qu.:2012 Class :character Class :character Class :character
Median :2013 Mode :character Mode :character Mode :character
Mean :2013
3rd Qu.:2014
Max. :2021
count rank
Min. : 10.00 Min. : 1.0
1st Qu.: 13.00 1st Qu.: 37.0
Median : 20.00 Median : 58.0
Mean : 33.85 Mean : 56.8
3rd Qu.: 36.00 3rd Qu.: 78.0
Max. :446.00 Max. :102.0
#groups the data by name and gender then sums up the count of babies for each name-group combo and tells it to ungroup after summarizingtopnames <- babynames |>group_by(childs_first_name, gender) |>summarise(Total_Count =sum(count, na.rm =TRUE), .groups ="drop") |>#"drop" tells it to ungroup after summarizinggroup_by(childs_first_name) |>mutate(count =sum(Total_Count)) |>#groups again by name only and creates a new column that holds the total numver of babies with that name, regrardles of genderungroup() |># Removes all groupingdistinct(childs_first_name, .keep_all =TRUE) |># Keeps only one row per name. Since there is now multiple gender rows per name, this keeps the first occurrencearrange(desc(count)) |># Use Combine_Count to sort by total popularityslice_head(n =10)
Bar Chart
# Custom color custom_colors <-c("FEMALE"="#CB12FF", "MALE"="#FF6912")# Plot with enhancementsggplot(topnames, aes(x =reorder(childs_first_name, count),y = count,fill = gender)) +geom_col(show.legend =TRUE) +scale_fill_manual(values = custom_colors) +coord_flip() +labs(title ="Top 10 Most Popular Baby Names in NYC (All Years)",subtitle ="Grouped by Gender",x ="Baby Name",y ="Total Number of Babies",fill ="Gender",caption ="Source: NYC Open Data - Popular Baby Names" ) +theme_fivethirtyeight() # Change from default ggplot2 theme
o explore trends in baby naming patterns in New York City, I analyzed the Popular Baby Names dataset from NYC Open Data. Before diving into visualization, I cleaned the data to ensure accuracy and consistency. This process included checking for missing values, properly referencing column names—especially those with spaces or special characters—and grouping the data by both child’s first name and gender. I calculated a new column called Combine_Count, which summed the total number of babies given each name across all years, regardless of gender. To prepare the dataset for visualization, I selected only one entry per name and sorted them by popularity. The resulting visualization was a horizontal bar chart that displayed the top 10 most popular baby names in NYC, grouped by gender. To enhance clarity and aesthetics, I used custom hex color codes to differentiate between male and female names and applied the theme_fivethirtyeight() for a cleaner look. The chart included meaningful axis labels, a descriptive title, a source caption, and a color-coded legend to guide interpretation. From the chart, several interesting patterns emerged. Male names like “Liam” and “Noah” ranked among the most popular, while names like “Olivia” and “Emma” led among females. The dominance of certain names suggests cultural influences and changing trends over time. Some names showed a potential overlap in gender usage, although only one gender was shown per name in the top 10. The visualization clearly highlighted naming trends that span multiple years and offered insight into gender-based name preferences. While the final chart was successful, there were features I hoped to include but couldn’t implement. For example, I wanted to show how name popularity changed over time using an animated time-series chart with the gganimate package. I also considered adding a breakdown by ethnicity to explore demographic influences on naming, but this added complexity made the chart less readable in a static format. These ideas remain potential improvements for future iterations. Overall, this project helped me apply data cleaning, transformation, and visualization techniques using R and the tidyverse. It also revealed meaningful social patterns through names and demonstrated the power of thoughtful data presentation. Although I faced a few limitations, the experience provided valuable insights into real-world data analysis and the creative possibilities of visual storytelling.