Project 1

Author

Cristian Mendez

The Popular Baby Names dataset from NYC Open Data provides information on the most common first names given to babies born in New York City each year, starting from 2011. It includes variables such as the baby’s gender, the mother’s reported ethnicity, the baby’s first name, the number of babies given that name in a given year, and the name’s rank in popularity within that demographic group. The data is based on official birth certificates and allows users to explore trends in baby naming patterns across different ethnicities, genders, and years within NYC.I will be using this dataset to compare top names by count using a bar chart.

Load in the library

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

Load the dataset into your global enviroment

setwd("C:/Users/cmend/Downloads")
babynames <- read_csv("Popular_Baby_Names.csv")

Rows: 77287 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Gender, Ethnicity, Child's First Name
dbl (3): Year of Birth, Count, Rank

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean and Summarize the data

names(babynames) <- tolower(names(babynames))
names(babynames) <- gsub(" ","_",names(babynames))
names(babynames) <- gsub("'","",names(babynames))
#gsub will remove spaces in between words in the headers and replace3 them with underscore
head(babynames)

# A tibble: 6 × 6
  year_of_birth gender ethnicity childs_first_name count  rank
          <dbl> <chr>  <chr>     <chr>             <dbl> <dbl>
1          2011 FEMALE HISPANIC  GERALDINE            13    75
2          2011 FEMALE HISPANIC  GIA                  21    67
3          2011 FEMALE HISPANIC  GIANNA               49    42
4          2011 FEMALE HISPANIC  GISELLE              38    51
5          2011 FEMALE HISPANIC  GRACE                36    53
6          2011 FEMALE HISPANIC  GUADALUPE            26    62

summary(babynames)

 year_of_birth     gender           ethnicity         childs_first_name 
 Min.   :2011   Length:77287       Length:77287       Length:77287      
 1st Qu.:2012   Class :character   Class :character   Class :character  
 Median :2013   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2013                                                           
 3rd Qu.:2014                                                           
 Max.   :2021                                                           
     count             rank      
 Min.   : 10.00   Min.   :  1.0  
 1st Qu.: 13.00   1st Qu.: 37.0  
 Median : 20.00   Median : 58.0  
 Mean   : 33.85   Mean   : 56.8  
 3rd Qu.: 36.00   3rd Qu.: 78.0  
 Max.   :446.00   Max.   :102.0

#groups the data by name and gender then sums up the count of babies for each name-group combo and tells it to ungroup after summarizing
topnames <- babynames |>
  group_by(childs_first_name, gender) |>
  summarise(Total_Count = sum(count, na.rm = TRUE), .groups = "drop") |> #"drop" tells it to ungroup after summarizing
  group_by(childs_first_name) |>
  mutate(count = sum(Total_Count)) |> #groups again by name only and creates a new column that holds the total numver of babies with that name, regrardles of gender
  ungroup() |> # Removes all grouping
  distinct(childs_first_name, .keep_all = TRUE) |> # Keeps only one row per name. Since there is now multiple gender rows per name, this keeps the first occurrence
  arrange(desc(count)) |>  # Use Combine_Count to sort by total popularity
  slice_head(n = 10)

Bar Chart

# Custom color 
custom_colors <- c("FEMALE" = "#CB12FF", "MALE" = "#FF6912")

# Plot with enhancements
ggplot(topnames, aes(x = reorder(childs_first_name,  count),
                             y = count,
                             fill = gender)) +
  geom_col(show.legend = TRUE) +
  scale_fill_manual(values = custom_colors) +
  coord_flip() +
  labs(
    title = "Top 10 Most Popular Baby Names in NYC (All Years)",
    subtitle = "Grouped by Gender",
    x = "Baby Name",
    y = "Total Number of Babies",
    fill = "Gender",
    caption = "Source: NYC Open Data - Popular Baby Names"
  ) +
  theme_fivethirtyeight()  # Change from default ggplot2 theme

o explore trends in baby naming patterns in New York City, I analyzed the Popular Baby Names dataset from NYC Open Data. Before diving into visualization, I cleaned the data to ensure accuracy and consistency. This process included checking for missing values, properly referencing column names—especially those with spaces or special characters—and grouping the data by both child’s first name and gender. I calculated a new column called Combine_Count, which summed the total number of babies given each name across all years, regardless of gender. To prepare the dataset for visualization, I selected only one entry per name and sorted them by popularity. The resulting visualization was a horizontal bar chart that displayed the top 10 most popular baby names in NYC, grouped by gender. To enhance clarity and aesthetics, I used custom hex color codes to differentiate between male and female names and applied the theme_fivethirtyeight() for a cleaner look. The chart included meaningful axis labels, a descriptive title, a source caption, and a color-coded legend to guide interpretation. From the chart, several interesting patterns emerged. Male names like “Liam” and “Noah” ranked among the most popular, while names like “Olivia” and “Emma” led among females. The dominance of certain names suggests cultural influences and changing trends over time. Some names showed a potential overlap in gender usage, although only one gender was shown per name in the top 10. The visualization clearly highlighted naming trends that span multiple years and offered insight into gender-based name preferences. While the final chart was successful, there were features I hoped to include but couldn’t implement. For example, I wanted to show how name popularity changed over time using an animated time-series chart with the gganimate package. I also considered adding a breakdown by ethnicity to explore demographic influences on naming, but this added complexity made the chart less readable in a static format. These ideas remain potential improvements for future iterations. Overall, this project helped me apply data cleaning, transformation, and visualization techniques using R and the tidyverse. It also revealed meaningful social patterns through names and demonstrated the power of thoughtful data presentation. Although I faced a few limitations, the experience provided valuable insights into real-world data analysis and the creative possibilities of visual storytelling.