Top 4 Popular Names in NYC by Ethnicity

Author

Hannah Le

This analysis uses data from the NYC Open Data repository to analyze a data-set of popular baby names in New York City for various genders and ethnic groups. The data-set includes the following variables: - Year of Birth: The year in which the babies were born. - Gender: The gender of the babies . - Ethnicity: The ethnicity of the babies. - Child’s First Name: The first names of babies. - Count: The number of babies given a specific name in a given year. - Rank: The popularity rank of the name within the respective year and group. My goal is to examine the most common baby names among different ethnic groups in New York City. I’ll be focusing on the top four most popular names within each ethnicity, looking for patterns and trends in name choices. This study tries to shed light on how cultural background effects naming standards in one of the world’s most varied cities.This project explores gender and ethnicity disparities in baby name trends over time as well as how popularity (Rank) and frequency (Count) of a name relate to each other. The dataset is from NYC Open Data.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

setwd("~/Desktop/Data 110")
baby_names <- read_csv("popularbabynames.csv")

Rows: 69214 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Gender, Ethnicity, Child's First Name
dbl (3): Year of Birth, Count, Rank

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(baby_names)

# A tibble: 6 × 6
  `Year of Birth` Gender Ethnicity `Child's First Name` Count  Rank
            <dbl> <chr>  <chr>     <chr>                <dbl> <dbl>
1            2011 FEMALE HISPANIC  GERALDINE               13    75
2            2011 FEMALE HISPANIC  GIA                     21    67
3            2011 FEMALE HISPANIC  GIANNA                  49    42
4            2011 FEMALE HISPANIC  GISELLE                 38    51
5            2011 FEMALE HISPANIC  GRACE                   36    53
6            2011 FEMALE HISPANIC  GUADALUPE               26    62

Remove duplicate rows

baby_names <- baby_names %>% distinct()

unique(baby_names$Ethnicity)

[1] "HISPANIC"                   "WHITE NON HISPANIC"        
[3] "ASIAN AND PACIFIC ISLANDER" "BLACK NON HISPANIC"        
[5] "ASIAN AND PACI"             "BLACK NON HISP"            
[7] "WHITE NON HISP"

Transform the Gender and Ethnicity columns of the baby_names dataset into factors

baby_names <- baby_names %>%
  mutate(Gender = factor(Gender),
         Ethnicity = factor(Ethnicity))

Cleaned data

str(baby_names)

tibble [21,612 × 6] (S3: tbl_df/tbl/data.frame)
 $ Year of Birth     : num [1:21612] 2011 2011 2011 2011 2011 ...
 $ Gender            : Factor w/ 2 levels "FEMALE","MALE": 1 1 1 1 1 1 1 1 1 1 ...
 $ Ethnicity         : Factor w/ 7 levels "ASIAN AND PACI",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Child's First Name: chr [1:21612] "GERALDINE" "GIA" "GIANNA" "GISELLE" ...
 $ Count             : num [1:21612] 13 21 49 38 36 26 126 14 17 17 ...
 $ Rank              : num [1:21612] 75 67 42 51 53 62 8 74 71 71 ...

# View the column names in dataset

colnames(baby_names)

[1] "Year of Birth"      "Gender"             "Ethnicity"         
[4] "Child's First Name" "Count"              "Rank"

Rename the column

baby_names <- baby_names %>%
  rename(Year = `Year of Birth`)

Regression Analysis

# Convert categorical variables to factors
baby_names$Gender <- as.factor(baby_names$Gender)
baby_names$Ethnicity <- as.factor(baby_names$Ethnicity)
# Perform multiple linear regression: Count ~ Year
lm_model <- lm(Count ~ `Year`, data = baby_names)

summary of the model

summary(lm_model)


Call:
lm(formula = Count ~ Year, data = baby_names)

Residuals:
   Min     1Q Median     3Q    Max 
-25.55 -19.75 -13.22   1.98 415.78 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1108.08076  165.83037   6.682 2.41e-11 ***
Year          -0.53333    0.08226  -6.483 9.16e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37.68 on 21610 degrees of freedom
Multiple R-squared:  0.001941,  Adjusted R-squared:  0.001895 
F-statistic: 42.03 on 1 and 21610 DF,  p-value: 9.165e-11

Model Analysis (Reference: ChatGPT)

# Diagnostic Plots
par(mfrow = c(2, 2))
plot(lm_model)

The equation for regression analysis (Reference Chat GPT)

Count=β0 + β1(Year)+ϵ Count is the dependent variable (response) representing the number of babies given a specific name. Year is the independent variable (predictor), which indicates the year the name was recorded. 𝛽0 is the intercept, and 𝛽1 is the coefficient for the Year. ϵ is the error term.

Visualization: Bar Graph of Top 4 Most Popular Baby Names by Ethnicity

# Filter the dataset to get the top 4 names per ethnicity
top_4_names <- baby_names %>%
  group_by(Ethnicity) %>%
  top_n(4, -Rank) %>%
  ungroup()

Clean up Ethnicity levels (remove leading/trailing spaces, standardize case) (Reference: ChatGPT)

top_4_names$Ethnicity <- trimws(top_4_names$Ethnicity)  # Remove whitespace
top_4_names$Ethnicity <- tolower(top_4_names$Ethnicity) # Convert to lowercase if desired

ggplot(top_4_names, aes(x = reorder(`Child's First Name`, -Count), 
                         y = Count, 
                         fill = Ethnicity)) +  
  geom_bar(stat = "identity", 
           position = position_dodge(width = 0.8), 
           width = 0.7) +
  labs(title = "Top 4 Most Popular Baby Names by Ethnicity in NYC",
       x = "Child's First Name",
       y = "Count of Babies",
       caption = "Source: NYC Open Data") +
  scale_fill_manual(values = c("hispanic" = "red", 
                                "asian and pacific islander" = "purple", 
                                "white non hispanic" = "blue", 
                                "black non hispanic" = "green")) +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold", size = 14),
        legend.position = "top")

Essay

In this study, I examined a data-set of popular baby names in New York City. I cleaned up the data by deleting duplicate items and used exploratory data analysis to visualize trends. A linear regression model was developed to study the association between a name’s frequency (Count) and popularity (Rank). The p-value determines the statistical significance of each variable in the model. The Year variable in this model has a low p-value (p < 0.05), showing its statistical significance and contribution to explaining the variance in baby name popularity. I combined the data according to name counts and ethnicity because the objective was to focus on the top 4 most common baby names for each group. This required combining grouping and filtering techniques. I removed all but the top four names for each ethnicity after sorting the names in descending order of counts. Ethnic groups clearly have different preferences when it comes to names. Popular names among a particular ethnic group are a reflection of cultural influences. Among babies, the names of Hispanics differ from those of White non-Hispanics, Asians, and Pacific Islanders, for example. The number of infants bearing the most popular names varies greatly by ethnicity. This may indicate that different groups have different amounts of name diversity or that different cultures favor more common or unusual names. While this research effectively showed the main trends and performed a linear regression, adding more variables or more complex models could improve the results. Although the chart offers helpful insights, there were more components I wanted to include but was unable to do so because of time or technical limits. An interactive graphic, for example, would let users click over bars to see exact values or other contextual details about the names, such past popularity trends.