This analysis uses data from the NYC Open Data repository to analyze a data-set of popular baby names in New York City for various genders and ethnic groups. The data-set includes the following variables: - Year of Birth: The year in which the babies were born. - Gender: The gender of the babies . - Ethnicity: The ethnicity of the babies. - Child’s First Name: The first names of babies. - Count: The number of babies given a specific name in a given year. - Rank: The popularity rank of the name within the respective year and group. My goal is to examine the most common baby names among different ethnic groups in New York City. I’ll be focusing on the top four most popular names within each ethnicity, looking for patterns and trends in name choices. This study tries to shed light on how cultural background effects naming standards in one of the world’s most varied cities.This project explores gender and ethnicity disparities in baby name trends over time as well as how popularity (Rank) and frequency (Count) of a name relate to each other. The dataset is from NYC Open Data.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 69214 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Gender, Ethnicity, Child's First Name
dbl (3): Year of Birth, Count, Rank
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The equation for regression analysis (Reference Chat GPT)
Count=β0 + β1(Year)+ϵ Count is the dependent variable (response) representing the number of babies given a specific name. Year is the independent variable (predictor), which indicates the year the name was recorded. 𝛽0 is the intercept, and 𝛽1 is the coefficient for the Year. ϵ is the error term.
Visualization: Bar Graph of Top 4 Most Popular Baby Names by Ethnicity
# Filter the dataset to get the top 4 names per ethnicitytop_4_names <- baby_names %>%group_by(Ethnicity) %>%top_n(4, -Rank) %>%ungroup()
top_4_names$Ethnicity <-trimws(top_4_names$Ethnicity) # Remove whitespacetop_4_names$Ethnicity <-tolower(top_4_names$Ethnicity) # Convert to lowercase if desired
ggplot(top_4_names, aes(x =reorder(`Child's First Name`, -Count), y = Count, fill = Ethnicity)) +geom_bar(stat ="identity", position =position_dodge(width =0.8), width =0.7) +labs(title ="Top 4 Most Popular Baby Names by Ethnicity in NYC",x ="Child's First Name",y ="Count of Babies",caption ="Source: NYC Open Data") +scale_fill_manual(values =c("hispanic"="red", "asian and pacific islander"="purple", "white non hispanic"="blue", "black non hispanic"="green")) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(face ="bold", size =14),legend.position ="top")
Essay
In this study, I examined a data-set of popular baby names in New York City. I cleaned up the data by deleting duplicate items and used exploratory data analysis to visualize trends. A linear regression model was developed to study the association between a name’s frequency (Count) and popularity (Rank). The p-value determines the statistical significance of each variable in the model. The Year variable in this model has a low p-value (p < 0.05), showing its statistical significance and contribution to explaining the variance in baby name popularity. I combined the data according to name counts and ethnicity because the objective was to focus on the top 4 most common baby names for each group. This required combining grouping and filtering techniques. I removed all but the top four names for each ethnicity after sorting the names in descending order of counts. Ethnic groups clearly have different preferences when it comes to names. Popular names among a particular ethnic group are a reflection of cultural influences. Among babies, the names of Hispanics differ from those of White non-Hispanics, Asians, and Pacific Islanders, for example. The number of infants bearing the most popular names varies greatly by ethnicity. This may indicate that different groups have different amounts of name diversity or that different cultures favor more common or unusual names. While this research effectively showed the main trends and performed a linear regression, adding more variables or more complex models could improve the results. Although the chart offers helpful insights, there were more components I wanted to include but was unable to do so because of time or technical limits. An interactive graphic, for example, would let users click over bars to see exact values or other contextual details about the names, such past popularity trends.