Introduction

Baby names can reflect social, cultural, and demographic patterns over time.
This investigation uses open data from the U.S. Social Security Administration, accessed through Data.gov.
The dataset records baby names from Social Security card applications.
The focus of this analysis is California baby name popularity from 2000 to 2025.
In this report, popularity means the number of babies given a specific name in a specific year and sex group.
This question is interesting because name popularity is not evenly distributed: many names are uncommon, while a small number of names are very popular.

Problem Statement

This investigation asks:

How does baby name popularity differ by sex in California from 2000 to 2025?

To answer this question, this presentation will:
- describe the data source and pre-processing steps;
- summarise the main numerical and categorical variables;
- visualise the distribution and yearly trends in name popularity;
- test whether mean name popularity differs between female and male records;
- test whether sex is associated with popularity category.

Data Source

The data is open data from the U.S. Social Security Administration, accessed through Data.gov.
Data.gov lists the access level as public and the licence as CC0 1.0 Public Domain.
The full dataset contains baby name records by state, sex, year of birth, name, and number of babies.
The sampling method is not random sampling. It is administrative data based on Social Security card applications.
To keep the investigation manageable, this analysis only uses California records from 2000 to 2025.

baby_names_ca <- read_csv(
  "CA.TXT",
  col_names = c("state", "sex", "year", "name", "number"),
  show_col_types = FALSE
)

baby_names_ca <- baby_names_ca %>%
  filter(year >= 2000, year <= 2025)

head(baby_names_ca)


# Data Description and Pre-processing

Important variables:

- `state`: U.S. state abbreviation. This analysis uses California only.
- `sex`: sex recorded as `F` or `M`.
- `year`: year of birth.
- `name`: baby name.
- `number`: number of babies given that name in that year and sex group.

Pre-processing steps:

- Imported the California text file from the ZIP dataset.
- Added meaningful column names.
- Filtered the data to years 2000 to 2025.
- Converted `sex` into a labelled factor.
- Checked for missing values.
- Created a popularity category for the categorical association test.


``` r
baby_names_ca <- baby_names_ca %>%
  mutate(
    sex = factor(sex, levels = c("F", "M"), labels = c("Female", "Male")),
    popularity_category = case_when(
      number < quantile(number, 0.33, na.rm = TRUE) ~ "Low",
      number < quantile(number, 0.66, na.rm = TRUE) ~ "Medium",
      TRUE ~ "High"
    ),
    popularity_category = factor(popularity_category, levels = c("Low", "Medium", "High"))
  )

missing_table <- baby_names_ca %>%
  summarise(
    missing_state = sum(is.na(state)),
    missing_sex = sum(is.na(sex)),
    missing_year = sum(is.na(year)),
    missing_name = sum(is.na(name)),
    missing_number = sum(is.na(number))
  )

kable(missing_table)

missing_state	missing_sex	missing_year	missing_name	missing_number
0	0	0	0	0

Descriptive Statistics

The main numerical variable is number.
It measures how many babies were given a specific name in a specific year and sex group.
Descriptive statistics summarise the centre, spread, and range of name popularity.

summary_table <- baby_names_ca %>%
  group_by(sex) %>%
  summarise(
    Min = min(number, na.rm = TRUE),
    Q1 = quantile(number, 0.25, na.rm = TRUE),
    Median = median(number, na.rm = TRUE),
    Mean = mean(number, na.rm = TRUE),
    Q3 = quantile(number, 0.75, na.rm = TRUE),
    Max = max(number, na.rm = TRUE),
    SD = sd(number, na.rm = TRUE),
    n = n(),
    Missing = sum(is.na(number)),
    .groups = "drop"
  )

kable(summary_table, digits = 2)

sex	Min	Q1	Median	Mean	Q3	Max	SD	n	Missing
Female	5	7	12	52.38	33	3645	158.70	101362	0
Male	5	7	12	79.54	36	4344	259.35	73485	0

Descriptive Statistics Interpretation

The mean number of babies per female name record is 52.38.
The mean number of babies per male name record is 79.54.
The median values are lower than the mean values for both groups.
This suggests that the data is right-skewed: most names have relatively low counts, while a small number of names are very popular.
The highest count for one name-year-sex record is 4344.

Visualisation 1: Distribution

ggplot(baby_names_ca, aes(x = number)) +
  geom_histogram(bins = 40) +
  labs(
    title = "Distribution of Baby Name Popularity in California, 2000-2025",
    x = "Number of babies with the name",
    y = "Frequency"
  )

The histogram shows that the distribution is strongly right-skewed.
Most name records have low counts.
A small number of name records have much higher counts and may appear as outliers.

Visualisation 2: Sex Comparison

ggplot(baby_names_ca, aes(x = sex, y = number)) +
  geom_boxplot() +
  labs(
    title = "Baby Name Popularity by Sex in California, 2000-2025",
    x = "Sex",
    y = "Number of babies with the name"
  )

The boxplot compares the distribution of name counts for female and male records.
It shows the median, spread, and outliers for each sex group.
Outliers are expected because some names are much more popular than most other names.

Visualisation 3: Yearly Trend

yearly_totals <- baby_names_ca %>%
  group_by(year, sex) %>%
  summarise(total_babies = sum(number, na.rm = TRUE), .groups = "drop")

ggplot(yearly_totals, aes(x = year, y = total_babies, colour = sex)) +
  geom_line() +
  labs(
    title = "Total Babies Recorded by Sex in California, 2000-2025",
    x = "Year",
    y = "Total number of babies",
    colour = "Sex"
  )

This graph shows how the total number of babies recorded in the dataset changed over time.
It also allows comparison between female and male records across years.

Task 5: Hypothesis Test and Confidence Interval

Research question:

Is the average baby name count different between female and male names in California from 2000 to 2025?

Hypotheses:

\[H_0: \mu_F = \mu_M\]

\[H_A: \mu_F \ne \mu_M\]

where:

\(\mu_F\) is the mean number of babies per female name record.
\(\mu_M\) is the mean number of babies per male name record.

Task 5: Assumptions

The response variable number is numerical.
The explanatory variable sex has two independent groups: Female and Male.
Welch’s two-sample t-test is used because it does not require equal variances.
The distribution is right-skewed, so the result should be interpreted carefully.
The sample size is large, so the t-test is reasonably robust to non-normality.

t_test_result <- t.test(number ~ sex, data = baby_names_ca)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  number by sex
## t = -25.176, df = 112769, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -29.27315 -25.04438
## sample estimates:
## mean in group Female   mean in group Male 
##             52.38224             79.54101

Task 5: Result Interpretation

The p-value is less than 2.2 × 10^-16.
The 95% confidence interval for the difference in means is from -29.27 to -25.04.
At the 5% significance level, the p-value is less than 0.05, so H0 is rejected.
This means there is statistically significant evidence that the mean baby name count differs between female and male records.
The result should be understood as a difference in average count per name record, not a claim about every individual baby name.

Task 6: Categorical Association

Research question:

Is there an association between sex and popularity category?

Hypotheses:

\[H_0: \text{Sex and popularity category are independent.}\]

\[H_A: \text{Sex and popularity category are associated.}\]

popularity_category was created from the number variable using approximate terciles.
The categories are Low, Medium, and High popularity.

association_table <- table(baby_names_ca$sex, baby_names_ca$popularity_category)
association_table

##         
##            Low Medium  High
##   Female 31359  34895 35108
##   Male   22285  25025 26175

Task 6: Chi-square Test

chi_result <- chisq.test(association_table)
chi_result

## 
##  Pearson's Chi-squared test
## 
## data:  association_table
## X-squared = 18.663, df = 2, p-value = 8.86e-05

chi_result$expected

##         
##               Low   Medium     High
##   Female 31098.41 34736.72 35526.87
##   Male   22545.59 25183.28 25756.13

The chi-square test checks whether sex and popularity category are independent.
The expected counts are checked because the chi-square test requires expected cell counts to be sufficiently large.

Task 6: Result Interpretation

The chi-square statistic is 18.66 with 2 degrees of freedom.
The p-value is 8.86e-05.
At the 5% significance level, the p-value is less than 0.05, so H0 is rejected.
This means there is statistical evidence of an association between sex and popularity category.
However, Cramer’s V is 0.01, which indicates that the association is very weak in practical strength.

Task 6: Association Plot

association_df <- baby_names_ca %>%
  count(sex, popularity_category) %>%
  group_by(sex) %>%
  mutate(percent = n / sum(n) * 100)

ggplot(association_df, aes(x = popularity_category, y = percent, fill = sex)) +
  geom_col(position = "dodge") +
  labs(
    title = "Popularity Category by Sex",
    x = "Popularity category",
    y = "Percentage of records",
    fill = "Sex"
  )

This bar chart shows the percentage of records in each popularity category.
It helps visually compare whether female and male names have similar or different popularity patterns.

Discussion

This investigation found that baby name popularity in California from 2000 to 2025 is highly uneven.
Most names appeared with low counts, while a smaller number of names were very popular.
The descriptive statistics and histogram support this because the distribution is right-skewed.
The hypothesis test found that there was statistically significant evidence of a difference in mean name count between female and male records.
The chi-square test found that there was evidence of an association between sex and popularity category.

Strengths and Limitations

Strengths:

The dataset is official, open, public, and clearly documented.
It covers many years of data.
It includes both numerical and categorical variables, making it suitable for descriptive statistics, hypothesis testing, and categorical association.

Limitations:

The dataset only includes names with at least 5 occurrences, so very rare names are excluded.
This analysis focuses only on California, so results may not represent the whole United States.
The analysis shows statistical patterns but does not explain the cultural or social reasons why certain names became popular.
Repeated yearly records mean observations are not completely independent in a real-world sense, so results should be interpreted as exploratory.

Final Conclusion

The key take-home message is that baby name popularity in California is strongly uneven, with most names having low counts and a small number of names being highly popular. The analysis also found statistically significant sex-related differences, although the categorical association was very weak in practical strength.
The results suggest that baby name popularity can be meaningfully analysed using numerical summaries, visualisations, a t-test, and a chi-square test.
Overall, this dataset is suitable for the assignment because it is open, clearly documented, and supports the required statistical analyses.

References

Social Security Administration. (2026). Baby Names from Social Security Card Applications - State and District of Columbia Data. Data.gov. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-state-and-district-of-columbia-data
Social Security Administration. (2026). State-specific baby names dataset. https://www.ssa.gov/oact/babynames/state/namesbystate.zip
R Core Team. (2026). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

Analysis of Baby Name Popularity in California

MATH1324 Assignment 2

RPubs link information