Analysis of Baby Name Popularity in California

MATH1324 Assignment 2

LE KHANH TOAN - 3932146

31 May, 2026

Introduction

Problem Statement

How does baby name popularity differ by sex in California from 2000 to 2025?

Data Source

baby_names_ca <- read_csv(
  "CA.TXT",
  col_names = c("state", "sex", "year", "name", "number"),
  show_col_types = FALSE
)

baby_names_ca <- baby_names_ca %>%
  filter(year >= 2000, year <= 2025)

head(baby_names_ca)

# Data Description and Pre-processing

Important variables:

- `state`: U.S. state abbreviation. This analysis uses California only.
- `sex`: sex recorded as `F` or `M`.
- `year`: year of birth.
- `name`: baby name.
- `number`: number of babies given that name in that year and sex group.

Pre-processing steps:

- Imported the California text file from the ZIP dataset.
- Added meaningful column names.
- Filtered the data to years 2000 to 2025.
- Converted `sex` into a labelled factor.
- Checked for missing values.
- Created a popularity category for the categorical association test.


``` r
baby_names_ca <- baby_names_ca %>%
  mutate(
    sex = factor(sex, levels = c("F", "M"), labels = c("Female", "Male")),
    popularity_category = case_when(
      number < quantile(number, 0.33, na.rm = TRUE) ~ "Low",
      number < quantile(number, 0.66, na.rm = TRUE) ~ "Medium",
      TRUE ~ "High"
    ),
    popularity_category = factor(popularity_category, levels = c("Low", "Medium", "High"))
  )

missing_table <- baby_names_ca %>%
  summarise(
    missing_state = sum(is.na(state)),
    missing_sex = sum(is.na(sex)),
    missing_year = sum(is.na(year)),
    missing_name = sum(is.na(name)),
    missing_number = sum(is.na(number))
  )

kable(missing_table)
missing_state missing_sex missing_year missing_name missing_number
0 0 0 0 0

Descriptive Statistics

summary_table <- baby_names_ca %>%
  group_by(sex) %>%
  summarise(
    Min = min(number, na.rm = TRUE),
    Q1 = quantile(number, 0.25, na.rm = TRUE),
    Median = median(number, na.rm = TRUE),
    Mean = mean(number, na.rm = TRUE),
    Q3 = quantile(number, 0.75, na.rm = TRUE),
    Max = max(number, na.rm = TRUE),
    SD = sd(number, na.rm = TRUE),
    n = n(),
    Missing = sum(is.na(number)),
    .groups = "drop"
  )

kable(summary_table, digits = 2)
sex Min Q1 Median Mean Q3 Max SD n Missing
Female 5 7 12 52.38 33 3645 158.70 101362 0
Male 5 7 12 79.54 36 4344 259.35 73485 0

Descriptive Statistics Interpretation

Visualisation 1: Distribution

ggplot(baby_names_ca, aes(x = number)) +
  geom_histogram(bins = 40) +
  labs(
    title = "Distribution of Baby Name Popularity in California, 2000-2025",
    x = "Number of babies with the name",
    y = "Frequency"
  )

Visualisation 2: Sex Comparison

ggplot(baby_names_ca, aes(x = sex, y = number)) +
  geom_boxplot() +
  labs(
    title = "Baby Name Popularity by Sex in California, 2000-2025",
    x = "Sex",
    y = "Number of babies with the name"
  )

Visualisation 3: Yearly Trend

yearly_totals <- baby_names_ca %>%
  group_by(year, sex) %>%
  summarise(total_babies = sum(number, na.rm = TRUE), .groups = "drop")

ggplot(yearly_totals, aes(x = year, y = total_babies, colour = sex)) +
  geom_line() +
  labs(
    title = "Total Babies Recorded by Sex in California, 2000-2025",
    x = "Year",
    y = "Total number of babies",
    colour = "Sex"
  )

Task 5: Hypothesis Test and Confidence Interval

Research question:

Is the average baby name count different between female and male names in California from 2000 to 2025?

Hypotheses:

\[H_0: \mu_F = \mu_M\]

\[H_A: \mu_F \ne \mu_M\]

where:

Task 5: Assumptions

t_test_result <- t.test(number ~ sex, data = baby_names_ca)
t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  number by sex
## t = -25.176, df = 112769, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -29.27315 -25.04438
## sample estimates:
## mean in group Female   mean in group Male 
##             52.38224             79.54101

Task 5: Result Interpretation

Task 6: Categorical Association

Research question:

Is there an association between sex and popularity category?

Hypotheses:

\[H_0: \text{Sex and popularity category are independent.}\]

\[H_A: \text{Sex and popularity category are associated.}\]

association_table <- table(baby_names_ca$sex, baby_names_ca$popularity_category)
association_table
##         
##            Low Medium  High
##   Female 31359  34895 35108
##   Male   22285  25025 26175

Task 6: Chi-square Test

chi_result <- chisq.test(association_table)
chi_result
## 
##  Pearson's Chi-squared test
## 
## data:  association_table
## X-squared = 18.663, df = 2, p-value = 8.86e-05
chi_result$expected
##         
##               Low   Medium     High
##   Female 31098.41 34736.72 35526.87
##   Male   22545.59 25183.28 25756.13

Task 6: Result Interpretation

Task 6: Association Plot

association_df <- baby_names_ca %>%
  count(sex, popularity_category) %>%
  group_by(sex) %>%
  mutate(percent = n / sum(n) * 100)

ggplot(association_df, aes(x = popularity_category, y = percent, fill = sex)) +
  geom_col(position = "dodge") +
  labs(
    title = "Popularity Category by Sex",
    x = "Popularity category",
    y = "Percentage of records",
    fill = "Sex"
  )

Discussion

Strengths and Limitations

Strengths:

Limitations:

Final Conclusion

References