LE KHANH TOAN - 3932146
31 May, 2026
How does baby name popularity differ by sex in California from 2000 to 2025?
baby_names_ca <- read_csv(
"CA.TXT",
col_names = c("state", "sex", "year", "name", "number"),
show_col_types = FALSE
)
baby_names_ca <- baby_names_ca %>%
filter(year >= 2000, year <= 2025)
head(baby_names_ca)
# Data Description and Pre-processing
Important variables:
- `state`: U.S. state abbreviation. This analysis uses California only.
- `sex`: sex recorded as `F` or `M`.
- `year`: year of birth.
- `name`: baby name.
- `number`: number of babies given that name in that year and sex group.
Pre-processing steps:
- Imported the California text file from the ZIP dataset.
- Added meaningful column names.
- Filtered the data to years 2000 to 2025.
- Converted `sex` into a labelled factor.
- Checked for missing values.
- Created a popularity category for the categorical association test.
``` r
baby_names_ca <- baby_names_ca %>%
mutate(
sex = factor(sex, levels = c("F", "M"), labels = c("Female", "Male")),
popularity_category = case_when(
number < quantile(number, 0.33, na.rm = TRUE) ~ "Low",
number < quantile(number, 0.66, na.rm = TRUE) ~ "Medium",
TRUE ~ "High"
),
popularity_category = factor(popularity_category, levels = c("Low", "Medium", "High"))
)
missing_table <- baby_names_ca %>%
summarise(
missing_state = sum(is.na(state)),
missing_sex = sum(is.na(sex)),
missing_year = sum(is.na(year)),
missing_name = sum(is.na(name)),
missing_number = sum(is.na(number))
)
kable(missing_table)
| missing_state | missing_sex | missing_year | missing_name | missing_number |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
number.summary_table <- baby_names_ca %>%
group_by(sex) %>%
summarise(
Min = min(number, na.rm = TRUE),
Q1 = quantile(number, 0.25, na.rm = TRUE),
Median = median(number, na.rm = TRUE),
Mean = mean(number, na.rm = TRUE),
Q3 = quantile(number, 0.75, na.rm = TRUE),
Max = max(number, na.rm = TRUE),
SD = sd(number, na.rm = TRUE),
n = n(),
Missing = sum(is.na(number)),
.groups = "drop"
)
kable(summary_table, digits = 2)| sex | Min | Q1 | Median | Mean | Q3 | Max | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Female | 5 | 7 | 12 | 52.38 | 33 | 3645 | 158.70 | 101362 | 0 |
| Male | 5 | 7 | 12 | 79.54 | 36 | 4344 | 259.35 | 73485 | 0 |
ggplot(baby_names_ca, aes(x = number)) +
geom_histogram(bins = 40) +
labs(
title = "Distribution of Baby Name Popularity in California, 2000-2025",
x = "Number of babies with the name",
y = "Frequency"
)ggplot(baby_names_ca, aes(x = sex, y = number)) +
geom_boxplot() +
labs(
title = "Baby Name Popularity by Sex in California, 2000-2025",
x = "Sex",
y = "Number of babies with the name"
)yearly_totals <- baby_names_ca %>%
group_by(year, sex) %>%
summarise(total_babies = sum(number, na.rm = TRUE), .groups = "drop")
ggplot(yearly_totals, aes(x = year, y = total_babies, colour = sex)) +
geom_line() +
labs(
title = "Total Babies Recorded by Sex in California, 2000-2025",
x = "Year",
y = "Total number of babies",
colour = "Sex"
)Research question:
Is the average baby name count different between female and male names in California from 2000 to 2025?
Hypotheses:
\[H_0: \mu_F = \mu_M\]
\[H_A: \mu_F \ne \mu_M\]
where:
number is numerical.sex has two independent
groups: Female and Male.##
## Welch Two Sample t-test
##
## data: number by sex
## t = -25.176, df = 112769, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -29.27315 -25.04438
## sample estimates:
## mean in group Female mean in group Male
## 52.38224 79.54101
Research question:
Is there an association between sex and popularity category?
Hypotheses:
\[H_0: \text{Sex and popularity category are independent.}\]
\[H_A: \text{Sex and popularity category are associated.}\]
popularity_category was created from the
number variable using approximate terciles.##
## Low Medium High
## Female 31359 34895 35108
## Male 22285 25025 26175
##
## Pearson's Chi-squared test
##
## data: association_table
## X-squared = 18.663, df = 2, p-value = 8.86e-05
##
## Low Medium High
## Female 31098.41 34736.72 35526.87
## Male 22545.59 25183.28 25756.13
association_df <- baby_names_ca %>%
count(sex, popularity_category) %>%
group_by(sex) %>%
mutate(percent = n / sum(n) * 100)
ggplot(association_df, aes(x = popularity_category, y = percent, fill = sex)) +
geom_col(position = "dodge") +
labs(
title = "Popularity Category by Sex",
x = "Popularity category",
y = "Percentage of records",
fill = "Sex"
)Strengths:
Limitations: