Back when my wife and I were deciding names for our daughter, I played around with some plots of the popularity of certain baby names over the decades. I do a lot of time series stuff in research and hierarchical models where I am interested groups of curves that change in similar ways over time. Thus, I did a quick-and-curious check for baby names with similar curves.
In a cubic-polynomial growth curve model, the trend is the weighted sum of three components. We might talk about “a linear trend” or “quadratic component” of the overall growth curve.
library(tidyverse)
#> -- Attaching packages ------------------------------------------------------------------ tidyverse 1.2.1 --
#> v ggplot2 2.2.1 v purrr 0.2.4
#> v tibble 1.4.2 v dplyr 0.7.4
#> v tidyr 0.8.0 v stringr 1.3.0
#> v readr 1.1.1 v forcats 0.3.0
#> -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
library(babynames)
library(ggplot2)
polypoly::poly_plot(poly(x = 1:100, degree = 3)) +
stat_summary(aes(color = "sum"), fun.y = "sum", geom = "line", size = 1)
We could include more and more degrees to find more and more trends but I find the cubic model is a good place to start.
So if I fit this model for each name, I can get a measure of the linear, quadratic, or cubic trends of each name’s trajectory.
f_names <- babynames %>%
filter(sex == "F")
# Fit a cubic growth curve model for each name
models <- f_names %>%
nest(-name) %>%
# Require 20 years of samples
filter(map_int(data, nrow) > 20) %>%
mutate(model = purrr::map(data, . %>% lm(prop ~ poly(year, 3), .)))
# A model for each name
models
#> # A tibble: 16,622 x 3
#> name data model
#> <chr> <list> <list>
#> 1 Mary <tibble [136 x 4]> <S3: lm>
#> 2 Anna <tibble [136 x 4]> <S3: lm>
#> 3 Emma <tibble [136 x 4]> <S3: lm>
#> 4 Elizabeth <tibble [136 x 4]> <S3: lm>
#> 5 Minnie <tibble [136 x 4]> <S3: lm>
#> 6 Margaret <tibble [136 x 4]> <S3: lm>
#> 7 Ida <tibble [136 x 4]> <S3: lm>
#> 8 Alice <tibble [136 x 4]> <S3: lm>
#> 9 Bertha <tibble [136 x 4]> <S3: lm>
#> 10 Sarah <tibble [136 x 4]> <S3: lm>
#> # ... with 16,612 more rows
# Extract coefficients
coefs <- models %>%
mutate(coefs = map(model, broom::tidy)) %>%
unnest(coefs)
coefs
#> # A tibble: 66,488 x 6
#> name term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Mary (Intercept) 0.0332 0.000450 74.0 2.62e-109
#> 2 Mary poly(year, 3)1 -0.264 0.00524 -50.4 4.76e- 88
#> 3 Mary poly(year, 3)2 -0.0230 0.00524 -4.40 2.25e- 5
#> 4 Mary poly(year, 3)3 0.0493 0.00524 9.41 2.10e- 16
#> 5 Anna (Intercept) 0.00895 0.000107 83.5 3.99e-116
#> 6 Anna poly(year, 3)1 -0.0821 0.00125 -65.7 1.01e-102
#> 7 Anna poly(year, 3)2 0.0506 0.00125 40.5 2.38e- 76
#> 8 Anna poly(year, 3)3 -0.00409 0.00125 -3.28 1.35e- 3
#> 9 Emma (Intercept) 0.00529 0.000105 50.4 4.07e- 88
#> 10 Emma poly(year, 3)1 -0.0301 0.00122 -24.6 3.94e- 51
#> # ... with 66,478 more rows
Now, we can identify the top 10 names for each growth curve component. We’ll look at both positive and negative coefficients.
# Keep names with 10 largest coefficients for each growth curve component
positive_trends <- coefs %>%
group_by(term) %>%
top_n(10, estimate) %>%
mutate(sign = "positive") %>%
ungroup()
# Find most negative ones
negative_trends <- coefs %>%
group_by(term) %>%
top_n(10, -estimate) %>%
mutate(sign = "negative") %>%
ungroup()
# Combine with data
trends <- bind_rows(negative_trends, positive_trends)
trends_with_data <- f_names %>%
inner_join(trends)
#> Joining, by = "name"
peak_years <- trends_with_data %>%
group_by(term, name) %>%
top_n(1, prop) %>%
ungroup()
To find the spikiest names, plot the names with 10 most negative quadratic effects.
spiky <- trends_with_data %>%
filter(term == "poly(year, 3)2", sign == "negative")
ggplot(spiky) +
aes(x = year, y = prop, color = name) +
geom_line() +
hrbrthemes::scale_y_percent() +
ggrepel::geom_text_repel(
aes(label = name),
data = peak_years %>% inner_join(spiky),
size = 5,
segment.alpha = 0) +
ylab("Percent of births with name") +
labs(color = NULL) +
guides(color = "none")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")
Names with largest positive quadratic effects are bowls.
bowly <- trends_with_data %>%
filter(term == "poly(year, 3)2", sign == "positive")
ggplot(bowly) +
aes(x = year, y = prop, color = name) +
geom_line() +
hrbrthemes::scale_y_percent() +
ggrepel::geom_text_repel(
aes(label = name),
data = peak_years %>% inner_join(bowly),
size = 5,
segment.alpha = 0) +
ylab("Percent of births with name") +
labs(color = NULL) +
guides(color = "none")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")
Names with largest cubic effects fit names that fall at the beginning and end of the dataset.
this_trend <- trends_with_data %>%
filter(term == "poly(year, 3)3")
ggplot(this_trend) +
aes(x = year, y = prop, color = name) +
geom_line() +
hrbrthemes::scale_y_percent() +
ggrepel::geom_text_repel(
aes(label = name),
data = peak_years %>% inner_join(this_trend),
size = 5,
segment.alpha = 0) +
ylab("Percent of births with name") +
labs(color = NULL) +
guides(color = "none") +
facet_wrap("sign")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")
The intercept is the average growth curve value, or the area under the curve. In the positive cases, these are names that were the most popular on average, either by being very trendy at some point or being consistently moderately popular.
this_trend <- trends_with_data %>%
filter(term == "(Intercept)")
ggplot(this_trend) +
aes(x = year, y = prop, color = name) +
geom_line() +
hrbrthemes::scale_y_percent() +
ggrepel::geom_text_repel(
aes(label = name),
data = peak_years %>% inner_join(this_trend),
size = 5,
segment.alpha = 0) +
ylab("Percent of births with name") +
labs(color = NULL) +
guides(color = "none") +
facet_wrap("sign")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")
This growth curve component is picking up on names that spiked recently or distantly.
this_trend <- trends_with_data %>%
filter(term == "poly(year, 3)1")
ggplot(this_trend) +
aes(x = year, y = prop, color = name) +
geom_line() +
hrbrthemes::scale_y_percent() +
ggrepel::geom_text_repel(
aes(label = name),
data = peak_years %>% inner_join(this_trend),
size = 5,
segment.alpha = 0) +
ylab("Percent of births with name") +
labs(color = NULL) +
guides(color = "none") +
facet_wrap("sign")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")