Demo of growth curve analysis on babynames

Back when my wife and I were deciding names for our daughter, I played around with some plots of the popularity of certain baby names over the decades. I do a lot of time series stuff in research and hierarchical models where I am interested groups of curves that change in similar ways over time. Thus, I did a quick-and-curious check for baby names with similar curves.

The polynomial growth cruve

In a cubic-polynomial growth curve model, the trend is the weighted sum of three components. We might talk about “a linear trend” or “quadratic component” of the overall growth curve.

library(tidyverse)
#> -- Attaching packages ------------------------------------------------------------------ tidyverse 1.2.1 --
#> v ggplot2 2.2.1     v purrr   0.2.4
#> v tibble  1.4.2     v dplyr   0.7.4
#> v tidyr   0.8.0     v stringr 1.3.0
#> v readr   1.1.1     v forcats 0.3.0
#> -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(babynames)
library(ggplot2)

polypoly::poly_plot(poly(x = 1:100, degree = 3)) + 
  stat_summary(aes(color = "sum"), fun.y = "sum", geom = "line", size = 1)

We could include more and more degrees to find more and more trends but I find the cubic model is a good place to start.

So if I fit this model for each name, I can get a measure of the linear, quadratic, or cubic trends of each name’s trajectory.

f_names <- babynames %>%
  filter(sex == "F")

# Fit a cubic growth curve model for each name
models <- f_names %>%
  nest(-name) %>%
  # Require 20 years of samples
  filter(map_int(data, nrow) > 20) %>%
  mutate(model = purrr::map(data, . %>% lm(prop ~ poly(year, 3), .)))

# A model for each name
models
#> # A tibble: 16,622 x 3
#>    name      data               model   
#>    <chr>     <list>             <list>  
#>  1 Mary      <tibble [136 x 4]> <S3: lm>
#>  2 Anna      <tibble [136 x 4]> <S3: lm>
#>  3 Emma      <tibble [136 x 4]> <S3: lm>
#>  4 Elizabeth <tibble [136 x 4]> <S3: lm>
#>  5 Minnie    <tibble [136 x 4]> <S3: lm>
#>  6 Margaret  <tibble [136 x 4]> <S3: lm>
#>  7 Ida       <tibble [136 x 4]> <S3: lm>
#>  8 Alice     <tibble [136 x 4]> <S3: lm>
#>  9 Bertha    <tibble [136 x 4]> <S3: lm>
#> 10 Sarah     <tibble [136 x 4]> <S3: lm>
#> # ... with 16,612 more rows

# Extract coefficients
coefs <- models %>%
  mutate(coefs = map(model, broom::tidy)) %>%
  unnest(coefs)
coefs
#> # A tibble: 66,488 x 6
#>    name  term           estimate std.error statistic   p.value
#>    <chr> <chr>             <dbl>     <dbl>     <dbl>     <dbl>
#>  1 Mary  (Intercept)     0.0332   0.000450     74.0  2.62e-109
#>  2 Mary  poly(year, 3)1 -0.264    0.00524     -50.4  4.76e- 88
#>  3 Mary  poly(year, 3)2 -0.0230   0.00524      -4.40 2.25e-  5
#>  4 Mary  poly(year, 3)3  0.0493   0.00524       9.41 2.10e- 16
#>  5 Anna  (Intercept)     0.00895  0.000107     83.5  3.99e-116
#>  6 Anna  poly(year, 3)1 -0.0821   0.00125     -65.7  1.01e-102
#>  7 Anna  poly(year, 3)2  0.0506   0.00125      40.5  2.38e- 76
#>  8 Anna  poly(year, 3)3 -0.00409  0.00125      -3.28 1.35e-  3
#>  9 Emma  (Intercept)     0.00529  0.000105     50.4  4.07e- 88
#> 10 Emma  poly(year, 3)1 -0.0301   0.00122     -24.6  3.94e- 51
#> # ... with 66,478 more rows

Spiky names

Now, we can identify the top 10 names for each growth curve component. We’ll look at both positive and negative coefficients.

# Keep names with 10 largest coefficients for each growth curve component
positive_trends <- coefs %>% 
  group_by(term) %>% 
  top_n(10, estimate) %>% 
  mutate(sign = "positive") %>% 
  ungroup()

# Find most negative ones
negative_trends <- coefs %>% 
  group_by(term) %>% 
  top_n(10, -estimate) %>% 
  mutate(sign = "negative") %>% 
  ungroup()

# Combine with data
trends <- bind_rows(negative_trends, positive_trends)
trends_with_data <- f_names %>%
  inner_join(trends)
#> Joining, by = "name"

peak_years <- trends_with_data %>%
  group_by(term, name) %>%
  top_n(1, prop) %>%
  ungroup()

To find the spikiest names, plot the names with 10 most negative quadratic effects.

spiky <- trends_with_data %>% 
  filter(term == "poly(year, 3)2", sign == "negative") 
  
ggplot(spiky) +
  aes(x = year, y = prop, color = name) +
  geom_line() +
  hrbrthemes::scale_y_percent() +
  ggrepel::geom_text_repel(
    aes(label = name), 
    data = peak_years %>% inner_join(spiky), 
    size = 5,
    segment.alpha = 0) +
  ylab("Percent of births with name") +
  labs(color = NULL) +
  guides(color = "none")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")

Bowly names

Names with largest positive quadratic effects are bowls.

bowly <- trends_with_data %>% 
  filter(term == "poly(year, 3)2", sign == "positive") 
  
ggplot(bowly) +
  aes(x = year, y = prop, color = name) +
  geom_line() +
  hrbrthemes::scale_y_percent() +
  ggrepel::geom_text_repel(
    aes(label = name), 
    data = peak_years %>% inner_join(bowly), 
    size = 5,
    segment.alpha = 0) +
  ylab("Percent of births with name") +
  labs(color = NULL) +
  guides(color = "none")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")

The cubic trends finds early or late bumps

Names with largest cubic effects fit names that fall at the beginning and end of the dataset.

this_trend <- trends_with_data %>% 
  filter(term == "poly(year, 3)3") 
  
ggplot(this_trend) +
  aes(x = year, y = prop, color = name) +
  geom_line() +
  hrbrthemes::scale_y_percent() +
  ggrepel::geom_text_repel(
    aes(label = name), 
    data = peak_years %>% inner_join(this_trend), 
    size = 5,
    segment.alpha = 0) +
  ylab("Percent of births with name") +
  labs(color = NULL) +
  guides(color = "none") + 
  facet_wrap("sign")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")

The intercept fits area under the curve

The intercept is the average growth curve value, or the area under the curve. In the positive cases, these are names that were the most popular on average, either by being very trendy at some point or being consistently moderately popular.

this_trend <- trends_with_data %>% 
  filter(term == "(Intercept)") 
  
ggplot(this_trend) +
  aes(x = year, y = prop, color = name) +
  geom_line() +
  hrbrthemes::scale_y_percent() +
  ggrepel::geom_text_repel(
    aes(label = name), 
    data = peak_years %>% inner_join(this_trend), 
    size = 5,
    segment.alpha = 0) +
  ylab("Percent of births with name") +
  labs(color = NULL) +
  guides(color = "none") + 
  facet_wrap("sign")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")

The linear trend is kind of weird

This growth curve component is picking up on names that spiked recently or distantly.

this_trend <- trends_with_data %>% 
  filter(term == "poly(year, 3)1") 
  
ggplot(this_trend) +
  aes(x = year, y = prop, color = name) +
  geom_line() +
  hrbrthemes::scale_y_percent() +
  ggrepel::geom_text_repel(
    aes(label = name), 
    data = peak_years %>% inner_join(this_trend), 
    size = 5,
    segment.alpha = 0) +
  ylab("Percent of births with name") +
  labs(color = NULL) +
  guides(color = "none") + 
  facet_wrap("sign")
#> Joining, by = c("year", "sex", "name", "n", "prop", "term", "estimate", "std.error", "statistic", "p.value", "sign")