Introduction

I’d like to create a function that generates made up first names. I plan to use the babynames package to train my function (so the resulting function will be biased towards generating names that look and sound like the names of people born in the United States). Preferably, such a function would satisfy various requirements, for now I’ll focus on these two:

A first attempt

I’ve already ruled out generating random strings, because they won’t look or sound like first names:

dumb <- function(k) {
    name_lengths <- sample(3:10, k, replace = TRUE)
    make_name <- function(len) {
        paste(sample(letters, len, replace = TRUE), collapse = "")
    }
    vapply(name_lengths, make_name, FUN.VALUE = character(1))
}

# these don't seem like first names
set.seed(57575)
dumb(20)
##  [1] "wgqgymxo"   "qkdljdoai"  "saeoufm"    "girrs"      "icnxxwh"   
##  [6] "uhnoig"     "lasefmxkh"  "krup"       "hdvoixokf"  "qgb"       
## [11] "qkzq"       "xfskscluw"  "fbph"       "qef"        "qyfrqvdbe" 
## [16] "uwtivwnwoj" "pgrveugmqy" "kiqlnik"    "uhrh"       "civsl"

Perhaps I could improve a bit by having the letters of the names I generate match the overall distribution of letters in real names. I start out by summarizing the babynames data:

library(magrittr)
library(dplyr)
library(babynames)
library(stringr)

name_data <- babynames %>%
    mutate(name = str_to_lower(name)) %>%
    group_by(sex, name) %>%
    summarise(n = sum(n)) %>%
    ungroup
name_data
## # A tibble: 104,110 × 3
##      sex      name     n
##    <chr>     <chr> <int>
## 1      F     aabha    21
## 2      F aabriella    10
## 3      F     aaden     5
## 4      F   aadhira    29
## 5      F    aadhya   639
## 6      F      aadi    16
## 7      F   aadison    11
## 8      F   aaditri    26
## 9      F   aadrika    10
## 10     F   aadvika    13
## # ... with 104,100 more rows

Now I calculate the frequency of each letter in the overall data, and use those frequencies to inform the name generation.

library(dplyr)
library(babynames)
library(tidyr)

character_frequencies <- name_data %>%
    # i'm ignoring sex for now:
    group_by(name) %>% summarise(n = sum(n)) %>%
    mutate(character = str_split(name, "")) %>%
    unnest %>%
    group_by(character) %>%
    summarise(frequency = sum(as.numeric(n))) %>%
    mutate(frequency = frequency / sum(frequency))

# in case you were curious:
library(ggplot2)
ggplot(character_frequencies, 
       aes(reorder(character, -frequency), frequency)) + 
    geom_bar(stat = "identity")

slightly_less_dumb <- function(k) {
    name_lengths <- sample(3:10, k, replace = TRUE)
    make_name <- function(len) {
        paste(
            sample(
                character_frequencies$character, 
                len, 
                replace = TRUE,
                prob = character_frequencies$frequency), collapse = "")
    }
    vapply(name_lengths, make_name, FUN.VALUE = character(1))
}

set.seed(89797)
slightly_less_dumb(20)
##  [1] "rdndrmher" "nnolaei"   "btawol"    "saamb"     "onstaiah" 
##  [6] "pfre"      "roalea"    "tboeal"    "iaa"       "ncr"      
## [11] "cgnarlye"  "brm"       "riaitsnea" "pnrjnn"    "srescrn"  
## [16] "dly"       "tdaledra"  "ecnls"     "snl"       "ehdjnree"

Ok, this is already looking better than before. But there are still sequences of characters that would never appear in a real name. The slightly_less_dumb function still does not consider context, so that for example the likelihood of a character being “a” is always the same, even if the previous two characters were already “a.”

Considering context

So the next improvement is to have a model that generates not only more realistic sets of characters, but more realistic sequences of characters. I’ll do this by creating a function that iteratively builds a name, one character at a time, at each step considering the letters that have already appeared. Specifically, for some value n, I’ll create a function that can generate the next character in a name, given the last n characters.

In order to do that, I’ll first have to be able to calculate the conditional probability of a character, given the past n characters. To avoid the natural problem that arises when trying to start a name (when there are no prior characters to look at), I’ll pad each name in the training data with the appropriate number of spaces. Similarly, I’ll add an explicit stop character “#” so that my model can also learn when a name is complete. My code below uses the term “n-gram,” commonly used to refer to a group of n words in a text, instead to refer to a string of n characters. Hopefully this doesn’t cause confusion.

library(purrr)
library(tokenizers)

conditional_distribution <- function(context_size = 3L) {
    param_n <- context_size + 1
    pad <- paste(rep(" ", context_size), collapse = "")
    
    nm <- name_data %>% mutate(name = paste0(pad, name, "#"))
    
    # the n-grams are the "context" along with the following letter
    ngrams <- nm %>%
        mutate(ngram = tokenize_character_shingles(
            name, n = param_n, strip_non_alphanum = FALSE
        )) %>% unnest %>%
        group_by(sex, ngram) %>%
        summarise(cnt = sum(n))
    
    # finally, i calculate the distribution over the last letter,
    # given the context_size letters that came before
    ngrams %>%
        mutate(prev = str_sub(ngram, 1, context_size),
               nxt = str_sub(ngram, param_n, param_n)) %>%
        group_by(sex, prev, nxt) %>%
        summarise(joint_count = sum(cnt)) %>%
        mutate(pct = joint_count / sum(joint_count)) %>%
        ungroup %>%
        select(-joint_count)
}

I’m now able to calculate the conditional distribution of the next character, given the previous n characters. For instance:

conditional_3 <- conditional_distribution(3L)
## if the letters "joa" appear in a female name:
conditional_3 %>% filter(sex == "F", prev == "joa")
## # A tibble: 6 × 4
##     sex  prev   nxt          pct
##   <chr> <chr> <chr>        <dbl>
## 1     F   joa     # 1.276488e-04
## 2     F   joa     h 1.660473e-05
## 3     F   joa     k 1.349134e-05
## 4     F   joa     l 8.509923e-05
## 5     F   joa     n 9.992642e-01
## 6     F   joa     q 4.929528e-04

Quick optimization

I’m not going to focus too much on speed here, as I’m more interested in prototyping a variety of solutions, but there is an easy and obvious optimization I can make. Given that my model will be building names character by character, a clear bottleneck is going to be how long it takes to look up the appropriate distribution (given the characters that already appeared). While dplyr::filter is plenty fast for many purposes, I can speed up the search by storing the conditional distributions in a hash table where the key is the previous n characters:

optimize_conditional_distribution <- function(conditional) {
    conditional %>%
        split(.$sex) %>%
        map(~split(., .$prev)) %>%
        map(list2env)
}

Let’s estimate what I’ve saved:

conditional_3_optimized <- optimize_conditional_distribution(conditional_3)
library(microbenchmark)
microbenchmark(
    regular = filter(conditional_3, sex == "F", prev == "joa"),
    optimized = conditional_3_optimized[["F"]][["joa"]]
)
## Unit: nanoseconds
##       expr     min      lq       mean  median      uq     max neval cld
##    regular 1054491 1194000 1364394.69 1287513 1410486 2536632   100   b
##  optimized       0     380    2166.92    1901    2661   49417   100  a

A more intelligent generator

I now have the pieces to build a name-generator that understands a bit of context. First I’ll create a helper function that returns the appropriate distribution given a conditional distribution data structure along with sex and the past n characters:

next_letter_dist <- function(conditional, past, sex) {
    df <- conditional[[sex]][[past]]
    if (is.null(df))
        return(data_frame(nxt = "#", pct = 0))
    dplyr::select(df, nxt, pct)
}

This generator function produces name generators that consider a given amount of context:

generator <- function(context_size) {
    
    # calculate the conditional distributions
    cond <- conditional_distribution(context_size)
    cond <- optimize_conditional_distribution(cond)
    
    function(sex = NULL, starter = NULL) {
        if (is.null(starter))
            starter <- paste(rep(" ", context_size), collapse = "")
        if (is.null(sex)) sex <- sample(c("M", "F"), 1L)
        
        # generate a character
        name_so_far <- starter
        temp <- next_letter_dist(cond, str_sub(name_so_far, -context_size), sex)
        next_char <- sample(temp$nxt, size = 1L, prob = temp$pct)

        # continue generating the next character, until the "#" is generated
        while (next_char != "#") {
            name_so_far <- paste0(name_so_far, next_char)
            temp <- next_letter_dist(cond, str_sub(name_so_far, -context_size), sex)
            next_char <- sample(temp$nxt, size = 1L, prob = temp$pct)
        }
        str_trim(name_so_far)
    }
}

Generator results

gen3 <- generator(3L)
set.seed(78737)
gen3_sample <- replicate(25, gen3())
gen3_sample
##  [1] "dylan"    "david"    "michard"  "jim"      "vernice"  "racie"   
##  [7] "joyce"    "steven"   "ron"      "richael"  "wilfred"  "charrell"
## [13] "davis"    "irenda"   "sheidi"   "stashan"  "kate"     "theryl"  
## [19] "jessie"   "laurene"  "jose"     "dana"     "ruby"     "eri"     
## [25] "joanes"

This is clearly the strongest effort so far. Names appear real, but some of the generated names are definitely novel, in the sense that they do not appear in the original babynames dataset:

data_frame(name = gen3_sample) %>%
    anti_join(name_data, by = "name")
## # A tibble: 6 × 1
##      name
##     <chr>
## 1 michard
## 2  irenda
## 3  sheidi
## 4 stashan
## 5  theryl
## 6  joanes

There are still a few problems, though. First off, how do I know what context_size to use? We can think of the slightly_less_dumb generator as a special case where the context_size is 0. As we saw, that wasn’t enough to capture some of the real structure in names, as we ended up with names such as “rdndrmher” that had unlikely (and unpronounceable) sequences of letters. As the context_size grows, though, we start to encounter very sparse distributions. For instance we looked earlier at the conditional distribution of the next character given that the sex is “F” and the prior three characters are “joa.” Notice there aren’t many possibilities:

conditional_3_optimized[["F"]][["joa"]]
## # A tibble: 6 × 4
##     sex  prev   nxt          pct
##   <chr> <chr> <chr>        <dbl>
## 1     F   joa     # 1.276488e-04
## 2     F   joa     h 1.660473e-05
## 3     F   joa     k 1.349134e-05
## 4     F   joa     l 8.509923e-05
## 5     F   joa     n 9.992642e-01
## 6     F   joa     q 4.929528e-04

As a result, we can never see reasonable fake names such as “joasie” or “joamma”

set.seed(87656)
replicate(100, gen3(sex = "F", "joa")) %>% unique
##  [1] "joannet"        "joan"           "joanna"         "joann"         
##  [5] "joannette"      "joannon"        "joanne"         "joanicole"     
##  [9] "joannifer"      "joannamantella" "joanette"       "joannabeth"    
## [13] "joannie"        "joannah"        "joannedy"       "joanniseles"

One way to deal with this would be to ensure that the distribution of the next character includes some small amount of probability for every character in the alphabet. We could do this, for instance, by subtracting a tiny amount of probability from the amounts we’ve calculated, and divide that probability among all characters that don’t appear in the conditional distribution.

But there is another way. We can just create multiple models based on different amounts of context that each produce a distribution over the next letter, then combine the distributions via interpolation, where the models that use a lot of context push the overall result more towards the names that are in the training data, while the models that use less context can provide more “surprises.”

Here’s code to create a single name generator that interpolates among several generators that each utilize a different amount of context:

# the "weights" argument allows us to vary the weighting of various models
interpolated_generator <- function(context_size, weights) {
    stopifnot(length(context_size) == length(weights))
    stopifnot(sum(weights) == 1)
    
    # calculate the conditional distributions
    cond <- map(context_size, conditional_distribution)
    cond <- map(cond, optimize_conditional_distribution)
    
    function(sex = NULL, starter = NULL) {
        if (is.null(starter))
            starter <- paste(rep(" ", max(context_size)), collapse = "")
        if (is.null(sex)) sex <- sample(c("M", "F"), 1L)
        
        # interpolate function takes what we have of the name, 
        # and generates the distribution of the next character
        interpolate <- function(name_so_far) {
            combine_distributions <- function(dist1, dist2) {
                dist1 %>% 
                    full_join(dist2, by = "nxt") %>%
                    # fill in NA probs with 0, so that if one of them is non-0, 
                    # i can still add the two together
                    replace_na(list(pct.x = 0, pct.y = 0)) %>%
                    transmute(nxt, pct = pct.x + pct.y)
            }
            
            # i run the next_letter_dist function on each distribution
            temp <- map2(.x = cond, .y = str_sub(name_so_far, -context_size), 
                         ~next_letter_dist(.x, .y, sex))
            # i then adjust the resulting probs by 
            # multiplying by the appropriate weight
            temp <- map2(temp, weights, ~mutate(.x, pct = pct * .y))
            # finally add the resulting probabilities together
            temp <- reduce(temp, combine_distributions)
        }
        
        # generate a character
        name_so_far <- starter
        temp <- interpolate(name_so_far)
        next_char <- sample(temp$nxt, size = 1L, prob = temp$pct)

        # continue generating the next character, until the "#" is generated
        while (next_char != "#") {
            name_so_far <- paste0(name_so_far, next_char)
            temp <- interpolate(name_so_far)
            next_char <- sample(temp$nxt, size = 1L, prob = temp$pct)
        }
        str_trim(name_so_far)
    }
}

Playing with name generators

Let’s make up some names! I’ll create multiple models, each utilizing differing amounts of context:

geni_1 <- interpolated_generator(c(1, 2), c(.5, .5))
geni_2 <- interpolated_generator(c(1, 2, 3), c(.2, .2, .6))
geni_3 <- interpolated_generator(c(1, 2, 3), c(.05, .2, .75))
geni_4 <- interpolated_generator(c(2, 3, 4), c(.1, .45, .45))

Let’s look at what sorts of names they produce:

set.seed(575790)
replicate(10, geni_1()) %>% unique
##  [1] "d"              "kettanda"       "wald"           "carlare"       
##  [5] "phie"           "ale"            "ca"             "williantertlen"
##  [9] "drer"           "reyelacia"
replicate(25, geni_2()) %>% unique
##  [1] "ricialomber"    "sa"             "lizabelliann"   "eth"           
##  [5] "john"           "debe"           "lare"           "hordianorah"   
##  [9] "pa"             "alph"           "ke"             "ges"           
## [13] "er"             "nor"            "hael"           "suisha"        
## [17] "jesselld"       "ely"            "ber"            "zeben"         
## [21] "jan"            "margannemillie" "kath"           "elle"
replicate(25, geni_3()) %>% unique
##  [1] "chris"       "kiari"       "frandon"     "leo"         "shia"       
##  [6] "grell"       "dam"         "mativan"     "hellia"      "madisue"    
## [11] "callisa"     "pamey"       "tammie"      "virgaren"    "meline"     
## [16] "edwardo"     "line"        "bren"        "dossie"      "elsymie"    
## [21] "rrie"        "makelliette" "heath"       "fery"        "holynth"
replicate(25, geni_4()) %>% unique
##  [1] "lynna"     "romy"      "david"     "devon"     "sarah"    
##  [6] "willie"    "clan"      "teritty"   "william"   "barry"    
## [11] "chael"     "raymond"   "carson"    "samuel"    "antonie"  
## [16] "alin"      "stellie"   "brina"     "mackenzie" "jack"     
## [21] "tyler"     "faye"      "manuel"    "louise"    "and"

Next steps

My name generator model now has a number of parameters. I can specify which context sizes to consider, as well as how to weight them. If I had some way to evaluate the quality of the generated names, I could treat this as an optimization problem and find the set of parameters that result in the highest quality.

One possible way to do that is to notice that with a little bit of extra effort, the code I’ve written can turn around and output the likelihood of a given name or set of names. For instance, here is a function that uses the conditional_3_optimized data that we created earlier to calculate the log-likelihood of particular names. This function treats each character in a name as depending only on the prior 3 characters, and treats different names as independent of each other:

calc_likelihood <- function(nm, sex) {
    nm <- paste0("   ", nm, "#")
    data_frame(
        nm, sex,
        chars = tokenize_character_shingles(nm, 4L, strip_non_alphanum = FALSE)
    ) %>% 
        unnest %>%
        mutate(
            nm,
            prev = str_sub(chars, 1, 3),
            nxt = str_sub(chars, -1)
        ) %>%
        mutate(
            prob = map2(prev, sex, 
                        ~next_letter_dist(conditional_3_optimized, .x, .y))) %>%
        mutate(prob = map2(prob, nxt, ~filter(.x, nxt == .y)),
               prob = unlist(map(prob, ~.$pct))) %>%
        mutate(prob = log(prob)) %>%
        summarise(s = sum(prob)) %>%
        "$"(s)
}

Now we are able to compare the log-likelihoods of different names:

calc_likelihood("tarak", sex = "M")
## [1] -17.95987
calc_likelihood("brad", sex = "M")
## [1] -7.403383
calc_likelihood("brad", sex = "F")
## [1] -15.13948
# we are also able to calculate the log-likelihood of a number of names:
calc_likelihood(c("john", "joanna", "rodney"), sex = c("M", "F", "M"))
## [1] -17.22259
calc_likelihood(c("josiah", "jessandria", "rooney"), sex = c("M", "F", "M"))
## [1] -37.25118

It’s possible (likely) for this function to encounter tokens in the validation data that never appear in the training data, so we’d have to make some adjustments to give non-zero likelihood to unseen tokens. With that, we could then experiment with different interpolated name generators, and choose the one that maximizes the likelihood of the validation data.