Getting and Cleaning Name Data from the U.S. Census Bureau and SSA for More (A)merican Random Sampling

by Austin Routt

Have you ever wanted to be someone else? God knows I have, my friend, but even taking the first step, by creating a new name at random, always seems to be such a hassle for me; whether its an issue of time or the fact that the first and last name combinations I come up with just aren't believable, I cannot seem to create new personas at a rate that meets my needs. Towards the end of fast and believable random American name generation, I have taken and processed data from the U.S. Census Bureau and Social Security Agency.

What now follows is a brief description of the malefirstnames.csv, femalefirstnames.csv, and surnames.csv data files: what they are, how they were derived, as well as how one might use them. Again, the primary purpose for this data set is to be used to implement a random name generator; unlike most random name generators, this data set allows one to take into account a weighted distribution of American first and last names when performing a random sampling by gender. Since the first name data contains the probabilities of names based on the occurrences of U.S. baby names from 1880 to 2013, and the surname data has frequencies based on the 2000 census, one need only reference the probabilities of these names when taking a sample. e.g. sample(male$name, size = 1, prob = male$probability) Although merely easing the implementation of fast and believable random millennium American name generation is its primary aim, feel free to reuse this data in anyway you see fit.

First Name Data

Both malefirstnames.csv and femalefirstnames.csv contain lists of male and female first names, respectively. Also, these hold information regarding the frequency of each name's occurrence, within the national U.S. population for people born from 1880 to 2013. The frequency data is based off of the National Data on baby names, provided by the Social Security Administration.

Here is a list of the top 10 male and female names:

##       Male    Freq    Prob    Female    Freq     Prob
## 1    James 5091189 0.03147      Mary 4046787 0.025708
## 2     John 5073958 0.03136 Elizabeth 1591439 0.010110
## 3   Robert 4789776 0.02961  Patricia 1570123 0.009975
## 4  Michael 4292994 0.02653  Jennifer 1461136 0.009282
## 5  William 4038447 0.02496     Linda 1449996 0.009211
## 6    David 3562957 0.02202   Barbara 1432413 0.009100
## 7   Joseph 2552666 0.01578  Margaret 1234912 0.007845
## 8  Richard 2551558 0.01577     Susan 1116573 0.007093
## 9  Charles 2345723 0.01450   Dorothy 1105005 0.007020
## 10  Thomas 2275889 0.01407     Sarah 1054061 0.006696

Last Name Data

The file surnames.csv contains a list of last names, as well as their probability of occurrence, within the United States, circa 2000; these were taken from the Census 2000 Data on Surnames occurring 100 or more times

Here is a list of the top 10 last names:

##    Last Names    Freq     Prob
## 1       Smith 2376206 0.009814
## 2     Johnson 1857160 0.007670
## 3    Williams 1534042 0.006336
## 4       Brown 1380145 0.005700
## 5       Jones 1362755 0.005628
## 6      Miller 1127803 0.004658
## 7       Davis 1072335 0.004429
## 8      Garcia  858289 0.003545
## 9   Rodriguez  804240 0.003322
## 10     Wilson  783051 0.003234

Data Processing

First, the original data sets were obtained and unzipped into a working directory via the following code blocks:

##Step 1a Fetch Social Security National Baby Name data, if not present, and unzip in the user's working directory


myzip1 = "names.zip"
#if data file has not yet been downloaded, fetch it
if (!file.exists(myzip1)) {
    download.file("http://www.ssa.gov/oact/babynames/names.zip",                                        destfile=myzip1,method="curl")
    unzip(myzip1)
    }

##Step 1b Fetch Census 2000 data for surnames occurring 100 or more times, if not present, and unzip in the user's working directory


myzip2 = "surnames.zip"
#if data file has not yet been downloaded, fetch it
if (!file.exists(myzip2)) {
    download.file("https://www.census.gov/genealogy/www/data/2000surnames/names.zip",                                       destfile=myzip2,method="curl")
    unzip(myzip2)
    }

Next, the R environment was checked to see if the required data sets were already present in memory. Since they were not, they were promptly read into memory;note that the first name data, obtained from the Social Security Administration, is split into multiple text files based on year.

##Step 2a, if baby name data isn't already available in memory read in all files

if(!(exists("datalist"))){



    filenames <- list.files(".", pattern="*.txt", full.names=TRUE)
    datalist<- lapply(filenames, read.table, sep = ",", fill = TRUE, header = FALSE, stringsAsFactors = FALSE)


    }

##Step 2b, if surname data isn't already available in memory read in the file

if(!(exists("surname"))){


    surname <- read.csv("app_c.csv", stringsAsFactors = F)

    }

Since the Baby name data is split into a list of data frames based on year, datalist, these are merged into one and then aggregated to give the total occurrence of each unique name. Following the merge, the data was divided by gender into two data frames: male and female. Both were then reordered based on each name's total frequency of occurrence.

##Step 3a Merge all name data frames into 1 

merge.all <- function(x, y){
    merge(x, y, all=TRUE) }

out <- Reduce(merge.all, datalist)

##Step4a aggregate all names by adding the sum occurrence of each

h <- aggregate(formula = V3 ~ V1 + V2, FUN = sum, data = out)


##Step5a, separate by gender and reorder from most frequent to least
 female <- h[h$V2 == "F", ]
  male <- h[h$V2 == "M", ]

male <- male[order(male$V3, decreasing = T), c(1,3) ]
female <- female[order(female$V3, decreasing = T), c(1,3) ]

For my current purposes, all 11 original variables of the surname data set, although interesting, are unecessary. In the future I may find a use for the racial probablity data, but lacking the same information for first names as well makes it extraneous at this point. Thus, using R's subsetting capablities, only the names and their corresponding frequencies were retained.

##Step5b, retain only the surname and frequency in dataframe

surname <- surname[, c(1,3)]

Next, using the count/frequency data for each data frame, probablities were calculated for all names; the probablility of a name is the name's count divided by the total count of all names within a data frame. For instance, over 242,121,073 people took the 2000 Census, 2,376,206 people shared the last name “Smith.” Thusly, given a random sample of people who took the 2000 census, the probablity of one of them having the last name Smith, the most popular surname in America, would be about 0.98%.

##Step6a find each name's probablility, for both males and females, and store this information in a third column

for(i in 1:length(male[,1])){

    male[i,3] <- male[i,2]/sum(male$V3)


}

for(i in 1:length(female[,1])){

    female[i,3] <- female[i,2]/sum(female$V3)

}

##Step6b find each last name's probablility, store this information in a third column 

for(i in 1:length(surname[,1])){

    surname[i,3] <- surname[i,2]/sum(surname[,2])


}

For clarity, column names are changed into the more descriptive name frequency probability format. Also, surnames are altered to only begin with a capital letter.

##Step7a rename each variable to an appropriate description
names(male) <- c("name", "frequency", "probability")
names(female) <- c("name", "frequency", "probability")



##Step7b change surname variable names and the case of names

names(surname) <- c("name", "frequency", "probability")

##Create a function that reads in a string and converts it to have words only begin with a capital letter
r_ucfirst <- function (str) {
  paste(toupper(substring(str, 1, 1)), tolower(substring(str, 2)), sep = "")
}

surname$name <- r_ucfirst(surname$name)

Finally, all data frames are exported to the working directory in the form of comma separated files.

##Step8a write first name data to two separarte .csv files

write.csv(male, file = "malefirstnames.csv",row.names=FALSE)
write.csv(female, file = "femalefirstnames.csv",row.names=FALSE)

##Step8b write surname data to a new .csv files

write.csv(surname, file = "surnames.csv",row.names=FALSE)

Using a R programming environment, with the appropriate files in your working directory, it is recommended that you copy the following commands to read in the name data:

male <- read.csv(“malefirstnames.csv”, stringsAsFactors=F)
female <- read.csv(“femalefirstnames.csv”, stringsAsFactors=F)
last <- read.csv(“surnames.csv”, stringsAsFactors=F)

You can then take weighted random samples via:

sample(male$name, size = 1, prob = male$probability)
sample(female$name, size = 1, prob = female$probability)
sample(last$name, size = 1, prob = last$probability)

rName:An Example

Here I have illustrated basic use of the data via a random name generating function, called rName(), which takes in a gender word, either “Male” or “Female”, and outputs a name.

rName <- function(gender = "Male"){


    if(!(exists("male"))){

        male <- read.csv("malefirstnames.csv", stringsAsFactors=F)
        }


    if(!(exists("female"))){

        female <-  read.csv("femalefirstnames.csv", stringsAsFactors=F)

        }

    if(!(exists("last"))){
        last <- read.csv("surnames.csv", stringsAsFactors=F)

        }


    if(gender == "Male"){

        name <- paste(sample(male$name, size = 1,  prob = male$probability), sample(last$name, size = 1,  prob = last$probability))
        } else if(gender == "Female"){

            name <- paste(sample(female$name, size = 1,  prob = female$probability), sample(last$name, size = 1,  prob = last$probability))

            }

    name
    }

Now I will demonstrate the wonder that is weighted random sampling by using rName to generate 100 American names, behold:

set.seed(5559898)

count = 0
onehundredcoins <- sample(0:1, size = 100, replace = TRUE)
onehundrednames <- rName("Male")

for(i in onehundredcoins){

    count = count + 1

    if(i == 0){
        onehundrednames[count] <- rName("Male")
        }
    if(i==1){
        onehundrednames[count] <- rName("Female")
        }

    }
onehundrednames

##   [1] "Tanner Guidry"       "Edward Hays"         "Maggie Kopke"       
##   [4] "Ella Marth"          "Walter Boldt"        "Joshua Bridges"     
##   [7] "Becky Nelson"        "Iona Sota"           "Henry Slavin"       
##  [10] "Bryan Powell"        "Michelle Munn"       "Ann Chavez"         
##  [13] "Mary Delotto"        "Efren Mann"          "Nicholas Harter"    
##  [16] "Terrence Jeudy"      "Jody Kramer"         "Heather Cano"       
##  [19] "Christopher Curry"   "Xavier White"        "Dale Aranda"        
##  [22] "James Mcpherson"     "Sandra Johnson"      "Shae Wiant"         
##  [25] "Linda Works"         "Angel Weaver"        "Toby Nelson"        
##  [28] "Tatum Mendoca"       "Ashley Elwell"       "Ethan Thomas"       
##  [31] "Richard Dasenbrock"  "Gerald Flauding"     "Joan Pollock"       
##  [34] "Christine Womack"    "Geoffrey Laun"       "Dale Botello"       
##  [37] "Sarah Garcia"        "Angel Reese"         "Ryan Mcintire"      
##  [40] "Pauline Bauer"       "Aviyah Perry"        "Richard Fajardo"    
##  [43] "Donna Michel"        "Sherrita Bishop"     "Norman Mccullar"    
##  [46] "Erik Reeves"         "Jaxxon Gray"         "Beth Ottaviani"     
##  [49] "Yolanda Jackson"     "Beverly Jenkins"     "Everett Bowman"     
##  [52] "Jean Valencia"       "Brian Massa"         "Raymond Graham"     
##  [55] "Emily Kilpatrick"    "Otto Mazzarella"     "Mark Spencer"       
##  [58] "George Sturdavant"   "Shamus Barksdale"    "Mary Moore"         
##  [61] "Doris Lassiter"      "Patricia Letourneau" "Robert Dopson"      
##  [64] "Randall Arellano"    "Daniel Bast"         "John Forbes"        
##  [67] "Marion Traylor"      "Ted Reinke"          "Joan Haskell"       
##  [70] "Amanda Saucier"      "Jack Jones"          "Maria Hommen"       
##  [73] "Eugene Tesch"        "Katherine French"    "Curtis Hoard"       
##  [76] "Isaac Wood"          "Jack Hanson"         "Kenneth Richard"    
##  [79] "Henry Aguirre"       "Cody Buitron"        "Alison Ngo"         
##  [82] "Kathleen Fluty"      "George Youngblood"   "Paula Williams"     
##  [85] "Douglas Crosby"      "Evelyn Jordan"       "Daniel Nix"         
##  [88] "Peggy Duncan"        "Gerardo Murphey"     "Richard Weasel"     
##  [91] "Anastasia Hudley"    "Ernest Conn"         "Madelyn Dimon"      
##  [94] "Jeremy Smith"        "Rudolph Weber"       "Joseph Longoria"    
##  [97] "Nicholas Nelson"     "Lois Harris"         "Ruth Doria"         
## [100] "Evelyn Mcgee"