Walkthrough

Load libraries

# Force UTF-8 encoding because of spanish characters
options(encoding = "UTF-8")
library(tidyverse)
library(lubridate)
library(stringr)

Tuning Parameters

There is a lot of randomization used in this process so I’ll set the seed for reproducibility.
As promised, I set the population size to 50,000 people, 55% female, 45% male.
The average age and sd can be tuned so that you don’t get anyone under 18.

# Set random seed for reproducibility; comment if you want different data
set.seed(317)

# General parameters
population_size <- 50000
percentage_male <- .45
customer_age_average <- 35
customer_age_sd <- 10
customer_start_num <- 1

Create our people

Generate a unique id for each member of population

The first column is a unique customer id
The id uses the customer_start_num to generate sequential integer id’s from that value to the population_size.

customer_id <- data_frame(unique_id = customer_start_num:(population_size + customer_start_num - 1))

Create birth dates

Uses three different points of randomization

year_today <- as.integer(year(Sys.Date()))

birth_month <- sample(1:12, population_size, replace = TRUE)
birth_day <- sample(1:28, population_size, replace = TRUE)
birth_year <- as.integer(year_today - rnorm(population_size, mean = customer_age_average, sd = customer_age_sd))

date_df <- data.frame(
  BIRTH_MONTH = str_pad(as.character(birth_month), 2, pad = "0"),
  BIRTH_DAY = str_pad(as.character(birth_day), 2, pad = "0"),
  BIRTH_YEAR = str_pad(as.character(birth_year), 2, pad = "0")
)

date_of_birth_final <- date_df %>%
  transmute(date_of_birth = paste0(BIRTH_YEAR, BIRTH_MONTH, BIRTH_DAY))

rm(year_today, birth_day, birth_month, birth_year, date_df)

Look at birthdates

This is the format that was required

head(date_of_birth_final)

##   date_of_birth
## 1      19840621
## 2      19850202
## 3      19950701
## 4      19821202
## 5      19710723
## 6      19860402

Create customer names

Use percentage male to generate the correct proportion
The approach is to load data for the most common girls & boys first names, as well as the most common surnames in Argentina.
Then randomly sample them to build the “Two Firstname, Two Lastname” structure.

“Two Firstname, Two Lastname” structure

In Latin America is is very common for people to have two first names and two last names … or more.
This helps us because we can use all four slots to help ensure that each record is unique.

Number of names used

There are 80 female first names
There are 80 male first names
There are 101 last names

Note that you will get people with the same first & middle names, and the same first & second last names. It’s just the nature of having a limited selection. Since this is test data there is no reason to dedupe the names.

Sources

I Googled Argentina baby names and surnames to get names that would be familiar to an Argentine user.

Get popular Latin-American baby names and surnames

Source: First Names - http://www.babynames.ch/Info/Hitparade/poAr2009f
Source: Last Names - http://surnames.behindthename.com/top/lists/argentina/2006

Load Name files

female_first_names <- read_csv("female_names.csv")
male_first_names <- read_csv("male_names.csv")
last_names <- read_csv("last_names.csv")

Ladies First

Create dataframe with random ladies’ first names put into first_name and middle_name

female_first_name_df <- data_frame(
  gender = "FEMALE",
  first_name = as.character(sample(
    female_first_names$Name,
    population_size * (1 - percentage_male),
    replace = TRUE
  )),
  middle_name = as.character(sample(
    female_first_names$Name,
    population_size * (1 - percentage_male),
    replace = TRUE
  ))
)

Next the men

male_first_name_df <- data.frame(
  gender = "MALE",
  first_name = as.character(sample(
    male_first_names$Name,
    population_size - nrow(female_first_name_df),
    replace = TRUE
  )),
  middle_name = as.character(sample(
    male_first_names$Name,
    population_size - nrow(female_first_name_df),
    replace = TRUE
  ))
)

Bind them together to get a full set of first names

first_names_final <- rbind(female_first_name_df, male_first_name_df)
sample_n(first_names_final, 6)

## # A tibble: 6 x 3
##   gender first_name middle_name
##    <chr>      <chr>       <chr>
## 1 FEMALE       Lola   Florencia
## 2 FEMALE    Daniela     Delfina
## 3   MALE    Gabriel      Alexis
## 4 FEMALE     Oriana      Magali
## 5   MALE      Elias     Facundo
## 6   MALE     Thiago       Diego

Create last names

There is no need to subset based on gender.

last_names_final <- data_frame(
        first_last_name = as.character(sample(
                last_names$Name,
                population_size,
                replace = TRUE
        )),
        second_last_name = as.character(sample(
                last_names$Name,
                population_size,
                replace = TRUE
        ))
)

Bind first & last names to create `full_names`

full_names_final <- first_names_final %>%
        bind_cols(last_names_final) %>%
        mutate_all(toupper)
sample_n(full_names_final, 6)

## # A tibble: 6 x 5
##   gender first_name middle_name first_last_name second_last_name
##    <chr>      <chr>       <chr>           <chr>            <chr>
## 1   MALE     MATIAS     GONZALO         PEREIRA          PEREIRA
## 2   MALE      DYLAN       MATEO       RODRIGUEZ           MARTIN
## 3   MALE        IAN        JUAN            RUIZ         IGLESIAS
## 4   MALE      PABLO      MARTIN         LORENZO          QUIROGA
## 5   MALE  JUAN CRUZ      MAXIMO             PAZ          COLOMBO
## 6 FEMALE    CANDELA      RENATA          BLANCO           IBANEZ

Build addresses and phone numbers

Load table of cities, states, postal codes, and area codes

Unfortunately I do not recall where I got this data from, probably wikipedia. I can post to a public data site if anyone wants to reproduce this.

address_phone_info <- read_csv("argentina_address_info.csv")
head(address_phone_info)

## # A tibble: 6 x 5
##                 city state country_code postal_code area_code
##                <chr> <chr>        <chr>       <int>     <int>
## 1    ALMIRANTE BROWN    CT           AR        1846       114
## 2         AVELLANEDA    TM           AR        1870       114
## 3       BAHIA BLANCA    JY           AR        8000       114
## 4       BAHIA BLANCA    JY           AR        8002       291
## 5 BANDA DEL RIO SALI    CD           AR        4109       381
## 6    CANADA DE GOMEZ    CR           AR        1439       114

Pull sample address data for `population_size`, with replacement

City
State
Country code
Postal code
Area code

The street address is very basic, but still plausible to Argentines:

Generate a number between 1 & 999
Generate a letter between A & Z
Add some boilerplate text to create an address like: Ruta 317 y Av P

address_samples <- address_phone_info[sample(
  nrow(address_phone_info),
  population_size,
  replace = TRUE
), ]

Create street addresses

I’ve set this up to functionalize at a later date

street_address <- vector(mode = "character")

for(i in seq_len(nrow(address_samples))) {
  street_address[i] <- paste(
    "Ruta",
    as.character(as.integer(sample(1:999, 1))),
    "y",
    "Av",
    toupper(as.character(sample(c(letters, LETTERS), 1)))
  )
}
head(street_address)

## [1] "Ruta 705 y Av T" "Ruta 909 y Av P" "Ruta 518 y Av L" "Ruta 357 y Av Z"
## [5] "Ruta 740 y Av V" "Ruta 251 y Av Q"

Create phone numbers

telephone_nums <- vector()
for(i in seq_len(nrow(address_samples))) {
        telephone_nums[i] <- paste0(
                address_samples[i, 5],
                as.character(sample(222:899, 1)),
                as.character(sample(1000:9999, 1))
        )
}
head(telephone_nums)

## [1] "1144102737" "1147462119" "3437642245" "1144027961" "1148349578"
## [6] "3876578617"

Bind street address and phone number to other address elements, and re-order variables

street <- tibble(address = street_address)
phone <- tibble(phone_num = telephone_nums)

full_address_final <- address_samples %>%
        bind_cols(street) %>%
        bind_cols(phone) %>%
        select(address, city, state, country_code, postal_code, 
               phone_num)

# Remove intermediate variables
rm(address_phone_info, address_samples, phone, street, 
   street_address, telephone_nums)

Create final data set

customer_file <- customer_id %>%
  bind_cols(full_names_final) %>%
  bind_cols(full_address_final) %>%
  bind_cols(date_of_birth_final)

glimpse(customer_file)

## Observations: 50,000
## Variables: 13
## $ unique_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ gender           <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMA...
## $ first_name       <chr> "LUCIA", "ORIANA", "PILAR", "ORNELLA", "SOL",...
## $ middle_name      <chr> "IARA", "SOFIA", "GUADALUPE", "PAULA", "MACAR...
## $ first_last_name  <chr> "PEREZ", "PEREYRA", "AGUILAR", "SANCHEZ", "AR...
## $ second_last_name <chr> "SANCHEZ", "VERA", "VIDAL", "OTERO", "SUAREZ"...
## $ address          <chr> "Ruta 705 y Av T", "Ruta 909 y Av P", "Ruta 5...
## $ city             <chr> "CAPITAL FEDERAL", "POSADAS", "CORRIENTES", "...
## $ state            <chr> "CF", "MN", "CR", "SA", "CD", "SC", "NQ", "CD...
## $ country_code     <chr> "AR", "AR", "AR", "AR", "AR", "AR", "AR", "AR...
## $ postal_code      <int> 1437, 3300, 3402, 1650, 5003, 9120, 1629, 551...
## $ phone_num        <chr> "1144102737", "1147462119", "3437642245", "11...
## $ date_of_birth    <chr> "19840621", "19850202", "19950701", "19821202...

dim(customer_file)

## [1] 50000    13

How to create 50,000 Argentines … or more

Kier O’Neil

December 28, 2017

Background

Inspiration

Why 50,000?

Outcome

Randomization

Argentine vs Argentinian

Walkthrough

Load libraries

Tuning Parameters

Create our people

Generate a unique id for each member of population

Create birth dates

Look at birthdates

Create customer names

“Two Firstname, Two Lastname” structure

Number of names used

Sources

Get popular Latin-American baby names and surnames

Load Name files

Ladies First

Next the men

Bind them together to get a full set of first names

Create last names

Bind first & last names to create `full_names`

Build addresses and phone numbers

Load table of cities, states, postal codes, and area codes

Pull sample address data for `population_size`, with replacement

Create street addresses

Create phone numbers

Bind street address and phone number to other address elements, and re-order variables

Create final data set

Voila!! We have 50,000 “people”

Final Notes

END

How to create 50,000 Argentines … or more

Kier O’Neil

December 28, 2017

Background

Inspiration

Why 50,000?

Outcome

Randomization

Argentine vs Argentinian

Walkthrough

Load libraries

Tuning Parameters

Create our people

Generate a unique id for each member of population

Create birth dates

Look at birthdates

Create customer names

“Two Firstname, Two Lastname” structure

Number of names used

Sources

Get popular Latin-American baby names and surnames

Load Name files

Ladies First

Next the men

Bind them together to get a full set of first names

Create last names

Bind first & last names to create full_names

Build addresses and phone numbers

Load table of cities, states, postal codes, and area codes

Pull sample address data for population_size, with replacement

Create street addresses

Create phone numbers

Bind street address and phone number to other address elements, and re-order variables

Create final data set

Voila!! We have 50,000 “people”

Final Notes

END

Bind first & last names to create `full_names`

Pull sample address data for `population_size`, with replacement