Background

In this installment of Magic with Data I am going to create 50,000 Agentine citizens from thin air.

Inspiration

Earlier this year I was working on a project for Argentina and it looked like the team was not going to receive real data from the source anytime soon, and our partner that uses the data for their application was ready to test their development.

I started thinking, “How can we create people that will look real to the Argentine end-user, then attach that data to simulated transactions so that the average transaction count per person was tunable?”

Why 50,000?

I chose 50,000 because it’s a nice round number and if I apply 50,000 people to 100,000 transactions there will be an average of two transactions per person.

This walkthrough only creates the people and does not create the full transaction set.

Outcome

The team ended up receiving enough real sample data that our partner could test so this solution was never used. It is a quick & dirty approach, the next evolution would be to turn it into a function, but this walkthrough accomplishes the goals of creating “people” so that the reader can see the step-by-step process.

Randomization

Note the randomization points during the process. There is almost no chance of a duplicate person being created and the population can be of nearly any size.

Argentine vs Argentinian

These terms are synonymous and, from what I can tell, neither is incorrect.
I chose to use Argentine because that seems like the more formal version, and I have the deepest respect & admiration for my colleagues from Argentina.
Source: https://en.wikipedia.org/wiki/Argentines

Walkthrough

Load libraries

# Force UTF-8 encoding because of spanish characters
options(encoding = "UTF-8")
library(tidyverse)
library(lubridate)
library(stringr)

Tuning Parameters

  • There is a lot of randomization used in this process so I’ll set the seed for reproducibility.
  • As promised, I set the population size to 50,000 people, 55% female, 45% male.
  • The average age and sd can be tuned so that you don’t get anyone under 18.
# Set random seed for reproducibility; comment if you want different data
set.seed(317)

# General parameters
population_size <- 50000
percentage_male <- .45
customer_age_average <- 35
customer_age_sd <- 10
customer_start_num <- 1

Create our people

Generate a unique id for each member of population

  • The first column is a unique customer id
  • The id uses the customer_start_num to generate sequential integer id’s from that value to the population_size.
customer_id <- data_frame(unique_id = customer_start_num:(population_size + customer_start_num - 1))

Create birth dates

Uses three different points of randomization

year_today <- as.integer(year(Sys.Date()))

birth_month <- sample(1:12, population_size, replace = TRUE)
birth_day <- sample(1:28, population_size, replace = TRUE)
birth_year <- as.integer(year_today - rnorm(population_size, mean = customer_age_average, sd = customer_age_sd))

date_df <- data.frame(
  BIRTH_MONTH = str_pad(as.character(birth_month), 2, pad = "0"),
  BIRTH_DAY = str_pad(as.character(birth_day), 2, pad = "0"),
  BIRTH_YEAR = str_pad(as.character(birth_year), 2, pad = "0")
)

date_of_birth_final <- date_df %>%
  transmute(date_of_birth = paste0(BIRTH_YEAR, BIRTH_MONTH, BIRTH_DAY))

rm(year_today, birth_day, birth_month, birth_year, date_df)

Look at birthdates

This is the format that was required

head(date_of_birth_final)
##   date_of_birth
## 1      19840621
## 2      19850202
## 3      19950701
## 4      19821202
## 5      19710723
## 6      19860402

Create customer names

  • Use percentage male to generate the correct proportion
  • The approach is to load data for the most common girls & boys first names, as well as the most common surnames in Argentina.
  • Then randomly sample them to build the “Two Firstname, Two Lastname” structure.

“Two Firstname, Two Lastname” structure

  • In Latin America is is very common for people to have two first names and two last names … or more.
  • This helps us because we can use all four slots to help ensure that each record is unique.

Number of names used

  • There are 80 female first names
  • There are 80 male first names
  • There are 101 last names

Note that you will get people with the same first & middle names, and the same first & second last names. It’s just the nature of having a limited selection. Since this is test data there is no reason to dedupe the names.

Sources

I Googled Argentina baby names and surnames to get names that would be familiar to an Argentine user.

Load Name files

female_first_names <- read_csv("female_names.csv")
male_first_names <- read_csv("male_names.csv")
last_names <- read_csv("last_names.csv")

Ladies First

Create dataframe with random ladies’ first names put into first_name and middle_name

female_first_name_df <- data_frame(
  gender = "FEMALE",
  first_name = as.character(sample(
    female_first_names$Name,
    population_size * (1 - percentage_male),
    replace = TRUE
  )),
  middle_name = as.character(sample(
    female_first_names$Name,
    population_size * (1 - percentage_male),
    replace = TRUE
  ))
)

Next the men

male_first_name_df <- data.frame(
  gender = "MALE",
  first_name = as.character(sample(
    male_first_names$Name,
    population_size - nrow(female_first_name_df),
    replace = TRUE
  )),
  middle_name = as.character(sample(
    male_first_names$Name,
    population_size - nrow(female_first_name_df),
    replace = TRUE
  ))
)

Bind them together to get a full set of first names

first_names_final <- rbind(female_first_name_df, male_first_name_df)
sample_n(first_names_final, 6)
## # A tibble: 6 x 3
##   gender first_name middle_name
##    <chr>      <chr>       <chr>
## 1 FEMALE       Lola   Florencia
## 2 FEMALE    Daniela     Delfina
## 3   MALE    Gabriel      Alexis
## 4 FEMALE     Oriana      Magali
## 5   MALE      Elias     Facundo
## 6   MALE     Thiago       Diego

Create last names

There is no need to subset based on gender.

last_names_final <- data_frame(
        first_last_name = as.character(sample(
                last_names$Name,
                population_size,
                replace = TRUE
        )),
        second_last_name = as.character(sample(
                last_names$Name,
                population_size,
                replace = TRUE
        ))
)

Bind first & last names to create full_names

full_names_final <- first_names_final %>%
        bind_cols(last_names_final) %>%
        mutate_all(toupper)
sample_n(full_names_final, 6)
## # A tibble: 6 x 5
##   gender first_name middle_name first_last_name second_last_name
##    <chr>      <chr>       <chr>           <chr>            <chr>
## 1   MALE     MATIAS     GONZALO         PEREIRA          PEREIRA
## 2   MALE      DYLAN       MATEO       RODRIGUEZ           MARTIN
## 3   MALE        IAN        JUAN            RUIZ         IGLESIAS
## 4   MALE      PABLO      MARTIN         LORENZO          QUIROGA
## 5   MALE  JUAN CRUZ      MAXIMO             PAZ          COLOMBO
## 6 FEMALE    CANDELA      RENATA          BLANCO           IBANEZ

Build addresses and phone numbers

Load table of cities, states, postal codes, and area codes

Unfortunately I do not recall where I got this data from, probably wikipedia. I can post to a public data site if anyone wants to reproduce this.

address_phone_info <- read_csv("argentina_address_info.csv")
head(address_phone_info)
## # A tibble: 6 x 5
##                 city state country_code postal_code area_code
##                <chr> <chr>        <chr>       <int>     <int>
## 1    ALMIRANTE BROWN    CT           AR        1846       114
## 2         AVELLANEDA    TM           AR        1870       114
## 3       BAHIA BLANCA    JY           AR        8000       114
## 4       BAHIA BLANCA    JY           AR        8002       291
## 5 BANDA DEL RIO SALI    CD           AR        4109       381
## 6    CANADA DE GOMEZ    CR           AR        1439       114

Pull sample address data for population_size, with replacement

  • City
  • State
  • Country code
  • Postal code
  • Area code

The street address is very basic, but still plausible to Argentines:

  • Generate a number between 1 & 999
  • Generate a letter between A & Z
  • Add some boilerplate text to create an address like: Ruta 317 y Av P
address_samples <- address_phone_info[sample(
  nrow(address_phone_info),
  population_size,
  replace = TRUE
), ]

Create street addresses

I’ve set this up to functionalize at a later date

street_address <- vector(mode = "character")

for(i in seq_len(nrow(address_samples))) {
  street_address[i] <- paste(
    "Ruta",
    as.character(as.integer(sample(1:999, 1))),
    "y",
    "Av",
    toupper(as.character(sample(c(letters, LETTERS), 1)))
  )
}
head(street_address)
## [1] "Ruta 705 y Av T" "Ruta 909 y Av P" "Ruta 518 y Av L" "Ruta 357 y Av Z"
## [5] "Ruta 740 y Av V" "Ruta 251 y Av Q"

Create phone numbers

telephone_nums <- vector()
for(i in seq_len(nrow(address_samples))) {
        telephone_nums[i] <- paste0(
                address_samples[i, 5],
                as.character(sample(222:899, 1)),
                as.character(sample(1000:9999, 1))
        )
}
head(telephone_nums)
## [1] "1144102737" "1147462119" "3437642245" "1144027961" "1148349578"
## [6] "3876578617"

Bind street address and phone number to other address elements, and re-order variables

street <- tibble(address = street_address)
phone <- tibble(phone_num = telephone_nums)

full_address_final <- address_samples %>%
        bind_cols(street) %>%
        bind_cols(phone) %>%
        select(address, city, state, country_code, postal_code, 
               phone_num)

# Remove intermediate variables
rm(address_phone_info, address_samples, phone, street, 
   street_address, telephone_nums)

Create final data set

customer_file <- customer_id %>%
  bind_cols(full_names_final) %>%
  bind_cols(full_address_final) %>%
  bind_cols(date_of_birth_final)

glimpse(customer_file)
## Observations: 50,000
## Variables: 13
## $ unique_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ gender           <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMA...
## $ first_name       <chr> "LUCIA", "ORIANA", "PILAR", "ORNELLA", "SOL",...
## $ middle_name      <chr> "IARA", "SOFIA", "GUADALUPE", "PAULA", "MACAR...
## $ first_last_name  <chr> "PEREZ", "PEREYRA", "AGUILAR", "SANCHEZ", "AR...
## $ second_last_name <chr> "SANCHEZ", "VERA", "VIDAL", "OTERO", "SUAREZ"...
## $ address          <chr> "Ruta 705 y Av T", "Ruta 909 y Av P", "Ruta 5...
## $ city             <chr> "CAPITAL FEDERAL", "POSADAS", "CORRIENTES", "...
## $ state            <chr> "CF", "MN", "CR", "SA", "CD", "SC", "NQ", "CD...
## $ country_code     <chr> "AR", "AR", "AR", "AR", "AR", "AR", "AR", "AR...
## $ postal_code      <int> 1437, 3300, 3402, 1650, 5003, 9120, 1629, 551...
## $ phone_num        <chr> "1144102737", "1147462119", "3437642245", "11...
## $ date_of_birth    <chr> "19840621", "19850202", "19950701", "19821202...
dim(customer_file)
## [1] 50000    13

Voila!! We have 50,000 “people”

Final Notes

END