In this installment of Magic with Data I am going to create 50,000 Agentine citizens from thin air.
Earlier this year I was working on a project for Argentina and it looked like the team was not going to receive real data from the source anytime soon, and our partner that uses the data for their application was ready to test their development.
I started thinking, “How can we create people that will look real to the Argentine end-user, then attach that data to simulated transactions so that the average transaction count per person was tunable?”
I chose 50,000 because it’s a nice round number and if I apply 50,000 people to 100,000 transactions there will be an average of two transactions per person.
This walkthrough only creates the people and does not create the full transaction set.
The team ended up receiving enough real sample data that our partner could test so this solution was never used. It is a quick & dirty approach, the next evolution would be to turn it into a function, but this walkthrough accomplishes the goals of creating “people” so that the reader can see the step-by-step process.
Note the randomization points during the process. There is almost no chance of a duplicate person being created and the population can be of nearly any size.
These terms are synonymous and, from what I can tell, neither is incorrect.
I chose to use Argentine because that seems like the more formal version, and I have the deepest respect & admiration for my colleagues from Argentina.
Source: https://en.wikipedia.org/wiki/Argentines
# Force UTF-8 encoding because of spanish characters
options(encoding = "UTF-8")
library(tidyverse)
library(lubridate)
library(stringr)
# Set random seed for reproducibility; comment if you want different data
set.seed(317)
# General parameters
population_size <- 50000
percentage_male <- .45
customer_age_average <- 35
customer_age_sd <- 10
customer_start_num <- 1
customer_start_num to generate sequential integer id’s from that value to the population_size.customer_id <- data_frame(unique_id = customer_start_num:(population_size + customer_start_num - 1))
Uses three different points of randomization
year_today <- as.integer(year(Sys.Date()))
birth_month <- sample(1:12, population_size, replace = TRUE)
birth_day <- sample(1:28, population_size, replace = TRUE)
birth_year <- as.integer(year_today - rnorm(population_size, mean = customer_age_average, sd = customer_age_sd))
date_df <- data.frame(
BIRTH_MONTH = str_pad(as.character(birth_month), 2, pad = "0"),
BIRTH_DAY = str_pad(as.character(birth_day), 2, pad = "0"),
BIRTH_YEAR = str_pad(as.character(birth_year), 2, pad = "0")
)
date_of_birth_final <- date_df %>%
transmute(date_of_birth = paste0(BIRTH_YEAR, BIRTH_MONTH, BIRTH_DAY))
rm(year_today, birth_day, birth_month, birth_year, date_df)
This is the format that was required
head(date_of_birth_final)
## date_of_birth
## 1 19840621
## 2 19850202
## 3 19950701
## 4 19821202
## 5 19710723
## 6 19860402
Note that you will get people with the same first & middle names, and the same first & second last names. It’s just the nature of having a limited selection. Since this is test data there is no reason to dedupe the names.
I Googled Argentina baby names and surnames to get names that would be familiar to an Argentine user.
Source: First Names - http://www.babynames.ch/Info/Hitparade/poAr2009f
Source: Last Names - http://surnames.behindthename.com/top/lists/argentina/2006
female_first_names <- read_csv("female_names.csv")
male_first_names <- read_csv("male_names.csv")
last_names <- read_csv("last_names.csv")
Create dataframe with random ladies’ first names put into first_name and middle_name
female_first_name_df <- data_frame(
gender = "FEMALE",
first_name = as.character(sample(
female_first_names$Name,
population_size * (1 - percentage_male),
replace = TRUE
)),
middle_name = as.character(sample(
female_first_names$Name,
population_size * (1 - percentage_male),
replace = TRUE
))
)
male_first_name_df <- data.frame(
gender = "MALE",
first_name = as.character(sample(
male_first_names$Name,
population_size - nrow(female_first_name_df),
replace = TRUE
)),
middle_name = as.character(sample(
male_first_names$Name,
population_size - nrow(female_first_name_df),
replace = TRUE
))
)
first_names_final <- rbind(female_first_name_df, male_first_name_df)
sample_n(first_names_final, 6)
## # A tibble: 6 x 3
## gender first_name middle_name
## <chr> <chr> <chr>
## 1 FEMALE Lola Florencia
## 2 FEMALE Daniela Delfina
## 3 MALE Gabriel Alexis
## 4 FEMALE Oriana Magali
## 5 MALE Elias Facundo
## 6 MALE Thiago Diego
There is no need to subset based on gender.
last_names_final <- data_frame(
first_last_name = as.character(sample(
last_names$Name,
population_size,
replace = TRUE
)),
second_last_name = as.character(sample(
last_names$Name,
population_size,
replace = TRUE
))
)
full_namesfull_names_final <- first_names_final %>%
bind_cols(last_names_final) %>%
mutate_all(toupper)
sample_n(full_names_final, 6)
## # A tibble: 6 x 5
## gender first_name middle_name first_last_name second_last_name
## <chr> <chr> <chr> <chr> <chr>
## 1 MALE MATIAS GONZALO PEREIRA PEREIRA
## 2 MALE DYLAN MATEO RODRIGUEZ MARTIN
## 3 MALE IAN JUAN RUIZ IGLESIAS
## 4 MALE PABLO MARTIN LORENZO QUIROGA
## 5 MALE JUAN CRUZ MAXIMO PAZ COLOMBO
## 6 FEMALE CANDELA RENATA BLANCO IBANEZ
Unfortunately I do not recall where I got this data from, probably wikipedia. I can post to a public data site if anyone wants to reproduce this.
address_phone_info <- read_csv("argentina_address_info.csv")
head(address_phone_info)
## # A tibble: 6 x 5
## city state country_code postal_code area_code
## <chr> <chr> <chr> <int> <int>
## 1 ALMIRANTE BROWN CT AR 1846 114
## 2 AVELLANEDA TM AR 1870 114
## 3 BAHIA BLANCA JY AR 8000 114
## 4 BAHIA BLANCA JY AR 8002 291
## 5 BANDA DEL RIO SALI CD AR 4109 381
## 6 CANADA DE GOMEZ CR AR 1439 114
population_size, with replacementThe street address is very basic, but still plausible to Argentines:
address_samples <- address_phone_info[sample(
nrow(address_phone_info),
population_size,
replace = TRUE
), ]
I’ve set this up to functionalize at a later date
street_address <- vector(mode = "character")
for(i in seq_len(nrow(address_samples))) {
street_address[i] <- paste(
"Ruta",
as.character(as.integer(sample(1:999, 1))),
"y",
"Av",
toupper(as.character(sample(c(letters, LETTERS), 1)))
)
}
head(street_address)
## [1] "Ruta 705 y Av T" "Ruta 909 y Av P" "Ruta 518 y Av L" "Ruta 357 y Av Z"
## [5] "Ruta 740 y Av V" "Ruta 251 y Av Q"
telephone_nums <- vector()
for(i in seq_len(nrow(address_samples))) {
telephone_nums[i] <- paste0(
address_samples[i, 5],
as.character(sample(222:899, 1)),
as.character(sample(1000:9999, 1))
)
}
head(telephone_nums)
## [1] "1144102737" "1147462119" "3437642245" "1144027961" "1148349578"
## [6] "3876578617"
street <- tibble(address = street_address)
phone <- tibble(phone_num = telephone_nums)
full_address_final <- address_samples %>%
bind_cols(street) %>%
bind_cols(phone) %>%
select(address, city, state, country_code, postal_code,
phone_num)
# Remove intermediate variables
rm(address_phone_info, address_samples, phone, street,
street_address, telephone_nums)
customer_file <- customer_id %>%
bind_cols(full_names_final) %>%
bind_cols(full_address_final) %>%
bind_cols(date_of_birth_final)
glimpse(customer_file)
## Observations: 50,000
## Variables: 13
## $ unique_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ gender <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMA...
## $ first_name <chr> "LUCIA", "ORIANA", "PILAR", "ORNELLA", "SOL",...
## $ middle_name <chr> "IARA", "SOFIA", "GUADALUPE", "PAULA", "MACAR...
## $ first_last_name <chr> "PEREZ", "PEREYRA", "AGUILAR", "SANCHEZ", "AR...
## $ second_last_name <chr> "SANCHEZ", "VERA", "VIDAL", "OTERO", "SUAREZ"...
## $ address <chr> "Ruta 705 y Av T", "Ruta 909 y Av P", "Ruta 5...
## $ city <chr> "CAPITAL FEDERAL", "POSADAS", "CORRIENTES", "...
## $ state <chr> "CF", "MN", "CR", "SA", "CD", "SC", "NQ", "CD...
## $ country_code <chr> "AR", "AR", "AR", "AR", "AR", "AR", "AR", "AR...
## $ postal_code <int> 1437, 3300, 3402, 1650, 5003, 9120, 1629, 551...
## $ phone_num <chr> "1144102737", "1147462119", "3437642245", "11...
## $ date_of_birth <chr> "19840621", "19850202", "19950701", "19821202...
dim(customer_file)
## [1] 50000 13