‘rethnicity’ Package

Predicting Ethnicity from Names in R

Martin Tran

Introduction to `rethnicity`

What is it? A machine learning model R package designed to predict race and ethnicity using only names.
How does it work? It utilizes a network trained on a massive Florida Voter Registration dataset.
Author: Fangzhou Xie (2021)

# Load required libraries
library(rethnicity)
library(dplyr)
library(ggplot2)
library(tidyr)
library(Lahman) #baseball package for names

Function

predict_ethnicity()

firstnames = NULL,
lastnames = NULL,
method = “fullname” or “lastname”
threads = 0,
na.rm = FALSE

Examples

single_predict <- predict_ethnicity(
  firstnames = "Martin", 
  lastnames = "Tran", 
  method = "fullname")

# Display the classification result
single_predict

  firstname lastname prob_asian   prob_black prob_hispanic  prob_white  race
1    Martin     Tran  0.9897194 0.0007328312   0.004585689 0.004962085 asian

last_predict <- predict_ethnicity(
  lastnames = "Martin", 
  method = "lastname")
last_predict

  lastname prob_asian prob_black prob_hispanic prob_white  race
1   Martin 0.07798092  0.3111516     0.3073014  0.3035661 black

Data Set

raw_roster <- Lahman::People

baseball_names <- raw_roster %>%
  filter(debut >= "2000-01-01") %>% #filters players after 2000s
  select(playerID, nameFirst, nameLast) %>%
  drop_na(nameFirst, nameLast)

head(baseball_names)

   playerID nameFirst nameLast
1 aardsda01     David  Aardsma
2  abadan01      Andy     Abad
3  abadfe01  Fernando     Abad
4 abbotan01    Andrew   Abbott
5 abbotco01      Cory   Abbott
6  abelmi01      Mick     Abel

predict_ethnicity() in action

roster_with_demographics <- baseball_names |> 
  mutate(
    predict_ethnicity(
      lastnames = nameLast, 
      firstnames = nameFirst,
      method = "fullname")) |> 
  select(playerID, firstname, lastname, prob_asian, prob_black, prob_hispanic, prob_white, race)

head(roster_with_demographics)

   playerID firstname lastname prob_asian  prob_black prob_hispanic  prob_white
1 aardsda01     David  Aardsma 0.07009414 0.347950069    0.08085186 0.501103934
2  abadan01      Andy     Abad 0.68531211 0.007956145    0.25888873 0.047843014
3  abadfe01  Fernando     Abad 0.08253610 0.006101720    0.90657341 0.004788768
4 abbotan01    Andrew   Abbott 0.10490424 0.274247768    0.04656111 0.574286882
5 abbotco01      Cory   Abbott 0.08479635 0.298825897    0.02083483 0.595542921
6  abelmi01      Mick     Abel 0.22109019 0.104949849    0.04501239 0.628947570
      race
1    white
2    asian
3 hispanic
4    white
5    white
6    white

aggregated_data <- roster_with_demographics %>%
  mutate(predicted_race = race) %>% 
  count(predicted_race) %>%
  mutate(percentage = n / sum(n) * 100)

aggregated_data

  predicted_race    n percentage
1          asian  546   9.084859
2          black 1241  20.648918
3       hispanic 1571  26.139767
4          white 2652  44.126456

ggplot(aggregated_data, aes(x = reorder(predicted_race, -percentage), y = percentage, fill = predicted_race)) +
  geom_col(show.legend = FALSE, color = "black", width = 0.6) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), vjust = -0.5, fontface = "bold", size = 5) +
  labs(
    title = "Predicted Ethnic Distribution of MLB Players After 2000",
    x = "Predicted Ethnicity",
    y = "Percentage of Roster",
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 18)
  )

Usefulness??

Biased towards United States citizens, fails internationally
Good to predict ethnic background for any data sets with names
Depended on the Florida voting data

‘rethnicity’ Package

Introduction to rethnicity

Function

Examples

Data Set

predict_ethnicity() in action

Usefulness??

THE END

Introduction to `rethnicity`