‘rethnicity’ Package

Predicting Ethnicity from Names in R

Martin Tran

Introduction to rethnicity

  • What is it? A machine learning model R package designed to predict race and ethnicity using only names.
  • How does it work? It utilizes a network trained on a massive Florida Voter Registration dataset.
  • Author: Fangzhou Xie (2021)
# Load required libraries
library(rethnicity)
library(dplyr)
library(ggplot2)
library(tidyr)
library(Lahman) #baseball package for names

Function

  • predict_ethnicity()
  • firstnames = NULL,
  • lastnames = NULL,
  • method = “fullname” or “lastname”
  • threads = 0,
  • na.rm = FALSE

Examples

single_predict <- predict_ethnicity(
  firstnames = "Martin", 
  lastnames = "Tran", 
  method = "fullname")

# Display the classification result
single_predict
  firstname lastname prob_asian   prob_black prob_hispanic  prob_white  race
1    Martin     Tran  0.9897194 0.0007328312   0.004585689 0.004962085 asian
last_predict <- predict_ethnicity(
  lastnames = "Martin", 
  method = "lastname")
last_predict
  lastname prob_asian prob_black prob_hispanic prob_white  race
1   Martin 0.07798092  0.3111516     0.3073014  0.3035661 black

Data Set

raw_roster <- Lahman::People

baseball_names <- raw_roster %>%
  filter(debut >= "2000-01-01") %>% #filters players after 2000s
  select(playerID, nameFirst, nameLast) %>%
  drop_na(nameFirst, nameLast)

head(baseball_names)
   playerID nameFirst nameLast
1 aardsda01     David  Aardsma
2  abadan01      Andy     Abad
3  abadfe01  Fernando     Abad
4 abbotan01    Andrew   Abbott
5 abbotco01      Cory   Abbott
6  abelmi01      Mick     Abel

predict_ethnicity() in action

roster_with_demographics <- baseball_names |> 
  mutate(
    predict_ethnicity(
      lastnames = nameLast, 
      firstnames = nameFirst,
      method = "fullname")) |> 
  select(playerID, firstname, lastname, prob_asian, prob_black, prob_hispanic, prob_white, race)

head(roster_with_demographics)
   playerID firstname lastname prob_asian  prob_black prob_hispanic  prob_white
1 aardsda01     David  Aardsma 0.07009414 0.347950069    0.08085186 0.501103934
2  abadan01      Andy     Abad 0.68531211 0.007956145    0.25888873 0.047843014
3  abadfe01  Fernando     Abad 0.08253610 0.006101720    0.90657341 0.004788768
4 abbotan01    Andrew   Abbott 0.10490424 0.274247768    0.04656111 0.574286882
5 abbotco01      Cory   Abbott 0.08479635 0.298825897    0.02083483 0.595542921
6  abelmi01      Mick     Abel 0.22109019 0.104949849    0.04501239 0.628947570
      race
1    white
2    asian
3 hispanic
4    white
5    white
6    white
aggregated_data <- roster_with_demographics %>%
  mutate(predicted_race = race) %>% 
  count(predicted_race) %>%
  mutate(percentage = n / sum(n) * 100)

aggregated_data
  predicted_race    n percentage
1          asian  546   9.084859
2          black 1241  20.648918
3       hispanic 1571  26.139767
4          white 2652  44.126456

ggplot(aggregated_data, aes(x = reorder(predicted_race, -percentage), y = percentage, fill = predicted_race)) +
  geom_col(show.legend = FALSE, color = "black", width = 0.6) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), vjust = -0.5, fontface = "bold", size = 5) +
  labs(
    title = "Predicted Ethnic Distribution of MLB Players After 2000",
    x = "Predicted Ethnicity",
    y = "Percentage of Roster",
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 18)
  )

Usefulness??

  • Biased towards United States citizens, fails internationally
  • Good to predict ethnic background for any data sets with names
  • Depended on the Florida voting data

THE END