Painting A Very, Very Clear Picture of Reduplication

See what I did there? If not, then strap in, as we zoom in on one of the lesser studied – and thereby, more exciting – phenomena of linguistics: reduplication.

Reduplication is a crosslinguistic phenomenon, whereby an entire word, or part of a word, is repeated in order to form a new meaning, or to enhance its existing meaning. Depending on the language surveyed, the scope and purpose of reduplication differ. Handily, reduplication exists in English in various forms, which allows for some concrete examples:

Bye, bye; no-no – exact reduplication (both fixed and productive in EN)
Super duper, okie dokie – rhyming reduplication (normally fixed in EN)
Flip-flop, tic-tac-toe – ablaut reduplication
Fancy-shmancy – shm reduplication (normally productive in EN)

And its forms vary depending on usage, as well. For example:

A: “I’m going to be practicing my falsetto in here. Is there any glass in this room?”

B: “Well, uhm. There’s… plexiglass.”

A: “Uhm. No. I meant, is there any glass glass in this room.”

In this interaction, we can see that A is emphasizing that they are referring to the authentic form of glass, and not to things that are called glass, but aren’t actually glass. This is referred to as contrastive focus reduplication.

Believe it or not, English reduplication only encompasses a narrow subset of reduplication across the world’s languages, and we are going to both quantiatively and illustratively paint a picture of this phenomenon globally.

Gathering the Data

The Graz Database on Reduplication is a database that documents reduplication across 82 languages globally. The goal of this notebook is to compile this data in a neat table format, clean the data, and see if we can observe any interesting patterns. It’s important to note that, by no means is this subset 82 documented cases particularly an exact representation of what reduplication looks like globally, but we hope that the data is diverse and unbiased enough to come as close as possible.

These languages are listed in a directory and distributed among several links, each of which gives us a ton of useful information on a per-language basis. Between the tables and the dropdowns, we can automatically collect a ton of information on these languages, as well as the semantic, syntactic, morphological, and phonological properties of its reduplication.

The first step of this process entails scraping the data using R. The code utilized to create our master table is below.

Code:

# note: this cell does NOT output a table
# ============== WEB SCRAPING SETUP ================= #
library(tidyverse) # the og
library(rvest) # for webscraping

# ========= USER VARIABLES ( modify me :) ) =========

# one string ONLY per cell
single_keys <- c(
  "Area",
  "General Data",
  "Reduplication Form-Function",
  "Relationship Form-Function",
  "Reduplication System",
  "Comments",
  "Diachrony",
  "Productivity",
  "Repetitive Operations",
  "Stylistic Information",
  "Recursive Operations",
  "Typological Information"
)

# needs a list for every cell
multi_keys <- c(
  "Alternative Names",
  "Family/Group"
)

# by default, multi-valued
dropdown_keys <- c(
  "Functions",
  "Semantics"
)

import_local_data = FALSE
csv_path = ""

# ======= SETUP ( don't modify me >:( ) =======

single_keys <- make.names(single_keys, unique = TRUE)
multi_keys <- make.names(multi_keys, unique = TRUE)
dropdown_keys <- make.names(dropdown_keys, unique = TRUE)
table_keys <- union(single_keys, multi_keys)
bullet_point <- "[\u0095\u2022]"

# ============== FUNCTIONS ================= #

# returns: (xml_doc) contents of the page
get_page <- function(url) return(read_html(url))

# returns: (string) name of the language
get_language_name <- function(page){
  lang_name <- page %>% 
    html_element("body") %>% html_text() %>% 
    str_match("Language:\\s*([^\\n\\()]+)") %>% 
    .[, 2] %>% 
    str_squish()
  return(lang_name)
}

# returns: (character vector) normalized whitespace
fix_ws <- function(x) {
  return(x %>% str_replace_all("\r", "") %>%
           str_replace_all("\t", " ") %>%
           str_replace_all(bullet_point, "") %>% str_squish())
}

# returns: (tibble with key/val columns) the biggest table
get_table <- function(page) {
  # --- for the table itself ---
  res <- page %>% 
    html_table(fill = TRUE) %>% 
    # grabs the biggest table by row count
    pluck(which.max(map_int(., nrow))) %>% 
    as_tibble(.name_repair = "minimal") %>% 
    select(1, 2) %>% set_names(c("key", "value")) %>% 
    # often the site lists a key once, and the rows below have blank keys. let's squish them into the above key
    mutate(
      key = fix_ws(key),
      value = fix_ws(value)
    ) %>% 
    fill(key, .direction = "down") %>% 
    group_by(key) %>% 
    summarize(value = list(value[value != ""]), .groups = "drop") %>%
    mutate(key = make.names(key, unique = TRUE)) %>% 
    filter(key %in% table_keys)
  return(res)
}

# returns: (tibble with key/val columns) the specified dropdown boxes
# note: this function *in particular* was written using ChatGPT
get_dropdown <- function(page) {
  rows <- page %>% html_elements("tr")
  rows <- rows[map_lgl(rows, ~ length(html_elements(.x, "select")) > 0)]
  if (length(rows) == 0) return(tibble(key = character(), value = list()))

  out <- map_dfr(rows, function(r) {
    cells <- r %>% html_elements(xpath = ".//th|.//td")
    if (length(cells) == 0) return(NULL)

    key <- cells[[1]] %>% html_text2() %>% fix_ws()
    key <- str_remove(key, ":\\s*$") %>% fix_ws()
    key <- make.names(key, unique = TRUE)

    if (!(key %in% dropdown_keys)) return(NULL)

    opts <- r %>% html_elements("select option") %>% html_text2() %>% fix_ws()
    opts <- opts[opts != ""]
    opts <- opts[!str_to_lower(opts) %in% c("select value", "select", "value")]
    opts <- unique(opts)

    tibble(key = key, value = list(opts))
  })
  out
}


# returns: (string or vector of strings) one singlular cell 
make_cell <- function(key, values) {
  text <- fix_ws(str_c(values, collapse = "\n"))
  if(is.null(text) || text == ""){
    return(NA)
  }
  
  if(key %in% single_keys){
    return(text)
  }
  
  if(key %in% multi_keys){
    split_chars = "[;,\n]"
    output <- str_split(text, split_chars)[[1]] %>% 
      fix_ws() %>% discard(~ .x == "") %>% unique()
    return(output)
  }
  
  return(NA)
}

# returns: (one row tibble) entire row for a language
scrape_language <- function(url) {
  page <- get_page(url)
  lang_name <- get_language_name(page)

  table <- get_table(page) %>%
    mutate(key = make.names(key, unique = TRUE),
      value = map2(key, value, make_cell))

  dropdown <- get_dropdown(page) %>%
    mutate(key = make.names(key, unique = TRUE))

  table <- table %>% filter(!key %in% dropdown$key) %>% bind_rows(dropdown)

  table %>%
    pivot_wider(names_from = key, values_from = value) %>%
    mutate(language = lang_name, .before = 1) %>%
    mutate(url = url)
}

# returns: (vector of strings) a list of all languages on the site
get_language_urls <- function() {
  # get the first search tree page
  base <- "https://reduplication.uni-graz.at/redup/"
  dir_root <- "tree_root.php?languagePage="
  lang_root <- "tree_lang\\.php\\?idlang=\\d+"

  first_page <- get_page(paste0(base, dir_root, 1))
  
  max_page <- first_page %>% 
    html_elements("a") %>% html_attr("href") %>% 
    str_extract("languagePage=\\d+")  %>% 
    str_remove("languagePage=") %>% 
    as.integer() %>% 
    max(na.rm = TRUE)
  
  # list of all available pages
  dir_urls <- paste0(base, dir_root, seq_len(max_page))
  
  all_hyperlinks <- character()
  
  for (dir_url in dir_urls){
    page_hyperlinks <- get_page(dir_url) %>% 
      html_elements("a") %>% html_attr("href")
    page_hyperlinks <- page_hyperlinks[str_detect(page_hyperlinks, lang_root)]
    all_hyperlinks <- c(all_hyperlinks, page_hyperlinks)
  }
  return(paste0(base, all_hyperlinks))
}

# this function *in particular* was mostly written by ChatGPT
# returns: nothing, but exports a csv/tsv/rds depending on user specifications
export <- function(df, path = getwd(), name = deparse(substitute(df)),
                      csv = FALSE, tsv = FALSE, rds = TRUE,
                      na = "NA", schema_ext = ".schema.json") {
  library(jsonlite)
  stopifnot(is.data.frame(df))
  dir.create(path, showWarnings = FALSE, recursive = TRUE)
  is_list <- vapply(df, is.list, logical(1))
  classes <- lapply(df, function(x) class(x))
  factor_levels <- lapply(df, function(x) if (is.factor(x)) levels(x) else NULL)
  schema <- list(name = name, na = na, list_cols = names(df)[is_list],
    classes = lapply(classes, function(x) paste(x, collapse = "|")),
    factor_levels = factor_levels)
  out <- df
  if (any(is_list)) {
    for (col in names(df)[is_list]) {
      out[[col]] <- vapply(df[[col]], function(x) {
        if (is.null(x)) {
          "null"
        } else {
          toJSON(x, auto_unbox = FALSE, null = "null")
        }
      }, character(1))
    }
  }
  if (csv) {
    csv_path <- file.path(path, paste0(name, ".csv"))
    write.csv(out, csv_path, row.names = FALSE, na = na)
  }
  if (tsv) {
    tsv_path <- file.path(path, paste0(name, ".tsv"))
    write.table(out, tsv_path, sep = "\t", row.names = FALSE, col.names = TRUE,
                       quote = TRUE, na = na)
  }
  if (rds) {
    rds_path <- file.path(path, paste0(name, ".rds"))
    saveRDS(df, rds_path)
  }
  schema_path <- file.path(path, paste0(name, schema_ext))
  write_json(schema, schema_path, pretty = TRUE, auto_unbox = TRUE)
  invisible(list(data = out, schema = schema))
}

# this function *in particular* was mostly written by ChatGPT
# returns: a data frame from a CSV file
import <- function(csv_path,
                   schema_path = sub("\\.csv$", ".schema.json", csv_path),
                   na = "NA") {
  df <- utils::read.csv(
    csv_path,
    stringsAsFactors = FALSE,
    check.names = FALSE,
    na.strings = na,
    colClasses = "character"
  )
  if (!file.exists(schema_path)) return(df)
  schema <- jsonlite::read_json(schema_path, simplifyVector = FALSE)
  `%||%` <- function(x, y) if (is.null(x)) y else x
  list_cols <- unlist(schema$list_cols %||% character(0))
  for (col in intersect(list_cols, names(df))) {
    df[[col]] <- lapply(df[[col]], function(s) {
      if (is.na(s) || identical(s, "null")) return(NULL)
      jsonlite::fromJSON(s, simplifyVector = TRUE)
    })
  }
  if (!is.null(schema$factor_levels)) {
    for (col in intersect(names(schema$factor_levels), names(df))) {
      lev_raw <- schema$factor_levels[[col]]
      lev <- unlist(lev_raw)

      if (length(lev) > 0) {
        df[[col]] <- factor(df[[col]], levels = lev)
      }
    }
  }
  df
}

# returns: nothing, but assigns master_table and tibble_table again globally
reload_master_table <- function(path = csv_path){
  if(exists("master_table")){
    rm(master_table)
  }
  master_table <<- import(path)
  tibble_table <<- as_tibble(master_table)
}

# ================================================
# =============== SCRAPE THE DATA ================
# ================================================

# local_data can be passed in to circumvent the web scraping step (especially
# useful during testing). import_local_data is at the top
if(!import_local_data){
  lang_urls <- get_language_urls()
  
  # we'll scrape using a list of tibbles, before we make the big table
  # makes it way easier to work with because the data is really messy (like, diff columns per page)
  tibble_table <- vector("list", length(lang_urls))
  i <- 1
  
  # for each land url, scrape language data from the page (notably excludes examples, which were written in directly)
  for(i in 1:length(lang_urls)){
    lang <- lang_urls[[i]]
    tibble_table[[i]] <- scrape_language(lang)
    Sys.sleep(0.1) # to prevent spam requests
    if(i%%10 == 0) print(paste0(i, "/", length(lang_urls), " scraped so far"))
  } 
  master_table <- bind_rows(tibble_table)
  
}else{
  # clears var if exists
  reload_master_table()
}

Functions of Reduplication

Let’s start simple and count the syntactic and semantic functions that reduplication serves:

superhero_bg_color = "#0F2537"
# ONLY USE THIS IF FUNCTION IS AN AXIS, OTHERWISE SKEWS THE DATA
unnest_functions <- function(table){
  table %>%
  unnest_longer(Functions, values_to = "func")
}

# determine most frequently used function of reduplication (via "func")
unnest_functions(master_table) %>%
  count(func, sort = TRUE, name = "func_n") %>%
  ggplot(aes(x = reorder(func, func_n), 
             y = func_n, 
             sort = TRUE, fill = func_n)) +
  geom_col(color = "white") +
  scale_fill_gradient2(
    low = "#9E5BD2",
    high = "#702DA4",
    name = "Freq."
  ) +
  coord_flip() +
  labs(x = "Function", y = NULL, fill = "Freq.") +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = superhero_bg_color, color = NA),
    panel.background = element_rect(fill = "#333333"),
    axis.text   = element_text(color = "white"),
    axis.title  = element_text(color = "white"),
    plot.title  = element_text(color = "white"),
    plot.subtitle = element_text(color = "white"),
    legend.text = element_text(color = "white"),
    legend.title= element_text(color = "white"),
    legend.background   = element_rect(fill = superhero_bg_color, color = NA),
    legend.key          = element_rect(fill = superhero_bg_color, color = NA),
    legend.box.background = element_rect(fill = superhero_bg_color, color = NA)
  )

For simplicity purposes, all examples will be transcribed using the Roman alphabet, as opposed to native scripts.

The most common semantic meaning that reduplication carries among the sampled language is for the pluralization of nouns. This makes intuitive sense. “Day” means day – in absence of a suffix or a numeric prefix, we can assume that “day day” could mean “days”, “several days”, “day by day” etc.
- Example from Chinese: tiān – day → tiāntiān – every day
We also observe intensification: this often makes adjectives or adverbs more intense. And in some cases, with verbs, it can even stretch their meaning to a more devious version of itself.
- Example from Hawaiian: wiki – quick→ wikiwiki – very quick (the origin of Wikipedia’s name!)
- Example from Thai: yāk – difficult → yākyāk – very difficult
- Example from Indonesian: meminta – to ask for → meminta-minta – to beg for, to mooch
Diminution is a wide umbrella. Normally, it refers to reducing the intensity of something, reducing a whole to a part, referring to a diminutive version of it, or dismissing something as inferior.
- Example from Malagasy: haingana – quickly → haingankaingana – somewhat quickly
  - Notice: This is the opposite of how Hawaiian is using reduplication!
- Example from Maori: kimo – blink → kikimo – wink
- Example from Tagalog: bahay – house → bahay-bahayan – doll house, pretend house
- Example from Indonesian: kamu – you → kamu-kamu – you (derogatory, with frustration)
Word class derivation refers to a shift
- Example from Fijian: rere – fear (noun) → rerere – fearful (adjective)
- Example from Marshallese: bahat – smoke (noun) → bahathat – to smoke (verb)
Lexical enrichment is a category defined by Graz that essentially means “the reduplicated form formed a new word in its own right”. The base form is often missing. An example of this in English for conceptualization purposes would be “yo-yo”. Yo-yo is a term in and of itself, and does not modify a noun “yo” in any capacity – it just exists on its own
- Example from Tok Pisin: wil – wheel → wil-wil – bicycle
- Example from Acoma: kudu – round → kudu-kudu - candy
- Example from Japanese: doki – nervous → doki-doki – thump-thump (sound of a heartbeat)
  - Japanese onomatopoeiaa (擬音語 and 擬態語) makes HEAVY use of reduplication. Other frequent examples include: ira-ira, waku-waku, kira-kira, pachi-pachi, potsu-potsu etc.

Patterns in the Data

Tidying code:

# THIS CELL DOES NOT OUTPUT A TABLE
# ============ TIDYING UP THE DATA AND MAKING NEW COLUMNS ===========

# make a word order column
master_table <- master_table %>% 
  mutate( Word.Order = str_extract( as.character(Typological.Information), "(?:SOV|SVO|VSO|VOS|OSV|OVS)" ) ) %>% 
  mutate(Word.Order = na_if(Word.Order, ""))

# split up families (that didn't come out right...), two new columns
master_table <- master_table %>% 
  mutate(Family = map_chr(Family.Group, ~ {
      x <- .x
      if(is.null(x) || length(x) < 1) return(NA_character_)
      as.character(x[[1]])
    })
  ) %>% 
  mutate(Subfamily = map_chr(Family.Group, ~ {
      x <- .x
      if(is.null(x) || length(x) < 2) return(NA_character_)
      as.character(x[[2]])
    })
  ) 

# productivity scores column -- i'll rank from high to low
master_table <- master_table %>%
  mutate(
    Productivity = map_chr(Productivity, ~{
      x <- .x
      if (is.null(x) || length(x) == 0) return(NA_character_)
      if (is.list(x)) x <- unlist(x, use.names = FALSE)
      x <- as.character(x)
      x <- x[!is.na(x)]
      if (length(x) == 0) return(NA_character_)

      txt <- str_squish(str_to_lower(paste(x, collapse = " ")))

      if (str_detect(txt, "fully productive|highly productive|very productive")) "high" else
      if (str_detect(txt, "semi-productive|partly productive|\\bproductive\\b")) "mid" else
      if (str_detect(txt, "limited|restricted|lexicali|rare|not productive|unproductive")) "low" else
        NA_character_
    })
  )

# reduplication types column
master_table <- master_table %>%
  mutate(
    Redup.Type = map_chr(Reduplication.System, function(x) {
      if (is.list(x)) x <- unlist(x, use.names = FALSE)
      x <- as.character(x); x <- x[!is.na(x)]
      if (length(x) == 0) return(NA_character_)
      txt <- str_squish(str_to_lower(paste(x, collapse = " ")))

      full    <- str_detect(txt, "full redup|full-redup|total redup")
      partial <- str_detect(txt, "partial redup|syllab|mora|segment|prosodic|geminat")
      case_when(
        full & partial ~ "Full + Partial",
        partial ~ "Partial-ish",
        full ~ "Full-ish",
        TRUE ~ NA_character_
      )
    })
  )

# three flag columns (boolean) for polysynthetic vs agglutinvative vs tonal
master_table <- master_table %>%
  mutate(
    Polysynthetic = map_lgl(Typological.Information, function(x) {
      if (is.list(x)) x <- unlist(x, use.names = FALSE)
      x <- as.character(x); x <- x[!is.na(x)]
      if (length(x) == 0) return(FALSE)
      str_detect(str_to_lower(paste(x, collapse = " ")), "polysynth")
    })
  ) %>%
  mutate(
    Agglutinative = map_lgl(Typological.Information, function(x) {
      if (is.list(x)) x <- unlist(x, use.names = FALSE)
      x <- as.character(x); x <- x[!is.na(x)]
      if (length(x) == 0) return(FALSE)
      str_detect(str_to_lower(paste(x, collapse = " ")), "agglutin")
    })
  ) %>%
  mutate(
    Tonal = map_lgl(Typological.Information, function(x) {
      if (is.list(x)) x <- unlist(x, use.names = FALSE)
      x <- as.character(x); x <- x[!is.na(x)]
      if (length(x) == 0) return(FALSE)
      str_detect(str_to_lower(paste(x, collapse = " ")), "\\btone\\b|tonal")
    })
  )

I was curious to see if there were any patterns on the basis of geographical location or general word order. These next two charts actually convey some interesting information:

table_fg_normalized2 <- unnest_functions(master_table) %>% 
  filter(func != "other") %>% # not useful
  filter(!is.na(Family)) %>% 
  group_by(Family) %>%
  filter(n_distinct(url) >= 5) %>% # keep only families with >= 5
  ungroup()

n_lang <- table_fg_normalized2 %>%
  distinct(url, Family) %>%
  count(Family, name = "n_lang")

plot_df <- table_fg_normalized2 %>%
  distinct(url, Family, func) %>%
  count(Family, func, name = "n") %>%
  left_join(n_lang, by = "Family") %>%
  mutate(proportion = n / n_lang) %>%
  # extra line so there's not a blank cell
  complete(Family, func, fill = list(n = 0, proportion = 0))

ggplot(plot_df, aes(x = func, y = Family, fill = proportion)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(trans = "sqrt", labels = scales::percent, limits = c(0, 1)) +
  labs(title = "Function frequency per family", 
       x = "Function", fill = "% of langs") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
  theme(
    plot.background = element_rect(fill = superhero_bg_color, color = NA),
    panel.background = element_rect(fill = "#333333"),
    axis.text   = element_text(color = "white"),
    axis.title  = element_text(color = "white"),
    plot.title  = element_text(color = "white"),
    plot.subtitle = element_text(color = "white"),
    legend.text = element_text(color = "white"),
    legend.title= element_text(color = "white"),
    legend.background   = element_rect(fill = superhero_bg_color, color = NA),
    legend.key          = element_rect(fill = superhero_bg_color, color = NA),
    legend.box.background = element_rect(fill = superhero_bg_color, color = NA)
  )

note for the above chart: language families with less than 5 languages in the data were excluded.

The most interesting finding of the above chart is that the most common function of reduplication is pluralization in practically every single observed language family EXCEPT for Proto-Indo-European. Instead, PIE normally utilizes reduplication for lexical enrichment or diminution. However, PIE languages tend to utilize reduplication less than their counterparts.

Another interesting thing to note is that Austronesian languages utilize all five functions of reduplication at an exceptionally high rate – in fact, each of the five functions have an occurrence rate of over 50%! This plays neatly into the next chart. The former chart focused on occurrence rate on a per-language family basis, but the latter will instead focus on the share of all function occurrences across all language families.

table_fg_normalized <- unnest_functions(master_table) %>% 
  filter(func != "other") %>% # not useful
  filter(!is.na(Family)) %>% 
  group_by(Family) %>%
  mutate(n_lang = n_distinct(url)) %>% # langs per family
  ungroup() %>%
  filter(n_lang >= 5) %>% # keep only families with >= 5
  distinct(url, Family, func) %>% 
  count(Family, func) %>%
  mutate(proportion = n / sum(n)) %>%
  complete(Family, func, fill = list(n = 0)) %>%
  ungroup() #un-tibble

ggplot(table_fg_normalized, aes(x = func, y = Family, fill = proportion)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(trans = "sqrt", labels = scales::percent) +
  theme_minimal() +
  labs(
    title = "Dataset mass",
    #subtitle = "Each tile = number of languages in family with function / total",
    x = "func",
    y = "Family",
    fill = "Share of all\n(lang/func)",
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
  theme(
    plot.background = element_rect(fill = superhero_bg_color, color = NA),
    panel.background = element_rect(fill = "#333333"),
    axis.text   = element_text(color = "white"),
    axis.title  = element_text(color = "white"),
    plot.title  = element_text(color = "white"),
    plot.subtitle = element_text(color = "white"),
    legend.text = element_text(color = "white"),
    legend.title= element_text(color = "white"),
    legend.background   = element_rect(fill = superhero_bg_color, color = NA),
    legend.key          = element_rect(fill = superhero_bg_color, color = NA),
    legend.box.background = element_rect(fill = superhero_bg_color, color = NA)
  )

It may be a bit difficult to understand what is going on here, so let’s explain it via the elephant in the room: why is the Austronesian row so bright?

This is because, a disproportionately large portion of all observed language functions appear specifically in Austronesian languages. This implies that Austronesian languages very often utilize reduplication for 3-5 functions, whereas other language families normally observe only 1-3 functions per language. This data implies that Austronesian languages utilize reduplication for many different purposes, and that it’s disproportionately represented in languages of this type.

#library(patchwork)
#p1 = unnest_functions(master_table) %>%
  #mutate(func = str_squish(func)) %>%
  #filter(!is.na(func), func != "", !is.na(Word.Order)) %>%
  #filter(Word.Order != "OVS") %>% # too few
  #filter(func != "other") %>% 
  #count(func, Word.Order) %>%
  #group_by(func) %>% 
  #mutate(total = sum(n)) %>% ungroup() %>%
  #ggplot(aes(x = fct_reorder(func, total), y = n, fill = Word.Order)) +
  #geom_col(position = "fill") +
  #coord_flip() +
  #theme(
    #plot.background = element_rect(fill = superhero_bg_color, color = NA),
    #panel.background = element_rect(fill = "#333333"),
    #axis.text   = element_text(color = "white"),
    #axis.title  = element_text(color = "white"),
    #plot.title  = element_text(color = "white"),
    #plot.subtitle = element_text(color = "white"),
    #legend.text = element_text(color = "white"),
    #legend.title= element_text(color = "white"),
    #legend.background   = element_rect(fill = superhero_bg_color, color = NA),
    #legend.key          = element_rect(fill = superhero_bg_color, color = NA),
    #legend.box.background = element_rect(fill = superhero_bg_color, color = NA)
  #)

lang_funcs <- unnest_functions(master_table) %>%
  mutate(func = str_squish(func)) %>%
  filter(!is.na(func), func != "", func != "other",
         !is.na(Word.Order), Word.Order != "OVS") %>%
  distinct(url, Word.Order, func)

per_lang <- lang_funcs %>%
  group_by(url, Word.Order) %>%
  summarise(n_funcs = n_distinct(func), .groups = "drop")

p2 = ggplot(per_lang, aes(x = Word.Order, y = n_funcs)) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.15, outlier.shape = NA) +
  geom_jitter(width = 0.05, alpha = 0.35, size = 1) +
  labs(x = "Word order", y = "# redup. functions per language") +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = superhero_bg_color, color = NA),
    panel.background = element_rect(fill = "#333333"),
    axis.text   = element_text(color = "white"),
    axis.title  = element_text(color = "white"),
    plot.title  = element_text(color = "white"),
    plot.subtitle = element_text(color = "white"),
    legend.text = element_text(color = "white"),
    legend.title= element_text(color = "white"),
    legend.background   = element_rect(fill = superhero_bg_color, color = NA),
    legend.key          = element_rect(fill = superhero_bg_color, color = NA),
    legend.box.background = element_rect(fill = superhero_bg_color, color = NA)
  )
#wrap_plots(p1, p2, ncol = 2)
p2

Is this not the coolest chart? It’s called a violin plot. So, what does this tell us? Well, at least in this data set, languages with verb-initial word order tend to have a higher number of reduplication functions on average, with medians around 4, whereas SOV and SVO have medians closer to 2-3. Of course, while VSO has substantial data, VOS is very limited by sample size, so it’s important to be careful not to draw conclusions from this. Further, the overall sample size of the dataset to begin with is pretty small, so it’s important not to make overgeneralizations, like, “this proves that VSO and VOS languages have more verbose reduplication repertoires”.

area_redup <- master_table %>%
  mutate(
    Area_1 = purrr::map_chr(Area, ~{
      x <- .x
      if (is.null(x) || length(x) < 1) NA_character_ else as.character(x[[1]])
    })
  ) %>%
  filter(!is.na(Area_1), !is.na(Redup.Type)) %>%
  count(Area_1, Redup.Type) %>%
  group_by(Area_1) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup()

ggplot(area_redup, aes(x = Area_1, y = prop, fill = Redup.Type)) +
  geom_col(position = "fill") +
  coord_flip() +
  labs(x = "Area", y = "Proportion within area", fill = "Redup type") +
  theme(
    plot.background = element_rect(fill = superhero_bg_color, color = NA),
    panel.background = element_rect(fill = "#333333"),
    axis.text   = element_text(color = "white"),
    axis.title  = element_text(color = "white"),
    plot.title  = element_text(color = "white"),
    plot.subtitle = element_text(color = "white"),
    legend.text = element_text(color = "white"),
    legend.title= element_text(color = "white"),
    legend.background   = element_rect(fill = superhero_bg_color, color = NA),
    legend.key          = element_rect(fill = superhero_bg_color, color = NA),
    legend.box.background = element_rect(fill = superhero_bg_color, color = NA)
  )

Conclusion

This notebook set out to do two things:

Turn the Graz Database on Reduplication into a tidy, usable dataset
Use that dataset to get a data-driven sketch of what reduplication does across languages

Even in a small sample, a few patterns pop out. Reduplication is most commonly a strategy for pluralization. Austronesian languages disproportionately utilize reduplication, and have a much wider net of functionality on a per-language basis. Reduplication is overall underutilized in PIE languages. Reduplication also serves a host of utilities, and even in clean categories, can vary quite substantially in their scope. Reduplication also takes on many morphological forms: sometimes the entire word gets repeated, sometimes it’s just the stem.

Important note

While this data is neat to look at, it’s really important to note that the sample of languages (n=82) is relatively small, and a lot of the extracted data was unlikely to correlate to reduplication functions to begin with. Another important disclaimer is that I cannot verify the authenticity of all of the entries on the database, nor the examples. There was a specific Portuguese example in the database (“bibichinho”), where I received direct feedback from L1 speakers that the listed example was unnatural. The authors of the database compiled information from a variety of sources, surely with varying degrees of authenticity. Nonetheless, I do hope this inspires further research into reduplication, as it seems to be an understudied area in linguistics.

Mapping Reduplication Across Languages

Michael Tombolini

2025