Preamble

Just loading some packages, nothing to see here.

pacman::p_load(rvest, stringr, dplyr, stringr, stringi, ggplot2, knitr)

Intro

Building an Introductory Example

In this section, I’ll take a moment to set up an example. It’s not necessary to understand why things are happening, so let’s just view my next code block as a fait accompli. I am merely “showing you how the sausage” is made, for clarity. The discussion that follows is the important bit.

autoalphabeticize_df <- function(df, lang = "english") {
    
    df %>% 
        filter(is.element(Number, 1:10)) %>% 
        arrange(NumberWord) %>% 
        mutate(
            AlphabeticalOrder = 1:10, 
            MatchingNumber = AlphabeticalOrder == Number,
            Language = rep(lang, 10)
        ) %>% 
        select(Language, AlphabeticalOrder, everything())
}

NumberWord = c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten")

example_df <- autoalphabeticize_df(tibble(Number = 1:10, NumberWord))

Example

Consider the following table of English number words, ranked in alphabetical order.

example_df

## # A tibble: 10 x 5
##    Language AlphabeticalOrder Number NumberWord MatchingNumber
##    <chr>                <int>  <int> <chr>      <lgl>         
##  1 english                  1      8 eight      FALSE         
##  2 english                  2      5 five       FALSE         
##  3 english                  3      4 four       FALSE         
##  4 english                  4      9 nine       FALSE         
##  5 english                  5      1 one        FALSE         
##  6 english                  6      7 seven      FALSE         
##  7 english                  7      6 six        FALSE         
##  8 english                  8     10 ten        FALSE         
##  9 english                  9      3 three      FALSE         
## 10 english                 10      2 two        FALSE

We can see that the first word in the sorted table is 8. And word number 2, is 5. Continuing down through the list, we see that the numerical and lexicographical orderings do not coincide for any of the numbers, if we consider only the first ten English number words.

I would like to introduce the term “autoalphabeticity” to denote this concept. It would then follow that English has an autoalphabeticity of 0 on the first 10 numbers.

This made me wonder if any language has a high autoalphabeticity. The purpose of this work is to determine the autoalphabeticity on the first 10 numbers of all the languages found on the website https://www.omniglot.com/language/numbers/index.htm .

Methods

Gathering Data

We start by getting a list of all the languages.

langs <- read_html("https://www.omniglot.com/language/numbers/index.htm") %>% #get website index
    html_nodes("div ol li p a") %>% #navigate to the appropriate nodes
    html_attr("href") %>% #get the href tag
    .[!is.na(.)] %>%  #remove things that didn't have href tag
    str_split_fixed("\\.", 2) %>% #remove .htm from all language names
    .[ , 1] #grab the first column (the one with all the language names)

length(langs)

## [1] 541

There are 541 languages with numbering systems available for inspection on this website.

Determining Autoalphabeticity

Check if a language is read right-to-left. If it’s in the list of languages below, we’ll flip it. If it’s not in the list, we’ll assume it’s left-to-right. I pulled the list of languages from https://en.wikipedia.org/wiki/Right-to-left by hand, and also checked them against the tables on www.omniglot.com/.

is_rtl <- function(lang) {
    
    rtl_langs <- c(
        "Arabic", 
        "Garshuni", 
        "Karshuni", 
        "Kazakh", 
        "Kurdish", 
        "Kyrgyz", 
        "Turkmen", 
        "Uzbek", 
        "Persian", 
        "Baluchi", 
        "Pashto", 
        "Azerbaijani", 
        "Talysh", 
        "Uyghur", 
        "Dari", 
        "Kashmiri", 
        "Punjabi*", 
        "Saraiki", 
        "Sindhi*", 
        "Urdu", 
        "Ida'an", 
        "Idahan", 
        "Cham", 
        "Tuareg", 
        "Tamasheq", 
        "Bedawi", 
        "Beja", 
        "Dongolawi", 
        "Andaandi", 
        "Nobiin", 
        "Zarma", 
        "Tadaksahak", 
        "Hausa", 
        "Dyula", 
        "Jola-Fonyi", 
        "Balanta", 
        "*Mandinka", 
        "Fula", 
        "Ajami", 
        "Wolofal" 
    ) %>% 
        tolower()
    
    if (is.element(lang, rtl_langs)) return(TRUE)
    
    FALSE
}

I’ll note that 3 of the languages in the list have an asterisk. Mandinka has an asterisk in front of it, to denote that it could also be written in an RTL script, but is written LTR on www.omniglot.com/. Punjabi and Saraiki can be written in LTR or RTL scripts, and both are represented (in a single column!) on the site. Since the LTR script is first and I’m only analyzing each language once, that means I’m sort of erasing the Punjabi and Saraiki RTL autoalphabeticity, but I don’t feel like trying to make weird exceptions for these two cases.

So now we need a function to determine how autoalphabetical each language is (over the first 10 numbers).

autoalphabeticity <- function(df, rtl = FALSE){
    
    if (rtl) df %>% mutate(NumberWord = stri_reverse(NumberWord))
    
    df %>% 
        filter(is.element(Number, 1:10)) %>%
        .[1:10, ] %>% #only select the first 10 (avoids dealing with 1.000 and the like)
        arrange(NumberWord) %>% #order data frame by number word (cardinal)
        mutate(AlphabeticalOrder = 1:10) %>% 
        filter(AlphabeticalOrder == Number) %>% 
        nrow()
}

Retreiving Numerical Tables

Our function to process languages will start with the language name, and then grab the website, and filter down to the language table. From there we’ll clean up the table.

Before we write that function, let’s write a helper function to parse tables.

Most commonly, tables have one column called “Number” or “Numeral”, and another column called “Cardinal”. But these tables are probably made by hand, which means they are riddled with inconsistencies, and there are many tables that are exceptions to this norm.

Because programming is weird, let’s start with a list of our exceptions to what a “normal” table should include. These are things to try out if we meet a table that doesn’t match the above.

try_weird_cases <- function(df) {
    
    #a helper function
    contains_col <- function(name) is.element(name, colnames(df))
    
    #We'll try to break things down to a case-by-case basis
    
    #if we have only 2 cols, assume they're in the correct order.
    if (ncol(df) == 2) {
        return(df)
    
    #next, consider a format that is common for tables of indian languages.
    #check whether we have a column of numbers that is not named (which has been given
    #the name "num" in the process_lang function we haven't writen yet),
    #and a column that is named "number", which has number words. 
    } else if (contains_col("num") & contains_col("Number") & contains_col("Numeral")) {
        df <- df %>% 
            mutate(NumberWord = Number) %>% 
            select(num, NumberWord)
    
    #in some character languages such as chinese, I'm going to rely on pronunciation as
    #the guiding metric
    } else if (contains_col("Number") & contains_col("Pronunciation")) {
        df <- df %>% 
            mutate(NumberWord = Pronunciation) %>% 
            select(Number, NumberWord)
    
    #in some indian language tables, if we have number and numeral columns 
    #but no cardinal column, the number colulmn has the number words
    } else if (contains_col("Number") & contains_col("Numeral")) {
        df <- df %>% 
            mutate(NumberWord = "Number") %>% 
            select(Numeral, NumberWord)

    #in the case of korean and japanese, we'll look for some key words  
    #to be a marker for the cardinal numbers
    } else if (contains_col("Numeral")) {
        df <- df %>% 
            select(
                Numeral, 
                starts_with("Native"), 
                starts_with("General"), 
                contains("Counting")
            )
        
        colnames(df)[2] <- "NumberWord"
            
    }
    
    df
}

Next we’ll build the function that takes the language, builds a url, finds the table of numbers and number words, and then cleans said table to be usable.

Needless to say, the last bit is the trickiest. And the above function is an ancillary piece that aids in this process.

process_lang <- function(lang, baseURL = "https://www.omniglot.com/language/numbers/") {

    url <- paste0(baseURL, lang, ".htm")
    
    df <- read_html(url) %>% #read website
            html_node("table") %>% #find table
            html_table(fill = TRUE) #extract table as data frame
            
    
    #the tables on this site do follow very strong patterns
    #but, from the errors I've seen, I think that if there is 
    #an unnamed column it is the number column.
    
    #however, we can't always give it the name "Number" because that is sometimes 
    #the column name for a number written with different numbering system, and sometimes
    #the digits 1-10 are written in the "Numeral" column, not the unnamed column.
    
    #we'll try to cirucmvent all of this by making a "num" column, reorganizing the columns 
    #to be in alhpabetical order, then selecting cols that start with "num". This is 
    #particularly tricky because data frames that have an empty column name cause R to break 
    #easily.
    if (is.element("", colnames(df))) {
        
        df <- df[ , order(names(df))]
        colnames(df)[1] <- "num"
    }
    
    #some tables have masculine and feminine numbers listed as different subcolumns within
    #one column. These get parsed as two columns with the same name. 
    #We'll use bind_cols to append a number suffix to the second of these
    #twin columns, and then drop the second one.
    df <- df %>% 
        bind_cols() %>% #this appends a number suffix to col names that had the same name.
        select(-ends_with("1")) #drop all cols that end with the number 1.
    
    tmp <- df %>% select(starts_with("num"), contains("cardinal"))
    
    if (ncol(tmp) >= 2) {
        df <- tmp
        
    } else {
        df <- try_weird_cases(df)
        
    }
        
    colnames(df)[1] <- "Number"
    df$Number <- df$Number %>% as.numeric()
    
    
    colnames(df)[2] <- "NumberWord"
    
    df
    
}

Analysis

Now let’s get to processing! We’ve built this thing, now let’s set it on an unsuspecting website and grab our sweet sweet number words.

autoalphabeticity_score <- rep(-1, length(langs))
x <- length(langs)

for (i in 1:x) {
    
    autoalphabeticity_score[i] <- tryCatch(
        
        process_lang(langs[i]) %>% 
            autoalphabeticity(rtl = is_rtl(lang[i])), 
        
        error = function(e) return(NA)
    )
    
    print(str_glue("{langs[i]} language is complete! {i}/{x}"))
}

save(autoalphabeticity_score, file = "autoalphabeticity_score.RData")

I’ve saved the data so I don’t have to reprocess it every time I update this file. It takes more than 5 mins to run each time. The price of success.

load(file = "autoalphabeticity_score.RData")

After analysis we see some interesting results!

raw_results <- tibble(langs, autoalphabeticity_score)

results <- raw_results %>% 
    filter(!is.na(autoalphabeticity_score))

na_results <- raw_results %>% filter(is.na(autoalphabeticity_score))

results

## # A tibble: 533 x 2
##    langs     autoalphabeticity_score
##    <chr>                       <dbl>
##  1 abaza                           0
##  2 abellen                         0
##  3 abenaki                         0
##  4 abkhaz                          1
##  5 abui                            2
##  6 acholi                          0
##  7 adaizan                         2
##  8 adyghe                          0
##  9 afar                            2
## 10 afrikaans                       0
## # ... with 523 more rows

Above we can see a few real results, whereas below, we have the few languages that didn’t process properly.

na_results

## # A tibble: 8 x 2
##   langs           autoalphabeticity_score
##   <chr>                             <dbl>
## 1 adi                                  NA
## 2 malayindonesian                      NA
## 3 javanese                             NA
## 4 latvia                               NA
## 5 malayindonesian                      NA
## 6 montagnais                           NA
## 7 nivacle                              NA
## 8 paraujano                            NA

## Of the 541 languages listed on the website index, only 8 resulted in error. This means that a
## whopping 533 languages, accounting for 98.5% of the total available, were used for analysis.

A histogram shows that the vast majority of languages have an autoalphabeticity score of 1 or 0, but there are a handful with more.

ggplot(results, aes(autoalphabeticity_score)) + geom_bar() + theme_minimal()

Digging into the numbers directly, we see that there are eight languages with a score of 4.

table(results$autoalphabeticity_score)

## 
##   0   1   2   3   4 
## 208 181  92  44   8

Let’s take a moment to look at these highly autoalphabetic languages.

high_aa_langs <- results %>% 
    filter(autoalphabeticity_score >= 4) %>% 
    arrange(desc(autoalphabeticity_score))

high_aa_langs

## # A tibble: 8 x 2
##   langs    autoalphabeticity_score
##   <chr>                      <dbl>
## 1 finnish                        4
## 2 juhuri                         4
## 3 karelian                       4
## 4 lingala                        4
## 5 nkore                          4
## 6 phom                           4
## 7 scots                          4
## 8 vai                            4

We’ll get the inside scoop by marvelling at the ordering of these languages’ number words firsthand.

get_lang_df <- function(lang) process_lang(lang) %>% autoalphabeticize_df(lang)

high_aa_df <- lapply(high_aa_langs$langs, get_lang_df) %>% 
    bind_rows() #put languages together in one data frame

#I use kable to make utf8 characters render properly.
kable(
    high_aa_df %>% mutate(NumberWord = enc2utf8(NumberWord)),
    align = c("l", "c", "c", "l", "c")
)

Language	AlphabeticalOrder	Number	NumberWord	MatchingNumber
finnish	1	8	kahdeksan	FALSE
finnish	2	2	kaksi	TRUE
finnish	3	3	kolme	TRUE
finnish	4	6	kuusi	FALSE
finnish	5	10	kymmenen	FALSE
finnish	6	4	neljä	FALSE
finnish	7	7	seitsemän	TRUE
finnish	8	5	viisi	FALSE
finnish	9	9	yhdeksän	TRUE
finnish	10	1	yksi	FALSE
juhuri	1	10	дегь(dəh)	FALSE
juhuri	2	2	дуь, дуьдуь(dy, dydy)	TRUE
juhuri	3	1	ек, еки, той(jək, jəki, toy)	FALSE
juhuri	4	9	нуьгь(nyh)	FALSE
juhuri	5	5	пенж(pənç)	TRUE
juhuri	6	3	се, сесе(sə, səsə)	FALSE
juhuri	7	7	хьофд(ħofd)	TRUE
juhuri	8	8	хьэшд(ħəşd)	TRUE
juhuri	9	4	чор(cor)	FALSE
juhuri	10	6	шеш(şəş)	FALSE
karelian	1	8	kahekšan	FALSE
karelian	2	2	kakši	TRUE
karelian	3	3	kolme	TRUE
karelian	4	6	kuuǯi	FALSE
karelian	5	10	kymmenen	FALSE
karelian	6	4	neljjä	FALSE
karelian	7	7	seiččemän	TRUE
karelian	8	5	viizi	FALSE
karelian	9	9	yhekšän	TRUE
karelian	10	1	yksi	FALSE
lingala	1	9	libwá	FALSE
lingala	2	2	míbalé	TRUE
lingala	3	4	mínei	FALSE
lingala	4	3	mísató	FALSE
lingala	5	5	mítáno	TRUE
lingala	6	1	mókó	FALSE
lingala	7	6	motóba	FALSE
lingala	8	8	nwámbe	TRUE
lingala	9	7	sámbó, nsámbó	FALSE
lingala	10	10	zómi	TRUE
nkore	1	1	emwe	TRUE
nkore	2	2	ibiri	TRUE
nkore	3	10	ikumi	FALSE
nkore	4	4	ina	TRUE
nkore	5	3	ishatu	FALSE
nkore	6	5	itaano	FALSE
nkore	7	6	mukaaga	FALSE
nkore	8	8	munaana	TRUE
nkore	9	7	mushanju	FALSE
nkore	10	9	mwenda	FALSE
phom	1	4	ali, ǝli	FALSE
phom	2	10	an, ʌn	FALSE
phom	3	3	chem, cʌm	TRUE
phom	4	1	hük, hɨk	FALSE
phom	5	5	nga, ŋa	TRUE
phom	6	7	nyet, ɲʌt	FALSE
phom	7	2	nyi, ɲi	FALSE
phom	8	8	shet, Šʌt	TRUE
phom	9	9	shü, Šɯ	TRUE
phom	10	6	vok, wɔk	FALSE
scots	1	1	ane / wan	TRUE
scots	2	8	echt	FALSE
scots	3	5	five	FALSE
scots	4	4	fower	TRUE
scots	5	9	nine	FALSE
scots	6	6	sax	TRUE
scots	7	7	seiven	TRUE
scots	8	10	ten	FALSE
scots	9	3	three	FALSE
scots	10	2	twa	FALSE
vai	1	1	dóndolɔ̀ndɔ́	TRUE
vai	2	2	férafɛ̀(ʔ)á	TRUE
vai	3	4	nánináánì	FALSE
vai	4	3	ságbasàk͡pá	FALSE
vai	5	5	sōrusóó(ʔ)ú	TRUE
vai	6	7	sumférasɔ̂ŋ fɛ̀(ʔ)á (5 + 2)	FALSE
vai	7	6	sūndóndosɔ̂ŋ lɔ̀ndɔ́ (5 + 1)	FALSE
vai	8	9	sūnnánisɔ̂ŋ náánì (5 + 4)	FALSE
vai	9	8	sūnságbasɔ̂ŋ sàk͡pá (5 + 3)	FALSE
vai	10	10	tantâŋ	TRUE

I haven’t reviewed all the possible corner cases, so some languages could be categorized incorrectly. It could be a really big shame..

‾\_(ツ)_/‾

Conclusion

It’s really uncommon for a language to be highly autoalphabetical over the first 10 counting numbers. Although having an autoalphabeticity score of 0 is the mode over the first 10 counting numbers, it would be interesting to see the mode over the first hundred or thousand counting numbers.

Also, I have reason to believe that of the 8 listed languages that weren’t analyzed, some had broken web-pages, but others had different formatting than my scripting expected. I could program the exceptions in by hand, but there’s no real fun in that. :P

If you were looking for some sort of punch line, some deeper meaning behind this work, then I’ve wasted your time quite thoroughly. I don’t see any application of this work, except my own amusement, and the amusement of others like me. :D

Autoalphabeticity Over the First 10 Counting Numbers

Kenneth L Osborne

February 14, 2020