Preamble

Just loading some packages, nothing to see here.

Intro

Building an Introductory Example

In this section, I’ll take a moment to set up an example. It’s not necessary to understand why things are happening, so let’s just view my next code block as a fait accompli. I am merely “showing you how the sausage” is made, for clarity. The discussion that follows is the important bit.

Example

Consider the following table of English number words, ranked in alphabetical order.

## # A tibble: 10 x 5
##    Language AlphabeticalOrder Number NumberWord MatchingNumber
##    <chr>                <int>  <int> <chr>      <lgl>         
##  1 english                  1      8 eight      FALSE         
##  2 english                  2      5 five       FALSE         
##  3 english                  3      4 four       FALSE         
##  4 english                  4      9 nine       FALSE         
##  5 english                  5      1 one        FALSE         
##  6 english                  6      7 seven      FALSE         
##  7 english                  7      6 six        FALSE         
##  8 english                  8     10 ten        FALSE         
##  9 english                  9      3 three      FALSE         
## 10 english                 10      2 two        FALSE

We can see that the first word in the sorted table is 8. And word number 2, is 5. Continuing down through the list, we see that the numerical and lexicographical orderings do not coincide for any of the numbers, if we consider only the first ten English number words.

I would like to introduce the term “autoalphabeticity” to denote this concept. It would then follow that English has an autoalphabeticity of 0 on the first 10 numbers.

This made me wonder if any language has a high autoalphabeticity. The purpose of this work is to determine the autoalphabeticity on the first 10 numbers of all the languages found on the website https://www.omniglot.com/language/numbers/index.htm .

Methods

Determining Autoalphabeticity

Check if a language is read right-to-left. If it’s in the list of languages below, we’ll flip it. If it’s not in the list, we’ll assume it’s left-to-right. I pulled the list of languages from https://en.wikipedia.org/wiki/Right-to-left by hand, and also checked them against the tables on www.omniglot.com/.

I’ll note that 3 of the languages in the list have an asterisk. Mandinka has an asterisk in front of it, to denote that it could also be written in an RTL script, but is written LTR on www.omniglot.com/. Punjabi and Saraiki can be written in LTR or RTL scripts, and both are represented (in a single column!) on the site. Since the LTR script is first and I’m only analyzing each language once, that means I’m sort of erasing the Punjabi and Saraiki RTL autoalphabeticity, but I don’t feel like trying to make weird exceptions for these two cases.

So now we need a function to determine how autoalphabetical each language is (over the first 10 numbers).

Retreiving Numerical Tables

Our function to process languages will start with the language name, and then grab the website, and filter down to the language table. From there we’ll clean up the table.

Before we write that function, let’s write a helper function to parse tables.

Most commonly, tables have one column called “Number” or “Numeral”, and another column called “Cardinal”. But these tables are probably made by hand, which means they are riddled with inconsistencies, and there are many tables that are exceptions to this norm.

Because programming is weird, let’s start with a list of our exceptions to what a “normal” table should include. These are things to try out if we meet a table that doesn’t match the above.

try_weird_cases <- function(df) {
    
    #a helper function
    contains_col <- function(name) is.element(name, colnames(df))
    
    #We'll try to break things down to a case-by-case basis
    
    #if we have only 2 cols, assume they're in the correct order.
    if (ncol(df) == 2) {
        return(df)
    
    #next, consider a format that is common for tables of indian languages.
    #check whether we have a column of numbers that is not named (which has been given
    #the name "num" in the process_lang function we haven't writen yet),
    #and a column that is named "number", which has number words. 
    } else if (contains_col("num") & contains_col("Number") & contains_col("Numeral")) {
        df <- df %>% 
            mutate(NumberWord = Number) %>% 
            select(num, NumberWord)
    
    #in some character languages such as chinese, I'm going to rely on pronunciation as
    #the guiding metric
    } else if (contains_col("Number") & contains_col("Pronunciation")) {
        df <- df %>% 
            mutate(NumberWord = Pronunciation) %>% 
            select(Number, NumberWord)
    
    #in some indian language tables, if we have number and numeral columns 
    #but no cardinal column, the number colulmn has the number words
    } else if (contains_col("Number") & contains_col("Numeral")) {
        df <- df %>% 
            mutate(NumberWord = "Number") %>% 
            select(Numeral, NumberWord)

    #in the case of korean and japanese, we'll look for some key words  
    #to be a marker for the cardinal numbers
    } else if (contains_col("Numeral")) {
        df <- df %>% 
            select(
                Numeral, 
                starts_with("Native"), 
                starts_with("General"), 
                contains("Counting")
            )
        
        colnames(df)[2] <- "NumberWord"
            
    }
    
    df
}

Next we’ll build the function that takes the language, builds a url, finds the table of numbers and number words, and then cleans said table to be usable.

Needless to say, the last bit is the trickiest. And the above function is an ancillary piece that aids in this process.

process_lang <- function(lang, baseURL = "https://www.omniglot.com/language/numbers/") {

    url <- paste0(baseURL, lang, ".htm")
    
    df <- read_html(url) %>% #read website
            html_node("table") %>% #find table
            html_table(fill = TRUE) #extract table as data frame
            
    
    #the tables on this site do follow very strong patterns
    #but, from the errors I've seen, I think that if there is 
    #an unnamed column it is the number column.
    
    #however, we can't always give it the name "Number" because that is sometimes 
    #the column name for a number written with different numbering system, and sometimes
    #the digits 1-10 are written in the "Numeral" column, not the unnamed column.
    
    #we'll try to cirucmvent all of this by making a "num" column, reorganizing the columns 
    #to be in alhpabetical order, then selecting cols that start with "num". This is 
    #particularly tricky because data frames that have an empty column name cause R to break 
    #easily.
    if (is.element("", colnames(df))) {
        
        df <- df[ , order(names(df))]
        colnames(df)[1] <- "num"
    }
    
    #some tables have masculine and feminine numbers listed as different subcolumns within
    #one column. These get parsed as two columns with the same name. 
    #We'll use bind_cols to append a number suffix to the second of these
    #twin columns, and then drop the second one.
    df <- df %>% 
        bind_cols() %>% #this appends a number suffix to col names that had the same name.
        select(-ends_with("1")) #drop all cols that end with the number 1.
    
    tmp <- df %>% select(starts_with("num"), contains("cardinal"))
    
    if (ncol(tmp) >= 2) {
        df <- tmp
        
    } else {
        df <- try_weird_cases(df)
        
    }
        
    colnames(df)[1] <- "Number"
    df$Number <- df$Number %>% as.numeric()
    
    
    colnames(df)[2] <- "NumberWord"
    
    df
    
}

Analysis

Now let’s get to processing! We’ve built this thing, now let’s set it on an unsuspecting website and grab our sweet sweet number words.

I’ve saved the data so I don’t have to reprocess it every time I update this file. It takes more than 5 mins to run each time. The price of success.

After analysis we see some interesting results!

## # A tibble: 533 x 2
##    langs     autoalphabeticity_score
##    <chr>                       <dbl>
##  1 abaza                           0
##  2 abellen                         0
##  3 abenaki                         0
##  4 abkhaz                          1
##  5 abui                            2
##  6 acholi                          0
##  7 adaizan                         2
##  8 adyghe                          0
##  9 afar                            2
## 10 afrikaans                       0
## # ... with 523 more rows

Above we can see a few real results, whereas below, we have the few languages that didn’t process properly.

## # A tibble: 8 x 2
##   langs           autoalphabeticity_score
##   <chr>                             <dbl>
## 1 adi                                  NA
## 2 malayindonesian                      NA
## 3 javanese                             NA
## 4 latvia                               NA
## 5 malayindonesian                      NA
## 6 montagnais                           NA
## 7 nivacle                              NA
## 8 paraujano                            NA
## Of the 541 languages listed on the website index, only 8 resulted in error. This means that a
## whopping 533 languages, accounting for 98.5% of the total available, were used for analysis.

A histogram shows that the vast majority of languages have an autoalphabeticity score of 1 or 0, but there are a handful with more.

Digging into the numbers directly, we see that there are eight languages with a score of 4.

## 
##   0   1   2   3   4 
## 208 181  92  44   8

Let’s take a moment to look at these highly autoalphabetic languages.

## # A tibble: 8 x 2
##   langs    autoalphabeticity_score
##   <chr>                      <dbl>
## 1 finnish                        4
## 2 juhuri                         4
## 3 karelian                       4
## 4 lingala                        4
## 5 nkore                          4
## 6 phom                           4
## 7 scots                          4
## 8 vai                            4

We’ll get the inside scoop by marvelling at the ordering of these languages’ number words firsthand.

Language AlphabeticalOrder Number NumberWord MatchingNumber
finnish 1 8 kahdeksan FALSE
finnish 2 2 kaksi TRUE
finnish 3 3 kolme TRUE
finnish 4 6 kuusi FALSE
finnish 5 10 kymmenen FALSE
finnish 6 4 neljä FALSE
finnish 7 7 seitsemän TRUE
finnish 8 5 viisi FALSE
finnish 9 9 yhdeksän TRUE
finnish 10 1 yksi FALSE
juhuri 1 10 дегь(dəh) FALSE
juhuri 2 2 дуь, дуьдуь(dy, dydy) TRUE
juhuri 3 1 ек, еки, той(jək, jəki, toy) FALSE
juhuri 4 9 нуьгь(nyh) FALSE
juhuri 5 5 пенж(pənç) TRUE
juhuri 6 3 се, сесе(sə, səsə) FALSE
juhuri 7 7 хьофд(ħofd) TRUE
juhuri 8 8 хьэшд(ħəşd) TRUE
juhuri 9 4 чор(cor) FALSE
juhuri 10 6 шеш(şəş) FALSE
karelian 1 8 kahekšan FALSE
karelian 2 2 kakši TRUE
karelian 3 3 kolme TRUE
karelian 4 6 kuuǯi FALSE
karelian 5 10 kymmenen FALSE
karelian 6 4 neljjä FALSE
karelian 7 7 seiččemän TRUE
karelian 8 5 viizi FALSE
karelian 9 9 yhekšän TRUE
karelian 10 1 yksi FALSE
lingala 1 9 libwá FALSE
lingala 2 2 míbalé TRUE
lingala 3 4 mínei FALSE
lingala 4 3 mísató FALSE
lingala 5 5 mítáno TRUE
lingala 6 1 mókó FALSE
lingala 7 6 motóba FALSE
lingala 8 8 nwámbe TRUE
lingala 9 7 sámbó, nsámbó FALSE
lingala 10 10 zómi TRUE
nkore 1 1 emwe TRUE
nkore 2 2 ibiri TRUE
nkore 3 10 ikumi FALSE
nkore 4 4 ina TRUE
nkore 5 3 ishatu FALSE
nkore 6 5 itaano FALSE
nkore 7 6 mukaaga FALSE
nkore 8 8 munaana TRUE
nkore 9 7 mushanju FALSE
nkore 10 9 mwenda FALSE
phom 1 4 ali, ǝli FALSE
phom 2 10 an, ʌn FALSE
phom 3 3 chem, cʌm TRUE
phom 4 1 hük, hɨk FALSE
phom 5 5 nga, ŋa TRUE
phom 6 7 nyet, ɲʌt FALSE
phom 7 2 nyi, ɲi FALSE
phom 8 8 shet, Šʌt TRUE
phom 9 9 shü, Šɯ TRUE
phom 10 6 vok, wɔk FALSE
scots 1 1 ane / wan TRUE
scots 2 8 echt FALSE
scots 3 5 five FALSE
scots 4 4 fower TRUE
scots 5 9 nine FALSE
scots 6 6 sax TRUE
scots 7 7 seiven TRUE
scots 8 10 ten FALSE
scots 9 3 three FALSE
scots 10 2 twa FALSE
vai 1 1 dóndolɔ̀ndɔ́ TRUE
vai 2 2 férafɛ̀(ʔ)á TRUE
vai 3 4 nánináánì FALSE
vai 4 3 ságbasàk͡pá FALSE
vai 5 5 sōrusóó(ʔ)ú TRUE
vai 6 7 sumférasɔ̂ŋ fɛ̀(ʔ)á (5 + 2) FALSE
vai 7 6 sūndóndosɔ̂ŋ lɔ̀ndɔ́ (5 + 1) FALSE
vai 8 9 sūnnánisɔ̂ŋ náánì (5 + 4) FALSE
vai 9 8 sūnságbasɔ̂ŋ sàk͡pá (5 + 3) FALSE
vai 10 10 tantâŋ TRUE

I haven’t reviewed all the possible corner cases, so some languages could be categorized incorrectly. It could be a really big shame..

‾\_(ツ)_/‾

Conclusion

It’s really uncommon for a language to be highly autoalphabetical over the first 10 counting numbers. Although having an autoalphabeticity score of 0 is the mode over the first 10 counting numbers, it would be interesting to see the mode over the first hundred or thousand counting numbers.

Also, I have reason to believe that of the 8 listed languages that weren’t analyzed, some had broken web-pages, but others had different formatting than my scripting expected. I could program the exceptions in by hand, but there’s no real fun in that. :P

If you were looking for some sort of punch line, some deeper meaning behind this work, then I’ve wasted your time quite thoroughly. I don’t see any application of this work, except my own amusement, and the amusement of others like me. :D