Preamble
Just loading some packages, nothing to see here.
Intro
Building an Introductory Example
In this section, I’ll take a moment to set up an example. It’s not necessary to understand why things are happening, so let’s just view my next code block as a fait accompli. I am merely “showing you how the sausage” is made, for clarity. The discussion that follows is the important bit.
autoalphabeticize_df <- function(df, lang = "english") {
df %>%
filter(is.element(Number, 1:10)) %>%
arrange(NumberWord) %>%
mutate(
AlphabeticalOrder = 1:10,
MatchingNumber = AlphabeticalOrder == Number,
Language = rep(lang, 10)
) %>%
select(Language, AlphabeticalOrder, everything())
}
NumberWord = c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten")
example_df <- autoalphabeticize_df(tibble(Number = 1:10, NumberWord))Example
Consider the following table of English number words, ranked in alphabetical order.
## # A tibble: 10 x 5
## Language AlphabeticalOrder Number NumberWord MatchingNumber
## <chr> <int> <int> <chr> <lgl>
## 1 english 1 8 eight FALSE
## 2 english 2 5 five FALSE
## 3 english 3 4 four FALSE
## 4 english 4 9 nine FALSE
## 5 english 5 1 one FALSE
## 6 english 6 7 seven FALSE
## 7 english 7 6 six FALSE
## 8 english 8 10 ten FALSE
## 9 english 9 3 three FALSE
## 10 english 10 2 two FALSE
We can see that the first word in the sorted table is 8. And word number 2, is 5. Continuing down through the list, we see that the numerical and lexicographical orderings do not coincide for any of the numbers, if we consider only the first ten English number words.
I would like to introduce the term “autoalphabeticity” to denote this concept. It would then follow that English has an autoalphabeticity of 0 on the first 10 numbers.
This made me wonder if any language has a high autoalphabeticity. The purpose of this work is to determine the autoalphabeticity on the first 10 numbers of all the languages found on the website https://www.omniglot.com/language/numbers/index.htm .
Methods
Gathering Data
We start by getting a list of all the languages.
langs <- read_html("https://www.omniglot.com/language/numbers/index.htm") %>% #get website index
html_nodes("div ol li p a") %>% #navigate to the appropriate nodes
html_attr("href") %>% #get the href tag
.[!is.na(.)] %>% #remove things that didn't have href tag
str_split_fixed("\\.", 2) %>% #remove .htm from all language names
.[ , 1] #grab the first column (the one with all the language names)
length(langs)## [1] 541
There are 541 languages with numbering systems available for inspection on this website.
Determining Autoalphabeticity
Check if a language is read right-to-left. If it’s in the list of languages below, we’ll flip it. If it’s not in the list, we’ll assume it’s left-to-right. I pulled the list of languages from https://en.wikipedia.org/wiki/Right-to-left by hand, and also checked them against the tables on www.omniglot.com/.
is_rtl <- function(lang) {
rtl_langs <- c(
"Arabic",
"Garshuni",
"Karshuni",
"Kazakh",
"Kurdish",
"Kyrgyz",
"Turkmen",
"Uzbek",
"Persian",
"Baluchi",
"Pashto",
"Azerbaijani",
"Talysh",
"Uyghur",
"Dari",
"Kashmiri",
"Punjabi*",
"Saraiki",
"Sindhi*",
"Urdu",
"Ida'an",
"Idahan",
"Cham",
"Tuareg",
"Tamasheq",
"Bedawi",
"Beja",
"Dongolawi",
"Andaandi",
"Nobiin",
"Zarma",
"Tadaksahak",
"Hausa",
"Dyula",
"Jola-Fonyi",
"Balanta",
"*Mandinka",
"Fula",
"Ajami",
"Wolofal"
) %>%
tolower()
if (is.element(lang, rtl_langs)) return(TRUE)
FALSE
}I’ll note that 3 of the languages in the list have an asterisk. Mandinka has an asterisk in front of it, to denote that it could also be written in an RTL script, but is written LTR on www.omniglot.com/. Punjabi and Saraiki can be written in LTR or RTL scripts, and both are represented (in a single column!) on the site. Since the LTR script is first and I’m only analyzing each language once, that means I’m sort of erasing the Punjabi and Saraiki RTL autoalphabeticity, but I don’t feel like trying to make weird exceptions for these two cases.
So now we need a function to determine how autoalphabetical each language is (over the first 10 numbers).
autoalphabeticity <- function(df, rtl = FALSE){
if (rtl) df %>% mutate(NumberWord = stri_reverse(NumberWord))
df %>%
filter(is.element(Number, 1:10)) %>%
.[1:10, ] %>% #only select the first 10 (avoids dealing with 1.000 and the like)
arrange(NumberWord) %>% #order data frame by number word (cardinal)
mutate(AlphabeticalOrder = 1:10) %>%
filter(AlphabeticalOrder == Number) %>%
nrow()
}Retreiving Numerical Tables
Our function to process languages will start with the language name, and then grab the website, and filter down to the language table. From there we’ll clean up the table.
Before we write that function, let’s write a helper function to parse tables.
Most commonly, tables have one column called “Number” or “Numeral”, and another column called “Cardinal”. But these tables are probably made by hand, which means they are riddled with inconsistencies, and there are many tables that are exceptions to this norm.
Because programming is weird, let’s start with a list of our exceptions to what a “normal” table should include. These are things to try out if we meet a table that doesn’t match the above.
try_weird_cases <- function(df) {
#a helper function
contains_col <- function(name) is.element(name, colnames(df))
#We'll try to break things down to a case-by-case basis
#if we have only 2 cols, assume they're in the correct order.
if (ncol(df) == 2) {
return(df)
#next, consider a format that is common for tables of indian languages.
#check whether we have a column of numbers that is not named (which has been given
#the name "num" in the process_lang function we haven't writen yet),
#and a column that is named "number", which has number words.
} else if (contains_col("num") & contains_col("Number") & contains_col("Numeral")) {
df <- df %>%
mutate(NumberWord = Number) %>%
select(num, NumberWord)
#in some character languages such as chinese, I'm going to rely on pronunciation as
#the guiding metric
} else if (contains_col("Number") & contains_col("Pronunciation")) {
df <- df %>%
mutate(NumberWord = Pronunciation) %>%
select(Number, NumberWord)
#in some indian language tables, if we have number and numeral columns
#but no cardinal column, the number colulmn has the number words
} else if (contains_col("Number") & contains_col("Numeral")) {
df <- df %>%
mutate(NumberWord = "Number") %>%
select(Numeral, NumberWord)
#in the case of korean and japanese, we'll look for some key words
#to be a marker for the cardinal numbers
} else if (contains_col("Numeral")) {
df <- df %>%
select(
Numeral,
starts_with("Native"),
starts_with("General"),
contains("Counting")
)
colnames(df)[2] <- "NumberWord"
}
df
}Next we’ll build the function that takes the language, builds a url, finds the table of numbers and number words, and then cleans said table to be usable.
Needless to say, the last bit is the trickiest. And the above function is an ancillary piece that aids in this process.
process_lang <- function(lang, baseURL = "https://www.omniglot.com/language/numbers/") {
url <- paste0(baseURL, lang, ".htm")
df <- read_html(url) %>% #read website
html_node("table") %>% #find table
html_table(fill = TRUE) #extract table as data frame
#the tables on this site do follow very strong patterns
#but, from the errors I've seen, I think that if there is
#an unnamed column it is the number column.
#however, we can't always give it the name "Number" because that is sometimes
#the column name for a number written with different numbering system, and sometimes
#the digits 1-10 are written in the "Numeral" column, not the unnamed column.
#we'll try to cirucmvent all of this by making a "num" column, reorganizing the columns
#to be in alhpabetical order, then selecting cols that start with "num". This is
#particularly tricky because data frames that have an empty column name cause R to break
#easily.
if (is.element("", colnames(df))) {
df <- df[ , order(names(df))]
colnames(df)[1] <- "num"
}
#some tables have masculine and feminine numbers listed as different subcolumns within
#one column. These get parsed as two columns with the same name.
#We'll use bind_cols to append a number suffix to the second of these
#twin columns, and then drop the second one.
df <- df %>%
bind_cols() %>% #this appends a number suffix to col names that had the same name.
select(-ends_with("1")) #drop all cols that end with the number 1.
tmp <- df %>% select(starts_with("num"), contains("cardinal"))
if (ncol(tmp) >= 2) {
df <- tmp
} else {
df <- try_weird_cases(df)
}
colnames(df)[1] <- "Number"
df$Number <- df$Number %>% as.numeric()
colnames(df)[2] <- "NumberWord"
df
}Analysis
Now let’s get to processing! We’ve built this thing, now let’s set it on an unsuspecting website and grab our sweet sweet number words.
autoalphabeticity_score <- rep(-1, length(langs))
x <- length(langs)
for (i in 1:x) {
autoalphabeticity_score[i] <- tryCatch(
process_lang(langs[i]) %>%
autoalphabeticity(rtl = is_rtl(lang[i])),
error = function(e) return(NA)
)
print(str_glue("{langs[i]} language is complete! {i}/{x}"))
}
save(autoalphabeticity_score, file = "autoalphabeticity_score.RData")I’ve saved the data so I don’t have to reprocess it every time I update this file. It takes more than 5 mins to run each time. The price of success.
After analysis we see some interesting results!
raw_results <- tibble(langs, autoalphabeticity_score)
results <- raw_results %>%
filter(!is.na(autoalphabeticity_score))
na_results <- raw_results %>% filter(is.na(autoalphabeticity_score))
results## # A tibble: 533 x 2
## langs autoalphabeticity_score
## <chr> <dbl>
## 1 abaza 0
## 2 abellen 0
## 3 abenaki 0
## 4 abkhaz 1
## 5 abui 2
## 6 acholi 0
## 7 adaizan 2
## 8 adyghe 0
## 9 afar 2
## 10 afrikaans 0
## # ... with 523 more rows
Above we can see a few real results, whereas below, we have the few languages that didn’t process properly.
## # A tibble: 8 x 2
## langs autoalphabeticity_score
## <chr> <dbl>
## 1 adi NA
## 2 malayindonesian NA
## 3 javanese NA
## 4 latvia NA
## 5 malayindonesian NA
## 6 montagnais NA
## 7 nivacle NA
## 8 paraujano NA
## Of the 541 languages listed on the website index, only 8 resulted in error. This means that a
## whopping 533 languages, accounting for 98.5% of the total available, were used for analysis.
A histogram shows that the vast majority of languages have an autoalphabeticity score of 1 or 0, but there are a handful with more.
Digging into the numbers directly, we see that there are eight languages with a score of 4.
##
## 0 1 2 3 4
## 208 181 92 44 8
Let’s take a moment to look at these highly autoalphabetic languages.
high_aa_langs <- results %>%
filter(autoalphabeticity_score >= 4) %>%
arrange(desc(autoalphabeticity_score))
high_aa_langs## # A tibble: 8 x 2
## langs autoalphabeticity_score
## <chr> <dbl>
## 1 finnish 4
## 2 juhuri 4
## 3 karelian 4
## 4 lingala 4
## 5 nkore 4
## 6 phom 4
## 7 scots 4
## 8 vai 4
We’ll get the inside scoop by marvelling at the ordering of these languages’ number words firsthand.
get_lang_df <- function(lang) process_lang(lang) %>% autoalphabeticize_df(lang)
high_aa_df <- lapply(high_aa_langs$langs, get_lang_df) %>%
bind_rows() #put languages together in one data frame
#I use kable to make utf8 characters render properly.
kable(
high_aa_df %>% mutate(NumberWord = enc2utf8(NumberWord)),
align = c("l", "c", "c", "l", "c")
)| Language | AlphabeticalOrder | Number | NumberWord | MatchingNumber |
|---|---|---|---|---|
| finnish | 1 | 8 | kahdeksan | FALSE |
| finnish | 2 | 2 | kaksi | TRUE |
| finnish | 3 | 3 | kolme | TRUE |
| finnish | 4 | 6 | kuusi | FALSE |
| finnish | 5 | 10 | kymmenen | FALSE |
| finnish | 6 | 4 | neljä | FALSE |
| finnish | 7 | 7 | seitsemän | TRUE |
| finnish | 8 | 5 | viisi | FALSE |
| finnish | 9 | 9 | yhdeksän | TRUE |
| finnish | 10 | 1 | yksi | FALSE |
| juhuri | 1 | 10 | дегь(dəh) | FALSE |
| juhuri | 2 | 2 | дуь, дуьдуь(dy, dydy) | TRUE |
| juhuri | 3 | 1 | ек, еки, той(jək, jəki, toy) | FALSE |
| juhuri | 4 | 9 | нуьгь(nyh) | FALSE |
| juhuri | 5 | 5 | пенж(pənç) | TRUE |
| juhuri | 6 | 3 | се, сесе(sə, səsə) | FALSE |
| juhuri | 7 | 7 | хьофд(ħofd) | TRUE |
| juhuri | 8 | 8 | хьэшд(ħəşd) | TRUE |
| juhuri | 9 | 4 | чор(cor) | FALSE |
| juhuri | 10 | 6 | шеш(şəş) | FALSE |
| karelian | 1 | 8 | kahekšan | FALSE |
| karelian | 2 | 2 | kakši | TRUE |
| karelian | 3 | 3 | kolme | TRUE |
| karelian | 4 | 6 | kuuǯi | FALSE |
| karelian | 5 | 10 | kymmenen | FALSE |
| karelian | 6 | 4 | neljjä | FALSE |
| karelian | 7 | 7 | seiččemän | TRUE |
| karelian | 8 | 5 | viizi | FALSE |
| karelian | 9 | 9 | yhekšän | TRUE |
| karelian | 10 | 1 | yksi | FALSE |
| lingala | 1 | 9 | libwá | FALSE |
| lingala | 2 | 2 | míbalé | TRUE |
| lingala | 3 | 4 | mínei | FALSE |
| lingala | 4 | 3 | mísató | FALSE |
| lingala | 5 | 5 | mítáno | TRUE |
| lingala | 6 | 1 | mókó | FALSE |
| lingala | 7 | 6 | motóba | FALSE |
| lingala | 8 | 8 | nwámbe | TRUE |
| lingala | 9 | 7 | sámbó, nsámbó | FALSE |
| lingala | 10 | 10 | zómi | TRUE |
| nkore | 1 | 1 | emwe | TRUE |
| nkore | 2 | 2 | ibiri | TRUE |
| nkore | 3 | 10 | ikumi | FALSE |
| nkore | 4 | 4 | ina | TRUE |
| nkore | 5 | 3 | ishatu | FALSE |
| nkore | 6 | 5 | itaano | FALSE |
| nkore | 7 | 6 | mukaaga | FALSE |
| nkore | 8 | 8 | munaana | TRUE |
| nkore | 9 | 7 | mushanju | FALSE |
| nkore | 10 | 9 | mwenda | FALSE |
| phom | 1 | 4 | ali, ǝli | FALSE |
| phom | 2 | 10 | an, ʌn | FALSE |
| phom | 3 | 3 | chem, cʌm | TRUE |
| phom | 4 | 1 | hük, hɨk | FALSE |
| phom | 5 | 5 | nga, ŋa | TRUE |
| phom | 6 | 7 | nyet, ɲʌt | FALSE |
| phom | 7 | 2 | nyi, ɲi | FALSE |
| phom | 8 | 8 | shet, Šʌt | TRUE |
| phom | 9 | 9 | shü, Šɯ | TRUE |
| phom | 10 | 6 | vok, wɔk | FALSE |
| scots | 1 | 1 | ane / wan | TRUE |
| scots | 2 | 8 | echt | FALSE |
| scots | 3 | 5 | five | FALSE |
| scots | 4 | 4 | fower | TRUE |
| scots | 5 | 9 | nine | FALSE |
| scots | 6 | 6 | sax | TRUE |
| scots | 7 | 7 | seiven | TRUE |
| scots | 8 | 10 | ten | FALSE |
| scots | 9 | 3 | three | FALSE |
| scots | 10 | 2 | twa | FALSE |
| vai | 1 | 1 | dóndolɔ̀ndɔ́ | TRUE |
| vai | 2 | 2 | férafɛ̀(ʔ)á | TRUE |
| vai | 3 | 4 | nánináánì | FALSE |
| vai | 4 | 3 | ságbasàk͡pá | FALSE |
| vai | 5 | 5 | sōrusóó(ʔ)ú | TRUE |
| vai | 6 | 7 | sumférasɔ̂ŋ fɛ̀(ʔ)á (5 + 2) | FALSE |
| vai | 7 | 6 | sūndóndosɔ̂ŋ lɔ̀ndɔ́ (5 + 1) | FALSE |
| vai | 8 | 9 | sūnnánisɔ̂ŋ náánì (5 + 4) | FALSE |
| vai | 9 | 8 | sūnságbasɔ̂ŋ sàk͡pá (5 + 3) | FALSE |
| vai | 10 | 10 | tantâŋ | TRUE |
I haven’t reviewed all the possible corner cases, so some languages could be categorized incorrectly. It could be a really big shame..
‾\_(ツ)_/‾
Conclusion
It’s really uncommon for a language to be highly autoalphabetical over the first 10 counting numbers. Although having an autoalphabeticity score of 0 is the mode over the first 10 counting numbers, it would be interesting to see the mode over the first hundred or thousand counting numbers.
Also, I have reason to believe that of the 8 listed languages that weren’t analyzed, some had broken web-pages, but others had different formatting than my scripting expected. I could program the exceptions in by hand, but there’s no real fun in that. :P
If you were looking for some sort of punch line, some deeper meaning behind this work, then I’ve wasted your time quite thoroughly. I don’t see any application of this work, except my own amusement, and the amusement of others like me. :D