Linguistic Similarity in Indo-European Languages

library(tidyverse)
library(DiagrammeR)
library(readr)
language <- read_csv("Downloads/language.csv")

By Sara Cantor

Language families serve to address genealogical similarities among languages. Languages that have been categorized into the same family are presumed to have a common ancestor. Within these families, languages are further split up into subgroups, also known as branches or genuses. The more splitting and branching off, the more linguistic similarities these languages tend to have. If two languages are mutually intelligible, then they usually have some amount of grammatical and lexical similarities, and generally are very close genealogically. Take, for example, Dutch and Afrikaans. Afrikaans is a daughter language of Middle Dutch, and Modern Dutch evolved from Middle Dutch as well. As a result, speakers of both languages can understand one another to a certain extent, Dutch-speakers having an easier time than the other way around. Afrikaans has had a lot of influence from other languages, such as Malay and Portuguese, but a Dutch speaker will not encounter many false cognates when trying to interpret Afrikaans.

grViz("
digraph boxes_and_circles{
  graph [overlap = TRUE, fontsize = 20]

  node [shape = plaintext,
        fontname = Arial,
        fixedsize = false,
        
        fontsize = 30]
 
  'Proto-Indo-European'; Italic; Germanic; Latin; French; Spanish; Romanian; Portuguese; Italian;
  'North Germanic'; 'West Germanic'; 'Old Norse'; Swedish;
  Norwegian; Icelandic; 'Anglo-Frisian'; 'Old Dutch'; 'Old High German'; 'Old English'; 'Old Frisian';
  'Middle Dutch'; 'Middle English'; 'Modern English'; Frisian; Flemish; Dutch; Afrikaans;
  'Middle High German'; German; Yiddish
  
  'Proto-Indo-European' -> Italic,Germanic
  Italic -> Latin Latin -> Spanish,Romanian,French,Portuguese,Italian
  Germanic -> 'North Germanic', 'West Germanic'
  'North Germanic' -> 'Old Norse',Swedish 'Old Norse' -> Norwegian, Icelandic
  'West Germanic' -> 'Anglo-Frisian','Old Dutch','Old High German'
  'Anglo-Frisian' -> 'Old English','Old Frisian' 'Old Dutch' -> 'Middle Dutch'
  'Old English', French -> 'Middle English' 'Middle English' -> 'Modern English'
  'Old Frisian' -> 'Frisian' 'Middle Dutch' -> Flemish,Dutch,Afrikaans
  'Old High German' -> 'Middle High German' 'Middle High German' -> German,Yiddish
}
")

As shown in the etymology tree above, which pictures some of the languages in the Indo-European family (other branches excluded for readability), Dutch and Afrikaans diverged from the same branch, Middle Dutch. If we go back a couple of branches, to West Germanic, and head over to the Old High German branch, we end up at German. German and Dutch may have some similarities, and speakers of these languages might be able to parse useful information in the other with some effort, but German and Afrikaans, on the other hand, are not similar enough to be considered at all mutually intelligible

All data in this paper is pulled from The World Atlas of Language Structures. Many aspects of the data set were incomplete, so there are many language aspects unaccounted for. Below is a chart showing which genuses are represented in the Indo-European family.

ie <- language %>%
  mutate(genus = genus %>% fct_infreq() %>% fct_rev()) %>%
  filter(family == "Indo-European")%>%
  group_by(genus) %>%
  count(Name) %>%
  na.omit

ggplot(ie) + 
  geom_col(aes(x = genus, y = n, fill = genus), width = 0.5, show.legend = FALSE) + 
  coord_flip() +
  theme(axis.text.x = element_text(vjust=0.6), axis.title.y = element_blank()) +
  labs(title = "Indo-European Genuses", x = "Number of Languages in Genus")

Similarities in Articles

Mutual intelligibility is not the only metric for judging the similarities of languages. Lack of mutual intelligibility does not mean lack of similarity. Still working with Germanic languages, we can focus on one aspect of a language: its articles, or lack thereof. According to the data, most Germanic languages tend to have both definite articles and indefinite articles, but they are represented in different ways.

germanic_articles <- language %>%
  filter(genus == "Germanic") %>%
  group_by(DefiniteArticles, IndefiniteArticles, Name) %>%
  count(Name) %>%
  na.omit

ggplot(germanic_articles) +
  geom_col(aes(x = DefiniteArticles, y = n, fill = Name), show.legend = FALSE) +
  geom_text(aes(label="Danish"), x = 2, y = 3.5) +
  geom_text(aes(label="Icelandic"), x = 2, y = 2.5) +
  geom_text(aes(label="Norwegian"), x = 2, y = 1.5) +
  geom_text(aes(label="Swedish"), x = 2, y = 0.5) +
  geom_text(aes(label="Dutch"), x = 1, y = 3.5) +
  geom_text(aes(label="English"), x = 1, y = 2.5) +
  geom_text(aes(label="Frisian"), x = 1, y = 1.5) +
  geom_text(aes(label="German"), x = 1, y = 0.5) +
  scale_x_discrete(labels = c("Definite word distinct from demonstrative", "Definite affix")) +
  theme(axis.text.y = element_blank(), axis.title.x = element_blank(), axis.ticks = element_blank())

ggplot(germanic_articles) +
  geom_col(aes(x = IndefiniteArticles, y = n, fill = Name), show.legend = FALSE) +
  geom_text(aes(label="Danish"), x = 1, y = 3.5) +
  geom_text(aes(label="Dutch"), x = 1, y = 2.5) +
  geom_text(aes(label="English"), x = 1, y = 1.5) +
  geom_text(aes(label="Frisian"), x = 1, y = 0.5) +
  geom_text(aes(label="German"), x = 2, y = 2.5) +
  geom_text(aes(label="Norwegian"), x = 2, y = 1.5) +
  geom_text(aes(label="Swedish"), x = 2, y = 0.5) +
  geom_text(aes(label="Icelandic"), x = 3, y = 0.5) +
  scale_x_discrete(labels = c("Indefinite word distinct from 'one'", "Indefinite word same as 'one'", "No indefinite")) +
  theme(axis.text.y = element_blank(), axis.title.x = element_blank(), axis.ticks = element_blank())

In the above tables, it can be seen that there is not much variety in how the most commonly-spoken Germanic languages represent their definite words. There is a little bit of variety in the indefinites, with Icelandic not having an indefinite article at all. In Icelandic, indefinite articles are “built into” the noun forms (“dagur” translates to “a day.”) In the others, nouns must have some sort of article. The division of the definite articles seems to be consistent with the genealogy tree seen above, with West Germanic languages having definite words distinct from the demonstrative words, and North Germanic words having definite affixes. The indefinites, however, do not fit the genealogy so tightly. Danish, a North Germanic language, is in a group with West Germanic languages, and German has gone to join the Scandinavian languages.

Tones in Indo-European Languages

When we think of linguistic tones, our first thought is not to look to Europe. Generally, tones are associated with languages spoken in Africa and Asia. Pulling data from The World Atlas of Language Structures, we can see that there are actually at least 3 Indo-European languages that utilize tone to convey semantic meaning.

#Languages sorted by tone
langtone <- language %>%
  filter(family == "Indo-European") %>%
  group_by(`Tone`, genus) %>%
  count(Name) %>%
  na.omit
langtone_count <- langtone %>%
  summarize(Count = n())

ggplot(langtone_count, aes(x = Tone, y = Count)) +
  geom_col(aes(fill = Tone), show.legend = FALSE) +
  labs(y = "Number of languages", x = "Tonal Category") +
  labs(title = "Number of Indo-European Languages in Each Tonal Category") +
  scale_x_discrete(labels = c("No tones", "Simple tone system"))

These languages are Kalami, Latvian, and Norwegian. Kalami, an Indic language, is closely related to Kashmiri, a language that has no tones. They are both part of the Dardic family of languages, yet they do not share tonal similarity. Norwegian shows up again, a language with a simple tone system, a sister language of Icelandic, which has no tones. Despite languages being in the same genus, there is still variety in how the language utilizes tone.

Consonant and Vowel Inventories

One other metric that can be used to compare similarities is a language’s phonetic inventory. It seems that in Indo-European languages, many favor a large vowel inventory and an average-sized consonant inventory, but the data is very much all over the place. There doesn’t seem to be consistency when we look at it from a larger family scale, but if we focus in on one family, take the Romance family, for example, it narrows down a lot more, showing that Romance languages overwhelmingly follow the large-vowel, average-consonant pattern.

ie_con_vow <- language %>%
  filter(family == "Indo-European") %>%
  group_by(ConsonantInventories, VowelQualityInventories) %>%
  count(Name) %>%
  tally() %>%
  na.omit 
r_con_vow <- language %>%
  filter(genus  == "Romance") %>%
  group_by(ConsonantInventories, VowelQualityInventories) %>%
  count(Name) %>%
  tally() %>%
  na.omit

ggplot(ie_con_vow) +
  geom_point(aes(x = ConsonantInventories, y = VowelQualityInventories, size = n)) +
  labs(title = "Consonant vs Vowel Inventories in Indo-European Languages", x="Consonants", y="Vowels") +
  scale_x_discrete(labels = c("Average", "Moderately large", "Large")) +
  scale_y_discrete(labels = c("Average", "Large")) +
  scale_size_continuous(name = "Number", range= c(1,6)) +
  theme_bw()

ggplot(r_con_vow) +
  geom_point(aes(x = ConsonantInventories, y = VowelQualityInventories, size = n)) +
  labs(title = "Consonant vs Vowel Inventories in Romance Languages", x="Consonants", y="Vowels") +
  scale_x_discrete(labels = c("Average", "Moderately large", "Large")) +
  scale_y_discrete(labels = c("Average", "Large")) +
  scale_size_continuous(name = "Number", range= c(1,6)) +
  theme_bw()

Even taking a broad category such as “Indo-European,” we were able to see that most languages fall in the same tonal category, but the ones that were outliers did not match any pattern, and were of separate genuses. And, once we chose a more specific family, Germanic, and analyzed a different feature within that family, we could see more consistencies in the data, with its few outliers. So, from these peeks into The World Atlas of Language Structures, it can be seen that mere genealogical similarity is not a surefire way to predict linguistic similarity.