Lexical Distance Across European Languages
Swadesh Lists – Orthographic vs Phonetic Distance
Lexical Distance?
- Romance:
- French (fr), Spanish (es),
- Germanic:
- North:
- Danish (da), Norwegian (no), Swedish (sv).
- West:
- German (de)
- English(en)
Data
- **Swadesh lists** (basic, cross-linguistically comparable vocabulary).
- Types:
- Orthographic - the way it is written
- Phonetic - the way it is spoken
- phonemized written Swadesh lists using eSpeak Engine
Data Structure: Swadesh Lists
-Orthographic:
# A tibble: 3 × 9
English Danish Norwegian Swedish German Spanish French POS Specific
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 I jeg jeg jag ich yo je Pronoun personal
2 you du du du du tú tu Pronoun personal
3 it han han han er él il Pronoun personal
-Phonetic:
# A tibble: 3 × 10
id en fr de es da no sv POS Specific
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 aI Z@- _|IC J^o jAj jE:I jA:g Pronoun personal
2 2 ju: ty du: tu du du-: du- Pronoun personal
3 3 It il _|Er el h?&n han han Pronoun personal
Research Questions & Expectations
Orthographic vs Phonetic Representation
- For each language, we have two parallel representations:
- Orthographic: the standard written form (e.g., English “water”).
- Phonetic: a phonemized form generated using eSpeak or similar tools.
- Reasons to keep both:
- Orthographic distance reflects how similar the spelling systems are.
- Phonetic distance reflects how similar the sounds actually are.
- Many interesting cases:
- Words that look similar but sound different.
- Words that look different but sound quite similar.
- Comparing orthographic and phonetic distance helps disentangle spelling conventions from phonological similarity.
Levenshtein Distance
- Levenshtein distance treats words as strings of characters.
- The distance between two words = minimum number of:
- Insertions
- Deletions
- Substitutions required to transform one into the other.
- normalized version, roughly:
- distance(word₁, word₂) / max(length(word₁), length(word₂))\
- makes all values a scale 0 to 1, most to least similar
- Averages of this across languages - total or by POS
Plot 1 – Orthographic Distance (from EN)
Analysis of English Distances
- Lexical Distance is a specific measure of similarity
- Foreign influences mainly affect Lexicon: Norman and Scandinavian
- Orthography is dependent on many other factors
- Types of words in data can set skew distances
Plot 2 – Distances from All (Orthographic)
Plot 3 - Distances from All (Phonetic)
Plot 4 – Phonetic vs. Orthographic Distances Scatter
Phon vs Ortho Analysis
- Generally aligned, but difference between W.Germanic Languages and Scandinavian consistently more pronounced phonetically.
- Orthographies of Germanic languages revised and based on English Latin script
- da-no pair outlier
Analysis of Lexical Distances Overview
Some expected patterns in data - Scandinavian Lang. closeness
Some unexpected - English closer to romance
More zoomed-in look at the words neccessary: Part of Speech (POS) classification
Applicable System
Pronoun, Verb, Noun, Adverb, Adjective
Grammatical Particle (gram)
Lexical Distance by Part of Speech
Plot 5 – POS-Specific Distances from English Heatmap (Ortho)
Plot 6 – POS-Specific Distances from English Heatmap (Phon)
Analysis of POS English Distances
- Closeness to German for grammatical, functional words
- structural similarity approximation
- reflects language family relationship
- English close to Scandinavian languages in noun, adjective, adverb categories
- consistent with low social level lexical borrowing
- French closest category to English is noun
- consistent with high social level lexical borrowing
Plot 7 – POS-Specific Distances from Swedish Heatmap (Ortho)
Plot 8 – POS-Specific Distances from Swedish Heatmap (Phon)
Plot 9 – POS - Norwegian vs Selected Languages (Ortho)
Plot 10 – POS - French vs Selected Languages (Ortho)
POS Analysis
- Much more consistent across categories than English
- Some unknown features of interest:
- Spanish adverb further distance from French
- Scandinavian Languages closest aspect in Pronouns and Noun
Plot 11 – Phonetic vs. Orthographic Distances Comparison by POS
Plot 12 - Distribution of Proximity Classification POS
Analysis
- Nouns are most often borrowed with language contact
- Grammatical particles show the highest average difference - shows structural relation
- wouldn’t be as likely to be borrowed
Linguistic Interpretation
- Scandinavian languages (da/no/sv):
- Very low distances reflect:
- Recent common ancestry.
- High degree of mutual intelligibility.
- Shared vocabulary and similar morphosyntax.
- English and other Germanic languages:
- English’s distances to German and Scandinavian languages:
- Reflect a shared Germanic core.
- Are modulated by heavy French and Latin borrowing
- One reason English/German much further than Scandinavian pairs; Within POS:
- Function words may track deep genetic relationships.
- Content words may reflect more recent borrowing.
Limitations
- Data limitations - Swadesh lists are relatively small and focused on basic vocabulary. - may not capture specialized lexical domains or contemporary slang.
- Methodological limitations
- Automatic phonemization (e.g., via eSpeak) can introduce errors or bias.
- Levenshtein distance: - Treats all character edits equally.
Limitations
- Sampling limitations
- Only a handful of European languages are included.
- Wider claims about typology or universals would require more languages and families.
Future Study
- More Languages
- Analyzing different groups in Indoeuropean Family
- Applying to different, smaller Language families
- Phonetic Improvements
- Less black box phonemizing
- Method which takes into account sound similarity; phonotactics and assimilation rules
- Grammatical Distance
- Finding a way to quantify grammatical difference between languages
- Structural, morphological rules similarity