Lexical Distance Across European Languages

Swadesh Lists – Orthographic vs Phonetic Distance

Quinn Weisenfeld

Lexical Distance?

  • Lexical distance ≈ how different or similar basic vocabulary (lexicon) is across languages.

  • Set of European languages:

-   Romance:
    -   French (fr), Spanish (es),
-   Germanic:
    -    North:
        -   Danish (da), Norwegian (no), Swedish (sv).
    -   West:
        -   German (de)
        -   English(en)

Data

-   **Swadesh lists** (basic, cross-linguistically comparable vocabulary).
  • Types:
    • Orthographic - the way it is written
    • Phonetic - the way it is spoken
      • phonemized written Swadesh lists using eSpeak Engine

Data Structure: Swadesh Lists

-Orthographic:

# A tibble: 3 × 9
  English Danish Norwegian Swedish German Spanish French POS     Specific
  <chr>   <chr>  <chr>     <chr>   <chr>  <chr>   <chr>  <chr>   <chr>   
1 I       jeg    jeg       jag     ich    yo      je     Pronoun personal
2 you     du     du        du      du     tú      tu     Pronoun personal
3 it      han    han       han     er     él      il     Pronoun personal

-Phonetic:

# A tibble: 3 × 10
     id en    fr    de    es    da    no    sv    POS     Specific
  <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   
1     1 aI    Z@-   _|IC  J^o   jAj   jE:I  jA:g  Pronoun personal
2     2 ju:   ty    du:   tu    du    du-:  du-   Pronoun personal
3     3 It    il    _|Er  el    h?&n  han   han   Pronoun personal

Research Questions & Expectations

  • Goal: Quantify and display the differences between the lexicons of a sample of European languages

    • RQ1. Do language families (Germanic, Romance, Scandinavian) form visible clusters in lexical distance space?

    • RQ2. How does English align with:

      • German and the Scandinavian languages?
      • Romance languages (French, Spanish)?
    • RQ3. How does the picture change when we use phonetic distance instead of orthographic distance?

    • Hypotheses:

      • Scandinavian languages (da/no/sv) should be extremely close to each other.
      • English should pattern with other Germanic languages, but also show closeness to French.

Orthographic vs Phonetic Representation

  • For each language, we have two parallel representations:
    • Orthographic: the standard written form (e.g., English “water”).
    • Phonetic: a phonemized form generated using eSpeak or similar tools.
  • Reasons to keep both:
    • Orthographic distance reflects how similar the spelling systems are.
    • Phonetic distance reflects how similar the sounds actually are.
  • Many interesting cases:
    • Words that look similar but sound different.
    • Words that look different but sound quite similar.
  • Comparing orthographic and phonetic distance helps disentangle spelling conventions from phonological similarity.

Levenshtein Distance

  • Levenshtein distance treats words as strings of characters.
  • The distance between two words = minimum number of:
    • Insertions
    • Deletions
    • Substitutions required to transform one into the other.
  • normalized version, roughly:
    • distance(word₁, word₂) / max(length(word₁), length(word₂))\
    • makes all values a scale 0 to 1, most to least similar
  • Averages of this across languages - total or by POS

Results

Plot 1 – Orthographic Distance (from EN)

Analysis of English Distances

  • Lexical Distance is a specific measure of similarity
  • Foreign influences mainly affect Lexicon: Norman and Scandinavian
  • Orthography is dependent on many other factors
  • Types of words in data can set skew distances

Plot 2 – Distances from All (Orthographic)

Plot 3 - Distances from All (Phonetic)

Plot 4 – Phonetic vs. Orthographic Distances Scatter

Phon vs Ortho Analysis

  • Generally aligned, but difference between W.Germanic Languages and Scandinavian consistently more pronounced phonetically.
  • Orthographies of Germanic languages revised and based on English Latin script
  • da-no pair outlier

Analysis of Lexical Distances Overview

  • Some expected patterns in data - Scandinavian Lang. closeness

  • Some unexpected - English closer to romance

  • More zoomed-in look at the words neccessary: Part of Speech (POS) classification

    • Applicable System

      • Pronoun, Verb, Noun, Adverb, Adjective

      • Grammatical Particle (gram)

Lexical Distance by Part of Speech

Plot 5 – POS-Specific Distances from English Heatmap (Ortho)

Plot 6 – POS-Specific Distances from English Heatmap (Phon)

Analysis of POS English Distances

  • Closeness to German for grammatical, functional words
  • structural similarity approximation
  • reflects language family relationship
  • English close to Scandinavian languages in noun, adjective, adverb categories
  • consistent with low social level lexical borrowing
  • French closest category to English is noun
  • consistent with high social level lexical borrowing

Plot 7 – POS-Specific Distances from Swedish Heatmap (Ortho)

Plot 8 – POS-Specific Distances from Swedish Heatmap (Phon)

Plot 9 – POS - Norwegian vs Selected Languages (Ortho)

Plot 10 – POS - French vs Selected Languages (Ortho)

POS Analysis

  • Much more consistent across categories than English
  • Some unknown features of interest:
    • Spanish adverb further distance from French
    • Scandinavian Languages closest aspect in Pronouns and Noun

Plot 11 – Phonetic vs. Orthographic Distances Comparison by POS

Plot 12 - Distribution of Proximity Classification POS

Analysis

  • Nouns are most often borrowed with language contact
  • Grammatical particles show the highest average difference - shows structural relation
  • wouldn’t be as likely to be borrowed

Linguistic Interpretation

  • Scandinavian languages (da/no/sv):
  • Very low distances reflect:
    • Recent common ancestry.
    • High degree of mutual intelligibility.
    • Shared vocabulary and similar morphosyntax.
  • English and other Germanic languages:
    • English’s distances to German and Scandinavian languages:
    • Reflect a shared Germanic core.
    • Are modulated by heavy French and Latin borrowing
    • One reason English/German much further than Scandinavian pairs; Within POS:
      • Function words may track deep genetic relationships.
      • Content words may reflect more recent borrowing.

Limitations

  • Data limitations - Swadesh lists are relatively small and focused on basic vocabulary. - may not capture specialized lexical domains or contemporary slang.
  • Methodological limitations
    • Automatic phonemization (e.g., via eSpeak) can introduce errors or bias.
    • Levenshtein distance: - Treats all character edits equally.

Limitations

  • Sampling limitations
    • Only a handful of European languages are included.
    • Wider claims about typology or universals would require more languages and families.

Future Study

  • More Languages
    • Analyzing different groups in Indoeuropean Family
    • Applying to different, smaller Language families
  • Phonetic Improvements
    • Less black box phonemizing
    • Method which takes into account sound similarity; phonotactics and assimilation rules
  • Grammatical Distance
    • Finding a way to quantify grammatical difference between languages
    • Structural, morphological rules similarity