Lexical Distance Across European Languages

Swadesh Lists – Orthographic vs Phonetic Distance

Quinn Weisenfeld

2025-01-01

Lexical Distance?

  • Lexical distance ≈ how different or similar basic vocabulary (lexicon) is across languages.

  • Set of European languages:

-   Romance:
    -   French (fr), Spanish (es),
-   Germanic:
    -    North:
        -   Danish (da), Norwegian (no), Swedish (sv).
    -   West:
        -   German (de)
        -   English(en)

Data

-   **Swadesh lists** (basic, cross-linguistically comparable vocabulary).
  • Types:
    • Orthographic - the way it is written
    • Phonetic - the way it is spoken
      • phonemized written Swadesh lists using eSpeak Engine

Data Structure: Swadesh Lists

-Orthographic:

# A tibble: 3 × 9
  English Danish Norwegian Swedish German Spanish French POS     Specific
  <chr>   <chr>  <chr>     <chr>   <chr>  <chr>   <chr>  <chr>   <chr>   
1 I       jeg    jeg       jag     ich    yo      je     Pronoun personal
2 you     du     du        du      du     tú      tu     Pronoun personal
3 it      han    han       han     er     él      il     Pronoun personal

-Phonetic:

# A tibble: 3 × 10
     id en    fr    de    es    da    no    sv    POS     Specific
  <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   
1     1 aI    Z@-   _|IC  J^o   jAj   jE:I  jA:g  Pronoun personal
2     2 ju:   ty    du:   tu    du    du-:  du-   Pronoun personal
3     3 It    il    _|Er  el    h?&n  han   han   Pronoun personal

Research Questions & Expectations

  • Goal: Quantify and display the differences between the lexicons of a sample of European languages

    • RQ1. Do language families (Germanic, Romance, Scandinavian) form visible clusters in lexical distance space?

    • RQ2. How does English align with:

      • German and the Scandinavian languages?
      • Romance languages (French, Spanish)?
    • RQ3. How does the picture change when we use phonetic distance instead of orthographic distance?

    • Hypotheses:

      • Scandinavian languages (da/no/sv) should be extremely close to each other.
      • English should pattern with other Germanic languages, but also show closeness to French.

Orthographic vs Phonetic Representation

  • For each language, we have two parallel representations:
    • Orthographic: the standard written form (e.g., English “water”).
    • Phonetic: a phonemized form generated using eSpeak or similar tools.
  • Reasons to keep both:
    • Orthographic distance reflects how similar the spelling systems are.
    • Phonetic distance reflects how similar the sounds actually are.
  • Many interesting cases:
    • Words that look similar but sound different.
    • Words that look different but sound quite similar.
  • Comparing orthographic and phonetic distance helps disentangle spelling conventions from phonological similarity.

Levenshtein Distance

  • Levenshtein distance treats words as strings of characters.
  • The distance between two words = minimum number of:
    • Insertions
    • Deletions
    • Substitutions required to transform one into the other.
  • normalized version, roughly:
    • distance(word₁, word₂) / max(length(word₁), length(word₂))\
    • makes all values a scale 0 to 1, most to least similar
  • Averages of this across languages - total or by POS

Results

Plot 1 – Orthographic Distance (All Language Pairs)

-English: far from others

-Scandinavian Languages: close, especially Norwegian and Danish

-German: far from English

Analysis of English Distances

  • Lexical Distance is a specific measure of similarity
  • Foreigh influences mainly affect Lexicon: Norman and Scandinavian
  • Orthography is dependent on many other factors
  • Types of words in data set

Plot 2 - Phonetic Distance (All Language Pairs)

Plot 3 – Phonetic vs. Orthographic Distances Scatter

-Generally aligned, but difference between W.Germanic Languages and Scandinavian consistently more pronounced phonetically. -Orthographies of Germanic languages revised and based on English Latin script -da-no pair outlier

Plot 4 – Distances from All (Orthographic)

Plot 5 – Distances from all (Phonetic)

Analysis of Lexical Distances Overview

  • Some expected patterns in data - Scandinavian Lang. closeness

  • Some unexpected - English closer to romance

  • More zoomed-in look at the words neccessary: Part of Speech (POS) classification

    • Applicable System

      • Pronoun, Verb, Noun, Adverb, Adjective

      • Grammatical Particle (gram)

Plot 6 – POS-Specific Distances from English Heatmap (Ortho)

Plot 7 – POS-Specific Distances from English Heatmap (Phon)

Plot 8 – POS-Specific Distances from Swedish Heatmap (Ortho)

Plot 9– POS-Specific Distances from Swedish Heatmap (Phon)

Lexical Distance by Part of Speech

Plot 10 – POS - English vs Selected Languages

Analysis of POS English Distances

  • Closeness to German for grammatical, functional words

    • structural similarity approximation

    • reflects language family relationship

  • English close to Scandinavian languages in noun, adjective, adverb categories

    • consistent with low social level lexical borrowing
  • French closest category to English is noun

    • consistent with high social level lexical borrowing

Plot 11 – POS - Swedish vs Selected Languages

Plot 11 – POS - French vs Selected Languages

## POS Analysis
- Much more consistent across categories than English
- Some unknown features of interest:
- Spanish adverb further distance from French
- Scandinavian Languages closest aspect in Pronouns and Noun

Plot 12 – Phonetic vs. Orthographic Distances Comparison by POS

Analysis

  • Nouns have the highest rate of similarity:
    • Nouns are most often borrowed with language contact
  • Grammatical particles show the highest average difference
    • shows structural relation
    • wouldn’t be as likely to be borrowed
## Linguistic Interpretation
- Scandinavian languages (da/no/sv): - Very low distances reflect: - Recent common ancestry. - High degree of mutual intelligibility. - Shared vocabulary and similar morphosyntax. - English and other Germanic languages: - English’s distances to German and Scandinavian languages: - Reflect a shared Germanic core. - Are modulated by heavy French and Latin borrowing - One reason English/German much further than Scandinavian pairs - Within POS: - Function words may track deep genetic relationships. - Content words may reflect more recent borrowing.

Linguistic Interpretation

  • Orthography vs phonology:
    • Orthographic distances are influenced by:
      • Spelling reforms.
      • Historical spelling conventions.
    • Phonetic distances are more directly tied to:
      • Sound change.
      • Phonological structure.
    • Which is better up for debate:
      • dependent on each language and its development history

Limitations

  • Data limitations
    • Swadesh lists are relatively small and focused on basic vocabulary.
    • may not capture specialized lexical domains or contemporary slang.
  • Methodological limitations
    • Automatic phonemization (e.g., via eSpeak) can introduce errors or bias.
    • Levenshtein distance:
      • Treats all character edits equally.

      • Ignores phonological features (e.g., similarity between different sounds).

      • Views only most common word

      • Ignores frequency of word use

Limitations

  • Sampling limitations
    • Only a handful of European languages are included.
    • Wider claims about typology or universals would require more languages and families.

Future Study

  • More Languages
    • Analyzing different groups in Indoeuropean Family
    • Applying to different, smaller Language families
  • Phonetic Improvements
    • Less black box phonemizing
    • Method which takes into account sound similarity; phonotactics and assimilation rules
  • Grammatical Distance
    • Finding a way to quantify grammatical difference between languages
    • Structural, morphological rules similarity