Dutch word formation

Non-linguistic perspective

Jacek Pardyak

26 march 2019

Outline

Motivation

To communicate you need to learn vocabulary (list of words) and grammar (art of manipulating with words from vocabulary). Then you can:

Second-language learners of Dutch struggle with phenomena like:

  1. kannen - kunnen - kennen - kinnen

  2. braken - breken - broeken - breuken

  3. morgen - zorgen - worgen - borgen

Questions

  1. From which word to start to pass maximal number of words?

  2. How to decide which word to choose next?

Dutch word formation through the eyes of a linguist

The major word formation processes in Dutch are :

  1. compounding e.g.:
  1. derivation (affixation or alternation) e.g.:

We focus further only on alternation (aka substitution), regardless of vowel or consonant.

Definitions

A graph (aka network) is a mathematical structure used to model pairwise relations between objects.

Graph is made up of vertices (aka nodes or points) which are connected by edges (aka arcs, links or lines).

Edges are defined by a relation between vertices, e.g. LinkedIn connection, people acquaintance, film co-starring, nominations etc.

Relation of substitution: two words x and y are related ( x ~ y ) iff they differ by exactly one letter.

This relation is:

Thus this is not an equivalence relation.

Definitions

Indicators of centrality identify the most important vertices within a graph.

Degree centrality is defined as the number of edges incident upon a vertex.

Closeness centrality average length of the shortest path between a vertex and all other vertices in the graph.

Betweenness centrality quantifies the number of times a vertex acts as a bridge along the shortest path between two other vertices.

A graph diameter is the longest geodesic distance (length of the shortest path between two vertices) in the graph.

Farthest vertices are vertices which are connected by the diameter path.

Graph clique is a complete (each pair of graph vertices is connected by an edge) sub graph.

Largest clique is a clique for which there is no other clique including more vertices.

Definitions

Circle is a path such that the first node of the path corresponds to the last.

Shortest circle is a circle for which there is no other circle including less vertices.

Connected component (aka component) is a subgraph in which each vertex is reachable from any other vertex in the subgraph but is not reachable from vertex in different subgraph.

Pseudoword (aka non-word) is a unit of speech or text that appears to be an actual word in a certain language, while in fact it has no meaning in the vocabulary.

Pseudoword generator is a program that generates non-words for psycholinguistic experiments, language exams, all kind of naming (product, project, brand, company, domain, child etc.).

Data preparation

  1. Data comes from: www.opentaal.org. We use all “basic-approved”, “basic-unapproved” and “inflected” datasets.

  2. Proper names (abbreviations, geographical names, first names, nationality names, etc.) are written with small letters.

  3. Pairs from the list (Tweetekenklanken):

au, ei, eu, ie, ij, oe, ou, ui

are treated as an one letter.

  1. Substitution relation is established thanks to the base::agrep function:
agrep(pattern = word, x = words, value = T,
                 max.distance = list(
                   insertions = 0,
                   deletions = 0,
                   substitutions = 1,
                   cost = 1), useBytes = TRUE)

Graph construction and visualization

  1. Graph is constructed with igraph::graph_from_data_frame function:
graph_from_data_frame(d = data_out,
                      directed = FALSE,
                      vertices = NULL)
  1. Interactive graph visualizations made with visNetwork::visNetwork function:
visNetwork(nodes = graph$nodes,
           edges = graph$edges)
  1. Other interactive visualizations made with plotly::plot_ly function:
plot_ly(x = deg,
        type = "histogram",
        histnorm = "probability")
  1. Dashboard is produces with shiny and shinydashboard libraries.

Example of substitution subgraph

Words in the sentence: Dank je wel voor de mooie dag form sub graph:

Substitution graph vertices

## + 343075/343075 vertices, named, from 59a2474:
##     [1] 's anderendaags                   's avonds                        
##     [3] 's middags                        's nachts                        
##     [5] 's ochtends                       's-gravenhage                    
##     [7] 's-hertogenbosch                  06-nummer                        
##     [9] 1 aprilgek                        1 aprilgrap                      
##    [11] 1 aprilmop                        5 meiherdenking                  
##    [13] 5 meiviering                      8 uurjournaal                    
##    [15] 80-jarige                         a-biljet                         
##    [17] a-bom                             a-kant                           
##    [19] a-merk                            a-omroep                         
## + ... omitted several vertices

Substitution graph edges

## + 168620/168620 edges from 59a2474 (vertex names):
##   [1] b --eu b --m  b --a  b --au b --c  b --d  b --e  b --ei b --f  b --g 
##  [11] b --h  b --i  b --ie b --ij b --j  b --k  b --l  b --n  b --o  b --p 
##  [21] b --q  b --r  b --s  b --t  b --u  b --ui b --v  b --w  b --x  b --y 
##  [31] b --z  b --à  eu--m  eu--a  eu--au eu--c  eu--d  eu--e  eu--ei eu--f 
##  [41] eu--g  eu--h  eu--i  eu--ie eu--ij eu--j  eu--k  eu--l  eu--n  eu--o 
##  [51] eu--p  eu--q  eu--r  eu--s  eu--t  eu--u  eu--ui eu--v  eu--w  eu--x 
##  [61] eu--y  eu--z  eu--à  m --a  m --au m --c  m --d  m --e  m --ei m --f 
##  [71] m --g  m --h  m --i  m --ie m --ij m --j  m --k  m --l  m --n  m --o 
##  [81] m --p  m --q  m --r  m --s  m --t  m --u  m --ui m --v  m --w  m --x 
##  [91] m --y  m --z  m --à  a --au a --c  a --d  a --e  a --ei a --f  a --g 
## + ... omitted several edges

Substitution graph degree centrality

Vertices with the highest degree:

## mans  les  mas  man  ras  len 
##   46   43   43   42   41   41

Vertices with the lowest degree:

##   kinderbijslaginstellingen         allesoverkoepelende 
##                           0                           0 
##      uitvloeiingsgesteenten      aansprakelijkgestelden 
##                           0                           0 
## gepensioneerdenverenigingen          regentengeslachten 
##                           0                           0

The ‘mans’ word neighborhood (highest degree vertex)

Relative frequency of vertex degree

Dependency between vertex degree and vertex length

Dependency between vertex degree, length and frequency of occurrence

Dependency between vertex degree, length and logarithmized frequency

Excercise 1

Is that possible to put more dots on the red line?

Excercise 1 - solution

No. One-letter words form polygon. All the vertices are connected together with sides and diagonals. Each vertex degree is the same.

Substitution graph closeness centrality

Vertices with the highest closeness:

##       belten       marten       bolten       manten       bonten 
## 8.776674e-12 8.776673e-12 8.776673e-12 8.776673e-12 8.776673e-12 
##       banten 
## 8.776673e-12

Vertices with the lowest closeness:

##   kinderbijslaginstellingen         allesoverkoepelende 
##                8.496169e-12                8.496169e-12 
##      uitvloeiingsgesteenten      aansprakelijkgestelden 
##                8.496169e-12                8.496169e-12 
## gepensioneerdenverenigingen          regentengeslachten 
##                8.496169e-12                8.496169e-12

The ‘belten’ word neighborhood (highest closeness vertex)

Substitution graph betweenness centrality

Vertices with the highest betweenness:

##  booten  bootee  geleed  geleen  marins marijns 
## 9284392 8350595 6086974 5935422 5910797 5579622

Vertices with the lowest betweenness:

##   kinderbijslaginstellingen         allesoverkoepelende 
##                           0                           0 
##      uitvloeiingsgesteenten      aansprakelijkgestelden 
##                           0                           0 
## gepensioneerdenverenigingen          regentengeslachten 
##                           0                           0

The ‘booten’ word neighborhood (highest betweenness vertex)

Excercise 2

What is wrong with the ‘booten’ - graph and how to fix it?

Excercise 2 - solution

The farthest substitution graph vertices

The farthest vertices are trawler and groeiend. Their distance is equal to: 57

The shortest path between the farthest vertices:

trawler trailer trainer trainee trainde traande traafde
draafde draaide graaide geaaide geaarde gesarde gesierde
gevierde gevoerde geroerde geroemde geramde gerande gelande
belande belandt belangt belangd gelangd gelengd geleegd
geleerd gekeerd gekkerd gekkere lekkere lekkers wekkers
weekers wrekers brekers brakers krakers kraters fraters
flaters fluiters sluiters sluipers gluipers gluiperd gluipend
sluipend slapend slakend blakend blazend blozend bloeiend
gloeiend groeiend

The sequence must be read per row from the start forward as well as from stop backward.

The shortest path between the farthest vertices

The largest substitution graph clique

The shortest substitution graph circle

Connected components in substitution graph

We have 343075 vertices distributed among 269291 components. Examples of components with the highest size:

##    Component member Component size
## 44           aarlen          10966
## 73            allah           9027
## 28             adsl           5081

Examples of components with size >= 12 (mean size of non-trivial components):

##        Component member Component size
## 187902         aanhaakt             12
## 223778        herstelde             12
## 243137           ombind             12

Examples of components with size = 2:

##    Component member Component size
## 16         a-biljet              2
## 17            a-bom              2
## 34           aow'er              2

Examples of components with the lowest size:

##   Component member Component size
## 1  's anderendaags              1
## 2        's avonds              1
## 3       's middags              1

Relative frequency of component size (encoded)

Dependency between component size (encoded) and it’s member length

Dependency between component size (encoded), it’s member length and frequency of occurrence (logarithmized)

Pseudoword generation

The procedure:

  1. Choose a non-trivial component (size > 2),

  2. Pick two members,

  3. Generate candidates related sequentially with members,

  4. From candidates choose vertex with the highest degree.

Examples:

Excercise 3

Check your WhatsApp and go to (https://goo.gl/forms/ExYWq8aUL7fGb13x2)[https://goo.gl/forms/ExYWq8aUL7fGb13x2].

Read the assignment and fill in the form.

Further work

Summary

We can use my Shiny App to recap most important concepts in graph theory.

References