Jacek Pardyak
26 march 2019
Motivation,
Dutch word formation - linguistic perspective,
Graph theory - introduction,
Relation of letter substitution,
Substitution graph properties,
Pseudoword generation,
Exercises,
Shiny application.
To communicate you need to learn vocabulary (list of words) and grammar (art of manipulating with words from vocabulary). Then you can:
receive messages (perceive): listen and read,
send messages (produce): write and speak.
Second-language learners of Dutch struggle with phenomena like:
kannen - kunnen - kennen - kinnen
braken - breken - broeken - breuken
morgen - zorgen - worgen - borgen
How to identify slightly differing words?
What are central (the most similar to other) Dutch words?
How far we can get sequentially altering one letter in a chosen word? And further:
From which word to start to pass maximal number of words?
How to decide which word to choose next?
What is the minimal number of sequential substitutions after which we come back to the same word?
How to find non-words which look and sound like Dutch words?
The major word formation processes in Dutch are :
oude-mannen-huis,
zak-geld & geld-zak,
arbeids-ongeschiktheids-verzekerings-maatschappij.
woon-t | ge-woon-d | on-ge-woon | won-en,
Amsterdam-se | Amsterdam-mer | Amsterdam-mers | Amsterdam-mertje,
sluit-en | sloot | slot | sleut-el.
We focus further only on alternation (aka substitution), regardless of vowel or consonant.
A graph (aka network) is a mathematical structure used to model pairwise relations between objects.
Graph is made up of vertices (aka nodes or points) which are connected by edges (aka arcs, links or lines).
Edges are defined by a relation between vertices, e.g. LinkedIn connection, people acquaintance, film co-starring, nominations etc.
Relation of substitution: two words x and y are related ( x ~ y ) iff they differ by exactly one letter.
This relation is:
not reflexive (x ~ x),
symmetric (if x ~ y then y ~ x ),
not transitive (if x ~ y and y ~ z then x ~ z).
Thus this is not an equivalence relation.
Indicators of centrality identify the most important vertices within a graph.
Degree centrality is defined as the number of edges incident upon a vertex.
Closeness centrality average length of the shortest path between a vertex and all other vertices in the graph.
Betweenness centrality quantifies the number of times a vertex acts as a bridge along the shortest path between two other vertices.
A graph diameter is the longest geodesic distance (length of the shortest path between two vertices) in the graph.
Farthest vertices are vertices which are connected by the diameter path.
Graph clique is a complete (each pair of graph vertices is connected by an edge) sub graph.
Largest clique is a clique for which there is no other clique including more vertices.
Circle is a path such that the first node of the path corresponds to the last.
Shortest circle is a circle for which there is no other circle including less vertices.
Connected component (aka component) is a subgraph in which each vertex is reachable from any other vertex in the subgraph but is not reachable from vertex in different subgraph.
Pseudoword (aka non-word) is a unit of speech or text that appears to be an actual word in a certain language, while in fact it has no meaning in the vocabulary.
Pseudoword generator is a program that generates non-words for psycholinguistic experiments, language exams, all kind of naming (product, project, brand, company, domain, child etc.).
Data comes from: www.opentaal.org. We use all “basic-approved”, “basic-unapproved” and “inflected” datasets.
Proper names (abbreviations, geographical names, first names, nationality names, etc.) are written with small letters.
Pairs from the list (Tweetekenklanken):
au, ei, eu, ie, ij, oe, ou, ui
are treated as an one letter.
agrep(pattern = word, x = words, value = T,
max.distance = list(
insertions = 0,
deletions = 0,
substitutions = 1,
cost = 1), useBytes = TRUE)graph_from_data_frame(d = data_out,
directed = FALSE,
vertices = NULL)visNetwork(nodes = graph$nodes,
edges = graph$edges)plot_ly(x = deg,
type = "histogram",
histnorm = "probability")Words in the sentence: Dank je wel voor de mooie dag form sub graph:
## + 343075/343075 vertices, named, from 59a2474:
## [1] 's anderendaags 's avonds
## [3] 's middags 's nachts
## [5] 's ochtends 's-gravenhage
## [7] 's-hertogenbosch 06-nummer
## [9] 1 aprilgek 1 aprilgrap
## [11] 1 aprilmop 5 meiherdenking
## [13] 5 meiviering 8 uurjournaal
## [15] 80-jarige a-biljet
## [17] a-bom a-kant
## [19] a-merk a-omroep
## + ... omitted several vertices
## + 168620/168620 edges from 59a2474 (vertex names):
## [1] b --eu b --m b --a b --au b --c b --d b --e b --ei b --f b --g
## [11] b --h b --i b --ie b --ij b --j b --k b --l b --n b --o b --p
## [21] b --q b --r b --s b --t b --u b --ui b --v b --w b --x b --y
## [31] b --z b --à eu--m eu--a eu--au eu--c eu--d eu--e eu--ei eu--f
## [41] eu--g eu--h eu--i eu--ie eu--ij eu--j eu--k eu--l eu--n eu--o
## [51] eu--p eu--q eu--r eu--s eu--t eu--u eu--ui eu--v eu--w eu--x
## [61] eu--y eu--z eu--à m --a m --au m --c m --d m --e m --ei m --f
## [71] m --g m --h m --i m --ie m --ij m --j m --k m --l m --n m --o
## [81] m --p m --q m --r m --s m --t m --u m --ui m --v m --w m --x
## [91] m --y m --z m --à a --au a --c a --d a --e a --ei a --f a --g
## + ... omitted several edges
Vertices with the highest degree:
## mans les mas man ras len
## 46 43 43 42 41 41
Vertices with the lowest degree:
## kinderbijslaginstellingen allesoverkoepelende
## 0 0
## uitvloeiingsgesteenten aansprakelijkgestelden
## 0 0
## gepensioneerdenverenigingen regentengeslachten
## 0 0
Is that possible to put more dots on the red line?
No. One-letter words form polygon. All the vertices are connected together with sides and diagonals. Each vertex degree is the same.
Vertices with the highest closeness:
## belten marten bolten manten bonten
## 8.776674e-12 8.776673e-12 8.776673e-12 8.776673e-12 8.776673e-12
## banten
## 8.776673e-12
Vertices with the lowest closeness:
## kinderbijslaginstellingen allesoverkoepelende
## 8.496169e-12 8.496169e-12
## uitvloeiingsgesteenten aansprakelijkgestelden
## 8.496169e-12 8.496169e-12
## gepensioneerdenverenigingen regentengeslachten
## 8.496169e-12 8.496169e-12
Vertices with the highest betweenness:
## booten bootee geleed geleen marins marijns
## 9284392 8350595 6086974 5935422 5910797 5579622
Vertices with the lowest betweenness:
## kinderbijslaginstellingen allesoverkoepelende
## 0 0
## uitvloeiingsgesteenten aansprakelijkgestelden
## 0 0
## gepensioneerdenverenigingen regentengeslachten
## 0 0
What is wrong with the ‘booten’ - graph and how to fix it?
The farthest vertices are trawler and groeiend. Their distance is equal to: 57
The shortest path between the farthest vertices:
| trawler | trailer | trainer | trainee | trainde | traande | traafde |
| draafde | draaide | graaide | geaaide | geaarde | gesarde | gesierde |
| gevierde | gevoerde | geroerde | geroemde | geramde | gerande | gelande |
| belande | belandt | belangt | belangd | gelangd | gelengd | geleegd |
| geleerd | gekeerd | gekkerd | gekkere | lekkere | lekkers | wekkers |
| weekers | wrekers | brekers | brakers | krakers | kraters | fraters |
| flaters | fluiters | sluiters | sluipers | gluipers | gluiperd | gluipend |
| sluipend | slapend | slakend | blakend | blazend | blozend | bloeiend |
| gloeiend | groeiend |
The sequence must be read per row from the start forward as well as from stop backward.
We have 343075 vertices distributed among 269291 components. Examples of components with the highest size:
## Component member Component size
## 44 aarlen 10966
## 73 allah 9027
## 28 adsl 5081
Examples of components with size >= 12 (mean size of non-trivial components):
## Component member Component size
## 187902 aanhaakt 12
## 223778 herstelde 12
## 243137 ombind 12
Examples of components with size = 2:
## Component member Component size
## 16 a-biljet 2
## 17 a-bom 2
## 34 aow'er 2
Examples of components with the lowest size:
## Component member Component size
## 1 's anderendaags 1
## 2 's avonds 1
## 3 's middags 1
The procedure:
Choose a non-trivial component (size > 2),
Pick two members,
Generate candidates related sequentially with members,
From candidates choose vertex with the highest degree.
Examples:
ingaan - … - induikt = > ingakt
aangan - … - fantast => aangast
strakjes - … - fooitjes => sooitjes
Check your WhatsApp and go to (https://goo.gl/forms/ExYWq8aUL7fGb13x2)[https://goo.gl/forms/ExYWq8aUL7fGb13x2].
Read the assignment and fill in the form.
construct graph with different relation (one insertion or deletion),
join both graphs (relation of one substitution, insertion or deletion),
construct graphs for all European languages (mutual intelligibility).
We can use my Shiny App to recap most important concepts in graph theory.
Network analysis with R and igraph: NetSci X Tutorial http://kateto.net/networks-r-igraph
All sources of this presentation: https://github.com/JacekPardyak/DWF/