Exploratory Data Analysis of Clustered African HaBiT Corpora

Descriptions of clusters, mostly by example. Amharic, Oromo, Tigrinya, and Somali.

Lars Bungum

Self-Organizing Maps (SOMs) and Hierarchical Clustering

The clustering process is done in three steps:

The document colection is transformed to a nxm matrix, where n is the number of documents and m the diension of each vector representing a document.
A SOM of some size (32x32 nodes in subsequent experiments) is created and trained with the above matrix.
A hierarchical clustering algorithm is run on the resulting map, and the documents belonging to each cluster are dumped into sub-collections.

A SOM Before Training

plot of chunk unnamed-chunk-1

Same SOM After Training

plot of chunk unnamed-chunk-2

The Entire Process

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

Clustering of SOMs

Hierarchical agglomerative clustering will merge clusters until there is only one.

Distance metric: quantify proximity between node vectors.
Clustering method: quantify proximity between clusters.
Determine how many clusters by inserting verical bar in dendrogram.

Number of Clusters

plot of chunk unnamed-chunk-5

Experiments

Experiments were conducted on Amharic, Oromo, Tigrinya and Somali.

Clusters were explored in terms of size, most frequent n-grams and a LDA analysis of each cluster.

The full analysis of all four languages are found at:

Amharic

Clustering process resulted in a partitioning of 3 clusters. Documents were vectorized with dimension 158,892.

Each cluster was loaded by the quanteda package, with the following sizes:

Corpus 1 consisting of 2,275 documents.
Corpus 2 consisting of 461 documents.
Corpus 3 consisting of 514 documents.

Analysis of Clustering Methods

plot of chunk unnamed-chunk-6

Exploration of Cluster-Corpus Properties

plot of chunk unnamed-chunk-7

First Four Words of 13/20 Topics of LDA Topic Model

Cluster 1

1 "።ነገር_ግን" "ነው::" "።_‹_‹" "ግብረ-ሰዶማውያን" 2 "!!!" "።_ነገር_ግን" "ነው::" "ማለት_ነው_።"
3 "፲_፱_፻" "ዓ/ም" "።_፲_፱" "/_ም-"
4 "..." "ወ/_ሮ" "ነው::" "ያልታወቀ_፤_ወንድ" 5 "።_ነገር_ግን" "ዶ/ር" "..." "ንዑስ_አንቀጽ("
6 "---" "ደጋግሞደጋግሞ_ደጋግሞ" "*_" "!!!"
7 "http:/" "://" "..." "]http:"
8 "_*" "።_‹_‹" "።_ነገር_ግን" "ማለት_ነው_።"
9 "..." "ዓ/ም" "read_more_»" "._read_more" 10 "..." "---" "ዶ/ር" "/_ት_ብርቱካን"
11 "..." "ነበር_።[" "ዓ/_ም" "።_ነገር_ግን"
12 "*_" "።ነገር_ግን" "ማለት_ነው_።" "..."
13 "ማለት_ነው_።" "።_ነገር_ግን" "ዶ/ር" ".._."

Cluster 2

"===" "..." "ብፁዕ_ወቅዱስ_አቡነ" "ወቅዱስ_አቡነ_ጳውሎስ"
"..." "readmore_»" "ዜና:-" "ኢሳት_ዜና:"
"ማለትነው_።" "?_ኦባንግ_፦" "'_ወያነ'" "።_ጎልጉል_፦"
"ዓ/_ም" "---" "ኢሳት_ዜና:" "ዜና:-"
"..." "!!!" "ቀን_፳_፻" "፳_፻_፭"
"%e1%" "ወ/_ሮ" "/_ሮ_አዜብ" "ዶ/_ር"
"!!!" "ማለትነው_።" "ሚ/_ር" "ብቻ_ነው_።"
"።ነገር_ግን" "ቀን_2005_ዓ.ም" "2005_ዓ.ም." "ማለት_ነው_።"
"ንዑስአንቀጽ/" "..." "377/_96" ".._read"
"።አዲስ_ዘመን" "።_ነገር_ግን" "አዲስ_ዘመን_፦" ".._."
"..." "!!!" "።_‹_‹" "ማለት_ነው_።"
"..." "^{^^"} "ጠ_/_ሚ" "/_ሚ_መለስ"
"መለስ/_ወያኔ" "ፀረ-_ሰላም" "፲_፱_፻" "።_ነገር_ግን"

Tigrinya

Clustering process resulted in a partitioning of 3 clusters. Documents were vectorized with dimension 158,892.

Each cluster was loaded by the quanteda package, with the following sizes:

Corpus 1 consisting of 958 documents.
Corpus 2 consisting of 758 documents.

Analysis of Clustering Methods

plot of chunk unnamed-chunk-8

Exploration of Cluster-Corpus Properties

plot of chunk unnamed-chunk-9

First Four Words of 13/20 Topics of LDA Topic Model

Cluster 1

1 "'ዩ_።" "ናይ_የሆዋ_መሰኻኽር" "..." "ቤተ-ክርስትያን"
2 "'_ዩ_።" "..." "እዩ_።_እዚ" "ቅድስተ_ቅዱሳን_ድንግል" 3 "'_ዩ_።" "*_" "እዩ።_እቲ" "..."
4 "'_ዩ_።" "ማለት_ኢዩ_።" "..." "ኣብ'ዚ"
5 "'_ዩ_።" "ከምኡ'ውን" "*_" "ዝናን-_ባህታን"
6 "እዩ_ነይሩ_።" "'_ዩ_።" "እዩ_።_እዚ" "ኣብ'ዚ"
7 "'_ዩ_።" "*_" "ቅዱሳንድንግል_ማርያም" "ቅድስተ_ቅዱሳን_ድንግል" 8 "'_ዩ_።" "እዩ_።_እዚ" "እዩ_።_እቲ" "እዩ_።_ኣብ"
9 "*_" "..." "መጽናዕቲመጽሓፍ_ቅዱስ" "ይኽእል_እዩ_።"
10 "እዩ_።_እዚ" "መድኃኒና_ኢየሱስ_ክርስቶስ" "።>>" "ይኽእል_እዩ_።"
11 "'_ዩ_።" "*_" "እዩ።_እዚ" "።_"
12 "---" "'ዩ_።" ".._." "ዕደጋ_ፈፃሚ_ኣካል"

Cluster 2

1 "..." "እዩ።(" "እዩነይሩ_።"
2 "እዩ_።_እዚ" "---" "ማለት_እዩ_።"
3 "..." "እዩ_።\"" "ማለትእዩ_።"
4 "ከኣ_፡\"" "ድማ፡\"" "\"በሎም_።"
5 "..." "።_ይኹን_እምበር" "ይኹን_እምበር:"
6 "ቅድስትድንግል_ማርያም" "እዩ_።(" "እዩ።_እዚ"
7 "'_ዩ_።" "ቅድስት_ድንግል_ማርያም" "ቤተ-ክርስቲያን"
8 "እዩ_።_እዚ" "መድኃኒናን_ኢየሱስ_ክርስቶስ" "ጐይታናን_መድኃኒናን_ኢየሱስ" 9 "..." "እዩ_።_ኣብ" "እዩ_።("
10 "እዩ።(" "..." "እዩነይሩ_።"
11 "እዩ_ነይሩ_።" "እዩ_።(" "እዩ።_እዚ"
12 "\"_በሎም_።" "\"_በሎ_።" "ድማ_፡\""
13 "እዩ_እሞ_፡" "፡_በሎ_።" "፡_በሎም_።"

Conclusions

Clustering Method

Ward's clustering method concistently found more even-sized clusters, visuall resemebling the original SOM maps. This was also the finding of Bungum & Gamback (2014, 2015)-

Interpretation of Results

Command of the languages is necessary to interpret the provided input. Otherwise, the corpora could be evaluated on some application where automatic measures are available.

Possible Uses of Corpora

Unsupervised clustering could be used e.g., for domain adaptation of applications such as machine transltion, spell checking, etc. Further to that, they could be used to create multi-lingual domain models (identify similar domains across languages).