knitr::opts_chunk$set(
  echo = TRUE,
  fig.showtext = TRUE,
  dpi = 300
)

library(showtext)
## Loading required package: sysfonts
## Warning: package 'sysfonts' was built under R version 4.4.1
## Loading required package: showtextdb
showtext_auto()
showtext_opts(dpi = 300)
font_add_google("Noto Sans Myanmar", "burmese")

Background

The fact that “the” is the most frequently used word in English is no longer a surprise for most of the people with some interest in language teaching. Then, “say” is the most used lexical verb after the lexical use of auxillaries such as ‘have’ ‘do’ or ‘be’ in their different forms.

However, there are no such insights for the Burmese yet. People have a general idea of where the script is most heavily used, or which letters have the least amount of words starting with them guaging from the available dictionaries, however, not at the depth and accuracy that a corpus analysis can offer.

This study attempts to address that gap by developing a mini corpus of 1885 Burmese pop songs and analysing the freqencies of ‘words’ demonstrated differently in various composers.

Script

So what does Burmese script look like? Here is a graphic for starter: a humourous comparison of Burmese and other languages in Asia, by an illustrator with a name that means conincidentally the same thing in both English and Burmese idiomatic expression, “Itchy Feet in Asia”.

knitr::include_graphics("http://languagelog.ldc.upenn.edu/~bgzimmer/itchyfeet.png")

Another easy way to describe Burmese script without body parts is that the alphabets are basically circles with openings facing in different directions and sometimes featuring eyes and limbs. Apologies: my attempt to explain script without body parts failed again.

knitr::include_graphics("https://freelanguage.org/sites/default/files/styles/node_image_style/public/burmese-writing-system_omniglot-com_0.png?itok=h9qJfrj-")

Then various glyphs are added onto these alphabets to make syllables, hence, the description “alphasyllabary”. Burmese is mostly monosyllabic, meaning root words are predominantly “monosyllabic”, and the majority of vocabulary is compound words formed by monosyllabic root words.

E.g.

စာ+အုပ် - sa+ouq, meaning- book, literally “text-bundle”

မိ+သား+စု -Mi+Tha+Su, meaning- family, literally “Mother/maternal+people/offspring+collective/gathering”

Challenge

As with many other alphasyllabary languages, word segmentation pose a big problem in the field of Burmese NLP. It is difficult to decide word boundaries as each syllable always is a word, and it’s the context that decides whether it is a monosyllabic unit or part of a multi-syllabic chunk.

Moreover, all available text in Burmese may use spaces quite liberally. It is common to type a space after each constituent in the Burmese syntax, but spaces are also found due to the idiosyncratic twiches of any given individual typist. So spaces can’t be used to decide word or phrase boundaries either.

Temporary Fix

So this study uses a syllable segmenter code made available by a Burmese NLP researcher to create a corpus in which frequencies can be calculated reliably.

Corpus

Randomly collected 1885 Burmese songs from 80s to present time. All the lyrics were broken into syllables, hereafter referred to as ‘words’. The syllable broken texts in .txt files were inserted into WordSmith_5 concordancer, as it is the only concordancer software that reads Burmese Script accurately and produce corpus data. The frequency data was exported in an excel file.

library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
excel_file <- "/Users/kaunghlazan/Downloads/Burmese\ Song\ Corpus\ /Main\ Song\ Lyrics\ Corpus/Final_Workbook_for_R.xlsx"

# Get all sheet names
sheets <- excel_sheets(excel_file)

# Read and combine all sheets
All_Lyrics <- lapply(sheets, function(sheet) {
  read_excel(excel_file, sheet = sheet)
}) %>% bind_rows()

# View the result
head(All_Lyrics)

Table (1) First six rows of the frequency data:

  1. Word=Syllables,
  2. Freq=Frequency of the syllable in the collection of songs by the particular composer,
  3. Percentage=the percentage of the syllable in the composition of all the lyrics by the composer,
  4. Total_Songs= the number of songs by the particular composer in this corpus,
  5. Composer= the composer’s name


summary(All_Lyrics)
##      Word                Freq           Percentage         Composer        
##  Length:16769       Min.   :    1.0   Min.   :0.000143   Length:16769      
##  Class :character   1st Qu.:    4.0   1st Qu.:0.010700   Class :character  
##  Mode  :character   Median :   11.0   Median :0.032237   Mode  :character  
##                     Mean   :   76.1   Mean   :0.125231                     
##                     3rd Qu.:   37.0   3rd Qu.:0.106203                     
##                     Max.   :39129.0   Max.   :7.007939                     
##   Total_Songs   
##  Min.   : 20.0  
##  1st Qu.: 28.0  
##  Median : 43.0  
##  Mean   :174.6  
##  3rd Qu.: 49.0  
##  Max.   :955.0

Table (2) Summary of the whole Dataset

With this dataset, we may be able to answer this question. ### Which is the most frequent burmese word?

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.1
library(tidyverse)
## Warning: package 'forcats' was built under R version 4.4.1
## Warning: package 'lubridate' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
All_Lyrics %>% 
  slice_max(order_by = Freq, n = 20) %>% 
  mutate(Ordered_Word = fct_reorder(Word, Freq)) %>% 
  ggplot(aes(x = Freq, y = Ordered_Word, fill = Composer)) +
  geom_col() +
  labs(
    title = "Burmese Word Frequencies by Composers", 
    x = "Frequency", 
    y = "Word",
    fill = "Composer"
  ) +
  theme_light()


Now this was a bad graph. The words needed to be summarized. The graph above is only showing the word-counts of various composers. So it’s like comparing actual word-counts of 955 songs and word counts from 20 to 40 songs.

All_Lyrics %>% 
  ggplot(aes(x = Freq, y = Word)) +
  geom_col() +
  scale_x_log10() +  
  labs(
    title = "Burmese Word Frequencies by Composers", 
    x = "Frequency (log10)", 
    y = "Word"
  ) +
  facet_wrap(~Composer) +
  theme_light()

This an even worse graph. Lots of noise and no info. Summarize should come to the rescue.

All_Lyrics_Combined <- All_Lyrics %>%
  group_by(Word) %>%
  summarise(Freq = sum(Freq)) %>%
  arrange(desc(Freq))
library(showtext)
showtext_auto()
font_add_google("Noto Sans Myanmar", "burmese")

All_Lyrics_Combined %>%
  slice_max(order_by = Freq, n = 20) %>%  
  mutate(Ordered_Word = fct_reorder(Word, Freq)) %>%
  ggplot(aes(x = Freq, y = Ordered_Word)) +
  geom_col(fill = "steelblue") +  
  labs(
    title = "Top 20 Most Frequent Burmese Words Across All Composers",
    x = "Total Frequency",
    y = "Word"
  ) +
  theme(
    axis.text.y = element_text(family = "burmese"),  # Only Burmese words
    strip.text = element_text(family = "burmese")     # Composer names if Burmese
  )

Figure (XX) Top 20 Most Frequent Burmese Words Across All Composers

အ (AH) and မ (MA)
These two letters/words are the two of the most versatile words in Burmese. အ (AH) has 13 entries in Burmese. It’s lexical meanings are; the last letter of Burmese alphabet, to be mute, to be at a lost for words, acoustically damaged, e.g. a cracked gong losing resonance