Word Frequencies and Lexical Complexity Measurements of Various Burmese Composers: A Mini Corpus Study of Burmese Pop Songs

Script

So what does Burmese script look like? Here is an illustration for starter: a humourous comparison of scripts of some languages in Asia including Burmese, by an illustrator with a name that means coincidentally the same thing in both English and Burmese idiomatic expressions, “Itchy Feet in Asia”.

knitr::include_graphics("http://languagelog.ldc.upenn.edu/~bgzimmer/itchyfeet.png")

Another easy way to describe Burmese script without body parts is that the alphabets are basically circles with openings facing in different directions and sometimes featuring eyes and limbs. Apologies: my attempt to explain the script without body parts failed again.

knitr::include_graphics("https://freelanguage.org/sites/default/files/styles/node_image_style/public/burmese-writing-system_omniglot-com_0.png?itok=h9qJfrj-")

The Burmese writing system (Source: Omniglot)

Then various glyphs are added onto these alphabets to make syllables, hence, the description “alphasyllabary”. Burmese is mostly monosyllabic, meaning root words are predominantly “monosyllabic”, and the majority of vocabulary is compound words formed by monosyllabic root words.

E.g.

စာ+အုပ် - sa+ouq, meaning- book, literally “text-bundle”

မိ+သား+စု -Mi+Tha+Su, meaning- family, literally “Mother/maternal+people/offspring+collective/gathering”

Challenge

As with many other alphasyllabary languages, word segmentation pose a big problem in the field of Burmese NLP. It is difficult to decide word boundaries as each syllable always is a word, and it’s the context that decides whether it is a monosyllabic unit or part of a multi-syllabic chunk.

Moreover, all available text in Burmese may use spaces quite liberally. It is common to type a space after each constituent in the Burmese syntax, but spaces are also found due to the idiosyncratic twiches of any given individual typist. So spaces can’t be used to decide word or phrase boundaries either.

Temporary Fix

So this study uses a syllable segmenter code made available by a Burmese NLP researcher to create a corpus in which frequencies can be calculated reliably.

Corpus

Randomly collected 1885 Burmese songs from 80s to present time. All the lyrics were broken into syllables, hereafter referred to as ‘words’. The syllable broken texts in .txt files were inserted into WordSmith_5 concordancer, as it is the only concordancer software that reads Burmese Script accurately and produce corpus data. The frequency data was exported in an excel file.

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.4.1

## Warning: package 'forcats' was built under R version 4.4.1

## Warning: package 'lubridate' was built under R version 4.4.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ readr     2.1.5
## ✔ ggplot2   4.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

excel_file <- "/Users/kaunghlazan/Downloads/Burmese\ Song\ Corpus\ /Main\ Song\ Lyrics\ Corpus/Final_Workbook_for_R.xlsx"

# Get all sheet names
sheets <- excel_sheets(excel_file)

# Read and combine all sheets
All_Lyrics <- lapply(sheets, function(sheet) {
  read_excel(excel_file, sheet = sheet)
}) %>% bind_rows()

# View the result
head(All_Lyrics)

Table (1) First six rows of the frequency data:

Word=Syllables,
Freq=Frequency of the syllable in the collection of songs by the particular composer,
Percentage=the percentage of the syllable in the composition of all the lyrics by the composer,
Total_Songs= the number of songs by the particular composer in this corpus,
Composer= the composer’s name

summary(All_Lyrics)

##      Word                Freq           Percentage         Composer        
##  Length:16769       Min.   :    1.0   Min.   :0.000143   Length:16769      
##  Class :character   1st Qu.:    4.0   1st Qu.:0.010700   Class :character  
##  Mode  :character   Median :   11.0   Median :0.032237   Mode  :character  
##                     Mean   :   76.1   Mean   :0.125231                     
##                     3rd Qu.:   37.0   3rd Qu.:0.106203                     
##                     Max.   :39129.0   Max.   :7.007939                     
##   Total_Songs   
##  Min.   : 20.0  
##  1st Qu.: 28.0  
##  Median : 43.0  
##  Mean   :174.6  
##  3rd Qu.: 49.0  
##  Max.   :955.0

Table (2) Summary of the whole Dataset

So, which Burmese word is used most frequently?

All_Lyrics_Combined <- All_Lyrics %>%
  group_by(Word) %>%
  summarise(Freq = sum(Freq)) %>%
  arrange(desc(Freq))

library(showtext)

## Loading required package: sysfonts

## Warning: package 'sysfonts' was built under R version 4.4.1

## Loading required package: showtextdb

library(ggplot2)
library(dplyr)
library(forcats)
showtext_auto()
font_add_google("Noto Sans Myanmar", "burmese")

All_Lyrics_Combined %>%
  slice_max(order_by = Freq, n = 20) %>%  
  mutate(Ordered_Word = fct_reorder(Word, Freq)) %>%
  ggplot(aes(x = Freq, y = Ordered_Word)) +
  geom_col(fill = "steelblue") +  
  labs(
    title = "Top 20 Most Frequent Burmese Words Across All Composers",
    x = "Total Frequency",
    y = "Word"
  ) +
  theme(
    axis.text.y = element_text(family = "burmese"),  # Only Burmese words
    strip.text = element_text(family = "burmese")     # Composer names if Burmese
  )

Figure (1) Top 20 Most Frequent Burmese Words Across All Composers

အ (AH) and မ (MA)
These two letters/words are the two of the most versatile words in Burmese. အ (AH) has 13 entries in Burmese. It’s lexical meanings are; the last letter of Burmese alphabet, to be mute, to be at a lost for words, acoustically damaged, e.g. a cracked gong losing resonance, to be naive, to have lost influence, soil to have lost fertility, sculptures to be made in bad proportions. Functionally it is a noun-maker particle.
မ (Ma) is a negating particle, with 13 lexical entries and 4 functional entries in the dictionary. These two syllable showing at the top of the list in the 1,885 songs is also in line with a similar test study of a smaller corpus of Burmese novels.
Illustration below confirms it again showing the frequencies by each of the composers.

library(showtext)
showtext_auto()
font_add_google("Noto Sans Myanmar", "burmese")

New_Lyrics<- All_Lyrics %>% 
  filter(Composer != "Various")
New_Lyrics %>% 
  group_by(Composer) %>%
  slice_max(order_by = Freq, n = 5) %>%
  ungroup() %>%
  mutate(
    # Create ordered factor manually
    Word = paste(Composer, Word, sep = "___"),
    Word = reorder(Word, Freq)
  ) %>%
  ggplot(aes(x = Freq, y = Word)) +
  geom_col(fill = "steelblue") +
  scale_x_log10()+
  facet_wrap(~Composer, scales = "free_y") +
  scale_y_discrete(labels = function(x) gsub("^.*___", "", x)) +  # This line is necessary, this is the difference bewteen a graph and a heap of jumbled letters. 
  labs(
    title = "Figure(2) Top 5 Words by Composers", 
    x = "Frequency (log10)",
    y = NULL
  ) +
  theme_light() +
  theme(
    axis.text.x = element_text(size = 5),
    axis.text.y = element_text(family = "burmese")
  )

Above faceted figures demonstrate the highest frequencies of ‘Ah’ and show there are three composers whose Ma was not as frequent as the others. El Phyu, Lin Lin, and Saw K Sel. It may be hypothesized that their lyrics are more lexically dense and use fewer functional words. As the table below show comparable numbers to other composers. Testing of such hypothesis will require a highly rigorous tagging of all lyrics.

Lexical Density

Lexical density here is calculated as the ration between number of types (number of unique words) and the square-root of the total number of tokens, or RTTR (rooted Type to Tokens Ratio). Rooted TTR is seleted to equalize some composers’ disproportionately larger or smaller body of lyrics.

The following illustration provides the RTTR values. The higher RTTR means the composer uses a greater variety of words in their songs. It can achieved by the composer skillfully arranging a high number of lexical words in the word-count and rhyme constraints of a song, like a small travel bag where all essentials are tightly packed leaving no wasted space.

library(tidyverse)

Lexical_density <- New_Lyrics %>%
  group_by(Composer) %>%
  summarise(
    Unique_Tokens = n(),
    Total_Tokens = sum(Freq),
    Type_to_Text_Ratio = round(Unique_Tokens / sqrt(Total_Tokens), 2)
  ) 

Lexical_density %>% 
  ggplot(aes(x=Unique_Tokens, y=Type_to_Text_Ratio, color=Composer))+
  geom_point(size=5, show.legend = FALSE)+
  geom_label(aes(label = Composer), 
             vjust = -0.1, 
             hjust = 1,
             angle=45,
             size = 2) +
   labs(
    title = "Lexical Density Measure by RTTR" )+
      theme_light()+
  theme(legend.position = "none")

However, the corpus analysis tool wouldn’t have clocked in the different senses of the polysemous words.
For example,
ဘုရားသိကြားမလို့ မမကို အိမ်ကနေ ရအောင် မ နိင်ရင် အချစ်တသက်လုံး မလျော့စေရဘူး။ တအိမ်လုံး ဝိုင်း မထားစေရမယ်
There are five senses of မ in this sentence but the corpus tools would have counted just one type, hence might have slight impact on the true measure of lexical complexity of the composer.
Below is the translation with the five different senses of ‘Ma’ highlighted.
If dieties helps me snatch you (an older female romantic prospect) out of your over-protective family, my love for you will never diminish. I will make sure my whole family treats like a princess too. (This is not from any of the lyrics, but a sentence demonstrating the polysemous nature of just one syllable.)

2-Gram and 3 Gram Analyses

The monosyllabic analysis offers some insights into the frequencies and lexical densities of various composers. However, 2 gram and 3 gram analysis would give more insight on the compound words at play.
The table below shows that the top two bisyllabic words are ‘achit’ and ‘bawa’ (love and life) and the trisyllabic words are ‘Hna-lone-thar’ (heart) and ‘ta-yout-tel’ (alone). They are the top results of 2 gram and 3 gram results produced by Wordsmith5.

library(tidyverse)

NGramData <- tribble(
  ~N, ~`N-Gram`,         ~Frequency, ~Translation,
  2,  "အချစ်",           9534,       "Love",
  2,  "ဘဝ",              5738,       "Life",
  2,  "တစ်ယောက်",        3194,       "partitive word for a person",
  2,  "မရှိ",             2838,       "do not have/be not",
  2,  "အချိန်",          2796,       "time",
  2,  "အိပ်မက်",         2379,       "dream",
  2,  "ကြင်နာ",          2201,       "affectionate",
  2,  "နှလုံး",           2145,       "Heart",
  2,  "ကလေး",            2144,       "Babe",
  3,  "နှလုံးသား",        1859,       "Heart",
  3,  "တစ်ယောက်တည်း",    1687,       "alone",
  3,  "တို့နှစ်ယောက်",     1250,       "two of us",
  3,  "မင်းအတွက်",       1002,       "for you",
  3,  "အကြင်နာ",         862,        "affection",
  3,  "အရာရာ",           854,        "everything",
  3,  "အမြဲတမ်း",        828,        "always/forever",
  3,  "အချစ်တွေ",        789,        "love (plural)",
  3,  "ရဲ့အချစ်",        754,        "the love of"
)

NGramData %>% knitr::kable()

N	N-Gram	Frequency	Translation
2	အချစ်	9534	Love
2	ဘဝ	5738	Life
2	တစ်ယောက်	3194	partitive word for a person
2	မရှိ	2838	do not have/be not
2	အချိန်	2796	time
2	အိပ်မက်	2379	dream
2	ကြင်နာ	2201	affectionate
2	နှလုံး	2145	Heart
2	ကလေး	2144	Babe
3	နှလုံးသား	1859	Heart
3	တစ်ယောက်တည်း	1687	alone
3	တို့နှစ်ယောက်	1250	two of us
3	မင်းအတွက်	1002	for you
3	အကြင်နာ	862	affection
3	အရာရာ	854	everything
3	အမြဲတမ်း	828	always/forever
3	အချစ်တွေ	789	love (plural)
3	ရဲ့အချစ်	754	the love of

Another comparison, the use of “chit” (love/ချစ်) by each composer.

library(tidyverse)

All_Lyrics_Love<-All_Lyrics %>% 
  rename(
    Love = Freq
  )

Total_Freq_Table <- All_Lyrics_Love %>% 
  group_by(Composer) %>%
    mutate(Total_Freq = sum(Love)) %>%
    ungroup()
  

Love_and_total <- Total_Freq_Table %>% 
  filter(Word == "ချစ်", Composer != "Various") %>%
  mutate(Other_Words = Total_Freq - Love) %>% 
  pivot_longer(
    cols = c(Other_Words, Love),
    names_to = "Word_Category",
    values_to = "Frequencies"
  ) %>% 
mutate(Word_Category = factor(
    Word_Category,
    levels = c("Other_Words", "Love") # Forces 'Other_Words' to the bottom
  ))


Love_and_total %>% 
  ggplot(aes(x = Composer, y = Frequencies, fill = Word_Category)) +
  geom_col(position = "stack") + # Use 'stack' for total counts
  labs(
    title = "Frequency of 'Love' vs. 'Other Words' by Composer",
    x = "Composer",
    y = "Total Frequency"
  ) +
    # --- Conditional Labeling ---
  geom_label(
    aes(
      label = if_else(
          Word_Category == "Other_Words", # Condition: ONLY label the bottom segment
          Composer,                       # Label content: The Composer's name
          NA_character_                   # Otherwise, show nothing
      )
    ), 
    position = position_stack(vjust = 0.5), # Centers label vertically in the segment
    angle = 90, 
    size = 3,
    color = "black" # Use a color that contrasts with the "Other_Words" bar color
  ) +
  
  scale_y_log10()+
   labs(
    title = "Frequency of 'love' vs other words" )+
  theme(legend.position = "right")+ 
  theme_light()+
  theme(axis.text.x = element_text(angle = 45, size=8, hjust=1))

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_label()`).