Of Chains and Linguistics: Applying the Markov Model to Language

I apply the Markov chain to help me understand my “native” tongue of Singlish better.

true
May 13, 2021

As I transition from student to a member of the full-time workforce, I have been doing a lot of reflecting on moments in NUS and the various takeaways. My final-year project, or Honors Dissertation, was a defining moment in my four years — especially since it was rare for a Business student to choose the research route to meet degree requirements.

Because of my Business Analytics specialization, my coursework was closely tied with data science. Interestingly, most of my projects ended up revolving around natural language processing (NLP) and/or text analytics. For my final-year project, I embarked on the ambitious objective of analyzing the grammar of Singlish (Singapore colloquial English) — something not yet widely explored in the field. It turned out to be a really meaningful way to cap off my final semester, as an opportunity to apply the various data science skills I learnt as well as experience from previous projects.

One of the models I used was the Markov chain, which I first learnt about in early 2020 as part of a stochastic modelling module. After using it previously in a text analytics project, I decided to take it further and explore its potential for this project.

State to state

Simply put, a Markov chain models a number of distinct, known states (known as a state space), and the probabilities that a system would transit from one state to another.

Example illustration by V. Powell

The defining feature of a Markov chain is that each transition is independent, that is, the probability of each transition is not affected by the probabilities of any past transition. This is referred to as the Markov property or “memorylessness”.

Markov chains are also categorized into discrete-time or continuous-time, depending on whether the system’s transitions follow discrete or continuous time steps.

A discrete-time Markov chain could be a simple model tracking the weather each day. On the other hand, a continuous-time Markov chain is usually more complicated — an example might be a queuing model, where people would join and leave the queue at rates following, say, an exponential distribution with means of 5 and 3 minutes respectively.

Applying this to text

We can consider text as a discrete-time system — the state space being all possible words, and at each time step, the system “jumps” from one word to another, eventually forming phrases, sentences, and so on.

Let’s see how this works with a small set of words! I’ve also added in some made-up probabilities, so just imagine that if the current word is “he”, the probability that the next word will be “is” would be 100%.

Interestingly, this forms the basis of predictive text! (I also explored this in my previous project.) Of course, modern predictive text software has grown much more complex, using even more cutting-edge models.

Here comes the big data

Now that we have a better idea of how we can model text as transitions between words, we can then apply this to large amounts of text.

In particular, since my project aimed to provide a more quantitative analysis of Singlish, I had to work out a way to parse though large amounts of text, and then get the transition probabilities between each and every word. As such, I wrote code that did the following:

(And, of course, not forgetting the pre-processing which involved tasks like standardizing everything to lowercase, as well as manual correction of commonly misspelled words.)

The result would look something like this:

Using the same example from the previous section
start_word end_word prob
going home 1.0
he is 1.0
is going 0.6
is really 0.4
really tired 1.0

My project dataset comprised of 20,000 sentences from a dataset of text messages contributed by NUS students and compiled by the School of Computing (which I call the NUS corpus), as well as 10,000 sentences scraped randomly from Singaporean local forums (Sammyboy and HardwareZone — do note that many of their sub-forums are NSFW).

It took about 20 minutes for my code to process all my text data, perhaps showing how much more memory text data tends to take as compared to, say, numeric data — though I’ve since improved on the code to run more efficiently. Hopefully this made things faster!

A more quantitative approach to linguistics

Extensive research has been done on the “patterns” of Singlish, and literature states that particular grammar features are common in Singlish. Can I compute the probability of them, based on my data?

Yes! Enter the parts_of_speech table, from the Moby Project by Grady Ward:

Using the same example from the previous section
word pos
going verb
he pronoun
home noun
is verb
really adverb
tired adjective

The next piece of the puzzle comes in modelling grammatical features as — once again — transition probabilities. For example, let’s look at one the features I investigated — the use of “got” as a possessive function (as opposed to “have” as in standard English):

Remember the transitions table? Since it contains all word-to-word transitions and their frequencies, we can then use it alongside the parts_of_speech table, and apply what I call Join, Filter, Aggregate (which should ring a bell for those familiar with analytics)!

Let’s see how this approach works:

And let’s see we can apply it to our previous example of “got” as a possessive function:

And there we have it! For my study, I got some interesting results for this particular grammar feature:

At first glance, the probabilities were lower than expected. However, this appeared to align with existing research, which propose that the use of Singlish lies on a “spectrum”. That is, the Singlish seen in our everyday usage actually lies somewhere in between perfectly standard English and a hypothetical language with maximum Singlish features.

My data was something like this

For this project, a large proportion of data comes from students, who tend to use relatively more standard English even when texting. The other data source, local forums, may also be where users tend to largely follow to standard English grammar — though the extensive usage of Singaporean vernacular makes the text easily identifiable as Singlish.

Conclusion

I concluded my project by proposing that if a “uniquely Singaporean” text generation model were to be made, the Markov model might be one way to model the specific grammatical features that make Singlish, well, Singlish.

As a closing remark, you may also remember the key assumption of independence between state transitions in a Markov model. I acknowledge that this may not be true in the case of text — as the usage of certain words may impact the distribution of other words down the line in a sentence. This could be mitigated through a model which models phrases (groups of more than 1 word), which I also explored and wrote code for, though at least for my project the results were not as promising. (Come to think of it, this may be a topic for another post!)

Nonetheless, this was my way of bringing a more quantitative approach to a line of research which has always been, for the most part, qualitative.

Project repository

Medium version

If this has been interesting to you, I am happy to discuss this further with you via LinkedIn or !