Word Forecast

Bjenk Ellefsen

About forecasting words

In processing human languages for prediction, usually this means finding out what word is likely to follow in a sequence. The means to do this are probabilistic models called ngrams. Predicting the next units in a linguistic sequence has important applications today: machine translation, speech recognition, spelling correction, typing assistance to speed things up and more.

As the final exercice for the Coursera Data Science Specialization, we were tasked to build such a model and an app from scratch. This is my very humble effort. I do say humble as, because of constraints in time, I have ot admit that my model has not shown high accuracy but performs well. In time, I could improve it many times over. For now, I have followed the well known approach of “build it, make it work and make it better after”.

Forecasting words

plot

  • Enter text as if writing a sentence.
  • The words are then compared with the datasets for matching sequences.
  • The highest likeliest words are then shown and the highest score word is presented on the right.

A few words on the model

In natural language processing, the most common tasks are tokenizing and forming ngrams. Tokenizing is a computational process used to break up texts into words, sentences or any other meaningful unit for analysis which are called tokens. Then ngramns are formed. Briefly, an “ngram"is a sequence of items that follow one another. The "n” in “ngram” indicates the number of items like “unigram”, “bigram”, “trigram”, and so on. Ngrams are a concept from the statistical modeling of naturally evolved human languages. We define the object of “natural” language as a distinction from formal languages like code and programming as an example.

In terms of language modeling, an ngram model is a probabilistic model for predicting the next item in a sequence, expressed as n-1. In order to build a model, we must first break down natural language sources of data into ngrams. In this case, we have done 5 ngrams tables, from 1-gram to 5-gram.

Stupid back off smoothing

We have used the Stupid back off algorithm developed by Google's Brants, Popat, Xu, Och, and Dean (2007). This scheme is different as it does not produce normalized probabilities but relies on a score calculated from relative frequencies.

\( S(w_i|w^{i-1}_{i-k+1}) = \begin{cases} \frac {f(w^i_{i-k+1})} {f(w^{i-1}_{i-k+1})} & \text{if } f(w^i_{i-k+1}) > 0 \\ \alpha S \ (w^i|w^{i-1}_{i-k+2}) & \text{otherwise} \end{cases} \)

While \( alpha \) may be made to depend on \( k \), the value is set to 0.4 based on Brants et al. experiments as indicated in their paper. This scheme has proved to be computationally less expansive and offers better results with large scale data. The overall approach is that the model will take n words typed and try to find matches with the humber of words up to 5 grams, if no matches are found, it slips to 4 grams, then to 3, to 2 and finally looks through the unigrams. Each time it slips down, it calculates the score according to the stupid back off smoothing.

A modest word forecasting app

app

Without further ramblings, here is the simple application that can predict the next word based on the text entered by you. The application is using a sample of text data from blogs, twitter and news sources.

All that is needed is to type in some words and click on the button. The five top predicted words are shown on the left, and the top prediction is shown on the right.