Simple text prediction shiny app

Hemantakumar Hegde
2020-03-21

alt text

Information about the app

  • Data source

    • Built using a corpora of news, blog and twitter posts in English language totalling to 583 MB of data.
  • Development

    • This was an enjoyable experience building this because of
    • Resource intensive preprocessing and search for the tools. Ran out of available limited RAM and had to rewrite the preprocess script to handle data in 100 chunks and consolidate. That took about 3 hours but consumed the whole corpora.
    • Noticed the processing power required and that the R languare was so far able to use only 1 core out of 8 cores of the processor!

…continued

  • Choices to make about removing sparse words, foreign language words, misspelling, character encoding, keep or drop features, treatment of punctuation, accented letters, preserving capitalization etc.

  • Starting to think about the algorithm required from the scratch!

  • Performance

    • The total size of the algorithm finally is 111 MB
    • Loads in about 5 seconds on shiny server if the app is in sleep mode and within a second if active.
    • Returns the prediction within 2 seconds

The algorithm

Ended up with an algorithm which startes as a Markov chain now similar to Kat,z backoff model! How my algorithm works:

  • Tried creating n2gram n3gram till n6grams of word tokens (created only within individual sentences) but ONLY using n6grams now as I noticed others were redundant.

  • Algorithm tokenizes the input text and stems that (as the training data features were also stemmed)

  • Then it tries to match up to 5 words of the input (in sequence) to the n6grams stored. If it could not match all 5 words, it falls back to matching only 4 and so on and finally only 1.

How to use

  • Click here to open the app. Just type your text and click on predict.
  • It can predict 1 word or a sequence of selected number of words (memoryless markov chain with a state of 2 to 5 words). Please also tick “Randomize ” otherwise it will repeat same squence of words.
  • Result is displayed attached to your input text.
  • alt text

Thank you