Text Prediction Application

Poobalan
22 April 2016

This application attempts to predict the next word based on user input (using maximum of 2 words to predict). The prediction is based on datasets provided namely twitter, blog and news data from SwiftKey.

Challenges and Solutions

Challenges faced

  • hardware limitation (8GB RAM, intel i5 1.8GHz processor)
  • data cleansing (punctuations, url, @, #, slang words, profanities, non-UTF characters, extra whitespaces, lower/upper cases, numbers, special characters, typos, emotional words like hahahaha etc.)
  • data size (over 4 million rows of data combination from three datasets)

Solution/Workarounds

  • hardware: using smaller sample size of about 10% of provided dataset size.
  • removing urls, RTs,@, #, profanities, non-UTF characters, extra whitespaces, numbers, special characters, repeating characters (such as aaaaa, ooooook ), convert to lowercase.

Algorithm

Two algorithms were used:

Simple Back-off

Simple Back-off check the possible words in a 3-word table (trigram), then in a 2-word table (bigram) and finally returns the word with highest occurence in a 1-word table (unigram) if the trigram and bigram searches fail.

Simple Good-Turing

This algorithm takes into consideration that a word not in dictionary may be entered by user. thus it calculates these probabilities to make a better prediction of the word. It checks in 3-word table, and if no match is found, it then checks in 2-word table. If no match is found in either table, it returns a “not found” message.

Usage Instructions

1. User can enter input, choose a prediction method, and click on submit button on the sidebar.

2. The resulting prediction will appear in the main panel.

alt text

The application is accessible at https://libra22.shinyapps.io/TextPredictor/

Performance and Limitations

Performance

  • The application is able to load in under 10 seconds.
  • Prediction using Simple Back-off is under 3 seconds.
  • Prediction using Simple Good-Turing is under 5 seconds.

Limitations

  • prediction is based on at most the last two words entered due to resource limitations
  • small n-grams tables (500k rows or less per table) due to resource limitations