Text Prediction Application

Poobalan
22 April 2016

This application attempts to predict the next word based on user input (using maximum of 2 words to predict). The prediction is based on datasets provided namely twitter, blog and news data from SwiftKey.

Challenges and Solutions

Challenges faced

hardware limitation (8GB RAM, intel i5 1.8GHz processor)
data cleansing (punctuations, url, @, #, slang words, profanities, non-UTF characters, extra whitespaces, lower/upper cases, numbers, special characters, typos, emotional words like hahahaha etc.)
data size (over 4 million rows of data combination from three datasets)

Solution/Workarounds

hardware: using smaller sample size of about 10% of provided dataset size.

removing urls, RTs,@, #, profanities, non-UTF characters, extra whitespaces, numbers, special characters, repeating characters (such as aaaaa, ooooook ), convert to lowercase.

Algorithm

Two algorithms were used:

Simple Back-off

Simple Back-off check the possible words in a 3-word table (trigram), then in a 2-word table (bigram) and finally returns the word with highest occurence in a 1-word table (unigram) if the trigram and bigram searches fail.

Simple Good-Turing

This algorithm takes into consideration that a word not in dictionary may be entered by user. thus it calculates these probabilities to make a better prediction of the word. It checks in 3-word table, and if no match is found, it then checks in 2-word table. If no match is found in either table, it returns a “not found” message.

Usage Instructions

1. User can enter input, choose a prediction method, and click on submit button on the sidebar.

2. The resulting prediction will appear in the main panel.

The application is accessible at https://libra22.shinyapps.io/TextPredictor/

Performance and Limitations

Performance

The application is able to load in under 10 seconds.
Prediction using Simple Back-off is under 3 seconds.
Prediction using Simple Good-Turing is under 5 seconds.

Limitations

prediction is based on at most the last two words entered due to resource limitations

small n-grams tables (500k rows or less per table) due to resource limitations