Word Predictor App

Sarathy Jay
May 15 2016


About the App

This is a smart text prediction app that learned multiple word combinations from huge set of twitter, public blogs and news datasets.


Key Features

  • Performs data cleaning to remove special characters and profanity words
  • The datasets are loaded into R data frames for faster prediction
  • Text Mining & NLP techniques are used to create N-grams (1,2,3 and 4 words)

How the app works

User Interface

  • A text box to capture user input. The user can type in one or few works
  • A button to perform the word prediction action

Background Process

  • The app gets the user's input and performs data cleaning (removes punctuation, special characters, extra-white spaces and profanity words)
  • Loads already created n-grams into memory
  • Performs prediction model using built in algorithm
  • Outputs all the possible next work predictions in a drop down box

Algorithm

The app’s algorithm is based on N-grams. N-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus


Corpus

The “corpus” used were blogs, news & tweets in the English language. Based on these, we have build a pair on n-grams (unigrams, bigrams, trigrams & quadragrams) to help predict the most likely word to come next in a sentence, based on the frequency that same word was used in the corpus we analyzed.

Conclusion

The app is availble through shinny for exploration.

Link to shinny app: Word Predictor

References