Chan
18/01/2016
The objective of the project is to develop an application that accepts multiple text inputs from the user and generates a prediction of the next word.
The text data that was employed in this application is obtained from the HC Corpora text corpus. The text corpus is over 500MB in size and contains over 100 million words sourced from blogs, news and twitter.
Techniques such as natural language processing (NLP), n-grams and Markov chains have been used to produce the prediction model.
The following features are supported in this application:
Preprocessing - Removal of all numbers, punctuation, special characters and whitespaces, and convert all words to lowercase
Tokenization - Truncating input string to last 4 words. All words will be used if there are less than 4 words
Pattern Matching - Attempt to perform pattern matching of the input with the 4-gram, 3-gram, 2-gram and 1-gram frequency matrices
Next Word Prediction - Pattern that returns the highest frequency from the frequency matrices is selected as the predicted next word
Natural language processing (NLP)
Ability to process text and make the information accessible to computer applications. This approach is used to perform cleansing of the text corpus by stemming, removing numbers, punctuation, and special characters
n-gram model
Probabilistic model for predicting the next item in a continous sequence of n-items from a given sequence of words. This model is used to generate unigrams, bigrams, trigrams and quadgrams from the tokenization of the text corpus.
Markov chain
Sequence of random models used to describe a chain of linked events, where what happens next depends only on the current state of the system. This model is used to compute the probabilities of each n-gram token and store them in term frequency matrices
The user begins by typing in a word or phrase in input box. The application will refresh and display the List of entered words and Next word prediction.
The application can be accessed online on RStudio's Shinyapp Server