Chris Dolan
This Project was completed as part of the Coursera Data Science Specilialization.
The main goal of this project was to design and implement a model that would take a partial phrase as an input, and try and predict the next word in the phrase.
An app was designed using RStudio and Shiny Apps to serve as the user interface for this project.
To develop the model, a corpus of three text examples was provided with examples from twitter, news, and blogs from HC Corpora
The Corpus provided for the project was sampled, cleaned, and then seperated in to n-grams of varying length. The n-grams formed the basis of the model.
The N-grams and the interpolated apperance frequency in the corpus were organized in to look-up tables that would be searched based on the users input. By using interpolation, frequency calculations were able to take in to account how often a word is used in certain context: the famous eample being that “Fransisco” is only common after “San” (San Francisco).
To predict the next word in a phrase, a simplified interpolation and Back-off model was developed.
The partial phrase input by the user would be used to search the n-gram tables. If a phrase was not found in the table then the model would “back-off” to the next lowest n-gram and search for a shortened phrase.
The app relies on Markov Chains and only uses a maximum of four words to make a prediction.
A partial phrase is input on the left hand side
The user hits submit
The predicted word is displayed in the center of the screen
Instructions and background information is found on the information tab