12/27/2020

Overview

The goal of the capstone project is to build a Shiny application based on a predictive text model using corpus data.

The user would provide a word or short phrase and the application would try to predict the next word.

Three corpus of English language from the Internet have been used to train the model (news, blogs, and tweets). They have been analyzed and processed to extract information of word collocations that have been used in the Shiny application to provide a fast and reliable prediction at the time of user request.

Application Inferface

Application

This tab is the working interface to the application. The input area is on the left, containing a text field where a word or phrase can be typed in as well as a Predict button to calculate the results.

The output area is on the right. The application returns the input word or phrase again, the predicted word to follow the former expression, and the full phrase when combining the input with the predicted word.

Model

This tab contains a brief description of usage and of the model behind the scenes.

Construction of the Model

Three corpus of English language from the Internet have been used to train the model (news, blogs, and tweets). Random subsets containining 5000, 5000 and 10000 entries, respectively, are selected for constructing the model.

These three datasets are converted into proper corpora. They are preprocessed, by substituting hyphens by spaces, removing punctuation marks, numbers, removing additional spaces and converting all the strings to lowercase.

The frequencies of n-grams for n = 2, 3, 4 and 5 are considered for each of the datasets, the results combined by adding the frequencies of n-grams occurring across different datasets, and arranged by frequency of appeareance. The resulting data are stored in a RData file, which will be later used by the application.

Model Application

The application uses the information loaded from the RData file to find matches for the introduced input and returns the next word if a match is found.

It calculates the number of words, n, in the input phrase and starts by looking for matches for the first n words in the n+1 gram dataset. If there are one or more than one matches, the algorithm returns the n+1 word with the highest frequency. If there are no matches, removes the first word of the inpunt sentence and iterates the process until the algorithm converges to a solution.

Application URL:

https://doctormanuel.shinyapps.io/TextPredictor/