Capstone Text Prediction

Peter Makai
29-09-2018

Introduction

This app is a simple text prediction tool. Based on text inut, it is able to predict the next word, which is the most likley after the last words.

In order to constract the app, I have used publicly available data from twitter, news artices and blog posts from Swiftkey, a text prediction company. I have cleaned a preprocessed the data, in order to be able to construct a database that was able to provide meaningful predictions.

Performance issues were dealt with by sampling the data.

Finally, I have built a shiny app for the prediction to work.

Methods

First the data was cleaned for non standard characters, such as punctuation marks, removed stopwords, and made all words lower case in order to be able to combine words appearing at the beginning of the sentence and later in the sentence.

Second, I have created a specialized database able to deal with text data, and have investigated the most common words, and the combinations of two and three words. In technical language, I have created a corpus object, created uni-, bi- and trigrams.

Third, I have made a table which shows the probability that one word follows the next, using the bi-grams. This is called a Markov model. The Markov model has been created seperately, and only the model itself is incorporated into the shiny app, greatly improving performance.

The shiny app

The shiny app is functional, and has no additional design features at this stage. It provides a textbox for the input of text, and it supplies the most common next word. The next word will be supplied in the text box.

The link to te shiny app

The app can be found here:

https://pmakai.shinyapps.io/Textpred/