Capstone Project Next Text Prediction

Gary Clarke
14/01/2021

Data input and preliminary cleanse and sampling

The following steps were applied to the raw data

  • Load data- the data was extracted and saved using a UTF-8 function to save the files in binary form

  • Sample data - a sample of the data was taken to create a data file that wasn't too big to slow the processing but was large enough to represent the full data set when predicting the text.

  • Clean data - the data had various elements stripped out to ensure the processing was not corrupted. The data removed was numbers, whitespace, punctuation, stopwords, and the data was transformed to lower case.

How the algorithm processes the user input

  • Libraries Quanteda and tm were used to manipulate data. Library Shiny is used to create the app

    1. Manipulate the data into a usable form for predicting text

A corpus was created and a data frequency matrix was used to process the data into ngrams. An n-gram is a contiguous sequence of n items from a given sample of text or speech.The programme used :- unigrams (1), bigrams (2), trigrams (3) quadgrams (4)and pentagrams (5)

  • 2. Return results of a input sentence or phrase

The algorithm contains a function that takes the input phrase and “reads” the last word. The algorithm loops through the n-grams scoring the frequency of the next words contained in the n-grams. The result is tabulated in order and the top ten words are returned as a list for the app to display

Stupid Backoff is used to improve the model

In the algorithm a “stupid back off” model is initiated.

The theory behind a back off model is the model estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by backing off through progressively shorter history models. By doing so, the model that has the most reliable information about a given history is used..

So in the example “a sunny day” the model would check the probability for “day” by looking for the trigram “a sunny day” first in the corpus then “back off” to “sunny day” as a bigram before looking for the unigram “day” and return the frequency for “day” divided by the total words in the corpus.

Shiny

  • ui and server files

The shiny ui and server files were created to enable the app to run and take user input of a phrase with an output of the top ten most likely “next” words

  • Using the app

The phrase is typed into the box and the predicted words are returned in an ordered list most frequent first.