Capstone Project Next Text Prediction

Gary Clarke
14/01/2021

Data input and preliminary cleanse and sampling

The following steps were applied to the raw data

  • Load data- the data was extracted and saved using a UTF-8 function to save the files in binary form

  • Sample data - a sample of the data was taken to create a data file that wasn't too big to slow the processing but was large enough to represent the full data set when predicting the text.

  • Clean data - the data had various elements stripped out to ensure the processing was not corrupted. The data removed was numbers, whitespace, punctuation, stopwords, and the data was transformed to lower case.

1. Manipulate the data into a usable form for predicting text

  • Libraries Quanteda and tm were used to manipulate data. Library Shiny is used to create the app

A corpus was created and a data frequency matrix was used to process the data into ngrams.

An n-gram is a contiguous sequence of n items from a given sample of text or speech.The programme used :- unigrams (1), bigrams (2), trigrams (3) quadgrams (4)and pentagrams (5)

2. Return results of a input sentence or phrase

The algorithm contains a function that takes the input phrase and “reads” the last word.

The algorithm loops through the n-grams scoring the frequency of the next word as contained in the n-grams based on the last word identified in the previous function.

The result is tabulated in order and the top ten words are returned as a list for the app to display

Shiny

  • ui and server files

The shiny ui and server files were created to enable the app to run and take user input of a phrase with an output of the top ten most likely “next” words

  • Using the app

The phrase is typed into the box and the predicted words are returned in an ordered list most frequent first.