Gary Clarke
14/01/2021
The following steps were applied to the raw data
Load data- the data was extracted and saved using a UTF-8 function to save the files in binary form
Sample data - a sample of the data was taken to create a data file that wasn't too big to slow the processing but was large enough to represent the full data set when predicting the text.
Clean data - the data had various elements stripped out to ensure the processing was not corrupted. The data removed was numbers, whitespace, punctuation, stopwords, and the data was transformed to lower case.
A corpus was created and a data frequency matrix was used to process the data into ngrams.
An n-gram is a contiguous sequence of n items from a given sample of text or speech.The programme used :- unigrams (1), bigrams (2), trigrams (3) quadgrams (4)and pentagrams (5)
The algorithm contains a function that takes the input phrase and “reads” the last word.
The algorithm loops through the n-grams scoring the frequency of the next word as contained in the n-grams based on the last word identified in the previous function.
The result is tabulated in order and the top ten words are returned as a list for the app to display
The shiny ui and server files were created to enable the app to run and take user input of a phrase with an output of the top ten most likely “next” words
The phrase is typed into the box and the predicted words are returned in an ordered list most frequent first.