Gary Clarke
14/01/2021
The following steps were applied to the raw data
Load data- the data was extracted and saved using a UTF-8 function to save the files in binary form
Sample data - a sample of the data was taken to create a data file that wasn't too big to slow the processing but was large enough to represent the full data set when predicting the text.
Clean data - the data had various elements stripped out to ensure the processing was not corrupted. The data removed was numbers, whitespace, punctuation, stopwords, and the data was transformed to lower case.
Libraries Quanteda and tm were used to manipulate data. Library Shiny is used to create the app
A corpus was created and a data frequency matrix was used to process the data into ngrams. An n-gram is a contiguous sequence of n items from a given sample of text or speech.E.g. “day” is a uni gram , “sunny day” is a bigram and “a sunny day” is a trigram, and so on. The programme used :- unigrams (1), bigrams (2), trigrams (3) quadgrams (4)and pentagrams (5)
The algorithm contains code to create a function that takes the input phrase and “reads” the last word Using this word, the algorithm loops through the n-grams looking for matching words and scoring the frequency of the next words contained in the n-grams The result of this scoring exercise is tabulated in order and the top ten words are returned as a list for the app to display
In the algorithm a “stupid back off” model is initiated.
The theory behind a back off model is the model estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by backing off through progressively shorter history models. By doing so, the model that has the most reliable information about a given history is used..
So in the example “a sunny day” the model would check the probability for “day” by looking for the trigram “a sunny day” first in the corpus then “back off” to “sunny day” as a bigram before looking for the unigram “day” and return the frequency for “day” divided by the total words in the corpus.
The shiny ui and server files were created to enable the app to run and take user input of a phrase with an output of the top ten most likely “next” words
The phrase is typed into the box and the predicted words are returned in an ordered list most frequent first.
The background colour was set in shiny using “library(shinyWidgets)” and the following code in the ui file.
setBackgroundColor( color = c(“#e6e6fa”, “#b6fffd”), gradient = “radial”, direction = c(“top”, “left”) ),