Nathalie Descusse-Brown
30th June 2019
The purpose of the MindReader app is to predict the user's next word.
The user enters a sentence and the app returns the most likely next word.
The app was trained on an entire corpus - or collection - of 3 US-originated documents, namely a blog file, a news file and a twitter file. The corpus was tidied up to filter out most 'bad words' (e.g. rude or offensive words).
White spaces were removed and the entire corpus was converted to lower case for standardisation.
An analysis of bigrams - or sequence of 2 words -, and trigrams - or sequence of 3 words - was performed and all bigrams and trigrams that were found to have a frequency higher than 1 across the entire corpus were stored in a data frame, which forms the dictionary the app looks up when the user enters an input.
The left hand side of the app is reserved for the user input, where the user is prompted to enter any sentence.
The next predicted word is then displayed in red in the main panel. Et voila!
The user can enter any sentence and the app will return the most likely word based on the analysis performed of the corpus.
The function used to look up the dictionary and match it to the user input is “textpredictbitrigram.R”.
The below code shows what is returned when the function is called within the app.
source('textpredictbitrigram.R')
DTngram <- fread("DTbitrisplit.csv",quote="")
textpredictbitrigram("you mean the world to",DTngram)
[1] "me"
Benchmarking was performed with the benchmark.R dataset provided by https://github.com/hfoffani/dsci-benchmark. The results are reported below:
The app works as such:
A few considerations for improvement of the current MindReader app: