MindReader

Nathalie Descusse-Brown
30th June 2019

What does the MindReader app do?

The purpose of the MindReader app is to predict the user's next word.
The user enters a sentence and the app returns the most likely next word.

The app was trained on an entire corpus - or collection - of 3 US-originated documents, namely a blog file, a news file and a twitter file. The corpus was tidied up to filter out most 'bad words' (e.g. rude or offensive words).

White spaces were removed and the entire corpus was converted to lower case for standardisation.

An analysis of bigrams - or sequence of 2 words -, and trigrams - or sequence of 3 words - was performed and all bigrams and trigrams that were found to have a frequency higher than 1 across the entire corpus were stored in a data frame, which forms the dictionary the app looks up when the user enters an input.

What does the TextPredict app look like?

The left hand side of the app is reserved for the user input, where the user is prompted to enter any sentence.

The next predicted word is then displayed in red in the main panel. Et voila!

How to use the MindReader app?

The user can enter any sentence and the app will return the most likely word based on the analysis performed of the corpus. The function used to look up the dictionary and match it to the user input is “textpredictbitrigram.R”.

The below code shows what is returned when the function is called within the app.

source('textpredictbitrigram.R')
DTngram <- fread("DTbitrisplit.csv",quote="")
textpredictbitrigram("you mean the world to",DTngram)

[1] "me"

How does it perform?

Benchmarking was performed with the benchmark.R dataset provided by https://github.com/hfoffani/dsci-benchmark. The results are reported below:

How does the MindReader app work?

The app works as such:

First search the app dictionary for all trigrams starting with the last two words of the input and return the last word of the trigram with the highest frequency.
If no matching trigram is found then the app looks up all bigrams starting with the last word of the input and and return the second word of the bigram with the highest frequency.
If no matching bigram is found then the most common unigram (single word), “the”, is returned by the app.

What does the future hold for the MindReader app?

A few considerations for improvement of the current MindReader app:

The accuracy of the prediction could be further improved by using quad and 5-grams - or sequences of 4 and 5 words, respectively - also when looking up the dictionary. However the memory requirements will need to be addressed as even with the efficient read R function 'fread', looking up a dictionary containing quad and 5-grams currently significantly slows down the app. This is something that will need to be looked into.
Although most 'bad words' have been removed, some words containing symbols have been left in and further investigation of the meaningfulness of these specific words will have to be further assessed to judge whether their removal would affect accuracy of the prediction.