Next-Word Prediction Language modeling Application

Georgios Tsagiannis (geotsa)
13/07/2020

The goal of the project

The goal is to build a predictive model of English text that will support a next-word prediction application.
Shiny App can be found at: https://gtsa.shinyapps.io/NLP_Next-Word_Predicition_App/
Source code: https://github.com/gtsa/NLP_Next-Word_Prediction_Language_Model
The Overall top-3 score of the model is 22.50 %
The Overall top-1 precision is: 19.12 %
The Overall top-3 precision is: 25.42 %

The scores are based on the Next word prediction benchmark test that can be found at: https://github.com/hfoffani/dsci-benchmark (psw:capstone4)

Τα κειμενικα δεδομενα με βαση τα οποια κατσκευαστηκε ο αλγοριθμος προβλεψης ειναι τεραστιο σωμα raw data προερχομενα απο blogs, twitter ή news sites,
The textual data on which is based the prediction algorithm is a huge corpus of raw data from blogs, twitter or news sites,
Our first concern is the data manipulation and a meticulous data cleaning,
The second step, the exploratory analysis (summarized at: https://github.com/gtsa/NLP_Next-Word_Prediction_Language_Model/blob/master/Report.html) aims to better understand the properties of data and a first evaluation of the relationship size of data needed, technical ability to process them efficiently (accurately and fast),
Unigrams, bigrams, trigrams, 4-grams and 5-grams are created with N-gram package for that reason, and we end up deciding to use all of them as long as they appear at least twice in the body of our data,
At the same time and in order to deal with profanity issues, we choose not to use those of our N-grams that contain words from the “SwearWords.csv” that can be found at www.bannedwordlist.com.

The prediction model is based on an optimized stupid Back-off (λ=0.4) N-gram frequency algorithm,
5-grams are the first N-grams to be used. That means that the algorithm takes into account the last four words that user has provided in order to find “probabilities” for the 5th one according to the N-gram frequency tables of our “train” text corpus which serve as frequency dictionaries,
If no match is found, the 4-grams are used (taking into account the last three words of the user input),
If no match is found the algorithm συνεχιζει την ιδια διαδικασια με τα trigrams and the bigrams, until eventually ending up with proposing the most used single words (unigrams) of our text corpus regardless of the user input,
If, as is most often the case, the search algorithm finds one or more suggestions, then the sentences of the dictionaries of frequency N-grams of lower frequency are weighted with a lower weight λ.

The app is titled “NLP Next-Word Prediction App”
https://gtsa.shinyapps.io/NLP_Next-Word_Predicition_App/
Just like typing on a smartphone, user can simply type in the input area a single word or text sentence(s) in the “box” provided.
Abbrevations, numbers, symbols and punctuations are removed by the model to predict the next word
The app will display the 3 most probable options of the next word based on your input
The very most probable among them will be on the green-bordered box.
The user can click on one of them if you see a match or just keep typing

Thank you very much !!!! .