Next-Word Prediction Language modeling Application

Georgios Tsagiannis (geotsa)
13/07/2020

The goal of the project

The scores are based on the Next word prediction benchmark test that can be found at: https://github.com/hfoffani/dsci-benchmark (psw:capstone4)

Processing of the data

  • Τα κειμενικα δεδομενα με βαση τα οποια κατσκευαστηκε ο αλγοριθμος προβλεψης ειναι τεραστιο σωμα raw data προερχομενα απο blogs, twitter ή news sites,
  • The textual data on which is based the prediction algorithm is a huge corpus of raw data from blogs, twitter or news sites,
  • Our first concern is the data manipulation and a meticulous data cleaning,
  • The second step, the exploratory analysis (summarized at: https://github.com/gtsa/NLP_Next-Word_Prediction_Language_Model/blob/master/Report.html) aims to better understand the properties of data and a first evaluation of the relationship size of data needed, technical ability to process them efficiently (accurately and fast),
  • Unigrams, bigrams, trigrams, 4-grams and 5-grams are created with N-gram package for that reason, and we end up deciding to use all of them as long as they appear at least twice in the body of our data,
  • At the same time and in order to deal with profanity issues, we choose not to use those of our N-grams that contain words from the “SwearWords.csv” that can be found at www.bannedwordlist.com.

The 'Prediction Model' algorithm

  • The prediction model is based on an optimized stupid Back-off (λ=0.4) N-gram frequency algorithm,
  • 5-grams are the first N-grams to be used. That means that the algorithm takes into account the last four words that user has provided in order to find “probabilities” for the 5th one according to the N-gram frequency tables of our “train” text corpus which serve as frequency dictionaries,
  • If no match is found, the 4-grams are used (taking into account the last three words of the user input),
  • If no match is found the algorithm συνεχιζει την ιδια διαδικασια με τα trigrams and the bigrams, until eventually ending up with proposing the most used single words (unigrams) of our text corpus regardless of the user input,
  • If, as is most often the case, the search algorithm finds one or more suggestions, then the sentences of the dictionaries of frequency N-grams of lower frequency are weighted with a lower weight λ.

       The Shiny Application

 

  • The app is titled “NLP Next-Word Prediction App”
  • https://gtsa.shinyapps.io/NLP_Next-Word_Predicition_App/
  • Just like typing on a smartphone, user can simply type in the input area a single word or text sentence(s) in the “box” provided.
  • Abbrevations, numbers, symbols and punctuations are removed by the model to predict the next word
  • The app will display the 3 most probable options of the next word based on your input
  • The very most probable among them will be on the green-bordered box.
  • The user can click on one of them if you see a match or just keep typing

 

        Thank you very much !!!! .