Gianmarco Polotti
November, 8, 2020
Final project for the Coursera Data Science Specialization
Web App for the prediction of the next word in an unknown sequence of words.
Algorithm knowledge comes from a large dataset coming from Coursera-Swiftkey team.
The 4 algorithm are:
They will be detailed in the next slides.
Data Cleaning : raw text contains a lot of useless information that need to be remove before analysis. I decide to remove in order: mentions, urls, emojis, numbers, punctuations, spaces and everything is unified to lowercase. Frasal words are removed by standard “stopwords-iso” dataset.
Data Analysis : analysis of the most frequent occurences is achieved by standrd NLP package common in R, such as tockenizers package. Usefull word frequencies distributions of single word (onegrams), couple of words (bigrams) and triplet of words (threegrams) are shown below.
Model : from the original dataset, I built three dictionaries that collect the distribution of one, bi and three grams respectively. Dictionaries are in binary format and are quickly available for the model. The new phrase is cleaned and decomposed in the components words.
Web App: The phrase is analysed after pushing the Find. The histogram shows the most probable words with their probabilities. Before a new input, the botton New must be pressed in order to avoid a reactive updating.