Word Prediction Web App

Gianmarco Polotti
November, 8, 2020

Final project for the Coursera Data Science Specialization

Project Key Points

Web App for the prediction of the next word in an unknown sequence of words.
Algorithm knowledge comes from a large dataset coming from Coursera-Swiftkey team.
The 4 algorithm are:
1. exploratory analysis,
2. data cleaning,
3. model construction,
4. Web App compilation.
They will be detailed in the next slides.

Exploratory and Data Cleaning

Data Cleaning : raw text contains a lot of useless information that need to be remove before analysis. I decide to remove in order: mentions, urls, emojis, numbers, punctuations, spaces and everything is unified to lowercase. Frasal words are removed by standard “stopwords-iso” dataset.

Data Analysis : analysis of the most frequent occurences is achieved by standrd NLP package common in R, such as tockenizers package. Usefull word frequencies distributions of single word (onegrams), couple of words (bigrams) and triplet of words (threegrams) are shown below.

A caption

Model and Web App

Model : from the original dataset, I built three dictionaries that collect the distribution of one, bi and three grams respectively. Dictionaries are in binary format and are quickly available for the model. The new phrase is cleaned and decomposed in the components words.

Web App: The phrase is analysed after pushing the Find. The histogram shows the most probable words with their probabilities. Before a new input, the botton New must be pressed in order to avoid a reactive updating.

A caption

Algorithm Block Diagrams

A caption