Archel
11/7/2021
This presentation will briefly explains about the “Data Science Specialization Course” from Coursera’s final project. The project is about building a predictive modeling for word recommendations upon user prompt. The idea itself came from SwiftKey (now becomes Microsoft SwiftKey), a company that develops TouchType for Android and iOS devices.
The original SwiftKey Dataset consists of several languages including Deutch, French, Russian. However, in this project, I only used the English dataset only. The English dataset itself has three separate files from different sources: blogs, news, and Twitter. In each file, the data is separated by lines, this means that the dataset is in a form of collection of sentences.
Before feeding the data into the model, there are several preprocessing steps that need to be done:
To perform next word predictions, there are several predictive modelling algorithms. In this case, I utilized the Katz’s Back-off model. This is one of the most widely used algorithm in language modelling. Katz’s Back-off or Back-off is a generative n-gram language model which calculates the probability of a word given its history in the n-gram.
In this project I modelled the n-gram up from 2-gram up to 4-gram. The pipeline is as follows:
The image below is the display of my web application for this project. You will see a text box on the left side for the user prompt to type a sentence and the prediction result will appear on the right side.
In this project we have seen how Back-off (n-gram) model can predict/recommend words from user input. We start off with a raw dataset, clean & preprocess it, and feed the corpus into the model. This is a common approach as it does not require sophisticated and deep understanding about the NLP topic. In addition, it is quite memory and time efficient.
Perhaps this project can be further improved by implementing other models like deep learning models for language modelling. Deep learning models can capture meaning and connections between words which the Back-off algorithm cannot do.