Thawatchai Phakwithoonchai
03 May 2020
———————————————————————————
This is the final project of “Data Science Capstone” course, which is a part of Coursera: Data Science Specialization program offered by Johns Hopkins University (JHU).
Objective of this project is to create the data product or application, which can predict the next word based on the prior words, phrase, or sentence. Datasets, which are provided by JHU incorporating with SwiftKey, are consists of multiple languages; English (en_US), Russian (ru_RU), German (de_DE) and Finnish (fi_FI), while each language dataset consist of 3 different files that is the information gathered from blogs, news, and twitter. For the purpose of this application, only English datasets are used to create the language model.
———————————————————————————
This application is created, modified, improved through the multiple iterative processes with the following general stages:
———————————————————————————
Create the different n-gram dataset, in this application, quadgram (n = 4) is the maximum n-gram words for building the language model
Markov assumption is applied to simplify the n-gram language model
Kneser–Ney smoothing is applied for calculating the probability distribution of n-grams in the cleaned dataset
Back-off is also applied for the model when the condition specified the less context and the higher order n-gram model haven’t learned much about
———————————————————————————
Application is developed and deployed on the shiny server
Web interface allows the users to input the prior words, phrase, or sentence into the text box; and then simply click the “Submit” button
Result of language model will show the top 10 words that categorized in each n-gram based on the probabilities.
Info icon at the sidebar tab is also provided the brief information about the model, its performance, and references.