Application for Word Prediction

Thawatchai Phakwithoonchai
03 May 2020

Overview

———————————————————————————
This is the final project of “Data Science Capstone” course, which is a part of Coursera: Data Science Specialization program offered by Johns Hopkins University (JHU).

Objective of this project is to create the data product or application, which can predict the next word based on the prior words, phrase, or sentence. Datasets, which are provided by JHU incorporating with SwiftKey, are consists of multiple languages; English (en_US), Russian (ru_RU), German (de_DE) and Finnish (fi_FI), while each language dataset consist of 3 different files that is the information gathered from blogs, news, and twitter. For the purpose of this application, only English datasets are used to create the language model.

Application Development Life Cycle

———————————————————————————
This application is created, modified, improved through the multiple iterative processes with the following general stages:

  • Getting and cleaning the data (lowercase, remove URL, remove E-mail, remove Twitter handle (@…), remove hashtags, remove numbers, remove punctuations, remove profanity words, remove whitespaces)
  • Exploratory data analysis for visualizing the word frequency
  • Using N-gram language model and other to build a word prediction framework
  • Measuring the model performance by perplexity
  • Developing a predictive text application on shinyapps.io

Algorithm and Model

———————————————————————————

  • Create the different n-gram dataset, in this application, quadgram (n = 4) is the maximum n-gram words for building the language model

  • Markov assumption is applied to simplify the n-gram language model

  • Kneser–Ney smoothing is applied for calculating the probability distribution of n-grams in the cleaned dataset

  • Back-off is also applied for the model when the condition specified the less context and the higher order n-gram model haven’t learned much about

Application

———————————————————————————

  • Application is developed and deployed on the shiny server

  • Web interface allows the users to input the prior words, phrase, or sentence into the text box; and then simply click the “Submit” button

  • Result of language model will show the top 10 words that categorized in each n-gram based on the probabilities.

  • Info icon at the sidebar tab is also provided the brief information about the model, its performance, and references.