Data Science Capstone Text Prediction App

West Pang
14 May 2017

Introduction

The goal of this capstone project is to mimic the experience of being a data scientist by using data science techniques learned from all 9 specialization courses to create a text prediction app for Swiftkey.

The datasets to form the prediction library are provided in the Coursera course website. It is derived from a corpus called HC Corpora www.corpora.heliohost.org and include three corpus files (blogs, news, and twitter) for each of four locales (English, German, Russian, and Finnish). This app applied only on the US English language.

The text prediction app will receive a word/phrase/sentence from the user and perform prediction of the next possible word for the user.

Cleaning and N-Gram Tokenizations

The datasets (blogs, news and twitters) are combined to become a Corpus which are then undergone data cleansing by removing irrelevant words: URL links, punctuations, profanity words, numbers, non-Ascii characters and unnecessary spacings.

The cleaned data is however difficult to be tokenized by normal computer due to its extremely large file size. A divide and conquer strategy is used to split the data into 10 different groups and perform 2-gram and 3-gram tokenization separately, and then aggregate the results of the chunks to form the 2-gram and 3-gram prediction model.

The results of the 2-gram and 3-gram tokens with frequency less than 4 are ignored as the probability and accuracy are low to the prediction. To further reduce the size of the 2-gram and 3-gram tokens file, only take the 75% quantile of the tokens are selected. The results form the prediction library for the prediction algorithm.

Prediction Algorithm

Backoff algorithm for prediction is employed to recursively try from higher order n-grams to lower order n-grams until a reasonable probability is found. If the input is less than 2 words, compare the final 2 words of the input to the first 2 words of the tri-gram model. If an appropriate match is not found, backoff to the next lower order ngram. Then compare the final word of input to the first word of the bi-gram. If no appropriate match found, pick the highest probability word of the uni-gram.

How The App Works?

The app works by allowing user to key in a word, phrase or sentence, upon submission, the app will predict and suggest the next word for the user. The user is allowed to choose from 1 out of 5 of the suggested words to complete the phrase or sentence. The final phrase/sentence will be displayed accordingly.