West Pang
14 May 2017
The goal of this capstone project is to mimic the experience of being a data scientist by using data science techniques learned from all 9 specialization courses to create a text prediction app for Swiftkey.
The datasets to form the prediction library are provided in the Coursera course website. It is derived from a corpus called HC Corpora www.corpora.heliohost.org and include three corpus files (blogs, news, and twitter) for each of four locales (English, German, Russian, and Finnish). This app applied only on the US English language.
The text prediction app will receive a word/phrase/sentence from the user and perform prediction of the next possible word for the user.
The datasets (blogs, news and twitters) are combined to become a Corpus which are then undergone data cleansing by removing irrelevant words: URL links, punctuations, profanity words, numbers, non-Ascii characters and unnecessary spacings.
The cleaned data is however difficult to be tokenized by normal computer due to its extremely large file size. A divide and conquer strategy is used to split the data into 10 different groups and perform 2-gram and 3-gram tokenization separately, and then aggregate the results of the chunks to form the 2-gram and 3-gram prediction model.
The results of the 2-gram and 3-gram tokens with frequency less than 4 are ignored as the probability and accuracy are low to the prediction. To further reduce the size of the 2-gram and 3-gram tokens file, only take the 75% quantile of the tokens are selected. The results form the prediction library for the prediction algorithm.