Data Science Specialization SwiftKey Capstone (Final Report)

Cheong Kok Hoe
22 Aug 2015

Background & Building the Model

  • The capstone project focuses on NLP and text mining. A Predict Next Word product is built; similar to SwiftKey.
  • Training data was from the data source. Only US english news, blogs and twitter (about 760MB) were used to build the model. Due to the size, they were partitioned into manageable size where text mining was performed.
  • Training data were cleaned: change to lower-case, remove punctuation, remove stopwords, profanity and numbers, and extra whitespaces. quanteda and tm packages were used to extract word features and relationships.
  • N-grams models containing 1-gram, 2-gram and 3-gram models were built. To address shinyapps.io memory quota and application response time, frequency of tokenized words of 1 were removed.

Using Model to Predict Next Word

  • N-grams with Katz's Back-off model were used.
  • Cleansing method on input phrase is same as building the model.

The model is executed in the following sequence:

  1. Last 2 words of input phrase is check against 3-gram model. Go step 3, if no match.
  2. If min. 1 match, return last word of max. top 5 occurrences.
  3. Last word of the input phase is check against the 2-gram model. Go step 5, if no match.
  4. If min. 1 match, return last word of max. top 5 occurrences.
  5. Return top 5 occurrences in 1-gram. This is to avoid no suggestions.

Instructions to use the Application (1/2)

image Click PredictNextWord to launch the application.

  1. Only English language is supported. The first time the application is executed, about 30 seconds are required for loading the model into memory. Look for the message highlighted in 'Red'.
  2. Enter a phrase in the text box. Highlighted in 'Green'.

Instructions to use the Application (2/2)

image

  1. The application will clean the input phrase which includes changing to lower-case, remove punctuation, remove stopwords, profanity and numbers, and extra whitespaces which were applied to the data used to train the model.
  2. Click Go! button.
  3. A maximum of top 5 predicted words will be displayed.