capstone-pitching-swiftkey

Pham Ngoc Hieu - Kuriboh Kuet
29-10-2020

Introduction

Capstone project for the Data Science Specialization on Coursera.
The project was sponsored by SwiftKey company.
Focus on predicting next word the user may use and suggest them for more quickly typing.

Feature supports

UI

  • User types anything in the Text Area.
  • The suggested words will appear on the left panel.
  • User can either keep typing or click on the suggested words.
  • Anything that users type will bee add to the data set and “improve” the prediction.

Cleaning the data set

Data set provided by SwiftKey.

  • Subset a small proportion of the data set
  • Clean the corpus: Removes punctuation, transform characters to lowercase, removes numbers
  • Remove curse words using a third-party word bank
clean_corpus <- function(corpus) {
    corpus %>%
        tm::tm_map(tm::stripWhitespace) %>%
        tm::tm_map(replacePunctuation) %>%
        tm::tm_map(tm::removeNumbers) %>%
        tm::tm_map(content_transformer(tolower)) %>%
        tm::tm_map(train_data, removeWords, bad_words_bank)
}

Algorithm

Build a transition matrix for the Markov Chain.

build_transition_matrix <- function (distribution_matrix) {
    t(apply(distribution_matrix, 1, function(current_row) {
        row_sum <- sum(current_row)
        current_row / row_sum
    }))
}

Bases on the transition matrix to predict the next word the user might use.
Smooths the transition matrix using Katz's Backoff model.