Natural Language Processing (NLP) Swiftkey Capstone

Lim Kah Kheng
6th Jan 2016

Introduction

The purpose is to analyse 3 corpus of data namely news, blogs and twitter data - which the corpuses are used to create a prediction algorithm of next words in a sentence.

This covers cleaning and analysing of data, take a samplings of data and build a predictive model.

The Shiny application is hosted in https://jkklim.shinyapps.io/swiftkey/

Process

  1. Determine the size of the corpus and select 1% of its data to speed up loading of data into Shiny.

  2. Clean the data and extract all unigrams, trigrams and bigrams

  3. Create a model of unigrams, trigrams and bigrams where each model is sorted by occurence.

  4. Take the input and compare with different models. Return first three matches.

Algorithm

https://en.wikipedia.org/wiki/Katz's_back-off_model is used.

The sentence is split into an array of words and compare with different models. If there is a match, it will return the most occurence and try to return the first three matches.