Capstone report

1 Background of SwiftKey project:

SwiftKey is a smart keyboard that makes it easier for people to type on their mobile devices. It uses Natural language processing (NLP) to build model with R. A good example is that when you type a sentence or a phrase, the keyboard presents some options for what the next word might be.

In this project, we use english version text data resources for prediction. There are three data files that are used: twitts, blogs and news. This report gives you general concept on how this app works.

2 Concept of n-gram:

An n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. In this project we will use 2-gram(bigram), 3-gram(trigram), and 4-gram(quadgram) in the prediction.

Procedure of NLP in R:

The next step is tokenize sentences. Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. For example, tokenize “how are you” into 2-grams returns two elements : “how are” , “are you”).

3 basic summary of data

Let’s see how many lines and how many words in each data file: twitts, blogs, news.

##     file   lines    words
## 1 twitts 2360148 30433240
## 2   news 1010242 35710862
## 3  blogs  899288 38222278

4 visualization of data:

unigram word cloud

Visualization for twitts, news and blogs(from left to right):

Combine news, twitts and blogs into one data and get below:

unigram corpus visualization:

2-gram corpus visualization:

3-gram corpus visualization:

4-gram corpus visualization:

5 Future plan(shiny app)

About shiny app

My app is designed to use on mobile phone. It has text input part and output part showing the predicted next word.

Below there are some points that gives me a clear structure on building this app:

The interface of my shiny app generally consists of two pages(input & output page, introduction page).
In the first page, it has two parts: input part(ngram method input part, and sentence text input part), and output part(next single word output part, and the other possible words output part).

Instruction on how to use it:

First, select ngram method(2gram, 3gram or 4gram), then type sentence in the text input part.
Second, click “update” button and check output(to see possible next word).
The input sentence is cleaned (change to lower case, remove whitespace, punctuation,stopwords, etc. ). Use model I create and use algorithm to predict based on input sentence.

About my ngram model:

4-gram takes last 3 words in a sentence/phrase(input), then search 4-gram tables and find 4-gram with highest frequency. Last word of 4-gram is what we want. 3-gram takes last 2 words; 2-gram takes last one word.
back off method: if it returns no observation with 4-gram method, then it backs off to its lower gram method(that is 3-gram method) until it has possible output words.

6 Reference:

http://en.wikipedia.org/wiki/N-gram
https://class.coursera.org/nlp
Download data from a corpus called HC Corpora http://www.corpora.heliohost.org
See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details