DSS Capstone Project

April 14, 2018

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

            I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

What I've Done

I created a shinyapp hosted at ShinyApp Link

This is my word predictor, based on NGramTokenizer, from the RWeka package.

The dataset, provided by JHU and SwiftKey, was tremendous. Thus, in order to develop a data product that is easily accessible from even the mobile devices, the data has to be sampled (under-sampled) to significatly smaller size.

Moreover, the large dataset, also requires quite a bit of heavy computing power, so as to develop the bi-gram, 3-gram, 4-gram (and so on). These processes do take a lot of system memory, so I came up with a little trick to do the task on lesser powerful machine.

Explanation

The trick that I implemented was as follows

create the "Corpus" seperately, then dump it into a RData file.
Now restart, the R Session. This way all the memory, previously used up, is cleared.
Now load this RData file, and seperately do the uni-gram, bi-gram (and so on) tokenization.
Now, again, I just dumped these files as simple data-frames of words and their frequency of occurences.

Screenshot

This is what the App looks like!

Guide

How to use the app–

Enter the text in the input box.
The prediction shall appear in the main_panel
Atleast, 6 word predictions will be there.
The word predictor uses the number of words in the input to select the tokenizer.
example :- 2 words -> bi-gram ; 3 words -> 3-gram
Incase, input has more than 3 words then ->
First of all, the 4-gram tokenizer is working. If it generates 6 predictions, then those are printed.
Else, it goes to call the 3-gram, which may further call up 2-gram, if it does not generate enough predictions.

Important Links

You may find these links important.