Capstone Project for Data Science Course

Alok

24 Jan 2016

Shiny App (Takes as input a phrase and outputs a prediction of next word)

Introduction and Problem Statement

SwiftKey is an input method for Android and iOS devices such as smartphones and tablets. SwiftKey uses a blend of artificial intelligence technologies that enable it to predict the next word the user intends to type to saving our time on typing on those device.

Build predictive Data Model shiny app based on given Dataset

Download data, clean data set, subset data, create data corpus, exploratory analysis, identify algorithm
Build the n-gram model (1-gram, 2-gram, 3-gram)
Build a model to handle unseen n-grams
Data model to predict the next word based on previous 1,2, or 3 words

Algorithm

Used Stupid Backoff Algorithm mainly used in web-scale applications. Step1: Count n-grams offline, Step2: Compute pseudo-probabilities at run-time
Generate the merged table of n-grams and their counts from twitter, blog,news), 5th most frequent words sorted by frequency and stored in R's “data.table”
In prediction algorithm, we use first trigram, if not exits then back-off to bigram, if not exits then finally the most frequent single words in the corpus. Three predictive words always provided.
Input text in app follows: Stop English words, Remove Punctuation & Number, split and unlist a phrase, lower input, choose only alpha char, make unigram, biagram, trigram and look-up the data-frame for next predictive words 1,2 and 3.

Application Instruction

Text Prediction shiny App: Simple SwiftKey (load time 30s)

How to function app : User shall Input a Phrase in the English language from data-set at the top left panel. User shall select a Numbers of Predicted Words, as the default set for three words and user can change to find more Predicted Words from n-gram model. Finally, press SUBMIT. You will see right-hand side given your Input Phrase and predicted words or a WARNING message.
In the Shiny web-app language supported by R text prediction model. taken 15% train data(660000 lines out of 4269678 lines)
Main Library used : tm, stringi,slam,SnowballC,RWeka,shiny
Stored as RDS and consume about 3.6 MB on disk

Limitation and Future work

Current Data-frame: Unigram (7.2 MB, 59086 records), Bigram (32.2 MB, 210103 records), Trigram(4.2 MB, 20933 records)
Create a large corpus 30% of data(Current only 15%),because used as desktop PC with 4 GB, In future with 16 GB ram with latest high end cpu server.
Use the other text mining predictive algorithms and write the white paper with delta analysis, detail out data model perplexity
Some times while create the n-gram got the error message “BigramTokenizer <- TermDocumentMatrix(eng_all,control = ctrl2),: Error in .jcall("RWekaInterfaces”, “[S”, “tokenize”, .jcast(tokenizer, :java.lang.NullPointerException)“ find out the root cuase of this error and fix.