Capstone: Predict Next Word

FW.Tang
23 Aug 2015

Introduction

“Predict Next Word” is a prediction prototype to predict possible next word taking user's input.

The sources for buidling the prediction model include data from blogs, twitter and news. Natural Language Processing technique is applied to clean, process the sources and create the prediction model by creating n-gram tokenizers. The user input will be parsed to fit into the n-gram backoff in stages to find the best match.

This prototype is trying to predict your next word based on the data collected from various sources.

The “Predict Next Word” Program

How:

  1. After the web page is loaded, user is required to enter single or multiple words as input

  2. System will detect change of input field and trigger the prediction model

  3. The predicted word(s) will be displayed

    - Best choice: the best word found in this model (ONLY)
    
    - Other choices: other second best words found
    

Illustration

Sample text input: “spent ”, pass to the prediction function and “spent the” is predicted, for example.

##load the model
bigramTable<-readRDS(file="bigram.rds")
##predict using bigram tokenizer
bigramChoice=bigramTable[grep(paste0("\\b","spent"," "),bigramTable$BigramTokenizer,ignore.case = T),]; prediction<-as.character(bigramChoice[1:5,"BigramTokenizer"])

##display the predicted words
prediction[1:5]
[1] "spent the"    "spent a"      "spent enough" "spent some"  
[5] "spent about" 

About the Function

The function is availble at https://fwtang.shinyapps.io/PredictNextWord/.

Note: This prediction model is just a prototype. It can be further refined to improve accuracy, performance, and even translate to “emoji”.

More works to be done!

Thank you.