Data Scientist Capstone Project Presentation

Ke Xiaobing
22 Aug 2015

Introduction of the application

This application is designed to predict the next word for the phrases entered by a user. The datasets taken for the basis of the predicting algorithm are downloaded from HC Corpora which has 3 text files, one from Twitter website, one from Blog website and one from News website. After data processing and data modeling, an application is created and published to shinyapp.io website.

When the shiny application launch, it will take 30 seconds around to load the datasets for prediction.

Application showcase

The interface of the application shows here.

screenshot

The steps to predict the next word of the phrases are as follows:

Step 1: Enter your words / phrases into the textbox at left panel.
Step 2: Click on button Submit.
Step 3: Check the predicted next word at right side.

Algorithm of word prediction

Data Loading, almost all the data are loaded in batches for processing and data modeling.
Data Processing, include data cleansing, such as removal of URLs, links, non- english words, numbers, whitespace, punctuation and profanity words.
Data modeling is to build bigram, trigram and quadgram for the loading datasets, save the result into files.
Build shinyapp for word prediction. Use the bigram, trigram and quadgram to predict the next word of the input phrases.
Use the simplified back-off model. Search the quadgram table. if miss in quadgram table, search the trigram table. if miss in trigram, search the bigram table.

Additional information

The word prediction application is hosted on shinyapps.io: https://kexiaobing.shinyapps.io/ShinyApp-Capstone2
The profanity words are downloaded from website: http://www.bannedwordlist.com/lists/swearWords.csv
The R package used for text mining is “tm”, and the R package used for ngram generation is “RWeka”
It is required to improve the accuracy of prediction and its performance in Shinyapp in the future.