Data Scientist Capstone Project Presentation

Ke Xiaobing
12 Aug 2015

Introduction of the application

This application is designed to predict the next word for the phrases entered by a user. The datasets taken for the basis of the predicting algorithm are downloaded from HC Corpora which has 3 text files, one from Twitter website, one from Blog website and one from News website. After data processing and data modeling, an application is created and published to shinyapp.io website.

Application showcase

The interface of the application shows here.

screenshot

The steps to predict the next word of the phrases as follows:

Step 1: Enter your words / phrases into the textbox at left panel.
Step 2: Click on button Submit.
Step 3: Check the predicted next word at right side.

Algorithm of word prediction

Data Loading, as the given datasets are very big size, so only part of the datasets are loaded for processing and data modeling.
Data Processing, include data cleansing, such as removal of URLs, links, non- english words, numbers, whitespace, punctuation and profanity words.
Build bigram, trigram and quadgram for the loading datasets, save the result into files.
Build shinyapp for word prediction. Use the bigram, trigram and quadgram to predict the next of the input phrases.

Additional information

The word prediction application is hosted on shinyapps.io: https://kexiaobing.shinyapps.io/ShinyApp-Capstone
The profanity words are downloaded from website: http://www.bannedwordlist.com/lists/swearWords.csv
The R package used for text mining is “tm”, and the R package used for ngram generation is “RWeka”