Ke Xiaobing
22 Aug 2015
This application is designed to predict the next word for the phrases entered by a user. The datasets taken for the basis of the predicting algorithm are downloaded from HC Corpora which has 3 text files, one from Twitter website, one from Blog website and one from News website. After data processing and data modeling, an application is created and published to shinyapp.io website.
When the shiny application launch, it will take 30 seconds around to load the datasets for prediction.
The interface of the application shows here.
The steps to predict the next word of the phrases are as follows:
Data Processing, include data cleansing, such as removal of URLs, links, non- english words, numbers, whitespace, punctuation and profanity words.
Data modeling is to build bigram, trigram and quadgram for the loading datasets, save the result into files.
Build shinyapp for word prediction. Use the bigram, trigram and quadgram to predict the next word of the input phrases.
Use the simplified back-off model. Search the quadgram table. if miss in quadgram table, search the trigram table. if miss in trigram, search the bigram table.
The word prediction application is hosted on shinyapps.io: https://kexiaobing.shinyapps.io/ShinyApp-Capstone2
The profanity words are downloaded from website: http://www.bannedwordlist.com/lists/swearWords.csv
The R package used for text mining is “tm”, and the R package used for ngram generation is “RWeka”
It is required to improve the accuracy of prediction and its performance in Shinyapp in the future.