Ke Xiaobing
16 Aug 2015
This application is designed to predict the next word for the phrases entered by a user. The datasets taken for the basis of the predicting algorithm are downloaded from HC Corpora which has 3 text files, one from Twitter website, one from Blog website and one from News website. After data processing and data modeling, an application is created and published to shinyapp.io website.
The interface of the application shows here.
The steps to predict the next word of the phrases are as follows:
Data Loading, as the given datasets are very big size, so only part of the datasets are loaded for processing and data modeling.
Data Processing, include data cleansing, such as removal of URLs, links, non- english words, numbers, whitespace, punctuation and profanity words.
Build bigram, trigram and quadgram for the loading datasets, save the result into files.
Build shinyapp for word prediction. Use the bigram, trigram and quadgram to predict the next word of the input phrases.
The word prediction application is hosted on shinyapps.io: https://kexiaobing.shinyapps.io/ShinyApp-Capstone2
The profanity words are downloaded from website: http://www.bannedwordlist.com/lists/swearWords.csv
The R package used for text mining is “tm”, and the R package used for ngram generation is “RWeka”