Eric Lim B G
26 April 2015
Coursera, Data Science Specialisation
Capstone Project - SwiftKey (Rpres)
Natural Language Processing and Text Mining is a field of computer science, artificial intelligence, and computational linguistics concerned with interactions between computers and human (natural) languages.
A shiny app has been developed to demonstrate the use of data science techniques in NPL and TM in building a predictive model for next word prediction.
The shiny app is composed of:
Training data from the HC Corpora corpus is used to build the model. 0.1% sampling is obtained for each of the US locale blogs, news and twitters dataset in consideration of the required performance.
Below are summary information on the datasets. Full exploratory analysis results are available in the project's milestone report.
File Lines Chars CharsNWhite TotalWords
1 blogs 899288 206824382 170389539 37570839
2 news 1010242 203223154 169860866 34494539
3 twitter 2360148 162096031 134082634 30451128
The main alogrithm used to build the model comes from tm and RWeka packages. Following are the steps performed after sampling each datasets:
The shiny app accepts input editorial profile and phrases from the user, and outputs prediction of the next word. More instructions are available under the application “Help”.