Alex Pilugin
05.02.2018
This is a brief for NLP project based on SwiftKey Data here. The goal of the project is to predict next typing word using previous 'learned' data from blogs, news and twitter texts. I used two datasets in English and Russian to construct bi-lingual shiny app. Project consists of 5 steps:
Firstly, I put all files together (news, blogs and twitter) and sample them (½ for English, 1/1 for Russian). Then I cleaned using:
Pre-processed data I put into Quanteda package to build n-grams (1:5). Detailed features had been saved in list and then pruned to approximately 5% of most popular words. Pruning allows to limit list size to 300-400 Mb and get convenient predict processing time,
Using these n-grams I've built two models based on Katz's Backoff and Kneser-Ney algorithms. Both models get up to 4 words and predict ten possible fifth word sorted by probability of appearance. Then I calculated models accuracy using previously left out-of-sample test dataset.
Finally, Shiny App emulates prediction app based on NLP technology with a difference in predict button. Brief manual for shiny app:
Many thanks to following docs and articles
To get better result I'll suggest to use different methods based on neural networks. I think increase corpus size or enhance n-grams features never give more than +10-20% in accuracy. So, my further work is to learn how to use neural network in NLP.