Words Guessing App

Alex Pilugin
05.02.2018

Overview

This is a brief for NLP project based on SwiftKey Data here. The goal of the project is to predict next typing word using previous 'learned' data from blogs, news and twitter texts. I used two datasets in English and Russian to construct bi-lingual shiny app. Project consists of 5 steps:

Data pre-processing
Building n-grams
Data modelling
Accuracy estimation
Shiny App Shiny app demonstrates the integrated result of data processing and prediction models.

Data pre-processing

Firstly, I put all files together (news, blogs and twitter) and sample them (½ for English, 1/1 for Russian). Then I cleaned using:

To lower case
Removing punctuation
Removing hashtags and URLs
Trimming unnecessary spaces
Converting all apostrophes to one form ' Then I split sample for train and test (80/20) and remove all one-letter words excluding 'i'.

Models and accuracy

Pre-processed data I put into Quanteda package to build n-grams (1:5). Detailed features had been saved in list and then pruned to approximately 5% of most popular words. Pruning allows to limit list size to 300-400 Mb and get convenient predict processing time,

Using these n-grams I've built two models based on Katz's Backoff and Kneser-Ney algorithms. Both models get up to 4 words and predict ten possible fifth word sorted by probability of appearance. Then I calculated models accuracy using previously left out-of-sample test dataset.

English models has performance: – accuracy: KB = 14,1%, KN = 14,3% – time to predict: KB = 0,09 sec, KN = 0,15 sec
Russian models has performance: – accuracy: KB = 11,4%, KN = 11,8% – time to predict: KB = 0,14 sec, KN = 0,13 sec

Shiny App

Finally, Shiny App emulates prediction app based on NLP technology with a difference in predict button. Brief manual for shiny app:

Choose language English or Russian
Choose algorithm Katz's Backoff (KB) or Kneser-Ney (KN) or both
Type first words and press “Predict” to guess the next one
Wait for a couple of seconds - first time app freezes a bit
Results arranged in table (or tables) sorted in descending order by probability
Type as much as you wish, test accuracy and speed
Don't mess up prediction and input languages

References and Further Work

Many thanks to following docs and articles

To get better result I'll suggest to use different methods based on neural networks. I think increase corpus size or enhance n-grams features never give more than +10-20% in accuracy. So, my further work is to learn how to use neural network in NLP.