Words Guessing App

Alex Pilugin
05.02.2018

Overview

This is a brief for NLP project based on SwiftKey Data here. The goal of the project is to predict next typing word using previous 'learned' data from blogs, news and twitter texts. I used two datasets in English and Russian to construct bi-lingual shiny app. Project consists of 5 steps:

  • Data pre-processing
  • Building n-grams
  • Data modelling
  • Accuracy estimation
  • Shiny App Shiny app demonstrates the integrated result of data processing and prediction models.

Data pre-processing

Firstly, I put all files together (news, blogs and twitter) and sample them (½ for English, 1/1 for Russian). Then I cleaned using:

  • To lower case
  • Removing punctuation
  • Removing hashtags and URLs
  • Trimming unnecessary spaces
  • Converting all apostrophes to one form ' Then I split sample for train and test (80/20) and remove all one-letter words excluding 'i'.

Models and accuracy

Pre-processed data I put into Quanteda package to build n-grams (1:5). Detailed features had been saved in list and then pruned to approximately 5% of most popular words. Pruning allows to limit list size to 300-400 Mb and get convenient predict processing time,

Using these n-grams I've built two models based on Katz's Backoff and Kneser-Ney algorithms. Both models get up to 4 words and predict ten possible fifth word sorted by probability of appearance. Then I calculated models accuracy using previously left out-of-sample test dataset.

  • English models has performance: – accuracy: KB = 14,1%, KN = 14,3% – time to predict: KB = 0,09 sec, KN = 0,15 sec
  • Russian models has performance: – accuracy: KB = 11,4%, KN = 11,8% – time to predict: KB = 0,14 sec, KN = 0,13 sec

Shiny App

Finally, Shiny App emulates prediction app based on NLP technology with a difference in predict button. Brief manual for shiny app:

  • Choose language English or Russian
  • Choose algorithm Katz's Backoff (KB) or Kneser-Ney (KN) or both
  • Type first words and press “Predict” to guess the next one
  • Wait for a couple of seconds - first time app freezes a bit
  • Results arranged in table (or tables) sorted in descending order by probability
  • Type as much as you wish, test accuracy and speed
  • Don't mess up prediction and input languages

References and Further Work

Many thanks to following docs and articles

To get better result I'll suggest to use different methods based on neural networks. I think increase corpus size or enhance n-grams features never give more than +10-20% in accuracy. So, my further work is to learn how to use neural network in NLP.