Keyboard Word Prediction

Model developed for the Coursera - SwiftKey Capstone Project

author: aiooo

date: 13.12.2014

Aim of the project

The aim of the project was to build the application that can quickly and accurately predict the next word typed in the keyboard. The predictions base on the frequency of occurence of different phrases from selected text corpus.

The final application was published on ShinyApps server and had to stick to its memory limit (up to 30MB).

The final size of application is 27MB. It can be found here.

Data

The dataset is sourced from HC Corpora - set of publicly availalble texts - and consisted of US English news, blogs and twitter corpora.

The following types of data were sampled to perform the frequency analysis:

  • 10 000 lines of news
  • 10 000 lines of blogs and
  • 40 000 lines of tweets.

Data have been cleaned, skimmed and tokenized into 2, 3 and 4- word phrases (grams).

Prediction algorithm

The algorithm is based on the simple phrase frequency count and frequency weighting. The model function counts the frequency for 2,3 and 4- word phrases (grams). The total frequency of 4-grams and 3-grams is then multiplied to be comparable with 2-grams.

The last word of the phrase is splitted and becomes the predicted word. If the phrase doesn't occure in the corpus, the application uses 'the' as a default outcome.

Functionality

The web application is simple and explicit. The user inserts the chosen phrase and the predicted word is automatically shown in the field below.

https://aiooo.shinyapps.io/shiny/