Model developed for the Coursera - SwiftKey Capstone Project
author: aiooo
date: 13.12.2014
The aim of the project was to build the application that can quickly and accurately predict the next word typed in the keyboard. The predictions base on the frequency of occurence of different phrases from selected text corpus.
The final application was published on ShinyApps server and had to stick to its memory limit (up to 30MB).
The final size of application is 27MB. It can be found here.
The dataset is sourced from HC Corpora - set of publicly availalble texts - and consisted of US English news, blogs and twitter corpora.
The following types of data were sampled to perform the frequency analysis:
Data have been cleaned, skimmed and tokenized into 2, 3 and 4- word phrases (grams).
The algorithm is based on the simple phrase frequency count and frequency weighting. The model function counts the frequency for 2,3 and 4- word phrases (grams). The total frequency of 4-grams and 3-grams is then multiplied to be comparable with 2-grams.
The last word of the phrase is splitted and becomes the predicted word. If the phrase doesn't occure in the corpus, the application uses 'the' as a default outcome.
The web application is simple and explicit. The user inserts the chosen phrase and the predicted word is automatically shown in the field below.