Capstone project text prediction with Swiftkey

Derek Corcoran
January 24, 2016

plot of chunk unnamed-chunk-2

Quanteda

To analize the data the quanteda package was used the main reasons to use this package were

  • More economic than tm
  • It does not relly on other programs
  • It is easy to combine corpus by adding them

Data used for modelling

  • 100.000 lines of each tipe of data was used
  • the corpus was tokenized, filtered for curses and punctuation removed
  • It was tokenized into unigrams, bigrams, trigrams, fourgrams, and fivegrams

Bellow an example of the frequency of the most comon fivegrams, fourgrams, and trigrams.

term n term n term n
at the end of the 286 the end of the 717 one of the 2693
in the middle of the 158 at the end of 550 a lot of 2509
for the first time in 116 the rest of the 543 to be a 1471
the end of the day 111 for the first time 484 the end of 1394

Prediction

  • The model will count the number of words (n) of the written sentence
  • see what is the most frequent sentences that start exactly the same with n+1 words and give those words as probable next words
  • If there is no such sentence it will take out the first words of the sentence (eg from “i won a”“ to "won a”) and see the same algorithm.
  • If not even one word matches the first word of a bigram (Last Chance) it will recomend the 6 most used words in english

App

  • Use this great app to test word predictions