The "Next Word" App

Shaddyjr
10/11/16

Description

This light-weight app accepts a user's string input and produces a prediction of the 'next word' that would most likely follow.

How it works

  • Data collected from blogs, twitter, and news reports
    • English, German, Finnish, and Russian
  • Used bigrams from the corpora to form a simple predictive model
    • The most frequent 'next words' are used
  • Only a fraction of the data practical for making a predictive model
    • Therefore, low accuracy (3.725%)
      BigramTerms         Count GoodTuringCounts
 [1,] "color being"       "308" "0"             
 [2,] "color codes"       "208" "0"             
 [3,] "colorful resident" "124" "1"             
 [4,] "come a"            "234" "0"             
 [5,] "come to"           "757" "0"             
 [6,] "comes from"        "115" "0"             
 [7,] "comes off"         "123" "0"             
 [8,] "comes on"          "374" "0"             
 [9,] "coming out"        "234" "0"             
[10,] "coming up"         "330" "0"             
[11,] "commercial hehe"   "69"  "0"             
[12,] "commercial or"     "210" "1"             

Practical Applications

Most useful for mobile apps

  • Users want quick predictions
  • App's main data file only uses ~190 kb

Functionality could also include:

  • Filling in missing content (like a “Rosetta Stone”)
  • Creating random, but proper sentences

Future Plans

  • Using trigrams would yeild a more accurate prediction

  • Using a Good Turing estimation would also increase model accuracy

  • Including all languages, made possible with UTF-8 conversion