Capstone Project: Word Guesser

Ryan Swartz
2014 Dec 16

How It Works

  • Operation of the app involves only one step: enter a snippet of text
  • With this, the app will use its underlying algorithm to report the five words most likely to come next in the snippet

alt text

Underlying Algorithm

Several components form the underlying algrithm of the Word Guesser app:

  • A back-off approach was trained with 30% of the English Twitter data for memory size considerations
  • First, matches from the quad-gram data from this training set are considered most likely (more on that in the next slide) and listed in decreasing order of probability of finishing the snippet
  • Backing-off, matches from the tri-gram data are considered the next most likely and listed in decreasing order of probability
  • For good measure, the app uses a Kneser-Ney approach on the bi-grams from this training data to determine the most likely words concluding text snippets to finalize the list of guesses (Gauthier)
  • The results from the back-off and Kneser-Ney are then combined into one list, from which the app displays the top five guesses

Kneser-Ney: alt text

How It Performs

With 20 separate tests using random 1000 Tweet samples from the remaining 70% of the Twitter corpus the application performed as follows:

  • On average, the model correctly predicts the next word in the snippet with its first guess 11.8% of the time
  • Altering the algorithm to instead only use tri-grams and Kneser-Ney, the accuracy dips to 10.9%
  • This trend of the performance of the two approaches continued and separated when accuaracy was considered as having the correct guess within the first 5
  • Using this criteria, the application has a final accuracy of 25.5% on average plot.ly version plot of chunk plot

How To Make It Better

One vector to improve the performance of the application is the inclusion of additional data:

  • More Twitter
  • Wikipedia articles
  • Definitions from Urban Dictionary to capture slang

Increasing the efficiency of the algorithm would also enhance user experience with faster response times

Lastly, making the site more visually appealing might entice more usership after realizing better accuracy and speed