Next Word Prediction

David Seibel
April 2015

seibeldb@gmail.comLinkedIn

Johns Hopkins

Data Science Specialization

Capstone Project

Next Word Prediction for Tweets

  • The Capstone project requires learning natural language processing (NLP) concepts and building an application that predicts the next word, given a partial phrase. The idea of completing someone elses sentence has far reaching implications, so I added a sentence prediction feature, and an extra tab called 'Watch It Work' so you can see the underlying data.

  • Click This to run my app from the web!

  • SwiftKey provided guidance. It was founded in Britain by Jon Reynolds and Ben Medlock, in 2008, to build technology that makes it easy for everyone to create and communicate on mobile. Stephen Hawking uses a customized version.

Natural Language Processing (NLP)

NLP is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human languages. Many challenges in NLP involve natural language understanding and generation. NLP resources used include:

  • The Stanford NLP and Machine Learning classes on Coursera and Youtube.
  • Wiki pages on many NLP topics.
  • The book, “Speach and Language Processing”, by D. Jurafsky and J. Martin
  • R packages and documentation for Text Mining, NLP and data manipulation including: tm, NLP, openNLP, stylo, qdap, RWeka, stringi, and data.table.

Model Description

Pre-processing involved: lower case, marking sentence boundaries, apostrophy conversion, removing punctuation and tokenizing. Ngrams are word sequences; tetragrams have four words. The model uses indexed data tables to match the last 3 words of the user input with the first 3 words in the tetragram collection. The highest frequency matching tetragram is selected and its fourth word becomes the “prediction”. When there is no match, trigrams and bigrams are used. The word 'I' is predicted when the user provides no input. The word 'beautiful' is printed when no match is found at all. This method is simple and fast with an out of sample accuracy of 19%. In addition, it has a feature called context that selects a lower frequency prediction when it matches the context of the prior words in the sentence. This improves accuracy but must be controlled to prevent run-on sentence prediction. It's vocabulary is limited at this time.

Let's Get Started

Click here to launch my app pictured below!