Data Science Capstone Project

M. Dagli
10/6/2016

The Problem

“Next word prediction” is a common need in many software applications from text messaging to wordprocessing.

  • In text messaging accurate prediction makes the user's experience easier and more enjoyable as it can save time and effort in typing entire words.
  • In word processing accurate prediction can also assist with computer-aided checks for linguistic errors.

The goal of this capstone project was to develop a Shiny Application that guesses a user's next word based on a predictive model.

Corpora Development

  • In order to create a predictive model a Corpora of blogs (899 thousand lines), news articles (77K), and tweets (2.5 million lines) was utilzed
  • In order to make the development of the predictive model more computationally feasible, given limited available time and processing power, 2% of each dataset was randomly subsetted and combined to create a single data set.
  • The data set was then cleaned: hashtags and extra white space were removed, all words converted to lower case. The decision was made to retain stopwords and punctuation as it appeared that these would all be helpful for this predictive model.

Model Development

  • An n-gram predictive model was utlized.
  • The Quanteda package was used to create document frequency matrices of bigram, trigram, and quadgram frequencies. Each of these DFMs were used to create their respective n-gram term-frequency matrix. T. o maximize efficiency each DFM was reduced to only include n-grams with at least 2 occurrrences
  • Prediction was performed using a simple maxiumum likelihood model (MLE) with a “stupid backoff”“ alogrithm as follows.
  • 1. Input transformed similarly to the training set. 2. Last three input words searched for in the quadgram frequency matrix. 3. If no match - backoff to searching for a match of the last 2 words in the trigram freguency matrix. 4. If no match - backoff to searching for a match of the last word in the bigram frequency matrix 5. If still no match - display "no match”

The Application

Here is an example of the application in action!

alt text alt text