Data Science Capstone Project

M. Dagli
10/6/2016

The Problem

“Next word prediction” is a common need in many software applications from text messaging to wordprocessing.

  • In text messaging accurate prediction makes the user's experience easier and more enjoyable as it can save time and effort in typing entire words.
  • In word processing accurate prediction can also assist with computer-aided checks for linguistic errors.

The goal of this capstone project was to develop a Shiny Application that guesses a user's next word based on a predictive model.

Corpora Development

  • In order to create a predictive model a Corpora of blogs (899 thousand lines), news articles (77K), and tweets (2.5 million lines) was utilzed
  • In order to make the development of the predictive model more computationally feasible, given limited available time and processing power, 2% of each dataset was randomly subsetted and combined to create a single data set.
  • The data set was then cleaned: hashtags and extra white space were removed, all words converted to lower case. The decision was made to retain stopwords and punctuation as it appeared that these would all be helpful for this predictive model.

Model Development

  • An n-gram predictive model was utlized.
  • The Quanteda package was used to create bigram, trigram, and quadgram document frequency matrices which were then converted to n-gram term-frequency matrices. To maximize efficiency only n-grams with at least 2 occurrrences were utilized in the model.
  • Prediction was performed using a maxiumum likelihood model (MLE) with a “stupid backoff”“ alogrithm as follows:
  • 1. Transform input transformed similarly to training set. 2. Last 3 input words searched for in the quadgram frequency matrix. 3. If no match - backoff to search for last 2 words in trigram freguency matrix. 4. If no match - backoff to search for last word in the bigram frequency matrix 5. Else display "no match.”

The Application

Here is an example of the application in action!

alt text alt text