Coursera Data Science - Capstone Project

Alexander Schniertshauer
2016 - 06 - 03

This presentation describes a shiny application that predicts the next word in a sentence. The app has been developed for the Capstone Project of the Coursera Data Science Specialization. The Capstone project has been offered through Coursera by Johns Hopkins University partnering with SwiftKey.

Assignment

Goal of the project has been to:

  • Use publicly available data from the HC Corpus (english texts from news, tweets and blogs containing a total of about 2.4m records) to create a word prediction model based on patterns (like ngram frequencies) in the texts.
  • Make sure that the corpus texts are properly cleansed and preprocessed before the model is created.
  • Build an app with shiny that incorporates the prediction model so that a user can key in words (like with SwiftKey) and gets a prediction for the next word.

Solution Principles

The solution has been build upon following principles:

  • Convert and cleanse the HC corpus so that each document reflects exactly one sentence , punctuation, numbers, etc. are removed, all characters changed to lowercase letters.
  • Draw a random sample (30%) to balance completeness of corpus and efficiency in building the language model. Use a separate holdout set to measure accuracy.
  • Determine conditional frequencies of ngrams (Unigrams, Bigrams, Trigrams and Fourgrams) based on the cleansed sample.Filter out all ngrams with frequency 1.
  • Use stupid backoff as prediction strategy. Create the required scores based on conditional ngram frequencies and save them in a data frame to be used by the shiny app.

App - Algorithm

The word prediction is done in following steps:

  • Load base file with ngrams from sampled corpus, next words and scores
  • Convert input to uni-, bi- and trigrams removing punctuation, numbers, space and change all characters to lower letters.
  • Look for match between input and base file using trigram. If three matches return the next words from base file as recommendation.
  • If less than three matches look in bigrams. Sort resulting matches (trigrams and bigrams) according scores and return the three next words with the highest scores.
  • If less than three matches with bigrams fall back to unigrams. If still not sufficient include the three most common words: the/ to/and with their scores.

App - Look and Feel

My shiny app looks like: Test

Go to my app to test it.

Results and Learnings

My main takeaways are:

  • Very interesting challenge as I had never used NLP before
  • Hopefully nice interface as this is a really important element for any user (who is most likely not a data scientist).
  • Happy with the accuracy achieved - repeated sampling on my holdout set gave an accuracy of 26 % - 31 % (percentage of cases where one of the three suggestions was equal to next word in text) - which proved strength and value of the sampling approach combined with stupid backoff.
  • Tapping into a new area was more labor intense than I thought and did not leave me with time to check out fundamentally different approaches (I wanted intially to try additionally word2vec…).