Next Word Prediction

Yanhua Hou
03/30/17

Presentation for Coursera Data Science Capstone Project

Overview

These slides serve as a presentation for the predictive text model built for the project of Data Science Capstone.

  • Aim: build an shiny app to predict the next word users will enter following the users' specified word sequences.
  • Content
    • Data source and processing
    • Algorithm for the next-word prediction
    • ShinyApp
  • Link

Data Preparation

  • Data downloaded from a corpus called HC Corpora consists of docs in English from three sources: Twitter, Blogs and News articles. Select randomly 5% samples from each file and combine them as a whole. Take 80% of the data as training, 10% as testing and 10% as validation data.

  • Data cleaning involving

    • eliminating emojis,urls
    • replacing abbreviations, contractions with full forms
    • transforming words into lower case
    • removing badwords, words containing numbers
    • replacing punctuations with spaces
    • removing numbers and 1 or 2 or3-letter meaningless words
    • removing extra white spaces
  • Building a n-gram dictionary

    • creat word combinations: 1-gram (covering 98% of vocabulary), 2-gram, 3-gram, 4-gram and 5-gram (for higher-order grams, filter grams that appear only once) and save them.

Algorithm for the Next-word Prediction

  • A stupid backoff scheme is implemented in this next word prediction.
  • Input from users are: partial phrase and the number of matches ('n_res') to show.
  • Phrase will be cleaned and the last 'nlastW' words will be extracted ('nlastW' indicates the number)
  • Search 'n_res' candidates from min(nlastW+1,5)-gram to bigram. Remove duplicates. In case insufficient matches found or in case words entered are out of vocabulary, supplement from most frequent unigrams.
  • Rank candidates by comparing Stupid Backoff score.
  • Return a data frame containing nextword, score and order of ngram used for prediction.

ShinyApp