Text Prediction Using N-Grams

      Coursera via Johns Hopkins University 
        Data Science Specialization
      Capstone Project with Swiftkey

By: Leigh Matthews

February 5, 2018

Building a Predictive Text Model: Goals

  • Build an algorithm for predicting the next word when given a word or phrase using Natural Language Processing

  • A very large Helio Corpus of raw data from blogs, news and Twitter data is analyzed as one file in R

  • Summary statistics for the raw and cleaned (preprocessed) are explored

  • N-grams are built and analyzed from the tidy corpus data to be used in building the predictive text model

Building the Algorithm

  • N-gram modeling is used for 1-grams to 4-grams (for the Shiny app, due to limitations, only 1-grams and 2-grams are used)

  • The rawdataset was cleaned, removing punctuation, capitalization, numbers, white spaces, and stopwords. The data is then stemmed and transformed

  • N-grams are built then visualized using RWeka

  • Retained only the highest frequency words/phrases for each n-gram

Shiny App Interface

  • Application has text input box for user to type a word/phrase

  • Uses words typed and predicts the next word via built algorithm

  • Iterates from 2-gram to 1-gram (due to app restrictions)

  • Predicts the highest probability word

R Packages Used

Capstone Progress Report The project uses several language processing packages:

  • tm: used to read the corpus of documents in a folder and create a vCorpus

  • NLP and SnowballC: Used to clean data and create n-grams

  • rweka: used to create a tokenizer and n-grams from a TermDocumentMatrix

  • dplyr: used to identify and plot most used terms and n-grams

The tm package was the primary package used for this project

The final application is deployed on the shiny server at: https://leigh-math.shinyapps.io/Capstone/