Text Prediction Using N-Grams

      Coursera via Johns Hopkins University 
        Data Science Specialization
      Capstone Project with Swiftkey

By: Leigh Matthews

February 5, 2018

Building a Predictive Text Model: Goals

Build an algorithm for predicting the next word when given a word or phrase using Natural Language Processing
A very large Helio Corpus of raw data from blogs, news and Twitter data is analyzed as one file in R
Summary statistics for the raw and cleaned (preprocessed) are explored
N-grams are built and analyzed from the tidy corpus data to be used in building the predictive text model

Building the Algorithm

N-gram modeling is used for 1-grams to 4-grams (for the Shiny app, due to limitations, only 1-grams and 2-grams are used)
The rawdataset was cleaned, removing punctuation, capitalization, numbers, white spaces, and stopwords. The data is then stemmed and transformed
N-grams are built then visualized using RWeka
Retained only the highest frequency words/phrases for each n-gram

Shiny App Interface

Application has text input box for user to type a word/phrase
Uses words typed and predicts the next word via built algorithm
Iterates from 2-gram to 1-gram (due to app restrictions)
Predicts the highest probability word

R Packages Used

Capstone Progress Report The project uses several language processing packages:

tm: used to read the corpus of documents in a folder and create a vCorpus
NLP and SnowballC: Used to clean data and create n-grams
rweka: used to create a tokenizer and n-grams from a TermDocumentMatrix
dplyr: used to identify and plot most used terms and n-grams

The tm package was the primary package used for this project

The final application is deployed on the shiny server at: https://leigh-math.shinyapps.io/Capstone/