Coursera via Johns Hopkins University
Data Science Specialization
Capstone Project with Swiftkey
By: Leigh Matthews
February 5, 2018
Build an algorithm for predicting the next word when given a word or phrase using Natural Language Processing
A very large Helio Corpus of raw data from blogs, news and Twitter data is analyzed as one file in R
Summary statistics for the raw and cleaned (preprocessed) are explored
N-grams are built and analyzed from the tidy corpus data to be used in building the predictive text model
N-gram modeling is used for 1-grams to 4-grams (for the Shiny app, due to limitations, only 1-grams and 2-grams are used)
The rawdataset was cleaned, removing punctuation, capitalization, numbers, white spaces, and stopwords. The data is then stemmed and transformed
N-grams are built then visualized using RWeka
Retained only the highest frequency words/phrases for each n-gram
Application has text input box for user to type a word/phrase
Uses words typed and predicts the next word via built algorithm
Iterates from 2-gram to 1-gram (due to app restrictions)
Predicts the highest probability word
Capstone Progress Report The project uses several language processing packages:
tm: used to read the corpus of documents in a folder and create a vCorpus
NLP and SnowballC: Used to clean data and create n-grams
rweka: used to create a tokenizer and n-grams from a TermDocumentMatrix
dplyr: used to identify and plot most used terms and n-grams
The tm package was the primary package used for this project
The final application is deployed on the shiny server at: https://leigh-math.shinyapps.io/Capstone/