Capstone Final Project

2/26/2020

NLP SwiftKey Word Prediction Text

      Coursera/Johns Hopkins University 
      Data Science Specialization

Cleaning & Data Exploration

Two lagre text files can be downloaded from Swiftkey Dataset Data cleanning required to reduce size of files to 20K each, create a large corpus of the data and was then analyzed after removing numerous not needed text characteristics: * Update unused characters to space * Convert to lowercase * Remove un appropriate Language, punctuation, numbers, etc. * N-grams were extracted from the corpus (uni, bi, tri) and then charted

Algorithm & Model Establishing

N-gram model with back-off strategy was employed for the Natural Language Process (NLP).  These data were then tokenized 3 times using 1-gram to 3-gram calculations using RWeka.  The algorithm predicts the next word rooted on the last 3 text inputs the user entered then begins to search employing the 3-gram. If the next word isn’t predicted, it choose the 2-gram, then 1-gram. If no outcome found it defaults back to a of the word most frequently seen ## Word Predict Sample [alt text][logo] [logo]: https://github.com/dans515c/capstoneproject/blob/master/WordPredictImage.PNG “see word predict sample”

Shiny App

[alt text][logo] [logo]: https://github.com/dans515c/capstoneproject/blob/master/WordPredictApp.PNG “see word predict app sample”

App Detail and Resources

Average response time under 2-3 seconds

Application memory usage only 169 MB ( mem_used() )

Application is running at: https://dans515e.shinyapps.io/CapstoneShinyApp/

Github link for various code files is here: https://github.com/dans515c/capstoneproject