Coursera/Johns Hopkins University
Data Science Specialization
2/26/2020
Coursera/Johns Hopkins University
Data Science Specialization
Two lagre text files can be downloaded from Swiftkey Dataset Data cleanning required to reduce size of files to 20K each, create a large corpus of the data and was then analyzed after removing numerous not needed text characteristics: * Update unused characters to space * Convert to lowercase * Remove un appropriate Language, punctuation, numbers, etc. * N-grams were extracted from the corpus (uni, bi, tri) and then charted
N-gram model with back-off strategy was employed for the Natural Language Process (NLP). These data were then tokenized 3 times using 1-gram to 3-gram calculations using RWeka. The algorithm predicts the next word rooted on the last 3 text inputs the user entered then begins to search employing the 3-gram. If the next word isn’t predicted, it choose the 2-gram, then 1-gram. If no outcome found it defaults back to a of the word most frequently seen ## Word Predict Sample [alt text][logo] [logo]: https://github.com/dans515c/capstoneproject/blob/master/WordPredictImage.PNG “see word predict sample”
[alt text][logo] [logo]: https://github.com/dans515c/capstoneproject/blob/master/WordPredictApp.PNG “see word predict app sample”
Average response time under 2-3 seconds
Application memory usage only 169 MB ( mem_used() )
Application is running at: https://dans515e.shinyapps.io/CapstoneShinyApp/
Github link for various code files is here: https://github.com/dans515c/capstoneproject