Data Science Capstone - Natural Language Processing

Susan Martin
Feb 7, 2021

An Implementation of a predictive nGram model

..to predict the next word using natural language processing

Executive summary

This report addresses an instance of an N-gram Prediction Model by using Natural Language Processing in R. It is the final project from the Coursera/Johns Hopkins course in a certification for the Data Science Specialization, the 'Data Science Capstone Project'. The data is sourced in collaboration with Coursera and SwiftKey and can be found here:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Project Goal

The goal of the Predictive Model project is to create a full working model with reasonable performance & accuracy. It is meant to motivate and peak curiosity. We utilize an N-gram prediction model with optimized performance and accuracy. We provide supporting statistics.

We aim to appeal to a wider audience, one that may or may not be technical. Therefore technical jargon is limited and demonstration emphasized. We hope to convey that data science, in addition to involving the traditional sciences such as math and statistics, is also dependent on the instincts of the creative artist, and is ultimately useful at the current time.

Project description - the model

The N-gram prediction model predicts the next word likely to be used in a sentence by considering the last word given, after running the algorithm against a large pre-processed classification data set. To implement the prediction model, the data is cleaned and tokenized into dictionaries of words. Training datasets are set aside for various data evaluations. Finally, we build the model noting accuracy of the output and efficiency of response time. Word clouds and bar-plots are provided to visualize the data sets once converted to predictive n-gram format. The appendix stresses implementations of methods for accuracy.

Limitations and further exploration

It is beyond the ability of NLP to produce an idiomatic response typical of spoken English. For example, if running the webapp below, you input 'Bright lights, big' you would expect the response to be 'city'. However, the algorithm outputs the word 'deal' indicating the phrase 'Bright lights big deal'. The grey line between literal and idiom exists within NLP.

R Code used for Exploratory analysis:
Included in the Appendix are several Accuracy models available to the user. https://github.com/suszanna/Predictive-Shiny-app

Shiny webapp for your use:
Input- any English phrase: Output -the most likely next word. (? = notfound) https://suszanna.shinyapps.io/ngramPredictor/