Data Science Capstone-Natural Language Processing
Christopher Papanicolas
July 29, 2017
Introduction
- Johns Hopkins along with Swiftkey, sponsored a project for students to analyze three different corpuses together, explore the content of the data with various data analysis tools, and create an app that predicts the next word of various sentences, phrases, and single words.
Objectives-
- Describe the application
- The technicalitlies of the prediction algorithm
- Limitations and where we go from here.
The Application
- Delete the word prediciton from the input box. (THere will be an error that comes up, ignore it)
- Add a sentence/phrase/word to the box and a prediction will return.
- The word predicted is based on frequency in the index we built for our model.

Modeling of App
- Sample of US blogs, US news, and twitter was extracted, and cleaned(tm_map package).
- Exploratory analysis was done on the text.
- The corpus was clean, tokenized, and indexed into dataframe with their respective probabilities in descending order.
- Each dataframe represented ether a unigram, bigrame, a trigram, or a gquadgram. -The dataframes for eacn N-gram was combined into one dataframe.
- The Katz Backoff Model and the Marckov Chain model was used to create a function to predict the next word using the N-gram data frame.
- The prediction was based on the frequency of the next word in the given n-gram dataframe.
Limitations and Future Work
Limitations
- Prediction is based strictly on probabilities
- Not enough sample from the corpus was used do to memory restrtictions
Future Steps
- Index a larger sample of the corpus anbd create larger dictionary
- COntinue to find ways to improve performance and memory usage
- Utilize grammar structure and word associations
- Use of models to remove noise in data and other models for better accuracy
- a greater n-list to get greater percision.