Data Science Capstone - Predictive Text Data Product

Marek Kluczynski
19/06/2017

Statement of Problem

The following slide deck relates to the Capstone Project for the Data Science Capstone from the Johns Hopkins University on Cousera.

The problem this activity and app tries to solve is that of predicting text based on input, for instance when someone types “I went to the” the application should predict options for what the next word might be.

In order to solve this problem I opted for building an n-gram model, that is building a model that given one or two words the model will predict what the most likely next word.

Approach to Solving the Problem (Slide 1)

The technology I used to solve this problem was as follows

R Studio as an Integrated Development Environment for coding
Microsoft Open R + Intel MKL libraries (http://bit.ly/2bUIg5D) which allows for more optimal parallel processing on Windows environments.
Microsoft Azure cloud to scale up to large amounts of CPU and RAM. The VM instance used for model training had 112GB RAM and 32 CPU Cores.

Approach to Solving the Problem (Slide 2)

The steps for processing the data were as follows

Ingest the data into an R Data frame and cleanse the data removing non ASCII characters, white space and set all to lower case, implemented using the tm package and gsub.
Tokenise the data using RWeka and output a document term matrix, tokenisation allowed for config of the size of the input n-gram (1,2,3,etc. words).
Using this document term matrix create and R Data Frame that included search term, follow on term and number of occurrences in the corpus of text.

The output was a set of data frames for the data product.

Predictive Text Data Product (Slide 1)

Using the data frames a data product was put together (http://bit.ly/2rwxU1S) which takes input text and then does the following:

Cleanse the input text removing non ASCII alpha characters, white space and set all characters to lower case, this was done using the tm package and gsub.
Displays the most likely word for the search term based on a lookup in the data frame.
Displays a word cloud (using wordcloud pkg) of up the most 10 likely follow on terms based on the data frame lookup.
Displays an emotion classification graph based on the input text from the user using syuzhet.

Predictive Text Data Product (Slide 2)

The data product could be described as a minimal viable product and does have some issues namely

Even with the 32 core 112GB ram VM training the n-gram model beyond 2 input words was not possible.
The data product itself makes use of a cut down one word input data frame due to memory limitations.

I was to approach this problem again from scratch the data cleansing method would likely remain the same however I may use a predictive model such as a neural net which may give better performance. It should be noted that the method used above gave good results with the test quizzes on the course giving high degree of matches.