Data Science Capstone Presentation

Tiago Marques
2016/09/25

Summary

In this project which completes the Data Science specialization was proposed the creation an application capable to predict a text next word.

This project explored the natural language processing area (NLP) in a challenging way. This was a new topic for the Cousera Data Science students, which conducted to an intensive research not only to discover the best way to implement the project but also to understand which tools are more helpful to complete the task.

Through a seven week process that included all the data transformation steps, starting with data acquisition and cleaning, passing through the exploratory analysis and the model creation, was build a solution which tries to achieve the balance between the desirable user accuracy and the mandatory lightness of a mobile ready application.

Method

To train the model were used 3 text files from different backgrounds (blogs, news and Twitter). For each of these data files were performed the following operations:

English language filtering
Offensive words exclusion
Sentence split

After the above steps the files were concatenated in one file, of which were sampled the training data and the test data (0.7 - train; 0.3 - test). Before the Corpus creation some additional cleaning were performed, which included punctuation and numbers removal, as well as lower case transformation.

From the training corpus were created 4 N-Grams types (1-grams, 2-grams, 3-grams and 4-grams), being discarded those with a frequency inferior to 4.

Application

The present application App allows a user to freely explore the text creation through an interface that presents a large number of probable words, that auto update everytime the text changes. Two methods were implemented:

Stupid backoff (presents the 50 most probable words)
Kneser-Ney backoff (presents the 6 most probable words, due to the time consuming computation)

Application Screenshot.

Conclusions and Future Works

The accuracy tests showed that the backoff implementation (N-grams with with frequency superior to 3) achieved a perplexity result of 142.987.
The high volume of data allowed a better comprehension of the impact of the code efficiency in the computation time and required a wise tools selection.
In future developments, an implementation of this project with a big data platform would enable a much more robust solution.