Data Science Capstone Presentation

Alejandro Balderas

18 March 2018

Overview

This is the final project for the Data Science Specialization in Coursera by the Johns Hopkins University. In this final project a text predictive model will be created using Natural Language Processing techniques like ngrams and the Katz-Back-Off model.

Input for this project was provide as text data from twitter, blogs and news feeds provided by Swiftkey. An exploratory data analysis phase was completed and can be found under the following link:

Report.

The Data

The data set is converted into tokens using the quanteda package and then create different ngrams. The frequency of the times each ngram appears in the text is saved an stored with each feature. This frequency give us then the probability that a certain word comes after another set of words. Below you can see the top trigrams from the blog data set

      feature frequency
1  one_of_the      4859
2    a_lot_of      4095
3  as_well_as      2292
4 some_of_the      2283
5     to_be_a      2275
6    it_was_a      2273

With this information we can asume that most of the time after the text “the end of” the most probable outcome will be “the”.

The algorithm

The algorithm searches for the last 4 words in a 5-gram and then takes the next word as the prediction. If the algorithm does not find a match then it “backs off” and takes the 3 last words and searchs for them in the 4-gram data. This process is continued until a match is found or no match is found in which case a random sample of the 6 most common words in the data set will be returned.

The Application

The application can be found under the following link:

Shiny App.

Have Fun

Try out the application and see for yourself if the application delivers the wanted outcome.

As an add-on I built a code that will create a random sentence based on the previous words you write. Try it out in the extra tab of the app.