JHU Data Science Capstone Project

Mathieu C.
2020-04-20

Objective and Ressources (1/2)

Hi and welcome on this presentation. These slides are part of the final project (Capstone) of the Data Scientist specialization.

The aim of the project is to create a model able to predict the next word of a given (and obviously incomplete) english sentence.

The model should be able to run on modest hardware builds such as smartphones and webapps. It will be showcased on Shinyapps.io.

Objective and Ressources (2/2)

To accomplish this, I've been given the following things:

A corpus of arond 3 millions lines of text from 3 sources; news, twitter and blog articles.
Documentation on NLP (Natural Language Processing) with R
Rstudio, GitHub and Shinyapps.io

Big picture

Here are the fundamental steps to achieve the goal at hand:

Explore the copus of text and clean it (swearwords, numbers and symbols, etc…)
Produce good quality n-grams representative from the whole corpus
Store the frequency of the n-grams in tables small enough to be handeled by the app.
Build an algorithm that predicts the next word of a sentence
Build the application
Prepare this presentation to showcase the app

Data cleaning and challenges (1/2)

The main challenge I had with this corpus was probably to handle the tremendous amount of space and power needed to process the whole corpus.

For the sake of having a reasonably fast application, I chose to narrow down the corpus to what was necessary. This also reduces computing time exponentially.

Exploring the data quickly showed that with only about 20% of the corpus, the proportions of words matching an english dictionary were kept and the model seemed to run fine, ending up with relatively small files.

Data cleaning and challenges (2/2)

For the cleaning, i used a package named quanteda, available on CRAN. it allows us to remove most of the “junk” characters and words.

Some manual wrangling to remove the rest was necessary as well as the use of a swearwords dictionnary, to make my model unable to use profanities.

For the ngrams, I decided to go up to a four-gram frequency table, enabling the user to have predictions based on the three last words that he/she wrote.

Model and Algorithm (1/2)

Three models will be used as backoff like follows:

The model will use the last three words (if there is at least three) and see if there's a match among the four-gram dataframe. If there is, it will use the (normalized) probability of the match(es) among four-grams to return the 5+ more probable possible words. If it fails, it will return a default word, the most frequent word in the whole corpus along with a (dummy) zero-probability.
Wether the first attempt was successful or not, the model will then do the very same thing to check matches among the three-grams dataframe (if there is at least two words altogether, of course). It also returns either a dataframe of 5+ suggestions or a dummy prediction.

Model and Algorithm (2/2)

The exact same process is repeated one last time with the very last word using the two-gram data.

In the end, the predictions will be pooled and weighted for the raw probability of the word appearing in the corpus (unigram frequency). This allows to break ties or similarly-probable words (same count) and to select, by default, the one that “should” appear more often.

The final prediction is the word with the highest weighted probability.

The Application

The application should be pretty self-explanatory. You can input a sentence on the left and after a few seconds, the model will return it's prediction.

Tips for the app:

Avoid using numbers or punctuation, the predictor will get rid of them anyway.
The text is automatically converted to lowercase, no need to pay attention to it.
The app does take a little bit of time to load on startup and to predict. Normally, not more than a few seconds. Please be patient!

A few words before I leave

Just a few things I would like you to know before you tryout the app:

This is my first project ever in the Natural Language Processing field and the app is far from perfect, any critical observation is welcome. There is and there will always be room for improvement.
The whole Data Science Specialization was quite an adventure and I've discovered more things than I could describe. It made me a better coder, a better scientist and it will be something I'll remember for a very long time.
I hope you enjoyed reading this short presentation. head over to the app to test it out!