Mathieu C.
2020-04-20
Hi and welcome on this presentation. These slides are part of the final project (Capstone) of the Data Scientist specialization.
The aim of the project is to create a model able to predict the next word of a given (and obviously incomplete) english sentence.
The model should be able to run on modest hardware builds such as smartphones and webapps. It will be showcased on Shinyapps.io.
To accomplish this, I've been given the following things:
Here are the fundamental steps to achieve the goal at hand:
The main challenge I had with this corpus was probably to handle the tremendous amount of space and power needed to process the whole corpus.
For the sake of having a reasonably fast application, I chose to narrow down the corpus to what was necessary. This also reduces computing time exponentially.
Exploring the data quickly showed that with only about 20% of the corpus, the proportions of words matching an english dictionary were kept and the model seemed to run fine, ending up with relatively small files.
For the cleaning, i used a package named quanteda, available on CRAN. it allows us to remove most of the “junk” characters and words.
Some manual wrangling to remove the rest was necessary as well as the use of a swearwords dictionnary, to make my model unable to use profanities.
For the ngrams, I decided to go up to a four-gram frequency table, enabling the user to have predictions based on the three last words that he/she wrote.
Three models will be used as backoff like follows:
The model will use the last three words (if there is at least three) and see if there's a match among the four-gram dataframe. If there is, it will use the (normalized) probability of the match(es) among four-grams to return the 5+ more probable possible words. If it fails, it will return a default word, the most frequent word in the whole corpus along with a (dummy) zero-probability.
Wether the first attempt was successful or not, the model will then do the very same thing to check matches among the three-grams dataframe (if there is at least two words altogether, of course). It also returns either a dataframe of 5+ suggestions or a dummy prediction.
In the end, the predictions will be pooled and weighted for the raw probability of the word appearing in the corpus (unigram frequency). This allows to break ties or similarly-probable words (same count) and to select, by default, the one that “should” appear more often.
The final prediction is the word with the highest weighted probability.
The application should be pretty self-explanatory. You can input a sentence on the left and after a few seconds, the model will return it's prediction.
Tips for the app:
Just a few things I would like you to know before you tryout the app:
This is my first project ever in the Natural Language Processing field and the app is far from perfect, any critical observation is welcome. There is and there will always be room for improvement.
The whole Data Science Specialization was quite an adventure and I've discovered more things than I could describe. It made me a better coder, a better scientist and it will be something I'll remember for a very long time.
I hope you enjoyed reading this short presentation. head over to the app to test it out!