Capstone Project Presentation

Aman Tiwari
24-11-2021

Introduction

Our goal for this Data Science Capstone was to create a Shiny web application that, taking a meaningful sequence of words as an input, could predict the word most likely to follow or complete such a sequence. This functionality could be used to speed user typing by suggesting the next word to the user and let them select it rather than having to type it themselves. Another potential application is search query autocomplete.

To provide this functionality we built a 5-gram probabilistic language model and used Stupid Backoff to rank next-word candidates. Our language model built its understanding of the English language from 3 different corpora: a list of blog posts, a list of news articles, and a list of tweets (none of which the user’s output).

Steps followed to make the model

First we downloaded the data from Swiftkey this link.
Then we performed some EDA and looked at the basic characteristics of the data which can be seen in the milestone report.
Then we cleaned the data and built term-frequency matrices.
Then performed N-gram pruning so that ultr-low frequency items are eliminated because they use significant amount of memory while adding little value to our model.
Then built our prediction model using the stupid backoff mechanism. I will try to explain more in the next slide.

Next Word Candidate Search

The algorithm first uses the last four words typed in and tries to find 5-grams that complete those four words. If less than 5 candidates are found, then the app uses the last three words typed in and searches for 4-grams that match those three words, and so on, until it has found at least 5 matches. If the app is unable to find any suitable candidates, it simply returns its top-5 most likely unigrams, based on their maximum-likelihood estimate.

Please use the app

Use the app and feel free to provide your feedback. App link(https://5i64zv-razorscythe.shinyapps.io/Next_word_prediction/).