The Next Word Is!

Data Science Capstone Project

Avinayan

Predicting the Next Word

  • The aim of this project is to predict the next word of a sentence or a phrase
  • Pitch:
    1. In today's world of smart phones and devices with small form-factor, the size of the keypad is shrinking.
    2. This makes the task of typing harder.
    3. This is where companies like SwiftKey are working on predicting what the user is likely to type next and through that improve the typing experience for users.
  • In this project, we will use data science to predict the user's next word with reasonable accuracy.

Model Strategy and Algorithm

  • To begin with, the data provided had to be prepared so that it is ready for the Algorithm.
  • This involved using the tm package on the sample data.
  • Then various data preprocessing steps like converting to lower case, removing numbers, punctuations, stopwords, profanity and using only the stem words were completed
  • The corpus was then converted into n-gram tokens. (n = 1, 2, 3, 4)
  • The probability of the occurrence of each of those n-grams were computed and sorted.
  • This process was repeated and refined to get a good n-grams model.
  • The n-grams based on their probability score forms the basis for predicting the next word

Shiny App

  • The final predictive model was deployed on the Shiny Server.
  • The Link is provided here: The Next Word Is!
  • How to Use this App:
    1. Enter your phrase or sentence in the Input Box and hit Submit
    2. The most likely next word of the phrase / sentence will be displayed on the right side panel
  • Note: Please wait for a few extra seconds during the first attempt so that the model can load.

Conclusion and Acknowledgements

  • This model provides reasonable level of accuracy and is good in predicting commonly used words.
  • Some Future enhancements:
    • Utilize cloud computing infrastructure as it provides for additional RAM capacity (which was limited in my laptop).
    • Utilize ensemble methods to improve the model accuracy.
    • Use external data and data idiosyncratic to the user (reinforced learning) to improve prediction accuracy.
  • Acknowldgements:
    • Professors at the JHU for the excellent content in this specialization and providing opportunity to practically apply what was learnt.
    • SwiftKey for the data, support and consulting through the Capstone project.