Word Prediction with N-Grams

Andrey Kotov
15.07.2016

Introduction

This is final project of Data Science Specialization by Johns Hopkins University on Coursera.

https://www.coursera.org/specializations/jhu-data-science

The goal of capstone project is to create data product - application that analyzes user's input and predicts next word.

Predict word? How?

Imagine you have text, a lot of words. Some combinations are frequent, some are rare.

It is possible to create algorithm for word prediction using these combinations - N-Grams. When you type something, it tries to find common phrase of 2,3 or 4 words starting with your sentence.

Found? Yes, it is prediction.

All you have to do is to develop this algorythm: take some text, construct frequency tables, build application. We did it.

Shinyapps.io application

What's next?

We know some options to improve this application

  • Punctuation. Current version ignores it, but dots and commas are very important. No need to predict new word after dot, the sentence is complete.

  • Typos. We need another algorythm to correct typos. When user presses spacebar, it is time to predict new word. But when user types his word, we can help him with typos correction. How? It is another question.

Thanks you!

https://github.com/hokumski/capstone-project-datascience