Data Science Specialization

Capstone Project

Mihaly Varadi
Biotech engineering

Introduction

The following few slides offer a brief introduction to a word prediction tool designed for the Data Science Specialization Capstone Project.

The goal of the project was to create an online tool that attempts to predict the next word in a sentence based on a huge textual dataset quickly, and preferably with high accuracy.

Algorithm design

Briefly, the algorithm design steps were the following:

  • Dataset assembly - English Twitter, News and Blogs data combined
  • Data sampling - Due to RAM limitations the combined dataset had to be sampled, 500k random lines of text was used
  • The text corpus was processed, number, whitespaces and punctuations removed
  • Stop words were not filtered out, as that decreased the accuracy
  • 3-gram Bayesian probabilities were calculated and saved
  • The algorithm takes user provided string input, and based on the last two words attempts to predict the next using the pre-calculated probabilities

Implementation

The word prediction tool was implemented as a ShinyApp, deployed at https://mvaradi.shinyapps.io/data_science_capstone/

The app works by typing in a sentence on the left side of the screen, and pressing 'Submit'. The predicted word will appear on the right side of the screen.

The algorithm is fast, and reasonably accurate as far as the underlying data sample allows it.

Summary

The words prediction tool presented here is fast, lightweight and reasonably accurate, relying on Bayesian probabilities for predictiing the next word in a sentence.

The algorithm is scaleable, and by increasing the size of the training set, along with the inclusion of 4-grams, the accuracy could further increase still.