Little Wordie-Teller: predicting your next word

Ashley You
25th March, 2018

This data product is designed to fulfill Coursera Data Science Capstone Project.

Introduction

The goal of the Coursera Capstone project is to build a Shiny application that is able to predict the next word.

Little Wordie-Teller has thus been designed and deployed on shinyapps.io website. It is a lightweight App built on a large corpus of textual data from HC Corpora provided by the Coursera website.

This presentation aims to discuss:

  • Methods of data pre-processing;
  • Prediction algorithm;
  • Features of the App;
  • Future directions.

Methods

The dataset was downloaded using R. English text files, i.e. en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt, are used.

Prediction algorithm

Little Wordie-Teller is built on Markov Ngram model which assumes the next word can be predicted from its previous N words. A simple backoff strategy is used to choose the next word. If the input is not found in our training data, top 100 unigrams is sampled to be the “best guess”.

How the App works

Little Wordie-Teller is hosted here. It is designed using shiny flexdashboard package. To increase loading speed, the app uses shiny_prerendered runtime. Packages, Ngram data tables, prediction functions are prerendered and as a result the App shall load and predict fairly fast.

Future directions

Little Wordie-Teller is fun to use as it responds almost instantly. However, the accuracy of its prediction is low due to size limitation for shiny app deployment.

My future directions from this project are:

  • Learn more about CSS to improve the app's interface;
  • Data preprocessing methods can be more robust. With my 6 year old computer, the process of cleaning the data using R wasn't as smooth as I hoped. Python may be more effective in cleaning textual data, which would be my next learning focus;
  • It is possible to use different NLP models such as neural-network-based language models.

For more details, please see my github page here.