Predicting next word Shiny app

Pavel Zimin
September 13, 2017

Introduction

  • This Shiny app was created to fulfill requirements for the Capstone Project for the Data Science Specialization at Coursera.
  • The objective of this Shiny app is to predict the next word from the preceding text.
  • This task is widely used in mobile phone applications to assist the user with completing a typing task.

Corpora

  • The corpora were provided by the Data Science Specialization at Coursera.
  • They were collected with the web crawler from online sources.
  • The provided corpora are collected in 4 languages: Finnish, Russian, English and German. This report analyzed the English data set.
  • The data were collected from blogs, news and twitter.

Algorithm

  • The data were first read and cleaned by changing letters to lowercase, removing punctuation and removing profanity words.
  • Data from blogs, news and twitter were combined and sampled for further analysis. Due to the large size of the provided data set, only 1/100 of the data were used.
  • Unigrams, bigrams and trigrams were identified and their frequencies.
  • Katz Back-Off Algorithm was used for creating the model. In this algorithm trigrams were used with some weight redistributed to the bigrams and unigrams.
  • 5-fold Cross Validation was performed to optimize the model parameters.

Screenshot

Here is the screenshot of the app. The App features simple interface. Just enter the text, and the app will produce the next predicted word.

alt text