title: "Next word prediction"author: "Enrique Figueroa"date: "2021-12-19"

Next Word Prediction

Capstone Project

by Enrique Figueroa

December 19, 2021

Introduction

  • The Shiny App predicts the next word of a user-entered English phrase.
  • The basic functionality of next word prediction is currently seen in, for instance, word processors or even programming IDEs. We try to mimic it in R code.
  • Text sources have been provided by Coursera. These are English texts from three sources: news, tweets and blog contributions.
  • The user should enter a two or three-word phrase in the App input box. The App will find and display the most probable next words based on their conditional probabilities of being together.

  • The App is stored at shinyapps.io.

Data Processing

  • As reported in the Part 1 of the course, we worked on a random sample of the big text bodies provided.
  • First, we remove punctuation, numbers, white spaces and graph characters; lowercase all words; and also get rid of too highly frequently words.
  • Next, a corpus is created, where all word are indexed, a necessary step for creating n-grams (“contiguous sequence of n items from a given sample of text”).
  • Afterwards, ordered tables of the most frequent 1, 2 or 3-grams are created and stored in .Rdata file format.

Shinny App

  • Bag of words (BoW) in the form of .Rdata files, created in the processing step are available for the App.

  • The user's input is processed for appropriately feeding the function that retrieves the most probable words. For instance:

    • Only the last two words will be considered if there more than three words are provided.
  • The top 3 predictions will be returned if available.

  • Unsuccessful searches of the 3-grams will resort to a 2-gram table search.

  • The App can be found at github.com/efignav.

User Instructions

Since we only want to proof concept the algorithms behind next word predictions the user interface is simple:

  • On the left panel enter a phrase composed of 1, 2 or 3 words and press the enter button.
  • Top 3 predictions are displayed on the right panel.

Further Development

  • Incorporate recent R libraries in preprocessing steps.
  • Extend model to 4 and 5-grams.
  • Record all users input to learn from them.