Shiny Word Predictor App

Nick
January 2020

Introduction

This Shiny app forms part of the Data Science Capstone Project to develop a Natural Language Processing (NLP) word prediction tool.

The Word Predictor takes words entered by the user and predicts the next word based on a corpus of documents that includes Twitter feeds, News, and Blogs.

The intention of the app is to provide similar functionality to a smart phone messaging platform.

The Data

The data was made available to students via this link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

The quanteda package was used to process text data. The three separate data sets were sampled at 30% of their size and combined using their source (news feeds, twitter feeds and blog posts) as new document variables.

An example of the sampling algorithm to reduce the object size and improve performance is shown below:

corp_blogs <- corpus(blogs[sample(length(blogs), length(blogs) * len.prop)])

In this case len.prop is the proportion of the sample size (0.3 or 30%).

The Data Model

The model was built by sampling the three data sets and combining them into one corpus via the process corpus.R function.

Three n-grams were developed consisting of 2, 3 and 4 grams in the model building. All of which are combined into a list and loaded into the Shiny application as an RDS file.

The four most frequently occurring predicted end words are presented as action buttons in descending order of frequency in the Shiny app.

Key Algorithms Developed

A word count function was used to determine the number of user entered words to apply a suitable n-gram model against. This algorithm was sourced via Stack Overflow (https://stackoverflow.com/a/55396237/8158951):

words <- function(txt) { 
  length(attributes(gregexpr("(\\w|\\w\\-\\w|\\w\\'\\w)+", txt)[[1]])$match.length) 
}

The number of words are counted using words(start.words) and assigned to the varaible last.x.words.

last.x.words <- as.numeric(ifelse(words(start.words) >= 3, 3, words(start.words)))

Where the number of words exceeds three, the 4-gram model is applied to predict the final fourth word in the set. Where no matches are found the algorithm works backwards to predict using 2 and 1 n-gram models.

'dplyr' is used to apply and filter the appropriate n-gram model from the RDS list object indexed using [[last.x.words]] and return the first four rows sorted by frequency of counts:

comb.list[[last.x.words]] %>%
  filter(stringr::str_detect(ngram, first.pattern)) %>%
  arrange(desc(frequency)) %>%
  ungroup() %>%
  slice(1:4)

Word Predictor App Instructions

The default sentence can be modified by typing over it, or predictions made by clicking on the relevant Action button. Do not add a space after the last word - if you do, the predictor will not interpret it. The text box size can be increased by dragging the handle in the lower right corner.