nextword: A Word Prediction App

Bryan Briones
2022

The Motivation

News, blogs, and tweets (a la Twitter), those especially written in the English language, provide us a very rich body of text that it is practically a goldmine for natural language processing.

Given a collection of texts from news, blogs, and tweets, text mining and natural language processing were performed to create a corpus of sampled texts. Based on this corpus of sampled texts, the ShinyApp named nextword was created.

In nextword, the user types a phrase of any length, after which the app predicts a single word that comes after that phrase.

The Inner Workings

R programming language, with the help of appropriate packages, was the tool behind creating nextword.

A corpus of blog, tweet, and news text files came from this source compiled by the company named SwiftKey: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

Lines from the blog, tweet, and news text files were read into R and then converted to data frames.

These files are so prohibitively large that trying to perform NLP or running a word prediction algorithm based on the entire corpus would not be practical–and thereby try anyone's patience! Therefore it behooved to take a sample of reasonable percentage of each data frame (settled for 3%).

The Inner Workings (continued)

Sampled lines from the blog, tweet, and news dataframes were combined to make a single dataframe samples. This dataframe underwent a data-cleaning process that included lower-casing (for the sake of unformity), punctuation and number removal, and white space stripping.

The next step, the process of tokenization creates a series of n-grams. Three were made and come in the form of two-word, three-word, and four-word ngrams.

The ngrams were saved as .RDS files and hosted in a GitHub repository for the word prediction algorithm (created afterward) to read and then perform the word prediction it is coded to do.

How the App Works...

User inputs a phrase in the box.

plot of chunk unnamed-chunk-1

Just access this app via the web at https://brnbrns.shinyapps.io/nextword.