Bryan Briones
2022
News, blogs, and tweets (a la Twitter), those especially written in the English language, provide us a very rich body of text that it is practically a goldmine for natural language processing.
Given a collection of texts from news, blogs, and tweets, text mining and natural language processing were performed to create a corpus of sampled texts. Based on this corpus of sampled texts, the ShinyApp named nextword was created.
In nextword, the user types a phrase of any length, after which the app predicts a single word that comes after that phrase.
R programming language, with the help of appropriate packages, was the tool behind creating nextword.
A corpus of blog, tweet, and news text files came from this source compiled by the company named SwiftKey: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
Lines from the blog, tweet, and news text files were read into R and then converted to data frames.
These files are so prohibitively large that trying to perform NLP or running a word prediction algorithm based on the entire corpus would not be practical–and thereby try anyone's patience! Therefore it behooved to take a sample of reasonable percentage of each data frame (settled for 3%).
Sampled lines from the blog, tweet, and news dataframes were combined to make a single dataframe samples. This dataframe underwent a data-cleaning process that included lower-casing (for the sake of unformity), punctuation and number removal, and white space stripping.
The next step, the process of tokenization creates a series of n-grams. Three were made and come in the form of two-word, three-word, and four-word ngrams.
The ngrams were saved as .RDS files and hosted in a GitHub repository for the word prediction algorithm (created afterward) to read and then perform the word prediction it is coded to do.
User inputs a phrase in the box.
Just access this app via the web at https://brnbrns.shinyapps.io/nextword.