Coursera Capstone Project - Word prediction app

pguillemi

9/11/2022

Rationale - about the app

Writing takes time
Predicting words saves time

– People usually write using a large but not infinite set of expressions
– A large corpus of texts can be broken down into comprehensive chunks (n-grams)
– Most common n-grams can be used for prediction: given n first words, it can be identified which few ones usually come after

About the app

This app aims at speed without sacrificing accuracy

– Thorough input data pre-processing
– Simple yet effective scoring system and prediction functions
– App with neat interface and two modes to select from

Input data and processing

Text corpus consisted of ~70 M words from twitter, blogs and news
– A custom function was created to capture 95% of most used words in each source
– Only those words that appeared in three sources were kept, which yields a dictionary of ~10.000 words
– Profanity was removed

Extensive n-gram creation… filtered
– All of the corpus was used to create prediction n-grams with custom functions and file parsing, using tidytext and all of tidyverse packages
– 2 to 6-word n-grams were extracted and parsed in n-gram ~ result combos
– Only those that appeared more than once were kept

Simple scoring - robust prediction
– Words in functions were internally replaced by numbers, to optimize data file sizes
– Each n-gram ~ result combo was given a unique numerical score based on number of words in n-gram (1 to 5), frequency in corpus (a proportion in the range of 0 to 1) and 1 point deduction for common stop-words
– Prediction function makes use of data.table’s package speedy joins and filters, which finish wrapping and adding up scores for each result to provide next probable words

App with neat interface and two modes to select from

A shiny app was created, with two modes

Trigger mode: a text input is entered, prediction runs after pressing the “Go!” button

Real-time mode: text input is evaluated in real time, and predicts in two ways
– When last character is a space, it uses the full string for prediction of a whole word
– When last character are letters or numbers, they considered as the beginning of next words, and results ordered by score are updated accordingly
– A small time debounce had to be added to prevent overflowing of server memory. That limitation isn’t required when running local app locally

Press predicted words and walk around predictions
– Predictions are shown in buttons, that can then be pressed to update input directly!

Easy text input reset
– Just press the reset button to clear text input

I want to use it!

App: https://pguillemi.shinyapps.io/Word_predictor_pguillemi

Milestone report: https://rpubs.com/pguillemi/957851
(Note: this report analysed 90% of most common words, afterwards, app’s dictionary was upgraded to cover 95% of most common words)

Questions and comments:

Thanks!!