Next word prediction app

Mateo Cordoba

Introduction

Objective: Build a predictive text application that suggests the next word as the user types.
Technology Stack:
- R for data processing and modeling
- Shiny for the web application
- HTML, CSS, JavaScript for custom UI

Data Preparation

Data Sources:
- News, blogs, and Twitter datasets (English language).
Sampling:
- Sampled 5% of the total data for analysis.
Data Cleaning Steps:
- Remove non-English characters, URLs, email addresses, Twitter handles, and hashtags.
- Strip out punctuation and numbers.
- Remove profane words using a pre-defined list.

N-gram Construction

Tokenization:
- Text is tokenized into unigrams, bigrams, trigrams, and quadgrams.
Frequency Calculation:
- For each n-gram, the frequency is calculated.
- Frequencies are stored in data frames and saved as .RData files for later use.
Data Storage:
- Unigram, bigram, trigram, and quadgram frequencies are stored in separate R data files.

Functions:

cleaning_text_input()

-   Convert text to lowercase.
-   Remove punctuation, digits, and stopwords.
-   Replace spaces with underscores to match n-gram format.

next_word_function()

-   Determine the number of words in the input.
-   Match the input with the largest possible n-gram.
-   If no match is found, back off to smaller n-grams.
-   If no suitable n-gram is found, return the most frequent unigram.

Shiny Application

UI:

-   Built using a custom HTML template with CSS and JavaScript.
-   Mimics a keyboard-like experience for the user.

Server Logic:

-   Uses the `next_word_function` to predict the next word as the user types.
-   Updates the UI in real-time to show the predicted word.

Try It Out! Check Out the Shiny App:

-   Visit the link below to try the predictive text application.

Shinyapp

Github repository