Next word prediction app

Mateo Cordoba

Introduction

  • Objective: Build a predictive text application that suggests the next word as the user types.
  • Technology Stack:
    • R for data processing and modeling
    • Shiny for the web application
    • HTML, CSS, JavaScript for custom UI

Data Preparation

  • Data Sources:
    • News, blogs, and Twitter datasets (English language).
  • Sampling:
    • Sampled 5% of the total data for analysis.
  • Data Cleaning Steps:
    • Remove non-English characters, URLs, email addresses, Twitter handles, and hashtags.
    • Strip out punctuation and numbers.
    • Remove profane words using a pre-defined list.

N-gram Construction

  • Tokenization:
    • Text is tokenized into unigrams, bigrams, trigrams, and quadgrams.
  • Frequency Calculation:
    • For each n-gram, the frequency is calculated.
    • Frequencies are stored in data frames and saved as .RData files for later use.
  • Data Storage:
    • Unigram, bigram, trigram, and quadgram frequencies are stored in separate R data files.

Functions:

cleaning_text_input()

-   Convert text to lowercase.
-   Remove punctuation, digits, and stopwords.
-   Replace spaces with underscores to match n-gram format.

next_word_function()

-   Determine the number of words in the input.
-   Match the input with the largest possible n-gram.
-   If no match is found, back off to smaller n-grams.
-   If no suitable n-gram is found, return the most frequent unigram.

Shiny Application

UI:

-   Built using a custom HTML template with CSS and JavaScript.
-   Mimics a keyboard-like experience for the user.

Server Logic:

-   Uses the `next_word_function` to predict the next word as the user types.
-   Updates the UI in real-time to show the predicted word.

Try It Out! Check Out the Shiny App:

-   Visit the link below to try the predictive text application. 

Shinyapp

Github repository