Introduction
- Objective: Build a predictive text application that suggests the next word as the user types.
- Technology Stack:
- R for data processing and modeling
- Shiny for the web application
- HTML, CSS, JavaScript for custom UI
Data Preparation
- Data Sources:
- News, blogs, and Twitter datasets (English language).
- Sampling:
- Sampled 5% of the total data for analysis.
- Data Cleaning Steps:
- Remove non-English characters, URLs, email addresses, Twitter handles, and hashtags.
- Strip out punctuation and numbers.
- Remove profane words using a pre-defined list.
N-gram Construction
- Tokenization:
- Text is tokenized into unigrams, bigrams, trigrams, and quadgrams.
- Frequency Calculation:
- For each n-gram, the frequency is calculated.
- Frequencies are stored in data frames and saved as
.RData files for later use.
- Data Storage:
- Unigram, bigram, trigram, and quadgram frequencies are stored in separate R data files.
Functions:
cleaning_text_input()
- Convert text to lowercase.
- Remove punctuation, digits, and stopwords.
- Replace spaces with underscores to match n-gram format.
next_word_function()
- Determine the number of words in the input.
- Match the input with the largest possible n-gram.
- If no match is found, back off to smaller n-grams.
- If no suitable n-gram is found, return the most frequent unigram.
Shiny Application
UI:
- Built using a custom HTML template with CSS and JavaScript.
- Mimics a keyboard-like experience for the user.
Server Logic:
- Uses the `next_word_function` to predict the next word as the user types.
- Updates the UI in real-time to show the predicted word.
Try It Out! Check Out the Shiny App:
- Visit the link below to try the predictive text application.
Shinyapp
Github repository