The goal of this project is to develop a predictive text algorithm
and a Shiny application that mimics “Smart Keyboard”
functionality.
This report explores the dataset provided by SwiftKey, summarizes its
major features, and outlines the roadmap for the final prediction
model.
The analysis is based on a large corpus of text from three sources:
Blogs, News, and Twitter.
Below is a summary of the raw data files.
| Source | Lines | Words |
|---|---|---|
| Blogs | 899288 | 37546250 |
| News | 1010242 | 34762395 |
| 2360148 | 30093413 |
Because the files are very large, a 1% sample was taken from each
source to perform exploratory analysis.
During the cleaning process:
| last | worked | that | he |
| week | up | were | down |
| he | about | not | in |
| was | some | actually | his |
| so | things | real | bed |
word_freq <- tibble(word = cleaned_tokens) %>%
count(word, sort = TRUE) %>%
mutate(freq = n / sum(n), cumulative = cumsum(freq))
#Coverage analysis:
cover_50 <- word_freq %>%
filter(cumulative >= 0.5) %>%
slice(1) %>%
pull("word")
count_50 <- which(word_freq$word == cover_50)
cover_90 <- word_freq %>%
filter(cumulative >= 0.9) %>%
slice(1) %>%
pull("word")
count_90 <- which(word_freq$word == cover_90)
# Foreign Language analysis:
foreign_check <- word_freq %>%
mutate(detected_lang = detect_language(word),
is_foreign = !is.na(detected_lang) & detected_lang != "en")
percent_foreign <- paste0(round(mean(foreign_check$is_foreign) * 100, 2), "%")
It was identified that a relatively small number of unique words
account for most of the language used. To cover 50% of all word
instances in the sample, only a few hundred words are needed
(135).
Based on the cld2 language detection library, approximately
4.53% of the unique words in the sample appear to be from
foreign languages.
The plots below show the top 20 Unigrams (single words), Bigrams (two-word phrases), and Trigrams (three-word phrases).
Moving forward, the prediction strategy will rely on a Stupid Backoff model:
The final Shiny App will feature a reactive interface where the user
can type text, and the top predicted word will appear instantly.
To ensure the app is fast and lightweight for mobile simulation, the
N-gram tables will be optimized and pruned of very low-frequency
entries.